Planet Code4Lib

Clear Vision Is Vital to Digital Transformation Success / Lucidworks

Digital transformation projects must be based on a clear vision that is propagated by the CEO to be successful, our research finds.

In a review of articles and studies on the keys to digital transformation success, there’s near uniformity around the idea that the C-suite in general, and the CEO in particular, must lead the vision.

Tom Puthiyamadam of PwC said recently the CEO is increasingly at the forefront of digital transformation. “I think the concept of the Chief Digital Officer at some point needs to fade because I don’t think a CEO can at any point outsource the most critical movement in their company,” he said. “Leveraging AI, automation, all this new technology innovation—I don’t know how you outsource that to somebody else.”

A MIT Sloan review survey of more than 4,300 managers, executives, and analysts worldwide found that more digitally mature organizations are likely to have the CEO, rather than the CIO, lead digital transformation efforts. This may be because digital transformation is as much a cultural and operational change as it is a technological one.

As this ZDNet article points out, CEOs have the institutional gravitas to push the entire organization in the direction digital transformation requires.

A Digital Transformation Plan Is Key to Success

A plan for transformation, clearly articulated is vital to success. A McKinsey report found that companies where the management team put in place a change narrative for the digital transformation were more than three times more likely to succeed than those that did not.

Additionally, companies were twice as likely to have positive results if senior managers “fostered a sense of urgency” to make the transformation occur.

Finally, companies with clear metrics to assess implementation of the vision had twice the success rate of companies that lacked these measurements.

A comprehensive Harvard Business Review article on digital transformation asserts that strategy must come before anything else, saying, “Figure out your business strategy before you invest in anything.”

While this might seem obvious, many companies skip right to purchasing technology without first understanding what they will use it for. agrees, saying that establishing a “Business Why” for the digital transformation is key to long-term success.

This ‘why’ is aligned to business objectives such as making the organization more agile or adaptable, keeping up with competitors, and attaining greater profits.

Once the vision is in place, the CEO and the rest of the company’s leadership must show clear support for its implementation. As Justin Grossman, CEO of meltmedia writes in this piece, thoughtful planning is the basis of long-term success. He explains, “Savvy teams assess organizational goals, analyze integration needs and evaluate impact before designing (or changing) their digital roadmaps.”  

Digital Transformation Is a Long-Term Process

Many technology experts point out the need to view digital transformation as a process or a journey that will change the business permanently, rather than an instantaneous change.

Deloitte’s “Strategy, Not Technology, Drives Digital Transformation” sums up this idea nicely. The report’s authors found that 80 percent of digitally mature organizations had a clear strategy compared to 15 percent of companies that were still in the nascent stages of digital maturation. “The power of a digital transformation strategy lies in its scope and objectives. Less digitally mature organizations tend to focus on individual technologies and have strategies that are decidedly operational in focus. Digital strategies in the most mature organizations are developed with an eye on transforming the business.”

Thus, before being wowed by vendor stories of the capabilities of the newest technologies, companies must turn inward and establish a clear understanding of why they want to digitally transform in the first place. Without such a vision, companies will be relying more on hope and luck to drive change rather than asserting control over the process.

Dan Woods is a Technology Analyst, Writer, IT Consultant, and Content Marketer based in NYC.

The post Clear Vision Is Vital to Digital Transformation Success appeared first on Lucidworks.

Ten Hot Topics / David Rosenthal

The topic of scholarly communication has received short shrift here for the last few years. There has been too much to say about other topics, and developments such as Plan S have been exhaustively discussed elsewhere. But I do want to call attention to an extremely valuable review by Jon Tennant and a host of co-authors entitled Ten Hot Topics around Scholarly Publishing.

The authors pose the ten topics as questions, which allows for a scientific experiment. My hypothesis is that all these questions, while strictly not headlines, will nevertheless obey Betteridge's Law of Headlines, in that the answer will be "No". Below the fold, I try to falsify my hypothesis.


The ten topics, with illustrative quotes from their sections of the paper, are:
  1. Will preprints get your research scooped? No:
    there is virtually no evidence that ‘scooping’ of research via preprints exists, not even in communities that have broadly adopted the use of the arXiv server for sharing preprints since 1991.
  2. Do the Journal Impact Factor and journal brand measure the quality of authors and their research? No:
    About ten years ago, national and international research funding institutions pointed out that numerical indicators such as the JIF should not be deemed a measure of quality. In fact, the JIF is a highly-manipulated metric, and justifying its continued widespread use beyond its original narrow purpose seems due to its simplicity (easily calculable and comparable number), rather than any actual relationship to research quality
  3. Does approval by peer review prove that you can trust a research paper, its data and the reported conclusions? No:
    Multiple examples across several areas of science find that scientists elevated the importance of peer review for research that was questionable or corrupted. ... At times, peer review has been exposed as a process that was orchestrated for a preconceived outcome. ... Another problem that peer review often fails to catch is ghostwriting, a process by which companies draft articles for academics who then publish them in journals, sometimes with little or no changes.
  4. Will the quality of the scientific literature suffer without journal-imposed peer review? No:
    the credibility conferred by the "peer-reviewed" label diminishes what Feynman calls the culture of doubt necessary for science to operate a self-correcting, truth-seeking process. The troubling effects of this can be seen in the ongoing replication crisis, hoaxes, and widespread outrage over the inefficacy of the current system. ... the issue is not the skepticism shared by the select few who determine whether an article passes through the filter. It is the validation and accompanying lack of skepticism—from both the scientific community and the general public—that comes afterwards. Here again more oversight only adds to the impression that peer review ensures quality, thereby further diminishing the culture of doubt and counteracting the spirit of scientific inquiry.
  5. Is Open Access responsible for creating predatory publishers? No:
    A recent study has shown that Beall’s criteria of “predatory” publishing were in no way limited to OA publishers and that, applying them to both OA and non-OA journals in the field of Library and Information Science, even top tier non-OA journals could be qualified as predatory; ... If a causative connection is to be made in this regard, it is thus not between predatory practices and OA. Instead it is between predatory publishing and the unethical use of one of the many OA business models adopted by a minority of DOAJ registered journals.
  6. Is copyright transfer required to publish and protect authors? No:
    not only does it appear that in scientific research, copyright is largely ineffective in its proposed use, but also perhaps wrongfully acquired in many cases, and goes practically against its fundamental intended purpose of helping to protect authors and further scientific research. ... we are unaware of a single reason why copyright transfer is required for publication, or indeed a single case where a publisher has exercised copyright in the best interest of the authors.
  7. Does gold Open Access have to cost a lot of money for authors, and is it synonymous with the APC business model? No:
    Some journals, such as the Journal of Machine Learning Research which costs between $6.50–$10 per article, demonstrate that the cost of publication can be far more efficient than what is often spent. Usually, the costs of publishing and the factors contributing to APCs are completely concealed. The publishers eLife and Ubiquity Press are transparent about their direct and indirect costs; the latter levies an APC of $500.
    And No:
    these data show that the APC model is far from hegemonic in the way it is often taken to be. For example, most APC-free journals in Latin America are funded by higher education institutions and are not conditional on institutional affiliation for publication.
  8. Are embargo periods on ‘green’ OA needed to sustain publishers? No:
    In 2013 the UK House of Commons Select Committee on Business, Innovation and Skills already concluded that “there is no available evidence base to indicate that short or even zero embargoes cause cancellation of subscriptions”. ... In a reaction to Plan S, Highwire suggested that three of their society publishers make all author manuscripts freely available upon submission and state that they do not believe this practice has contributed to subscription decline. Therefore, there is little evidence or justification supporting the need for embargo periods.
  9. Are Web of Science and Scopus global platforms of knowledge? No:
    Both are commercial enterprises, whose standards and assessment criteria are mostly controlled by panels of gatekeepers in North America and Western Europe. The same is true for more comprehensive databases such as Ulrich’s Web which lists as many as 70,000 journals, but crucially Scopus has fewer than 50% of these, while WoS has fewer than 25%. While Scopus is larger and geographically broader than WoS, it still only covers a fraction of journal publishing outside North America and Europe. For example, it reports a coverage of over 2000 journals in Asia (“230% more than the nearest competitor”) which may seem impressive until you consider that in Indonesia alone there are more than 7000 journals listed on the government’s Garuda portal (of which more than 1300 are currently listed on DOAJ); while at least 2500 Japanese journals listed on the J-Stage platform.
  10. Do publishers add value to the scholarly communication process? Yes:
    publishers do add value to current scholarly communication. Kent Anderson has listed many things that journal publishers do which currently contains 102 items and has yet to be formally contested from anyone who challenges the value of publishers.
    Many items on the list could be argued to be of value primarily to the publishers themselves, e.g., “Make money and remain a constant in the system of scholarly output”. ... It could be questioned though, whether these functions are actually necessary to the core aim of scholarly communication, namely, dissemination of research to researchers and other stakeholders such as policy makers, economic, biomedical and industrial practitioners as well as the general public. Above, for example, we question the necessity of the current infrastructure for peer review, and if a scholar-led crowdsourced alternative may be preferable.
    Importantly, this section ignores the fact that, as established in the previous sections, publishers also subtract value. The ways they subtract value include the massive annual amounts they transfer from research budgets to their executives and shareholder's wallets, the delays they impose on the communication of results, their impact-factor based marketing, their co-option of senior researchers via editorial board slots, and so on and on.


Nine out of ten (technically ten out of eleven) topics conform to Betteridge's Law. The remaining one is a "Yes, but ...", because in my view it was poorly stated. It should have been "Do publishers add more value to the scholarly communication process than they subtract?" Thus whether question #10 falsifies my hypothesis is open to some doubt.

Offweb / Ed Summers

I’m having a bit of fun experimenting with FlexSearch which provides search capabilities much like Solr & ElasticSearch but runs entirely in the browser. There’s also a short InfoQ piece about it too. It looks like you can migrate your index to the server when it gets too big for the browser, which is nice to have as a Plan B.

One of the impressive things about the FlexSearch project is how they have really focused on performance as compared to other JavaScript search options that are available. For example, check out the index size comparisons:

I’m mostly interested in trying to keep the index client-side to reduce the infrastructure costs of app deployment. If an app can be deployed as a static site with one less dependent service that’s great. But relocating the index to the browser also could allow the app to circulate off the HTTP web, on thumb drives, Dat sites, etc.

After spending the last few decades developing for the web I never thought I would be focusing so much on off-web circulation. off-web is increasingly important as having your data, and being able to share it in particular social settings, are prioritized.

It’s not so much a concern of a decentralized or distributed web, as it is a disconnected web. That there is value in not having a globally addressed namespace. This is an idea I’ve been slow to appreciate. It’s also one of the things that keeps me coming back to Scuttlebutt.

Fedora 6: Application-independent Storage to Ensure the Persistence of Digital Data Through Time / DuraSpace News

The Fedora community has long been at the forefront of developing an open source repository framework that supports digital preservation efforts. Fedora 6 aims to integrate a standardized structure for persisting and delivering the essential characteristics of digital objects in Fedora with the emerging Oxford Common File Layout (OCFL) specification.  The OCFL defines a shared approach to file hierarchy for long-term preservation which describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.

Overview of Fedora 6 planned features:

  • Replace current Modeshape back-end
  • Implement the OCFL for persistence
  • Add a native, synchronous query interface
  • Improve performance and scale

Fedora 6 digital preservation enhancements are of interest to institutions because digital content stored with Fedora 6 will be transparently readable by both humans and machines, thereby making it accessible with or without Fedora. This application-agnostic approach to preservation is an essential characteristic of ensuring the persistence of digital data through time.

Johns Hopkins University (JHU) and Penn State have joined community efforts to support the development of Fedora 6 by increasing their Fedora membership contributions in 2019.  JHU is now a platinum member of DuraSpace in support of Fedora and Platinum member Penn State has contributed an additional $22,000 in support of Fedora for 2019.

Sayeed Choudhury, Associate Dean for Research Data Management; Hodson Director of the Digital Research and Curation Center, Johns Hopkins University, explains, “We are developing open infrastructure at Johns Hopkins including the Public Access Submission System or PASS ( and an updated data archive specifically designed for health sciences data. Additionally, we plan to migrate our digital collections from DSpace to Islandora 8. Our increased commitment to Fedora in terms of financial resources, software developers, and user feedback reflects its importance for each of these components of our evolving open infrastructure. Fedora 6’s increased support for preservation and migration are ideal next steps in terms of capabilities.”


The post Fedora 6: Application-independent Storage to Ensure the Persistence of Digital Data Through Time appeared first on

ARL+DLF Forum Fellowships / Digital Library Federation

This year, the Association of Research Libraries (ARL) and Digital Library Federation (DLF) will again sponsor a number of fellowships meant to foster a more diverse and inclusive practitioner community in digital libraries and related fields. We are proud to be strengthening the networking opportunities available to our Fellows, both at and beyond the Forum, in collaboration with colleagues across our two associations.

ARL+DLF Forum Fellowships, which have been offered since 2013, are designed to offset or completely cover travel and lodging expenses associated with attending the annual DLF Forum, which will be held October 14-16, 2019 in Tampa, Florida. ARL+DLF Forum Fellows additionally receive a complimentary full registration to the Forum (up to a $750 value) and an invitation to special networking events. Fellows will be required to write a blog post about their experiences at the Forum, to be published on the DLF blog.

Eligibility Requirements

Applicants should identify as members of a group (or groups) underrepresented among digital library and cultural heritage practitioners. These include—but are not limited to—people of Hispanic or Latino, Black or African-American, Asian, Middle Eastern, Native Hawaiian or Pacific Islander, First Nations, American Indian, or Alaskan Native descent. Applications from people who could contribute to the diversity of the Forum in other ways are also warmly welcomed.

To Apply

For full details and how to apply visit:

Applications are due by Monday, June 10 at 11:59 pm Eastern Time. Applicants will be notified of their status in early-to-mid July.

About ARL

The Association of Research Libraries (ARL) is a nonprofit organization of 124 research libraries in the US and Canada. ARL’s mission is to influence the changing environment of scholarly communication and the public policies that affect research libraries and the diverse communities they serve. ARL is on the web at and on Twitter at @ARLnews.

About DLF

The Digital Library Federation (DLF) is a robust and diverse community of practitioners dedicated to the advancement of research, learning, social justice, and the public good through the creative design and wise application of digital library technologies. DLF serves as a resource and catalyst for collaboration among the staff of its 190 institutional members and all who are invested in digital library issues. DLF can be found on the web at and on Twitter at @CLIRDLF.

The post ARL+DLF Forum Fellowships appeared first on DLF.

For a fair, free and open future: celebrating 15 years of the Open Knowledge Foundation / Open Knowledge Foundation

Fifteen years ago, the Open Knowledge Foundation was launched in Cambridge by entrepreneur and economist Rufus Pollock.

At the time, open data was an entirely new concept. Worldwide internet users were barely above the 10 per cent mark, and Facebook was still in its infancy.

But Rufus foresaw both the massive potential and the huge risks of the modern digital age. He believed in access to information for everyone about how we live, what we consume, and who we are – for example, how our tax money gets spent, what’s in the food we eat or the medicines we take, and where the energy comes from to power our cities.

From humble beginnings, the Open Knowledge Foundation grew across the globe and pioneered the way that we use data today, striving to build open knowledge in government, business and civil society – and creating the technology to make open material useful.

We created the Open Definition that is still the benchmark today – that open data and content can be freely used, modified, and shared by anyone for any purpose.

With staff on six continents, we became known as Open Knowledge International and launched projects in dozens of countries.

As we celebrate our 15th anniversary today, our world has changed dramatically. Large unaccountable technology companies have monopolised the digital age, and an unsustainable concentration of wealth and power has led to stunted growth and lost opportunities. When that happens it is consumers, future innovators and society that loses out.

We live in powerful times, where the greatest danger is not the chaos but to rest in the past. So as we reach an important milestone in our organisation’s own journey, we recognise it is time for new rules for this new digital world.

We have decided to re-focus our efforts on why we were created in 2004, ‘to promote the openness of all forms of knowledge’, and return to our name as the Open Knowledge Foundation.

Our vision is for a future that is fair, free and open. That will be our guiding principle in everything we do.

Our mission is to create a more open world – a world where all non-personal information is open, free for everyone to use, build on and share; and creators and innovators are fairly recognised and rewarded.

We understand that phrases like ‘open data’ and ‘open knowledge’ are not widely understood. It is our job to change that.

The next 15 years and beyond are not to be feared. We live in a time when technological advances offer incredible opportunities for us all.

This is a time to be hopeful about the future, and to inspire those who want to build a better society.

We want to see enlightened societies around the world, where everyone has access to key information and the ability to use it to understand and shape their lives; where powerful institutions are comprehensible and accountable; and where vital research information that can help us tackle challenges such as poverty and climate change is available to all.

Our work will focus on health, where access to medicines requires new thinking, and on education where new EU-wide copyright law impacts on both academic research and on people’s ability to access knowledge.

We will also concentrate on employment, including tackling the growing inequality from working patterns and conditions, and the ability for creators and innovators to be fairly compensated. This reaches to the heart of a fair, free and open future where there is opportunity for all.

We have also set out five demands for this week’s European elections and will push for MEPs from across Europe to prioritise these when the European Parliament returns in summer.

Firstly, we will fight the introduction of Article 17 of the EU’s copyright reforms which threatens to restrict the sharing of data and other content on the internet for half-a-billion people in Europe.

We also want to see improved transparency measures at social media companies like Facebook to prevent the spread of disinformation and fake news.

We recognise the concerns that people have about the misuse of data, so we will champion ‘responsible data’ to ensure that data is used ethically and legally, and protects privacy.

We also want to persuade governments and organisations to use established and recognised open licences when releasing data or content; and we will aim to build a network of open advocates in the European Parliament who will push for greater openness in their own nations.

We live in a knowledge society where we face two different futures: one which is open and one which is closed.

An open future means knowledge is shared by all – freely available to everyone, a world where people are able to fulfil their potential and live happy and healthy lives.

A closed future is one where knowledge is exclusively owned and controlled leading to greater inequality and a closed future.

With inequality rising, never before has our vision of a fair, free and open future been so important to realise our mission of an open world in complex times.

Improving pip-compile --generate-hashes / Harvard Library Innovation Lab

Recently I landed a series of contributions to the Python package pip-tools:

pip-tools is a "set of command line tools to help you keep your pip-based [Python] packages fresh, even when you've pinned them." My changes help the pip-compile --generate-hashes command work for more people.

This isn't a lot of code in the grand scheme of things, but it's the largest set of contributions I've made to a mainstream open source project, so this blog post is a celebration of me! 🎁💥🎉 yay. But it's also a chance to talk about package manager security and open source contributions and stuff like that.

I'll start high-level with "what are package managers" and work my way into the weeds, so feel free to jump in wherever you want.

What are package managers?

Package managers help us install software libraries and keep them up to date. If I want to load a URL and print the contents, I can add a dependency on a package like requests

$ echo 'requests' > requirements.txt
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
  Downloading (439kB)
     |████████████████████████████████| 440kB 4.1MB/s
Installing collected packages: requests
Successfully installed requests-2.0.1

… and let requests do the heavy lifting:

>>> import requests
>>> requests.get('').text
'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title> ...'

But there's a problem – if I install exactly the same package later, I might get a different result:

$ echo 'requests' > requirements.txt
$ pip install -r requirements.txt
Collecting requests (from -r requirements.txt (line 1))
  Downloading (57kB)
     |████████████████████████████████| 61kB 3.3MB/s
Collecting certifi>=2017.4.17 (from requests->-r requirements.txt (line 1))
  Using cached
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (from requests->-r requirements.txt (line 1))
  Downloading (150kB)
     |████████████████████████████████| 153kB 10.6MB/s
Collecting idna<2.9,>=2.5 (from requests->-r requirements.txt (line 1))
  Using cached
Collecting chardet<3.1.0,>=3.0.2 (from requests->-r requirements.txt (line 1))
  Using cached
Installing collected packages: certifi, urllib3, idna, chardet, requests
Successfully installed certifi-2019.3.9 chardet-3.0.4 idna-2.8 requests-2.22.0 urllib3-1.25.2
<requirements.txt, pip install -r, import requests>

I got a different version of requests than last time, and I got some bonus dependencies (certifi, urllib3, idna, and chardet). Now my code might not do the same thing even though I did the same thing, which is not how anyone wants computers to work. (I've cheated a little bit here by showing the first example as though pip install had been run back in 2013.)

So the next step is to pin the versions of my dependencies and their dependencies, using a package like pip-tools:

$ echo 'requests' >
$ pip-compile
$ cat requirements.txt
# This file is autogenerated by pip-compile
# To update, run:
#    pip-compile
certifi==2019.3.9         # via requests
chardet==3.0.4            # via requests
idna==2.8                 # via requests
urllib3==1.25.2           # via requests

(There are other options I could use instead, like pipenv or poetry. For now I still prefer pip-tools, for roughly the reasons laid out by Hynek Schlawack.)

Now when I run pip install -r requirements.txt I will always get the same version of requests, and the same versions of its dependencies, and my program will always do the same thing.

… just kidding.

The problem with pinning Python packages

Unfortunately pip-compile doesn't quite lock down our dependencies the way we would hope! In Python land you don't necessarily get the same version of a package by asking for the same version number. That's because of binary wheels.

Up until 2015, it was possible to change a package's contents on PyPI without changing the version number, simply by deleting the package and reuploading it. That no longer works, but there is still a loophole: you can delete and reupload binary wheels.

Wheels are a new-ish binary format for distributing Python packages, including any precompiled programs written in C (or other languages) used by the package. They speed up installs and avoid the need for users to have the right compiler environment set up for each package. C-based packages typically offer a bunch of wheel files for different target environments – here's bcrypt's wheel files for example.

So what happens if a package was originally released as source, and then the maintainer wants to add binary wheels for the same release years later? PyPI will allow it, and pip will happily install the new binary files. This is a deliberate design decision: PyPI has "made the deliberate choice to allow wheel files to be added to old releases, though, and advise folks to use –no-binary and build their own wheel files from source if that is a concern."

That creates room for weird situations, like this case where wheel files were uploaded for the hiredis 0.2.0 package on August 16, 2018, three years after the source release on April 3, 2015. The package had been handed over without announcement from Jan-Erik Rediger to a new volunteer maintainer, ifduyue, who uploaded the binary wheels. ifduyue's personal information on Github consists of: a new moon emoji; an upside down face emoji; the location "China"; and an image of Lanny from the show Lizzie McGuire with spirals for eyes. In a bug thread opened after ifduyue uploaded the new version of hiredis 0.2.0, Jan-Erik commented that users should "please double-check that the content is valid and matches the repository."

ifduyue's user account on

The problem is that I can't do that, and most programmers can't do that. We can't just rebuild the wheel ourselves and expect it to match, because builds are not reproducible unless one goes to great lengths like Debian does. So verifying the integrity of an unknown binary wheel requires rebuilding the wheel, comparing a diff, and checking that all discrepancies are benign – a time-consuming and error-prone process even for those with the skills to do it.

So the story of hiredis looks a lot like a new open source developer volunteering to help out on a project and picking off some low-hanging fruit in the bug tracker, but it also looks a lot like an attacker using the perfect technique to distribute malware widely in the Python ecosystem without detection. I don't know which one it is! As a situation it's bad for us as users, and it's not fair to ifduyue if in fact they're a friendly newbie contributing to a project.

(Is the hacking paranoia warranted? I think so! As Dominic Tarr wrote after inadvertently handing over control of an npm package to a bitcoin-stealing operation, "I've shared publish rights with other people before. … open source is driven by sharing! It's great! it worked really well before bitcoin got popular.")

This is a big problem with a lot of dimensions. It would be great if PyPI packages were all fully reproducible and checked to verify correctness. It would be great if PyPI didn't let you change package contents after the fact. It would be great if everyone ran their own private package index and only added packages to it that they had personally built from source that they personally checked, the way big companies do it. But in the meantime, we can bite off a little piece of the problem by adding hashes to our requirements file. Let's see how that works.

Adding hashes to our requirements file

Instead of just pinning packages like we did before, let's try adding hashes to them:

$ echo 'requests==2.0.1' >
$ pip-compile --generate-hashes
# This file is autogenerated by pip-compile
# To update, run:
#    pip-compile --generate-hashes
requests==2.0.1 \
    --hash=sha256:8cfddb97667c2a9edaf28b506d2479f1b8dc0631cbdcd0ea8c8864def59c698b \

Now when pip-compile pins our package versions, it also fetches the currently-known hashes for each requirement and adds them to requirements.txt (an example of the crypto technique of "TOFU" or "Trust On First Use"). If someone later comes along and adds new packages, or if the https connection to PyPI is later insecure for whatever reason, pip will refuse to install and will warn us about the problem:

$ pip install -r requirements.txt
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    requests==2.0.1 from (from -r requirements.txt (line 7)):
        Expected sha256 8cfddb97667c2a9edaf28b506d2479f1b8dc0631cbdcd0ea8c8864def59c6981
        Expected     or f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f61
             Got        f4ebc402e0ea5a87a3d42e300b76c292612d8467024f45f9858a8768f9fb6f6e

But there are problems lurking here! If we have packages that are installed from Github, then pip-compile can't hash them and pip won't install them:

$ echo '-e git+' >
$ pip-compile --generate-hashes
# This file is autogenerated by pip-compile
# To update, run:
#    pip-compile --generate-hashes
-e git+
certifi==2019.3.9 \
    --hash=sha256:59b7658e26ca9c7339e00f8f4636cdfe59d34fa37b9b04f6f9e9926b3cece1a5 \
chardet==3.0.4 \
    --hash=sha256:84ab92ed1c4d4f16916e05906b6b75a6c0fb5db821cc65e70cbd64a3e2a5eaae \
idna==2.8 \
    --hash=sha256:c357b3f628cf53ae2c4c05627ecc484553142ca23264e593d327bcde5e9c3407 \
urllib3==1.25.2 \
    --hash=sha256:a53063d8b9210a7bdec15e7b272776b9d42b2fd6816401a0d43006ad2f9902db \
$ pip install -r requirements.txt
Obtaining requests from git+ (from -r requirements.txt (line 7))
ERROR: The editable requirement requests from git+ (from -r requirements.txt (line 7)) cannot be installed when requiring hashes, because there is no single file to hash.

That's a serious limitation, because -e requirements are the only way pip-tools knows to specify installations from version control, which are useful while you wait for new fixes in dependencies to be released. (We mostly use them at LIL for dependencies that we've patched ourselves, after we send fixes upstream but before they are released.)

And if we have packages that rely on dependencies pip-tools considers unsafe to pin, like setuptools, pip will refuse to install those too:

$ echo 'Markdown' >
$ pip-compile --generate-hashes
# This file is autogenerated by pip-compile
# To update, run:
#    pip-compile --generate-hashes
markdown==3.1 \
    --hash=sha256:fc4a6f69a656b8d858d7503bda633f4dd63c2d70cf80abdc6eafa64c4ae8c250 \
$ pip install -r requirements.txt
Collecting markdown==3.1 (from -r requirements.txt (line 7))
  Using cached
Collecting setuptools>=36 (from markdown==3.1->-r requirements.txt (line 7))
ERROR: In --require-hashes mode, all requirements must have their versions pinned with ==. These do not:
    setuptools>=36 from (from markdown==3.1->-r requirements.txt (line 7))

This can be worked around by adding --allow-unsafe, but (a) that sounds unsafe (though it isn't), and (b) it won't pop up until you try to set up a new environment with a low version of setuptools, potentially days later on someone else's machine.

Fixing pip-tools

Those two problems meant that, when I set out to convert our Caselaw Access Project code to use --generate-hashes, I did it wrong a few times in a row, leading to multiple hours spent debugging problems I created for me and other team members (sorry, Anastasia!). I ended up needing a fancy wrapper script around pip-compile to rewrite our requirements in a form it could understand. I wanted it to be a smoother experience for the next people who try to secure their Python projects.

So I filed a series of pull requests:

Support URLs as packages

Support URLs as packages #807 and Fix –generate-hashes with bare VCS URLs #812 laid the groundwork for fixing --generate-hashes, by teaching pip-tools to do something that had been requested for years: installing packages from archive URLs. Where before, pip-compile could only handle Github requirements like this:

-e git+

It can now handle requirements like this:

And zipped requirements can be hashed, so the resulting requirements.txt comes out looking like this, and is accepted by pip install: \

This was a long process, and began with resurrecting a pull request from 2017 that had first been worked on by nim65s. I started by just rebasing the existing work, fixing some tests, and submitting it in the hopes the problem had already been solved. Thanks to great feedback from auvipy, atugushev, and blueyed, I ended up making 14 more commits (and eventually a follow-up pull request) to clean up edge cases and get everything working.

Landing this resulted in closing two other pip-tools pull requests from 2016 and 2017, and feature requests from 2014 and 2018.

Warn when --generate-hashes output is uninstallable

The next step was Fix pip-compile output for unsafe requirements #813 and Warn when –generate-hashes output is uninstallable #814. These two PRs allowed pip-compile --generate-hashes to detect and warn when a file would be uninstallable for hashing reasons. Fortunately pip-compile has all of the information it needs at compile time to know that the file will be uninstallable and to make useful recommendations for what to do about it:

$ pip-compile --generate-hashes
# This file is autogenerated by pip-compile
# To update, run:
#    pip-compile --generate-hashes
# WARNING: pip install will require the following package to be hashed.
# Consider using a hashable URL like
-e git+
click==7.0 \
    --hash=sha256:2335065e6395b9e67ca716de5f7526736bfa6ceead690adf616d925bdc622b13 \
first==2.0.2 \
    --hash=sha256:8d8e46e115ea8ac652c76123c0865e3ff18372aef6f03c22809ceefcea9dec86 \
markdown==3.1 \
    --hash=sha256:fc4a6f69a656b8d858d7503bda633f4dd63c2d70cf80abdc6eafa64c4ae8c250 \
six==1.12.0 \
    --hash=sha256:3350809f0555b11f552448330d0b52d5f24c91a322ea4a15ef22629740f3761c \

# WARNING: The following packages were not pinned, but pip requires them to be
# pinned when the requirements file includes hashes. Consider using the --allow-unsafe flag.
# setuptools==41.0.1        # via markdown

Hopefully, between these two efforts, the next project to try using –generate-hashes will find it a shorter and more straightforward process than I did!

Things left undone

Along the way I discovered a few issues that could be fixed in various projects to help the situation. Here are some pointers:

First, the warning to use --allow-unsafe seems unnecessary – I believe that --allow-unsafe should be the default behavior for pip-compile. I spent some time digging into the reasons that pip-tools considers some packages "unsafe," and as best I can tell it is because it was thought that pinning those packages could potentially break pip itself, and thus break the user's ability to recover from a mistake. This seems to no longer be true, if it ever was. Instead, failing to use –allow-unsafe is unsafe, as it means different environments will end up with different versions of key packages despite installing from identical requirements.txt files. I started some discussion about that on the pip-tools repo and the pip repo.

Second, the warning not to use version control links with --generate-hashes is necessary only because of pip's decision to refuse to install those links alongside hashed requirements. That seems like a bad security tradeoff for several reasons. I filed a bug with pip to open up discussion on the topic.

Third, PyPI and binary wheels. I'm not sure if there's been further discussion on the decision to allow retrospective binary uploads since 2017, but the example of hiredis makes it seem like that has some major downsides and might be worth reconsidering. I haven't yet filed anything for this.

Personal reflections (and, thanks Jazzband!)

I didn't write a ton of code for this in the end, but it was a big step for me personally in working with a mainstream open source project, and I had a lot of fun – learning tools like black and git multi-author commits that we don't use on our own projects at LIL, collaborating with highly responsive and helpful reviewers (thanks, all!), learning the internals of pip-tools, and hopefully putting something out there that will make people more secure.

pip-tools is part of the Jazzband project, which is an interesting attempt to make the Python package ecosystem a little more sustainable by lowering the bar to maintaining popular packages. I had a great experience with the maintainers working on pip-tools in particular, and I'm grateful for the work that's gone into making Jazzband happen in general.

RA21: Technology is not the problem. / Eric Hellman

RA21 vows to "improve access to institutionally-provided information resources". The barriers to access are primarily related to the authorization of such access in the context of licensing agreements. In a perfect world, trust and consensus between licensors and licensing communities would render authorization technology irrelevant. In the real world, technological controls need to build upon good-faith agreements and the consent of community members. Also in the real world, poorly implemented technology erodes that good-faith and consent.

The RA21 draft recommended practice focuses on technology and technology implementations, all the while failing to consider how to build the trust that underpins good-faith and consent. Service providers need to trust that identity providers faithfully facilitate authorized users and that the communities that identity providers serve will adhere to licensing agreements; users of information resources need to trust that their usage data will not be tracked and sold to the highest bidder.

Trust is not created out of thin air and certainly not by software. Technology can provide tools that facilitate trust, but shared values and communication between parties is the raw material of trust. An effective program to improve access must include processes and procedures that develop shared values and promote cooperation.

I recognize that RA21 has chosen to consider only the authentication intercourse as in-scope. But the draft recommendation has identified several areas of "further work". Included in this further work should be areas where community standards and best practices can enhance trust around authentication and authorization. To name two examples:
  1. A set of best practices around "incident response" would in practice work much better than a "guiding principle" of "end-to-end traceability".
  2. A set of best practices around auditing of security and privacy procedures and technology at service providers and identity providers would materially address the privacy and security concerns that the draft recommendation punts over to cited reports and studies.

This is the fifth and last of my comments submitted as part of the NISO standards process. The 102+ comments that have been submitted so far represent a great deal of expertise and real-world experience. My previous comments were  about secure communication channels, potential phishing attacks, the incompatibility of the recommended technical approach with privacy-enhancing browser features, and the need for radical inclusiveness. I've posted the comments here so you can easily comment. 

The Shocking Truth About RA21: It's Made of People! / Eric Hellman

Useful Utilities
logo from 2004
When librarian (and programmer) Chris Zagar wrote a modest URL-rewriting program almost 20 years ago, he expected the little IP authentication utility would be useful to libraries for a few years and would be quickly obsoleted by more sophisticated and powerful access technologies like Shibboleth. He started selling his program to other libraries for a pittance, naming this business "Useful Utilities", fully expecting that it would not disrupt his chosen profession of librarianship.

He was wrong. IP address authentication and EZProxy, now owned and managed by OCLC, are still the access management mainstays for libraries in the age of the internet. IP authentication allows for seamless access to licensed resources on a campus, while EZProxy allows off-campus users to log in just once to get similar access. Meanwhile, Shibboleth, OpenAthens and similar solutions remain feature-rich systems with clunky UIs and little mainstream adoption outside big rich publishers, big rich universities and the UK, even as more distributed identity technologies such as OAuth and OpenID have become ubiquitous thanks to Google, Facebook, Twitter etc.

from My Book House, Vol. I: In the Nursery, p. 197.
So how long will the little engines that could keep chugging? Not long, if the folks at RA21 have their way. Here are some reasons why the EZProxy/IP authentication stack needs replacement:

  1. IP authentication imposes significant administrative burdens on both libraries and publishers. On the library side, EZProxy servers need a configuration file that knows about every publisher  supplying the library. It contains details about the publisher's website that the publisher itself is often unaware of! On the publisher side, every customer's IP address range must be accounted for and updated whenever changes occur. Fortunately, this administrative burden scales with the size of the publisher and the library, so small publishers and small institutions can (and do) implement IP authentication with minimal cost. (For example, I wrote a Django module that does it.)
  2. IP Addresses are losing their grounding in physical locations. As IP address space fills up, access at institutions increasingly uses dynamic IP addresses in local, non-public networks. Cloud access points and VPN tunnels are now common. This has caused publishers to blame IP address authentication for unauthorized use of licensed resources, such as that by Sci-Hub. IP address authentication will most likely get leakier and leakier.
  3. Men Monsters in the middle are dangerous, and the web is becoming less tolerant of them. EZProxy acts as a "Man Monitor in the Middle", intercepting web traffic and inserting content (rewritten links) into the stream. This is what spies and hackers do, and unfortunately the threat environment has become increasingly hostile. In response, publishers that care about user privacy and security have implemented website encryption (HTTPS) so that users can be sure that the content they see is the content they were sent.

    In this environment, EZProxy represents an increasingly attractive target for hackers. A compromised EZProxy server could be a potent attack vector into the systems of every user of a library's resources. We've been lucky that (as far as is known) EZProxy is not widely used as a platform for system compromise, probably because other targets are softer.

    Looking into the future, it's important to note that new web browser APIs, such as service workers, are requiring secure channels. As publishers begin to make use these API's, it's likely that EZProxy's rewriting will unrepairably break new features.

So RA21 is an effort to replace IP authentication with something better. Unfortunately, the discussions around RA21 have been muddled because it's being approached as if RA21 is a product design, complete with use cases, technology pilots, and abstract specifications. But really, RA21 isn't a technology, or a product. It's a relationship that's being negotiated.

What does it mean that RA21 is a relationship? At its core, the authentication function is an expression of trust between publishers, libraries and users. Publishers need to trust libraries to "authenticate" the users for whom the content is licensed. Libraries need to trust users that the content won't be used in violation of their licenses. So for example, users are trusted keep their passwords secret. Publishers also have obligations in the relationship, but the trust expressed by IP authentication flows entirely in one direction.

I believe that IP Authentication and EZProxy have hung around so long because they have accurately represented the bilateral, asymmetric relationships of trust between users, libraries, and publishers. Shibboleth and its kin imperfectly insert faceless "Federations" into this relationship while introducing considerable cost and inconvenience.

What's happening is that publishers are losing trust in libraries' ability to secure IP addresses. This is straining and changing the relationship between libraries and publishers. The erosion of trust is justified, if perhaps ill-informed. RA21 will succeed only if creates and embodies a new trust relationship between libraries, publishers, and their users. Where RA21 fails, solutions from Google/Twitter/Facebook will succeed. Or, heaven help us, Snapchat.

Whatever RA21 turns out to be, it will add capability to the user authentication environment. IP authentication won't go away quickly - in fact the shortest path to RA21 adoption is to slide it in as a layer on top of EZProxy's IP authentication. But capability can be good or bad for parties in a relationship. An RA21 beholden to publishers alone will inevitably be used for their advantage. For libraries concerned with privacy, the scariest prospect is that publishers could require personal information as a condition for access. Libraries don't trust that publishers won't violate user privacy, nor should they, considering how most of their websites are rife with advertising trackers.

It needn't be that way. RA21 can succeed by aligning its mission with that of libraries and earning their trust. It can start by equalizing representation on its steering committee between libraries and publishers (currently there are 3 libraries, 9 publishers, and 5 other organizations represented; all three of the co-chairs represent STEM publishers.) The current representation of libraries omits large swaths of libraries needing licensed resources. MIT, with its Class A huge IP address block, has little in common with my public library, the local hospital, or our community colleges. RA21 has no representation of Asia, Africa, or South America, even on the so-called "outreach" committee. The infrastructure that RA21 ushers in could exert a great deal of power; it will need to do so wisely for all to benefit.

To learn more...
Thanks to Lisa Hinchliffe and Andromeda Yelton for very helpful background.

Would you let your kids see an RA21 movie?

Update 5/17/2019: A year later, the situation is about the same

Using Solr Tagger for Improving Relevancy / Lucidworks

Full-text documents and user queries usually contain references to things we already recognize, such as names of people, places, colors, brands, and other domain-specific concepts. Many systems overlook this important, explicit, and specific information which ends up treating the corpus as just a bag of words. Detecting, and extracting, these important entities in a semantically richer way improves document classification, faceting, and relevancy.

Solr includes a powerful feature, the Solr Tagger. When the system is fed a list of known entities, the tagger provides details about these things in text. This article steps through using the Solr Tagger to tag cities, colors, and brand, illustrating how it could be adapted to your own search projects.

What is the Solr Tagger?

Entity recognition isn’t new, but prior to Solr 7.4, there was a lot of Python and hand-wavy stuff.This got the job done but it wasn’t easy. Then Solr Tagger was introduced with the release of Solr 7.4 an incredible piece of work by Solr committer extraordinaire, David Smiley.

The Solr Tagger takes in a collection, field name, and a string and returns occurrences of tags that occur in a piece of text. For instance if I ask the tagger to process “I left my heart in San Francisco but I have a New York state of mind” and I’ve defined “San Francisco” and “New York” as both cities, Solr will say so:

        "name":["San Francisco"],
        "type": "city",
        "name":["New York"],

excerpt of sample output

Solr’s tagger is a naive tagger and doesn’t actually do any Natural Language Processing (NLP), it can still be used as part of a complete NER or ERD (Entity Recognition and Disambiguation) system or even for building a question answering or virtual assistant implementation.

How the Solr Tagger Works

The Solr Tagger is a Solr endpoint that uses a specialized collection containing “tags” that are defined text strings. In this case, “tags” are pointers to text ranges (substrings; start and end offsets) within the provided text. These text ranges match _documents_, by way of the specified tagging field. Documents, in the general Solr sense, are simply a collection of fields. In addition to tags, users can define Metadata associated with the tag. For example, metadata for tagged “cities” might include the country code, population, and latitude/longitude.

The Solr Tagger is a Solr endpoint that uses a specialized collection containing “tags” that are defined text strings. In this case, “tags” are pointers to text ranges (substrings; start and end offsets) within the provided text. These text ranges match _documents_, by way of the specified tagging field. Documents, in the general Solr sense, are simply a collection of fields. In addition to tags, users can define Metadata associated with the tag. For example, metadata for tagged “cities” might include the country code, population, and latitude/longitude.

The Solr Reference Guide section for the Solr Tagger has an excellent tutorial that is easy to run on a fresh Solr instance. We encourage you to go through that thorough tutorial first since we’ll be building upon that same tutorial data and configuration below.

For a deep dive into the inner workings of the Solr Tagger inner workings, tune into David Smiley’s presentation from a few years ago:

Tagging Entities

Expanding upon the Solr Tagger tutorial with cities, now we’re going to add additional kinds of entities.

Adding a single type field to the documents in the tagger collection gives us the ability to tag other types of things, like colors, brands, people names, and so on. Having this and other type-specific information on tagger documents also allows filtering the set of documents available for tagging. For example, the type, and type-specific fields can facilitate tagging of only cities within a short distance of a specified location.

Using Fusion, we first get the basic geonames Solr Tagger tutorial data placed into Fusion’s blob store and create a datasource to index it (there’s a few other sample entities in another datasource that we’ll explore next):

Before starting this datasource we modify the schema and configuration using Fusion’s Solr Config editor, according to the Solr Tagger documentation. While the Geonames cities data contains latitude and longitude, the tutorial’s simple indexing strategy leaves them as separate fields. Solr’s geospatial capabilities work with a special combined “lat/lon” field, which must be provided as a single combined comma-separated string. Solr’s out of the box schema provides a dynamic *_p (for “point”) field. In Fusion, the handy field mapping stage provides a way to set a field with a template (StringTemplate) trick, as shown for location_p here:

Also, the type field is set to literally “city” for every document in this data source, containing only cities.  We’ll bring in other “types” of entities shortly, but let’s first see what the type and location_p fields give us now:


“san francisco” is the only known string to the `entities` collection in that text. There are 42 cities known exactly as “San Francisco” in this collection. (there are many more that have “San Francisco” as part of its name, as shown in the Fusion Query Workbench below)

Lots of “San Francisco”s, but only 42 with exact match, and thus taggable with the current configuration.

Visualize Tags While Typing

The final touches here are to allow tagging of text while typing it. A simple (VelocityResponseWriter) proof of concept interface was created to quickly wire together a text box, a call to the Solr Tagger endpoint, and display the results. Typing that same phrase, the Solr Tagger response is shown diagnostically, along with a color-coded view of the tagged string. Tagged cities are color-coded yellow:

Assuming we are in San Francisco, California, at the Lucidworks headquarters, and we want to locate nearby sushi restaurants, we provide a useful bit of context: our location. Narrowing the tagging of locations to a geographic distance from a particular point, we can narrow down the available cities used to tag the string. When the location-aware context is provided, by checking the “In SF?” box, the tagger is sent a filter (fq) of:

(type:city AND {!geofilt sfield=location_p}) OR (type:* -type:city)

This geo-filters on cities within 10km (d=10) of the location provided (the pt parameter, specifying a lat/lon point).

Tagging More Than Locations

So far we’ve still only tagged cities. But we’ve configured the Tagger collection to support any “type” of thing. To show how additional types work, a few additional types are brought in similarly (a CSV in the Fusion blobstore, and a basic data source to index it):

brand-1,brand,White Linen

Suppose we now tag the phrase “Blue White Linen sheets” (add “in san francisco” for even more color):

The sub-string “white” is ambiguous here, as it is a color and also part of “white linen” (a brand).  There’s a parameter to the tagger (overlaps) that controls how to handle overlapping tagged text. The default is NO_SUB, so that no overlapping tags are returned, only the outermost ones, which causes “white linen” to be tagged as a brand, not as a brand with a color name inside it.

End Offset – What’s Next?

Well, that’s cool – we’re able to tag substrings in text and associate them with documents.  The real magic materializes when the tagged information and documents are put to use. Using the tagged information, for example, we can extract the entities from a query string and turn them into filters.

For example, if someone searched for “blue shoes,” and blue is tagged as a color, we can find it with the tagger and extract it from the query string and add it as a filter query. Effectively, we can  convert “blue shoes” to q=shoes&fq=color_s:blue with just a few lines of string manipulation code. This changes the structure of the query and makes it more relevant. For one, we’re looking for blue in a color field instead of the document or product description generally. For two, it can be more efficient for longer queries which generally get carried away by spreading terms across all fields regardless of whether the query terms actually make sense for those fields.

Taking tagging to another level, Lucidworks Chief Algorithms Officer, Trey Grainger, presented Natural Language Search with Knowledge Graphs at the recent Haystack Conference.   In this presentation, Trey leveraged the Solr Tagger to tag not only tag things (“haystack” in his example), but also command/operator words (“near”).   The Solr Tagger provides a necessary piece to natural language search.

Natural Language Search with Knowledge Graphs from Trey Grainger

Try It Now on Lucidworks Labs

The Solr Tagger example described above has been packaged into a Lucidworks Lab, making it launchable from

Erik explores search-based capabilities with Lucene, Solr, and Fusion. He co-founded Lucidworks, and co-authored ‘Lucene in Action.’

The post Using Solr Tagger for Improving Relevancy appeared first on Lucidworks.

Deploying with shiv / Brown University Library Digital Technologies Projects

I recently watched a talk called “Containerless Django – Deploying without Docker”, by Peter Baumgartner. Peter lists some benefits of Docker: that it gives you a pipeline for getting code tested and deployed, the container adds some security to the app, state can be isolated in the container, and it lets you run the exact same code in development and production.

Peter also lists some drawbacks to Docker: it’s a lot of code that could slow things down or have bugs, docker artifacts can be relatively large, and it adds extra abstractions to the system (eg. filesystem, network). He argues that an ideal deployment would include downloading a binary, creating a configuration file, and running it (like one can do with compiled C or Go programs).

Peter describes a process of deploying Django apps by creating a zipapp using shiv and goodconf, and deploying it with systemd constraints that add to the security. He argues that this process achieves most of the benefits of  Docker, but more simply, and that there’s a sweet spot for application size where this type of deploy is a good solution.

I decided to try using shiv with our image server Loris. I ran the shiv command “shiv -o loris.pyz .”, and I got the following error:

User “loris” and or group “loris” do(es) not exist.
Please create this user, e.g.:
`useradd -d /var/www/loris -s /sbin/false loris`

The issue is that in the Loris file, the install process not only checks for the loris user as shown in the error, but it also sets up directories on the filesystem (including setting the owner and permission, which requires root permissions). I submitted a PR to remove the filesystem setup from the Python package installation (and put it in a script the user can run), and hopefully in the future it will be easier to package up Loris and deploy it different ways.

Shaping an Applied Research Agenda / HangingTogether

In late April 2019 a group of librarians, technologists, and OCLC staff gathered in Dublin, Ohio to help shape an applied research agenda that charts a path for engagement with data science and a range of computational methods. The applied research agenda will be released this summer as a free resource. This resource will add to a growing conversation and series of efforts focused on ethically grounded library engagement in these spaces.

Participants at Shaping an Applied Research Agenda came from a diverse set of institutions including, but not limited to, Drexel University, Indiana University, Montana State University, the National Library of Medicine, Yale University, and the Universities of Houston, Illinois, Maryland, and Rhode Island. Participants had a range of motivations for taking part in the event and held in common a desire to generously share expertise and learn from their colleagues.

responses to the question, “what brought you to this event?”

Over the course of one-and-a-half days, participants took part in a series of human-centered design exercises facilitated by OCLC staff. These exercises worked to surface challenges, opportunities, and recommendations.

Identifying challenges

Prior to the event, participants received a list of seventy-four high level challenges that applied research agenda development had surfaced to that point. These challenges were the product of nearly twenty hours of engagement with an advisory group and thirty hours of engagement with a growing landscape group. Challenges to date are diverse, exhibiting shifting focus on technical, social, and organizational issues.

A sampling of pre-event, high level challenges are included below:

  • Investigate what skills, competencies, and organizational support are needed for library staff to critically engage with and advance machine learning, AI, and data science in a library context.
  • Develop resources that foster critical engagement with and ethical use of machine learning, artificial intelligence, and more.
  • Develop means to accommodate uncertain and/or probabilistic data generated via machine learning back into library catalogs.

Sharing challenges ahead of time was meant to help event participants begin thinking about what they might contribute to the applied research agenda. From all accounts this seeding activity appears to have been a success. Event participants iteratively developed six discrete challenges for the applied research agenda.

Iterative development

Like the seventy-four challenges that were shared prior to the event, these six challenges span technical, social, and organizational issues:

  • Applying data science and data analytics in library services and operations
  • Identifying high level computational/data science problems spaces that can be used as use cases
  • Developing strategies for engaging bias
  • Re-skilling librarians and preparing new librarians to serve 21st century library needs
  • Establishing communities of practice connected to professional networks (new or existing)
  • Demonstrating the value proposition

No less than 114 prioritized recommendations (impact x difficulty) for making progress on each challenge were generated by the group.

event participants prioritize recommendations for each challenge

Moving forward, these generous contributions will be integrated into the applied research agenda, which will be published openly as an OCLC Research Report. While the applied research agenda is slated for release this summer, the work is still very much in process. If you would like to learn more about the agenda or contribute your perspective to it please do reach out to me, Thomas Padilla, Practitioner Researcher in Residence.

The post Shaping an Applied Research Agenda appeared first on Hanging Together.

Open Data Day 2019: it’s a wrap! / Open Knowledge Foundation

On Saturday 2nd March 2019, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. In this blog, we wrap up the ninth edition of Open Data Day with a summary of all that happened across the globe, as well as a look ahead to the future.

This year’s edition saw a total of 325 events registered in the map around the world, with a record total of 26 events in Nepal, and 57 events in Japan! This shows that Open Data Day is well established and a growing community.

With so much happening, the online spread of news on Open Data Day was also impressive: the hashtag #OpenDataDay was used extensively to share live updates on events via Twitter. We summarised some of the highlights during the day itself: you can check up on what happened on the different continents here:

Open Data Day survey

To prepare well for next year’s edition, we want to learn more about the people behind these events. We know open data looks different from place to place and the needs to make Open Data Day happen are different as well. That is why we created a brief survey to learn a bit more about this needs and be able to foster and support all of you better – many thanks in advance for your contributions!

Mini-grant scheme

This year, 40 events received funding through the Open Data Day mini-grants scheme funded by the Open Contracting Program of Hivos, Mapbox, Frictionless Data for Reproducible Research, the Foreign and Commonwealth Office of the United Kingdom,  the Latin American Initiative for Open Data (ILDA) and the Open Contracting Partnership. This year, the focus was on four key areas that we think open data can help solve: Follow public money flows (particularly focusing on Open Contracting), Open Mapping, Open Science and Equal Development.

Following on the success of the 2018 edition, we set up a blogging schedule that connected the different events. Mini-grantees were linked to each other based on a similarity in topic, location or type of event. This resulted in a series of Open Data Day blogs that reported on activities from different angles, and also in more contact between the different organisers – something we hope will extend also beyond the actual event itself. Below is the list of all blogs of this edition per topic, for easy future reference:

Follow Public Money Flows

Open Mapping

Open Science

Equal Development

Many thanks to everyone who contributed to making this Open Data Day a success!

Review Of Data Storage In DNA / David Rosenthal

Luis Ceze, Jeff Nivala and Karin Strauss of the University of Washington and Microsoft Research team have published a fascinating review of the history and state-of-the-art in Molecular digital data storage using DNA. The abstract reads:
Molecular data storage is an attractive alternative for dense and durable information storage, which is sorely needed to deal with the growing gap between information production and the ability to store data. DNA is a clear example of effective archival data storage in molecular form. In this Review , we provide an overview of the process, the state of the art in this area and challenges for mainstream adoption. We also survey the field of in vivo molecular memory systems that record and store information within the DNA of living cells, which, together with in vitro DNA data storage, lie at the growing intersection of computer systems and biotechnology.
They include a comprehensive bibliography. Below the fold, some commentary and a few quibbles.

At this stage of the technology's development, having an authoritative review of the field is very useful, especially to push back against the hype that DNA storage always seems to attract. The UW/MSFT team's credentials for writing such a review are unmatched.

Some may have assumed that I was exaggerating the difficulty of getting a DNA storage product into the market when I wrote:
Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.
But one of the things the UW/MSFT team has always been impressively realistic about is the scale of the technological problem they face. They roughly agree with me when they write:
The current overall writing throughput of DNA data storage is likely to be in the order of kilobytes per second. We estimate that a system competitive with mainstream cloud archival storage systems in 10 years will need to offer writing and reading throughput of gigabytes per second. This is a 6 orders-​of-magnitude gap for synthesis and approximately 2–3 orders of magnitude for sequencing. On the cost gap, tape storage cost about US$16 per tera­byte in 2016 and is going down approximately 10% per year. DNA synthesis costs are generally confidential, but leading industry analyst Robert Carlson estimates the array synthesis cost to be approximately US$0.0001 per base, which amounts to US$800 million per terabyte or 7–8 orders magnitude higher than tape.
Sadly, just getting the write cost competitive with tape isn't enough to displace tape from the market. DNA storage would need to be significantly cheaper.

A review in an academic journal is not the place for the kind of marketing analysis I undertook in DNA's Niche in the Storage Market, so the following three quibbles are just that, quibbles. First, Ceze et al omit the most important cost factor when they write:
Density, durability and energy cost at rest are primary factors for archival storage, which aims to store vast amounts of data for long-​term, future use.
One of the fundamental economic problems that I didn't discuss in Archival Media: Not A Good Business is the barrier caused by the epidemic of short-termism in society. As our work on economic models of long-term storage showed, long-lived media with high capital and write costs but low running costs are at a huge disadvantage compared to short-lived media with low capital but higher running costs (including costs of regular migration to successor media). This barrier is a big part of the reason tape is such a small part of the overall storage market. The authors should have included "system capital cost" in the quote above.

Second, the cost of the robotic and fluidic write and read hardware for DNA storage is likely to be quite high. As with Facebook's Blu-Ray cold storage, this cost needs to be amortized across a large amount of stored data. Thus the economics of DNA storage are likely best suited to data-center scale systems, making the marketing problems even more difficult because there are only a few potential customers. They are all much bigger than any storage device vendor, and thus able to squeeze the device vendor's margins, as they have been doing in the markets for hard disk and flash.

Third, Ceze et al write:
using DNA for data storage offers density of up to 1018 bytes per mm3, approximately six orders of magnitude denser than the densest media available today
In DNA's Niche in the Storage Market I compared the density of the DNA medium with the density of the hard disk medium:
State-of-the-art hard disks store 1.75TB of user data per platter in a 20nm-thick magnetic layer on each side. The platter is 95mm in diameter with a 25mm diameter hole in the middle, so the volume of the layer that actually contains the data is π*(952-252)*40*10-6 ≅ 1mm3. This volume contains 1.4*1013 usable bits, so each bit occupies about 7*10-14mm3.
The real comparison is thus 1.25*10-19mm3/bit vs. 7*10-14mm3/bit, or a factor of about 5.6*105. Six orders of magnitude is plausible, but misleading in two ways. First, it compares the "up to" theoretical density of the DNA medium with the actual media density of 2018 hard disks in volume production. Second, as I described in the same post, the density of storage devices is far lower than the density of the raw medium. For hard disk:
The overhead of the packaging surrounding the raw bits is about half a million times bigger than the bits themselves. If this overhead could be eliminated, we could store 7 exabytes in a 3.5" disk form factor.

DNA storage systems will impose a similar overhead.
Exactly how big the packaging overhead will be depends on a range of system design issues yet to be addressed, but eventual DNA storage systems are unlikely to be a million times denser than their competition.

A growing data community in Paraguay / Open Knowledge Foundation

This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. Girolabs from Paraguay received funding through the mini-grant scheme by Hivos / Open Contracting Partnership, to organise events under the Open Contracting theme. The Spanish version of this blog is available at the Girolabs blog. 

For the sixth year in a row, we organised an Open Data Day event in Paraguay, as part of the Open Gov WeekThis initiative was born 9 years ago and has become a world event with more than 250 events in hundreds of cities. In Asunción, this meetup was lead by Girolabs and Fundación CIRD on 14 March, at Loffice Las Mercedes.

The meetup was a chance to bring together people that are passionate and interested in the philosophy: to connect with other people and organisations, hear about projects, experiences and exchange ideas.

For the last edition, we had an unconference format, where the participants designed the agenda for the conversations. This year, the goal was to make the number of possible projects more visible. For this reason, we selected nine initiatives (through submissions and invites) to present their work linked to Open Data.

We were surprised by the response of the community. Like never before, we had more than 160 people sign up to the event. Despite the rain of previous days in Asunción (similar to London lol), approximately 70 people attended Open Data Day Asunción 2019.

The methodology for this edition was to have 9 different sessions: we built three spaces in three different locations, where people could attend based on their interest. Loffice Las Mercedes was an ideal place to do this.

Room 1

CEAMSO  (Center for Environmental and Social Studies for its acronym in Spanish), represented by Raúl Quiñonez, shared about their Observatory of Political Financing (ONAFIP).

The Paraguayan Government was also there. Irina Vologdina from the office of Electoral Justice led the conversation about their Open Data Portal.

At the same time, Carlos Magnone from Wild-Fi Paraguay shared his experience with Frutos de Asunción (Fruits of Asunción).

Room 2

Afterwards, Juntos por la Educación represented by Maria Fe Dos Santos, Oscar Charotti and Santiago García presented the website of the Citizen Observatory of Education.

Roy Ramirez from Fundación CIRD with his initiative A Quienes Elegimos shared an analysis of data of public funds destined to political parties and the spending of the Electoral Justice department on marketing and advertising.

In parallel, Fernando Maidana of Info Paraguay shared his portal on places and activities in the country.

After an hour with a lot of inspiration, we had a break with mingling and networking. We hosted an open mic where everyone could share and hear about the ideas in the room.

Room 3

In the third and last conversation, Julio Pacielo and Juan Pane of the Centro de Desarrollo Sostenible (CDS) shared some open data on Open Contracting.

Katrina Nichuk of Maps Py talked about Open Steet Map and the OSM community in Paraguay.

Lastly, Luis Pablo Alonzo of TEDIC presented the Observatorio Anti-Pyrawebs, an initiative that opposes the law to track and store data from IP traffic.

Looking ahead

For one more year Paraguay shares the different works that use open data and shows that the society recognises the importance of people and organisations that can transform these data into valuable information for decision making. For 2020 we want to make this meeting even bigger to have greater impact, carrying proudly the ODD flag, on the Mandi’obyte version.

You can see all the even photos here.


Sharing the Fedora Conferences Page / Islandora

For quite some time now, the Fedora community has maintained a public list of conferences that community members want to attend, particularly those that will have Fedora-related content or workshops (including Islandora and Samvera):
The Islandora Coordinating Committee recently took up a discussion about whether the Islandora community could benefit from such a list, and landed on the idea that rather than duplicating what Fedora has been doing so well, we'd all benefit from collaborating and making more use of the same list. 
To that end, the Fedora list has added and edited a few fields to make things more general, including a new column to briefly describe the conference/event so that newcomers can quickly review what's relevant for them, and a column for linking to individual sessions or workshops at an event that are relevant for Fedora/Islandora/Samvera. 
The page is run wiki-style, so anyone who would like to add or edit an event is welcome to go ahead and make changes. All you need is a Duraspace account. You are also welcome to send your list entries/updates to me, and I'll make the changes for you.

User Intent Steers AI-Powered Search / Lucidworks

Consumers expect personalized experiences on retail sites. After all, they get them on social media, in entertainment, on mobile devices, and while they search the web — why settle for less when shopping?

Marketers agree. A recent survey of 700 U.S. marketers conducted by ClickZ and Chatmeter found that fully half of respondents identified “changing customer behavior driven by new technology” as the number one search trend of 2019.

In ecommerce, the shopper’s search experiences can vary widely. Amazon’s search and recommendation engines, powered by artificial intelligence (AI) and machine learning (ML), provide highly relevant and personalized results based on a shopper’s prior purchases and browsing, among other factors, but the search experience on most other retail sites is often disappointing.

The situation is beginning to change, though. Trends in distributed computing and open-source programming are making it possible for organizations without enormous scale and resources to dramatically improve ecommerce search and recommendations for shoppers — when they work with the right partner.

To find out why search is challenging from a technical perspective and how that’s changing, Lucidworks tapped the expertise of Peter Curran, president and cofounder of Cirrus10 and a consultant on the business and technology of ecommerce, in a question-and-answer session.

Curran has been a featured speaker in two recent Lucidworks webinars, Why Your Customers Can’t Find What They Are Looking For and Create the Ideal Customer Experience—Without Giving Up Control. He has overseen the implementation of Lucidworks’ Fusion search application with three clients, but he also works with other solution providers and on projects other than search.

Q: What technologies are making it possible for ecommerce companies to improve their search significantly?

A: With the emergence of ML and distributed computing, companies can do things they couldn’t do before without writing a bunch of rules. For example, Lucidworks Fusion has a distributed computing architecture based on Apache Spark that is integrated with a search platform. Distributed computing allows search engines to handle more scenarios algorithmically, and Fusion can do a lot of complicated ML calculations that would be too computationally expensive without this marriage.

Q: Why has search been such a challenge for e-retailers?

A: E-commerce search is hard to manage because there are a million different ways people can tell you what they want, and it’s hard to predict the words they’re going to use. Then, the language itself is complicated. For example, if you search for shorts, you might get a pair of short pants, but you might also see a short-sleeve shirt, a short skirt, a “skort” or any number of other things. This is happening because the word short and shorts have a common root, and the search engine can’t distinguish between short as an adjective and shorts as a noun. To make matters worse, what we normally call a pair of shorts is technically called just a “short.”

To deal with these issues, retailers set up rules, but manual rules might cause different problems. An example is a search for “leopard print.” In the results from a website I showed in one of the Lucidworks webinars, I get prints, as in wall art, with various animal images in my search results, but I don’t get any garments with a leopard pattern, which is what I wanted. This started when someone — probably someone focused on home décor — decided that the word leopard should be equivalent to the word animal but didn’t think about “leopard print” fabrics.

Tuning searches manually not only takes a long time to do in the first place, but once you’ve put in a manual curation rule, you have to maintain it as time goes on. Rules don’t last forever because products and searches change. You also have limitations on the number of rules you can put in. A manual process is painful and just not sustainable.

Q: AI and machine learning can process much higher volumes of data much faster than humans, but why doesn’t it make the same kinds of mistakes?

A: Machine learning is about pattern recognition and drawing conclusions based on patterns. Lucidworks uses what’s called signal ingestion or signal processing. Signals are all the user interactions — queries, clicks, add-to-cart, checkout, etc. — as well as all the profile and/or third-party data you have about a session. A signal is a broad concept, but there is a clear set of best practices.

The system isn’t “understanding” the meaning of the words; it is learning through search behaviors which words and combinations of words are important to determine the searcher’s intent.

Let’s say people are searching a site like Best Buy or Walmart for “dysentery.” The system doesn’t know that it makes no sense for a person to search for dysentery, which is an intestinal infection, on a normal big-box-store ecommerce site. But by going through enough sessions, the ML begins to see that people who search dysentery are buying Dyson vacuums.  No matter how informed they are about the product or the customer, a human merchandiser would never think to create a rule like “dysentery=Dyson.” The pairing is not logical to human intelligence, but it works for real-life searches because the iPhone’s internal spelling correction changes the word Dyson to dysentery.

“a human merchandiser would never think to create a rule like “dysentery=Dyson.”

Q: Retailers have inventory they need to sell and promotional commitments to meet. They need to match products to the right buyer, not necessarily match the buyer to the product. How can AI help with that?

A: Those types of business needs are just another feature to engineer, and it can be done either at the ML model level or it can be done as a human override. When you build an algorithmic search experience, you simply want to reduce the number of decisions that humans are making.

With Fusion, the system sometimes suggests new rules based on the data, and the user can decide whether to accept it. Fusion also allows merchants to come up with rules so that they can achieve business goals, but they have to test them and experiment with actual searches. As a fellow speaker in one webinar put it, “Experimentation is the best way to make a data-driven decision.”

Fusion is powerful because it gives a retailer just-add-water-type solutions to tune common ecommerce searches, and those have been tested over millions of searches with numerous retailers. Fusion 4.2 provides a business rules manager that gives retailers the ability to build their own models and plug them into the Lucidworks framework to customize it.

Q: We’ve just touched the surface of how AI and ML can transform search for consumers and retailers. What are your final thoughts?

A: I spent the beginning of my career in web content management, where we were always trying to make companies more efficient. I like working in ecommerce search because the work tends to make companies more money. The reason is that search shows intentionality. People who search generally want to buy, and the old saying is absolutely true: If they can’t find it, they can’t buy it.

So, if your search is hiding relevant results, or if your relevant results are lost in a sea of irrelevant results, or if your search UX is messed up, then an investment in addressing these things will almost always pay off. Some of the companies that have implemented Fusion have been able to return their investment in a matter of days. It can be quite an enormous and quick return.

Marie Griffin is a writer and editor with extensive experience covering retail, technology, media and other B2B topics. She has held multiple editorial leadership positions, including editor/associate publisher of Drug Store News, and has been freelancing for web/print publications and marketers since 2001.

The post User Intent Steers AI-Powered Search appeared first on Lucidworks.

Naturalist Datathon: Bogotá (Datatón Naturalista) / Open Knowledge Foundation

This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. The Karisma Foundation from Colombia received funding through the mini-grant scheme by the Frictionless Data for Reproducible Research project to organise an event under the Open Science theme. This report was written by Karen Soacha: her biography is included at the bottom of this post. The original Spanish version is published at

Open data, naturalists and pizza were part of the Open Data Day celebration in Bogotá

Why and how to improve the quality of open data on biodiversity available in citizen science platforms, were the questions that brought together more than 40 naturalists in the event organized by the Karisma Foundation, the Humboldt Institute and the Biodiversity Information System of Colombia (SiB Colombia) on 2nd of March 2019 as part of the global celebration of Open Data Day.

Expert naturalists, amateurs and those interested in citizen science came together to review the open data generated for Bogotá through the City Nature Challenge 2018. The City Nature Challenge is an annual event that invites city-dwellers across the world to hit the streets for two days to capture and catalogue nature which they might be too occupied to notice otherwise.  Using their smartphones, hundreds of people generate thousands of observations of plants, birds, insects and more, which they share through citizen science platforms such as iNaturalist. Generating the data is just the beginning of the process: improving its quality, so that they have the greatest possibility of being used, is the next step.

During the Naturalist Datathon we shared guides to facilitate the identification of species, tips to review observations, as well as good practices for users and reviewers to improve the quality of the data. After a morning of collaborative work, the groups shared their learning and engaged in a discussion about the importance of data quality and its potential use in environmental monitoring especially in the context of environmental issues in Bogotá.

1.Introduction and guides for the activity 2. Roles of the participants 4. Organization of work groups 5. Collaborative review of observations 6. Discussion 7. Naturalist Kit for all the participants

The Datatón Naturalista left us with a set of outputs, specific lessons learned and a set of good practices for the participants, the organizers and the community of naturalists and open data. To begin with, this activity contributed to increasing the community of experts who actively participate in the “curation” of observations published in Naturalista Colombia, which is necessary in order to improve the quality of the data. At the end of the datathon, the quality of the data the participants worked on was vastly improved — so much so that the data will be integrated into the SiB Colombia (the official national continental biodiversity portal). As a result of this datathon, more Colombians were encouraged to participate as urban/rural naturalists.

Participants also shared good practices for taking photographs and collecting data necessary for observations to be useful for multiple uses, they mentioned the importance of use licenses for facilitating the reuse and  sharing the information (Creative Commons). They also gave recommendations for the 2019 City Nature Challenge (CNC), such as the need for guides in easy-to-consume formats (such as short videos) that ought to be shared in advance of the CNC.  This guide should go beyond basic information on data capturing, and should include good practices, as well as ethical recommendations for the creators, curators, and users of information. One of the challenges that the participants highlighted was the need to recognize and integrate citizen science data as a source of information for the environmental management of the city.

For the organizers, the datatón turned out to be an effective means to create conversation, connections and reflections on the how and for what of the open data, at the same time that allowed to strengthen capacities and contribute with open data of quality.

Finally, this event showed that more and more citizens are becoming involved in citizen science, actively contributing to our knowledge of biodiversity, and are working collaboratively to further understand their environment and to generate information that is useful for decision-making. Therefore, it is necessary to continue promoting spaces that allow community-building and facilitate networking around open science and citizen science.  For that reason, we in Bogotá are looking forward to the next Open Data Day.



Karen Soacha is interested in the connection between knowledge management, citizen science, governance and nature. She’s been working with environmental organizations for over 10 years, in the management of data and information networks, especially with open data on biodiversity. She is convinced that science is a way to build dialogue within the society. She is also a teacher, an amateur dancer, and an apprentice naturalist.


Records & Representations / Ed Summers

One of the interesting topics that came up for me at the recent Congrés d’Arxivística de Catalunya was the distinction between records and representations in archival work. It was most clearly expressed by Greg Rolan who said that in his recent work on Participatory Information Governance (Evans, McKemmish, & Rolan, 2018) the notion of the record of an event is displaced by a representation of that event.

This renaming does a few interesting things for archival theory and practice. Most importantly it allows for multiple perspectives of an event to coexist, and for them to contradict, support or supplement what would have otherwise been a singular narrative. This is particularly important in situations where there is a power imbalance between the documenter and the documented, such as the case they explore: documentation of out-of-home care in Australia where long-term living arrangements are made for young people who “cannot live in their family home because of concerns regarding physical, sexual and emotional abuse or neglect.”

Allowing for multiple representations of an event then allows for a plurality of stakeholders to become part of, and participate in, the information system. Record centric models tend to privledge organizational accounts of events, where what is “the record” and what is a point of contention, or even political struggle.

Using this approach, a record may represent many activities and any activity may bere presented by many records. Such conceptualisation is less concerned with the nature of records as physical artefacts, than with facilitating the expression of human activity and involvement. Furthermore, this approach also allows for other types of documentality such as personal memory and performance (Rolan, 2017).

This is doubly interesting to me because it folds in with my own research interest with the archicture of the web, and how it is used as a systemm of documentation, or dare I say “archive” (Summers & Salo, 2013). The idea of a Represention was introduced into web architecture to resolve some of the tension around the role of Documents in the web (Fielding, 2000), and their behavior over time, and with respect to the preferences of clients (web browsers). A web server makes Representations of Resources available, and dependeing on who you are, how you ask, and when you ask, you may receive a different Representation of that Resource. The Document is a convenient illusion…not a material object.

In the same way that Fielding’s notion of Representational State Transfer (REST) decenters the web from being a system for document transfer, Rolan et al. are asking us to consider archives as less about collections of records, and more as processual systems where events, their representations, and their stakeholders exist in time and in relation to each other. I just got a brief look into this “meta-model for recordkeeping metadata” at the conference, but see that he develops the idea in Rolan (2017) which I will be taking a closer look at.


Evans, J., McKemmish, S., & Rolan, G. (2018). Participatory information governance: Transforming recordkeeping for childhood out-of-home care. Records Management Journal, 29(1/2).

Fielding, R. (2000). Architectural styles and the design of network-based software architectures (PhD thesis). University of California, Irvine. Retrieved from

Rolan, G. (2017). Towards interoperable recordkeeping systems: A meta-model for recordkeeping metadata. Records Management Journal, 27(2), 125–148.

Summers, E., & Salo, D. (2013). Linking things on the web: A pragmatic examination of linked data for libraries, archives and museums (No. arXiv:1302.4591). arXiv. Retrieved from

Library Metadata Evolution: The Final Mile / Richard Wallis

I am honoured to have been invited to speak at the CILIP Conference 2019, in Manchester UK, July 3rd.  My session, in the RDA, ISNIs and Linked Data track, shares the title of this post: Library Metadata Evolution: The Final Mile?

Why this title, and why the question mark?

I have been rattling around the library systems industry/sector for nigh on thirty years, with metadata always being somewhere near the core of my thoughts.  From that time as a programmer wrestling with the new to me ‘so-called’ standard MARC; reacting to meet the challenges of the burgeoning web; through the elephantine gestation of RDA; several experiments with linked library data; and the false starts of BIBFRAME, it has consistently never scaled to the real world beyond the library walls and firewall.

When arrived on the scene I thought we might have arrived at the point where library metadata  could finally blossom; adding value outside of library systems to help library curated resources become first class citizens, and hence results, in the global web we all inhabit.  But as yet it has not happened.

The why not yet, is something that has been concerning me of late.   The answer I believe is that we do not yet have a simple, mostly automatic, process to take data from MARC, process it to identify entities (Work, Instance, Person, Organisation, etc.) and deliver it as Linked Data (BIBFRAME), supplemented with the open vocabulary used on the web (

Many of the elements of this process are, or nearly are, in place.   The Library of Congress site provides conversion scripts from MARCXML to BIBFRAME, the group are starting to develop conversion processes to take that BIBFRAME output and add

So what’s missing?   A key missing piece is entity reconciliation.

The BIBFRAME conversion scripts identity the Work for the record instance being processed.  However, they do not recognise Works they have previously processed. What is needed is a repeatable reliable, mostly automatic process for reconciling the many unnecessarily created duplicate Work descriptions in the current processes.  Reconciliation is also important for other entity types, but with the aid of VIAF, ISNI, LC’s name and other identifier authorities, most of the groundwork for those is well established.

There are a few other steps required to achieve the goal of being able to deliver search engine targeted structured library metadata to the web at scale, such as crowbarring the output into out ageing user interfaces, but none I believe more important than giving them data about reconciled entities.

Although my conclusions here may seem a little pessimistic, I am convinced we are on that last mile which will take us to a place and time where library curated metadata is a first class component of search engine knowledge graphs and is likely to lead a user to a library enabled fulfilment as to a commercial one.


Storing Data In Oligopeptides / David Rosenthal

Bryan Cafferty et al have published a paper entitled Storage of Information Using Small Organic Molecules. There's a press release from Harvard's Wyss Institute at Storage Beyond the Cloud. Below the fold, some commentary on the differences and similarities between this technique and using DNA to store data.

The paper's abstract reads:
Although information is ubiquitous, and its technology arguably among the highest that humankind has produced, its very ubiquity has posed new types of problems. Three that involve storage of information (rather than computation) include its usage of energy, the robustness of stored information over long times, and its ability to resist corruption through tampering. The difficulty in solving these problems using present methods has stimulated interest in the possibilities available through fundamentally different strategies, including storage of information in molecules. Here we show that storage of information in mixtures of readily available, stable, low-molecular-weight molecules offers new approaches to this problem. This procedure uses a common, small set of molecules (here, 32 oligopeptides) to write binary information. It minimizes the time and difficulty of synthesis of new molecules. It also circumvents the challenges of encoding and reading messages in linear macromolecules. We have encoded, written, stored, and read a total of approximately 400kilobits (both text and images), coded as mixtures of molecules, with greater than 99% recovery of information, written at an average rate of 8bits/s, and read at a rate of 20 bits/s. This demonstration indicates that organic and analytical chemistry offer many new strategies and capabilities to problems in long-term, zero-energy, robust information storage.
The press release explains the basic idea:
Oligopeptides also vary in mass, depending on their number and type of amino acids. Mixed together, they are distinguishable from one another, like letters in alphabet soup.

Making words from the letters is a bit complicated: In a microwell—like a miniature version of a whack-a-mole but with 384 mole holes—each well contains oligopeptides with varying masses. Just as ink is absorbed on a page, the oligopeptide mixtures are then assembled on a metal surface where they are stored. If the team wants to read back what they “wrote,” they take a look at one of the wells through a mass spectrometer, which sorts the molecules by mass. This tells them which oligopeptides are present or absent: Their mass gives them away.

Then, to translate the jumble of molecules into letters and words, they borrowed the binary code. An “M,” for example, uses four of eight possible oligopeptides, each with a different mass. The four floating in the well receive a “1,” while the missing four receive a “0.” The molecular-binary code points to a corresponding letter or, if the information is an image, a corresponding pixel.

With this method, a mixture of eight oligopeptides can store one byte of information; 32 can store four bytes; and more could store even more.
The idea of encoding data using a "library" of previously synthesized chemical units is similar to that Catalog uses with fragments of DNA. Catalog claims to be able to write data to DNA at 1.2Mb/s, which makes the press release's claim that the Wyss technique's:
“writing” speed far outpaces writing with synthetic DNA
misleading to say the least. Also misleading is this claim from the press release:
DNA synthesis requires skilled and often repetitive labor. If each message needs to be designed from scratch, macromolecule storage could become long and expensive work.
The team from Microsoft Research and U.W. have a paper and a video describing a fully-automated write-store-read pipeline for DNA. As I understand it Catalog's approach is also fully automated.

Both the Wyss Institute and Catalog approaches can readily expand their "libraries" to increase the raw bit density of the medium, but again this is misleading. As I explained in detail in DNA's Niche in the Storage Market, the data density of an actual storage device is controlled not by the size of a bit on the medium, but by the infrastructure that has to surround the medium in order to write, preserve and read it.

Like all publications about chemical storage, the Wyss technique's prospects are hyped by referring to the "data tsunami", or the "data apocalypse", with the demand for data storage being insatiable. This merely demonstrates that the writers don't understand the storage business, because they uncritically accept the bogus IDC numbers. The idea that data will only be stored, especially for the long term, if the value to be extracted from it justifies the expense seems not to occur to them. And thus that the price/performance of storage devices rather than the density of storage media is critical to their market penetration.

The details of the competing chemical storage technologies are actually not very relevant to their commercial prospects. All the approaches are restricted only to archival data storage, which as I explained in Archival Media: Not A Good Business, is a niche market with low margins. There is some demand to lock data away for the long term in low-maintenance long access latency media, but it is a long way from insatiable.

Open Data Day: Open Science events in the Democratic Republic of the Congo and Costa Rica / Open Knowledge Foundation

This report is part of the event report series on International Open Data Day 2019. On Saturday 2nd March, groups from around the world organised over 300 events to celebrate, promote and spread the use of open data. AfricaTech from the Democratic Republic of the Congo and the Society for Open Science and Biodiversity Conservation (SCiAC) from Costa Rica received funding through the mini-grant scheme by the Frictionless Data for Reproducible Research project and by the Latin American Initiative for Open Data (ILDA), to organise events under the Open Science theme. This report was written by Stella Agama Mbiyi and Diego Gómez Hoyos.


We organized in the UCC in Kinshasa on March 2, 2019, the Open Day event 2019. Our event was focused on Open Science in the Democratic Republic of the Congo. We had about 50 participants in the event, especially students and some researchers who participated positively in the different sessions and discussions on Open Science in the Democratic Republic of the Congo and its implications for sustainable development.

5 Speakers among 4 women presented various concepts related to Open Science to participants. The conference started at 8:00 and ended at 17:30. Several participants made positive comments about the event such as Florent Nday, a Biological student at University of Kinshasa who said: “This is my first time to hear about Open Science, it’s a huge opportunity for us students from developing countries. Because we will have access to a wide range of knowledge easily.”

The social science researcher at Kinshasa’s Institute of Social Science, Mr. Jiress Mbumba commented, “It’s time for us Congolese researchers to promote Open Science in the  Democratic Republic of the Congo, we have an interest to share our researches, and findings with everyone to spur the development of science.” The event ended with a dinner offered to all participants.

Society for Open Science and Biodiversity Conservation (SciAc)

The training workshop on Reproducibility in Science as a link between Open Data, Open Science and Open Education, was organized by SCiAC (Society for Open Science and Biodiversity Conservation) in collaboration with the Biology Department of the University of Costa Rica, ProCAT International, Abriendo Datos Costa Rica and CR Wildlife Foundation.

The workshop included general presentations on open ecosystems and data management plans during research projects, as well as training in the use of GitHub and R language for data release and data analysis code in a context of Open Science practices. The four speakers in the workshop were Diego Gómez Hoyos and Rocío Seisdedos from SCiAC, Susana Soto from Abriendo Datos Costa Rica and Ariel Mora from the University of Costa Rica. Fifteen people (66% women) from different provinces of Costa Rica (Puntarenas, Guanacaste, Heredia and San José) participated in the activity.

In Central America, especially in Costa Rica, considerable advances have been made regarding open data and open government issues. Our workshop has been one of the first efforts to offer researchers tools in order for open science and open education practices. This workshop has been inspired by the project Open Science MOOC and the “Panama Declaration for Open Science”, led by Karisma Foundation and in which SCiAC took part.

From this experience we see a great potential and interest of researchers in knowing the tools with which they can share the elements of their research processes. We also recognize that open science practices could have a significant impact on the teaching of scientific practice. Finally, we identify the need to carry out these training activities as a tool that seeks to democratize access to and generation of knowledge in order to face the environmental, social and economic problems faced by our society.

The CAP Roadshow / Harvard Library Innovation Lab

In 2019 we embarked on the CAP Roadshow. This year, we shared the Caselaw Access Project at conferences and workshops with new friends and colleagues.

Between February and May 2019, we made the following stops at conferences and workshops:

Next stop on the road will be UNT Open Access Symposium from May 17 - 18 at University of North Texas College of Law. See you there!

On the road we were able to connect the Caselaw Access Project with new people. We were able to share where data comes from, what kinds of questions we can ask when we have the machine readable data to do it, and all the new ways that you’re all building and learning with Caselaw Access Project data to see the landscape of U.S. legal history in new ways.

The CAP Roadshow doesn’t stop here! Share Caselaw Access Project data with a colleague to keep the party going.

CAP Roadshow

RA21 doesn't address the yet-another-WAYF problem. Radical inclusiveness would. / Eric Hellman

The fundamental problem with standards is captured by XKCD 927.
Single sign-on systems have the same problem. The only way for a single sign-on system to deliver a seamless user experience is to be backed by a federated identity system that encompasses all use cases. For RA-21 to be the single button that works for everyone, it must be radically inclusive. It must accommodate a wide variety of communities and use cases.

Unfortunately, the draft recommended practice betrays no self-awareness about this problem. Mostly, it assumes that there will be a single "access through your institution" button. While it is certainly true that end-users have more success when presented with a primary access method, it's not addressed how  RA-21 might reach that state.

Articulating a radical inclusiveness principle would put the goal of single-button access within reach. Radical inclusiveness means bringing IP-based authentication, anonymous access, and access for walk-ins into the RA-21 tent. Meanwhile the usability and adoption of of SAML-based systems would be improved; service providers who require "end-to-end traceability" could achieve this in the context of their customer agreements; it needn't be a requirement for the system as a whole.

Radical inclusiveness would also broaden the user base and thus financial support for the system as a whole. We can't expect a 100,000 student university library in China to have the same requirements or capabilities as a small hospital in New Jersey or a multinational pharmaceutical company in Switzerland, even though all three might need access to the same research article.

This is my fourth comment on the RA-21 draft "Recommended Practices for Improved Access toInstitutionally-Provided Information Resources". The official comment period ends Friday. This comment, 57 others, and the add-comment form can be read here. My comments so far are about secure communication channelspotential phishing attacks, and the incompatibility of the recommended technical approach with privacy-enhancing browser features. I'm posting the comments here so you can easily comment. I'll have one more comment, and then a general summary.

What’s New Season 2 Wrap-up / Dan Cohen

With the end of the academic year at Northeastern University, the library wraps up our What’s New podcast, an interview series with researchers who help us understand, in plainspoken ways, some of the latest discoveries and ideas about our world. This year’s slate of podcasts, like last year’s, was extraordinarily diverse, ranging from the threat of autonomous killer robots to the wonders of tactile writing systems like Braille, and from the impact of streaming music on the recording industry to the disruption and meaning of Brexit. I’ve enjoyed producing and being the interviewer on these podcasts, and since I like to do my homework in addition to conversing with the guests live, I’ve learned an enormous amount from What’s New.

I hope you have too if you’re a subscriber to the podcast or just the occasional listener, and would love your feedback about what we can do better, and topics you would like to hear us cover in the future. One surprising and rewarding thing we’ve noticed about the podcast is how new subscribers are going back and listening to the show from Episode 1. Podcasts do seem to encourage binging, and the fact that we keep our podcasts to roughly 30 minutes means that you can easily go through both Seasons 1 and 2 during a relatively short timespan while commuting, walking your dog, or relaxing this summer.

The overall audience for What’s New has also gone up considerably over the last year. In the last 12 months we’ve had about 150,000 streams, and each episode now receives 5-10,000 listeners. These are not chart-topping numbers, but for a fairly serious educational podcast (with, I hope, intermittent humor) it’s good to find a decent-sized niche that continues to grow.

If you haven’t had a chance to listen yet, you can subscribe to What’s New on Apple PodcastsGoogle PlayStitcherOvercast, or wherever you get your podcasts, or simply stream episodes from the What’s New website. Word of mouth has been the primary way new listeners have heard about the podcast, so if you like what we’re doing, please tell others or leave a review on iTunes, as that remains the starting point for most podcast listeners.

And as a jumping off point for new listeners or those who may have missed a few shows during the school year, here’s a summary of this year’s episodes:

Episode 17: Remaking the News – how consolidation in the news industry and the rise of the internet has changed professional journalism, with Dan Kennedy

Episode 18: Making Artificial Intelligence Fairer – exploring the biases endemic to AI, which come from its creators, with Tina Eliassi-Rad

Episode 19: The Shifting Landscape of Music – how the music industry moved from vinyl records to cassettes, CDs, downloads, and now streaming, and what this evolution has meant for musicians, with David Herlihy

Episode 20: A New Way to Scan the Human Body – pioneering the use of nanosensors within the body and its potential applications, with Heather Clark

Episode 21: Election Day Special: Michael Dukakis – on 2018’s Election Day, the three-term governor and presidential candidate spoke candidly about the state of politics

Episode 22: Bridging the Academic-Public Divide Through Podcasts – a recording of yours truly giving a keynote at the Sound Education conference at Harvard, which brought together hundreds of educational and academic podcasters and podcast listeners

Episode 23: The Regeneration of Body Parts – new research and techniques for stimulating the growth of limbs, eyes, and organs, with Anastasiya Yandulskaya, Brian Ruliffson, and Alex Lovely

Episode 24: The Urban Commons – how 311 systems, which allow citizens to provide feedback to municipalities, have changed our knowledge of cities and they ways residents and governments interact, with Dan O’Brien

Episode 25: Touch This Page – the history and future of tactile writing systems, and what they tell us about the act of reading, with Sari Altschuler

Episode 26: Seeking Justice for Hidden Deaths – between 1930 and 1970 there were thousands of racially motivated homicides in the U.S., and one project is attempting to document them all, with Margaret Burnham

Episode 27: Tracing the Spread of Fake News – looking carefully at the impact of untrustworthy online sources in the election of 2016, with David Lazer

Episode 28: How College Students Get the News – the surprising results of a large study of the news consumption habits of college students, with Alison Head and John Wihbey

Episode 29: The Web at 30 – celebrating the 30th anniversary of the founding of the World Wide Web with a discussion of how it has reshaped our world for better and worse, with Kyle Courtney

Episode 30: Controlling Killer Robots – how major advances in robotics and artificial intelligence have led to the dawn of deadly, independent machines, and how an international coalition is trying to prevent them from taking over warfare, with Denise Garcia

Episode 31: European Disunion – how Europe has regularly escaped the fate of dissolution, and what Brexit means in this longer history, with Mai’a Cross

Thanks for tuning in!

Learn How the Fortune 500 Combine Search and AI to Build a Better Online Shopping Experience / Lucidworks

On Wednesday May 15th, Lucidworks is hosting its first Activate Now event in New York City. Register now to hear from leading digital commerce companies on their strategies for building a hyper-personalized shopping experience for customers and the impact artificial intelligence is having on the role of merchandising. The event will include talks and panels from retail industry leaders and networking opportunities for attendees.

Activate Now is Lucidworks first one-day digital commerce conference, an off-shoot of the annual Activate search and AI conference. This event is a localized opportunity to learn from and network with digital commerce leaders. Attendees will hear how brands create personalized shopping experiences using search, AI, and data analytics to understand customer intent, deliver relevant recommendations, and increase add-to-cart and conversion rates.

“Our retail customers are committed to delivering the same level of expert suggestions and personal customer experience online that they do in our store to drive revenue and loyalty,” said Will Hayes, Lucidworks CEO. “How we get them there continues to evolve and grow. Relying on machine learning to augment our merchandisers’ capabilities and deploying AI that enables smarter recommendations for shoppers is one way to create that personal customer experience and drive purchases. We’re looking forward to sharing more best practices from some of the biggest brands at Activate Now.”

The event will be held at the Helen Mills Event Space and Theater at 137 West 26th Street in New York City. Full schedule is below:

Registration & Networking: 8:30-9:00am

Delivering a Hyper-Personalized Customer Experience

  • Will Hayes, CEO, Lucidworks

How AI Might Impact the Role of Merchandising (panel)

  • Ronak Shah, Architect, major homeware brand
  • Katharine McKee, Founder, Digital Consultancy
  • Liz O’Neill, Sr. Digital Commerce Mgr, Lucidworks
  • Diane Burley, VP Content, Lucidworks

Why Customers Can’t Find What They Are Looking For

  • Peter Curran, President and CEO, Cirrus10

Enhancing Search Experience at Speed and Scale

  • Pavan Baruri, Director – Core Services (Customer Experience), Foot Locker

AI-powered Search for Increased Conversions

  • Grant Ingersoll, CTO, Lucidworks

Networking Lunch, 12:30-2:00pm

Join e-commerce leaders for education, inspiration, and networking. Tickets are $199 for attendance and lunch. Press passes are available with valid accreditation. Full details at

The post Learn How the Fortune 500 Combine Search and AI to Build a Better Online Shopping Experience appeared first on Lucidworks.

Two years on, little action from the EU on public country-by-country reporting / Open Knowledge Foundation

Two years ago, members of the European Parliament voted to force large multinational corporations registered in Europe to reveal how much tax they pay, how many people they employ and what profits they make in every country where they work.

The transparency measure – known as public country-by-country reporting (or public CBCR) – was first proposed by the Tax Justice Network in 2003 and has gained prominence in recent years following international tax scandals including the Panama Papers.

MEPs approved the introduction of the measure by 534 votes to 98 votes and also mandated that the information should be published by corporations as open data to allow anyone to freely use, modify and share it.

Calls for action on this issue had come from campaigners including the Tax Justice Network, Tax Justice UK and Transparency International and were echoed by politicians from Labour leader Jeremy Corbyn and a cross-party selection of UK MPs to the European Commission’s Taxation and Customs Union. Some large companies and investors also spoke out in favour. 78% of British voters would be in favour of public CBCR for multinationals present in UK, according to a 2017 YouGov poll conducted for Oxfam. Oxfam called on the UK government to enforce comprehensive public CBCR for UK companies by the end of 2019.

But, ahead of the European parliamentary elections due next week, little progress has been made towards introducing public CBCR across the continent, with legislation being blocked by members of the EU Council. So what will it take for public CBCR to become law?

Public CBCR by Financial Transparency Coalition is licensed under CC BY-NC-ND 3.0

The European Union already requires companies in the extractive, logging and banking sectors to publish public CBCR information on a regular basis, albeit not as open data. These measures were introduced following the 2007/08 financial crash and in line with the Extractive Industries Transparency Initiative.

Using this information, researchers have revealed the extent to which the top 20 EU banks are using tax havens and also to show how CBCR requirements have forced some banks to change their behaviour. But academics have also shown how better data is needed and efforts to understand the data have been hampered by the need to extract, structure and clean it from tables or text in companies’ annual reports.

Since MEPs voted in 2017, the case for the EU to act to introduce public CBCR across more sectors and industries has only grown stronger. The final report of the European Parliament’s Panama Papers committee adopted in December 2017 called for “ambitious public country-by-country reporting in order to enhance tax transparency and the public scrutiny of multinational enterprises” noting that public CBCR is “one of the key measures for achieving greater transparency in relation to companies’ tax information for all citizens”.

Investors and those promoting business sustainability have recognised the importance of understanding more about corporations’ tax affairs as well as structuring this information in more of a standardised way. Some businesses have even gone so far as to publish their own CBCR reports ahead of legislation coming into force.

In February 2017, as part of our Open Data for Tax Justice project, the Open Knowledge Foundation published a white paper which examined the prospects for creating a global database on the tax contributions and economic activities of multinationals as measured by public CBCR.

This found that such a public database was possible and that a pilot could be created by bringing together the best existing source of public CBCR information – disclosures made by European Union banking institutions in line with the Capital Requirements Directive IV (CRD IV) passed in 2013. In July 2017, we took steps towards the creation of this pilot.

As European parliamentary candidates enter the final stretch of campaigning, we urge those elected to return to Brussels in July to arrive with a renewed sense of urgency in this area and to focus efforts on making sure public CBCR becomes law before the public’s trust is rocked by yet another international tax scandal.

Immutability FTW! / David Rosenthal

There's an apparently apocryphal story that when Willie Sutton, the notorious bank robber of the 1930s  to 1950s, was asked why he robbed banks, he answered:
Because that's where the money is!
Today's Willie Suttons don't need a disguise or an (unloaded) Thompson submachine gun, because they rob cryptocurrency exchanges. As David Gerard writes:
Crypto exchange hacks are incredibly rare, and only happen every month or so.
Yesterday Bloomberg reported:
Binance, one of the world’s largest cryptocurrency exchanges, said hackers withdrew 7,000 Bitcoins worth about $40 million via a single transaction in a “large scale security breach,” the latest in a long line of thefts in the digital currency space.
Below the fold, a few thoughts:

BTC-USD on Coinbase 5/8/19
First, "7,000 Bitcoins worth about $40 million" implies 1BTC ≅ $5,700. Just before the news 1BTC ≅ $5,900 on Coinbase, an exchange where it is possible to sell BTC for USD, On the news it dropped to $5,700 before recovering to around $5,800.

But in order to allow customers to withdraw USD, Coinbase conforms to the Know Your Customer/Anti-Money-Laundering laws. So it is unlikely that the perpetrators could use any of the BTC-USD exchanges to turn their ill-gotten BTC into USD. They would have to use a less scrupulous exchange, which means they'd end up with USDT (Tether) not USD.

Since the New York State Attorney General sued Bitfinex, the exchange that sponsors Tether, and revealed an $850M hole in Tether's reserve, Tether was forced to admit that it was not backed 100% by USD. They now claim only 74%, but there has never been an independent audit to confirm their USD holdings. Despite this, the USDT-USD rate has held up well although customers are fleeing Bitfinex. Because USDT is so central to cryptocurrency trading, it has become too big to fail. But the converse of this is that if it does fail, the whole house of cards collapses.

Even assuming the perpetrators could trade their BTC for USD, what effect would selling 7,000BTC have on the price? Cryptocurrency markets are heavily manipulated; around 95% of all cryptocurrency trades are fake. Apart from the fake trades, the markets are not very liquid, as the Mt. Gox bankruptcy trustee found out:
An upset Mt. Gox creditor analyses the data from the bankruptcy trustee’s sale of bitcoins. He thinks he’s demonstrated incompetent dumping by the trustee — but actually shows that a “market cap” made of 18 million BTC can be crashed by selling 60,000 BTC, over months, at market prices, which suggests there is no market.
So the stolen 7,000 BTC are in practice unlikely to end up worth anything close to $40M. Still good for the perpetrators, but not so good for the journalists reporting on the theft.

Second, the initial response from the CEO of Binance is revealing:
In the wake of a multimillion-dollar hack Tuesday, Changpeng Zhao, the CEO of cryptocurrency exchange startup Binance publicly discussed whether the company might seek to encourage bitcoin miners and node operators to “rollback” the bitcoin blockchain, reversing transactions confirmed by the network to return the funds. ... Zhao said:
“To be honest, we can actually do this probably within the next few days. But there are concerns that if we do a rollback on the bitcoin network at that scale, it may have some negative consequences, in terms of destroying the credibility for bitcoin.”
Mining Pools 05/08/19
Zhao is right that Binance could have paid for a rollback. They would only have to have persuaded 4 mining pools to do it. 7,000 BTC is the reward for nearly 4 days of mining, so they would have had a lot of BTC to do the persuasion with.

The whole point of the blockchain technology underlying cryptocurrencies is to implement an "immutable public ledger", and thus make transactions irreversible. But, as we saw with the great DAO heist, the first reaction to a major theft is to consider a "hard fork" to reverse the transaction. Because immutability is for the little guys, not for us:
  • Immutability of blockchain ledgers is sold as being enforced by the technology, but as we see it is really enforced socially.
  • Ifa ledger is really immutable it is a "be careful what you wish for" thing, because in the real world it works well until it doesn't.
Binance have suspended withdrawals, but have promised to make good customer losses. However, as David Gerard reports, there is much less to this promise than meets the eye:
Binance has reassured customers that their SAFU insurance fund fully covers the loss, and customers will not be out anything.

The SAFU fund was created after July 2018 irregularities on the exchange involving Syscoin, a minor altcoin. SAFU contains only Binance’s own BNB on-exchange token — and the July 2018 compensation to affected traders was paid out in BNB.
That’s at least 188,000 bitcoins that can’t be sold on real markets for a week. The Bitcoin price on Binance is likely to diverge wildly from the prices on exchanges still open to withdrawals.

Binance itself — and insiders — would be able to move their coins off just fine. The price differences would create remarkable arbitrage opportunities, and ability to capture what liquidity exists in the markets — for those privileged few who can still deposit and withdraw.
The transaction that removed the 7,000 BTC was confirmed with 7.5 BTC, or about 0.1% of its value. Recent game-theoretic analysis suggests that there are strong economic limits to the security of cryptocurrency-based blockchains. For safety, the total value of transactions in a block needs to be less than the value of the block reward. Which kind of spoils the whole idea, doesn't it?

Panoptikum: exploring new ways to categorize a collection of various unusual and unique objects / Open Knowledge Foundation

This blog has been reposted from

For the past two and a half years, the artist Jürg Straumann has been working on a digital retrospective of his life’s work, spanning over four decades of visual art. The latest stage of this project involved creating an interactive way to browse this unique and very personalized database. During our workshop on Open Data Day, March 3 – while Rufus Pollock’s book The Open Revolution was passed around the room, I introduced a gathering of collectors and art experts to Open Knowledge and OpenGLAM. We discussed the question of how new channels and terms like Creative Commons support both the artwork and the artist in a digital economy. And we got lots of great feedback for our project together, which you can read about in this post.

The image above is a style transfer by Oleg Lavrovsky from Der Raub der Deianira durch den Zentauren Nessus by Jürg Straumann (nach Damià Campeny, 2012) to La muse by Pablo Picasso (1935)

Wahnsinnig viel Züg, es isch e wahri Freud! (Swiss German, approx. translation: So much stuff, a true delight!)

Oleg’s story

Over my years as web developer I have worked on several collaborations with artists like Didier Mouron/Don Harper or Roland Zoss/Rene Rios, and on various ‘code+art’ projects like Portrait Domain with the #GLAMhack and demoscene community. I’m drawn to this kind of project both from a personal interest in art and it’s many incarnations, as well as from the fascinating opportunity to get to know the artist and their work.

When Jürg approached me with his request, I quickly recognized that this was a person who was engaged at the intersection of traditional and digital media, who explores the possibilites of networked and remixed art, who is meticulous, scientific, excited by the possibilities andcommitted to the archiving and preservation of work in the digital commons. I was very impressed with the ongoing efforts to digitize his life works on a large scale, and jumped in to help bring it to an audience.

During this same time, I’ve been working on implementing the Frictionless Data standards in various projects. Since he gave me complete freedom to propose the solution, the first thing I did was to use Data Package Pipelines to implement a converter for the catalogue, which was in Microsoft Excel format as shown in the screenshot below. In this process we identified various data issues, slightly improved the schema, and created a reliable conversion process which connected the dataset to the image collection. The automatic verifications in this process started helping to accelerate the digitization efforts.


Together with Rebekka Gerber, an art historian who works at the Museum für Gestaltung Zürich, we reviewed various systems used for advanced web galleries and museum websites, such as:

While they all had their advantages and disadvantages, we remained unsure which one to commit to: budget and time constraints led us to take the “lowest hanging fruit”, and …not use any backend at all. Our solution, inspired by the csvapi project by Open Data Team, is an instant JSON API. Like their csvapi, ours works directly from the CSV files, which are first referenced from the Data Package generated by our pipeline using the Python Data Package library. Based on this API, I wrote a simple frontend using the Twitter Bootstrap framework I’m used to hacking on for short term projects.


Et voilà! A powerful search interface in the hands of one of our first beta-testers. When you see it – and I hope pretty soon at least a partial collection will be available online – you’ll notice a ton of options. Three screen-fulls of various filters and settings to delight the art collector, exploring the collection of nearly 7’000 images with carefully nuanced features.


If you’ve been reading this blog, you can imagine that it is a collection that could also delight a Data Scientist. If there is interest, I am happy to separately open source the API generator that was made in this project. And our goal is to get this API out there in the hands of fellow artists and remixers. For now, you can check out the code in

The open source project is available at, and we are going to continue working on future developments in this repository. The content is not yet available to the public, since we are still working out the copyright conditions and practical questions. Nevertheless, we wish to share some insight into this project with more people through workshops, exhibitions and this blog.

More on all that in future posts. In the meantime, I’ll let Jürg share more background on the project in his own words. Subscribe to our GitHub repository to be notified of progress – and stay tuned!


Wenn Kunst vergrabe isch und vergässe gaht, isch es es Problem für alli Aghörige, e furchtbari Belastig für d Nachkomme. (When art is buried and is lost, it is a problem for all involved, a terrible weight for the next generation.)


(This is the story of the project written by Jürg and translated with DeepL‘s help. You can read the German original at the bottom of this page.)

In a good 40 years of work as a visual artist (in the conventional media of drawing, printmaking and painting), over 6,600 smaller and larger works have accumulated in my collection. In retrospect, these prove to be unusually diverse, but with sporadically recurring elements, somehow connected by a personal “sound”. Very early on I tried to systematize the spontaneous development of sculpture in different directions. This is the basic idea of the project PANOPTIKUM (since 2000), whereby the categorizations of the whole uncontrolled growth are only the basis for further artistic works – which should, ironically, dissolve the whole again.

In the middle of 2016, with the help of numerous experts, I began to compile a catalogue of my works, i.e. to scan or photograph my works and then to index them in a differentiated way in an Excel spreadsheet. In 2018, Oleg Lavrovsky agreed to make the collected data accessible as desired, i.e. after entering the search terms, to display the respective images numerically and optically on the screen by means of a filter function. This is a prerequisite for the fact that in the coming years it will be possible to continue working with the image material in a variety of creative ways. Our project takes the form of an application, which can also be reviewed and further developed by other people (Open Source). The copyright and publication rights for all content remain with me, the created app can be freely used as a structure for other projects. In the longer term, general accessibility via the Internet is planned. At the moment, however, all content should only be available to individual interested parties.

After the completion of this basic work, whereby the directory is to be supplemented about every six months, the task now is to concretize own artistic projects: digital graphics and an interactive work as well as possibly videos are pending. For this I am dependent on expert support, the search for interested persons continues. Commissioned works as well as forms of egalitarian cooperation are possible. In addition, the image material may also be made available for independent projects of third parties.

The starting point and pivotal point of the PANOPTIKUM project is in any case the question of what can be done with a catalogued visual work. A wide variety of sub-projects can be created over an unlimited period of time (artistically, art historically, statistically, literarily, musically, didactically, psychologically, parodistically… depending on the point of view and interests of the participants).

The central idea is to make a visual work accessible in an unusual and entertaining way. To capture additional public benefit through revision. Potential goals include:

  • Unusual: the very differentiated formal and content-related recording of one’s own work, which becomes the basis for further creations (self-reflexiveness and reference to the outside world).
  • Entertaining: exploring in a playful way (e.g. searching for the unknown author of this picture pool, memory, domino, competition, etc.) by means of interactive functions, games, VR applications.
  • Artistic work: my own works (approx. 6,600 drawings, paintings and prints), which are presented anonymously and with a good pinch of irony and questioned.
  • Making accessible: multimedia, on various channels: exhibition spaces (also improvised and private), internet, cinema. The target audience is as broad as possible, especially outside the usual art scene.
  • Stimulating: the desire to look, the pleasure of pleasurable immersion (flood of images!). On the other hand, thoughts about identity, freedom, openness.
  • Useful: sustainability material: ecological aspects in production and presentation. Social sustainability: smaller events, e.g. with the sale of the works at very favourable conditions in favour of “Public Eye” (instead of a rubble dump at the end of life!). Thus discussion about artist’s estates, archiving, economic aspects (art trade). Any visual material for teaching (art history, art mediation)?

Next steps: Work on the overall concept, on a “story” with scriptwriters, event managers, advertisers, etc. One idea we call the Kunstfund would ask: who is the author? Take the role of art historians, amateurs, gallery owners, art critics and collectors, and speculate; picture disputes, questions of taste; search for meaning; models for political systems – all slightly spunky and ironic.

Parallel to this, experimenting with concrete formal implementations:

  • How can my very sensually influenced, conventionally designed images be staged and brought into a visually attractive contrast with the digitally generated elements. For example, by means of split screens, transparencies, animated lettering, infographics, combinations with photo and video material from the “outside world”, whereby my collage books could serve as a bridge.
  • Function which continuously (anonymously if desired) records all activities and creations of the users – for example, in the design of virtual exhibition spaces with my pictures.

Visit Jürg’s website for glimpses into his work and contact options.

Trans-inclusive design at A List Apart / Erin White

I am thrilled and terrified to say that I have an article on Trans-inclusive design out on A List Apart today.

I have read A List Apart for years and have always seen it as The Site for folks who make websites, so it is an honor to be published there.

RA21's recommended technical approach is broken by emerging browser privacy features / Eric Hellman

This is my third comment about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th.  My first comment concerned the use of secure communication channels. The second looked at potential phishing attacks on the proposed system. I'm posting the comments here so you can easily comment.

RA21's recommended technical approach is broken by emerging browser privacy features

Third party cookies are widely on the web used as trackers, or "web bugs", by advertising networks wishing to target users with advertising on the web. The impact of these trackers on privacy has been widely reported and decried. Browser local storage deployed using 3rd-party iframes is similarly employed for user tracking by ad networks. Browser vendors, led by Apple, have fought back against user tracking by providing user options to limit third party information sharing. Apple's "Intelligent Tracking Protection"  has progressively increased the barriers to cross-site information storage, for example, by partitioning the local storage according to third-party context.

Unfortunately for RA21, the draft recommended practice (RP) has endorsed a technical approach which mirrors the tactics used for user tracking by the advertising industry. For this reason, users of Safari who choose to enable the "prevent cross-site tracking" option may not benefit from the "seamless" access promised by RA21 if implemented with the endorsed technical approach.

Wikimedia commons
The optimistically acronymed "P3W" pilot used a javascript library called "Krakenjs/zoid" (According to the Norse sagas, the kraken is a squidlike monster that terrorizes voyagers) to exchange data between cross-domain contexts. The limitations on krakenjs in Safari are acknowledged by the library's developer.  It works by having the host webpage create an iframe loaded from a P3W website. With privacy controls off, the web page posts to the iframe, which answers with a reference to the user's identity provider. The service provider website uses that information to help the user authenticate without having to search through a huge list of identity providers. With Safari privacy features turned on, the search process must be repeated for each and every service provider domain.

Other browser vendors have moved towards restricting tracking behaviour. Firefox has announced that it will phase in "enhanced tracking protection"
Even Google's Chrome browser is moving towards restrictions on tracking technologies.

The bottom line is that if RA21 is implemented with the recommended technical approach, library users will probably be required to turn off privacy enhancing features of their browser software to use resources in their library. As a result, RA21 will have difficulty moving forward with community consensus on this technical approach.

Browser software is much more tolerant of cross-domain communication when the information "hub" is a first-party context (i.e. a window of its own, not an embedded iframe), as is done in more established authentication schemes such as OpenID Connect and SAML flow. RA21 should refocus its development effort on these technical approaches.

Open call: become a Frictionless Data Reproducible Research Fellow / Open Knowledge Foundation

The Frictionless Data Reproducible Research Fellows Programme, supported by the Sloan Foundation, aims to train graduate students, postdoctoral scholars, and early career researchers how to become champions for open, reproducible research using Frictionless Data tools and approaches in their field.

Fellows will learn about Frictionless Data, including how to use Frictionless tools in their domains to improve reproducible research workflows, and how to advocate for open science. Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content. In addition to mentorship, we are providing Fellows with stipends of $5,000 to support their work and time during the nine-month long Fellowship. We welcome applications using this form from 8th May 2019 until 30th July 2019, with the Fellowship starting in the fall. We value diversity and encourage applicants from communities that are under-represented in science and technology, people of colour, women, people with disabilities, and LGBTI+ individuals.

Frictionless Data for Reproducible Research

The Fellowship is part of the Frictionless Data for Reproducible Research project at the Open Knowledge Foundation. Frictionless Data aims to reduce the friction often found when working with data, such as when data is poorly structured, incomplete, hard to find, or is archived in difficult to use formats. This project, funded by the Sloan Foundation, applies our work to data-driven research disciplines, in order to help researchers and the research community resolve data workflow issues.  At its core, Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The core specification, the Data Package, is a simple and practical “container” for data and metadata. The Frictionless Data approach aims to address identified needs for improving data-driven research such as generalised, standard metadata formats, interoperable data, and open-source tooling for data validation.

Fellowship programme

During the Fellowship, our team will be on hand to work closely with you as you complete the work. We will help you learn Frictionless Data tooling and software, and provide you with resources to help you create workshops and presentations. Also, we will announce Fellows on the project website and will be publishing your blogs and workshops slides within our network channels.  We will provide mentorship on how to work on an Open project, and will work with you to achieve your Fellowship goals.

How to apply

We welcome applications using this form from 8th May 2019 until 30th July 2019, with the Fellowship starting in the fall. The Fund is open to early career research individuals, such as graduate students and postdoctoral scholars, anywhere in the world, and in any scientific discipline. Successful applicants will be enthusiastic about reproducible research and open science, have some experience with communications, writing, or giving presentations, and have some technical skills (basic experience with Python, R, or Matlab for example), but do not need to be technically proficient. If you are interested, but do not have all of the qualifications, we still encourage you to apply.

If you have any questions, please email the team at, ask a question on the project’s gitter channel, or check out the Fellows FAQ section. Apply soon, and share with your networks!

Outcomes from the 2019 OCLC RLP Research Retreat / HangingTogether

At the 2019 OCLC RLP Research Retreat we heard many themes that will be familiar to those working in research libraries. As an affiliate group that seeks to return value to our Partners for their investment, this was an opportunity to confirm and reshape our program of work.

The library is one of many stakeholders

As research libraries respond to rapid changes in scholarly outputs and workflows and seek to provide additional support in areas like research data management and research information management, they must increasingly partner with a multitude of stakeholders from across the institution to implement enterprise-wide services to support the entire research life cycle and address multiple, complex institutional challenges. Our RLP members attending the Research Retreat described successful cross-campus collaborations and communication of the library’s value proposition as pressing challenges. And OCLC Research has also surfaced this in several of its recent research reports, which have described the importance of communication and coordination with other internal stakeholders for institutional success.

Libraries need to better understand and be more responsive to campus stakeholders. Librarians are skilled collaborators but frequently lack the knowledge they need of other institutional functional areas to most effectively engage and collaborate with other professionals across the institution. This particularly rings true for units the library may previously have had little interaction with, such as the Research Office, Institutional Research, and Campus Communications.

Beginning later this year, Senior Program Officer Rebecca Bryant will be leading an effort to better understand the operations, goals, and pain points of research university stakeholders to inform library communications and partnership. She will be looking to RLP members for their assistance in identifying articulate informants from numerous campus units, such as:

  • Sponsored programs
  • Institutional research
  • Campus communications
  • Information services/data warehouse
  • Academic affairs
  • Graduate school

RLP members are encouraged to stay tuned via the Announce listserv for more information soon.

Library skills development

At the Research Retreat, we heard that research libraries are challenged to acquire and develop new skills for staff in key areas. We also learned in a post-event survey that RLP members want to learn what colleagues and peer institutions are doing. As a response the need to learn about new areas of practice as well as what others are doing in the many emergent areas of library practice, the OCLC RLP has established the Works in Progress Webinar series as a place to explore and expose new areas of work. The WIP webinars are a lightweight way for library professionals to learn about emergent areas of practice, as well as to learn about work at OCLC Research while it is in a formative state. We have learned that many institutions organize group viewing parties — we love this, because learning is better when done with others.

We have also developed more in depth discussion and study groups for areas like Research Data Management and Library Assessment. And we will be continuing to offer quarterly discussions on research support topics such as how institutions are implementing FAIR practices.

Additionally, our long running communities of practice such as the Metadata Managers Focus Group and SHARES (the resource sharing arm of the RLP) serve as fertile discussion forums. These groups support active dialog to identify shared areas of need and support best practices.

Bringing people together

Although many of our learning opportunities are done virtually, people value the opportunity to come together. We are always trying to improve our in-person events, so here are some of our takeaways on that score.

Play Doh creations from the Research Retreat
  • People value the opportunity to meet with one another, particularly at gatherings that are held in conjunction with other meetings. We plan to continue to hold RLP events at OCLC Regional Council meetings and alongside other important meetings such as the RLUK Conference. We are considering what other meetings where we might consider having an RLP “pop up” meeting.
  • We tried to inject a sense of play into our meeting by supplying Play Doh at all the tables. This was a hit with Research Retreat participants who expressed their creativity in a variety of ways.
  • To keep attention focussed on the meeting, we asked attendees to observe a “no device” policy. Not only did meeting participants kindly comply with our request, we received positive feedback.
  • We had a suggestion for stretch breaks, as well as to make the role of “host” more clear in our Art of Hosting activities. We’re also trying to be even more mindful of the importance of using the mic.

If you attended the Research Retreat, what were your “lessons learned?” Does your institution have someone at the table for the Metadata Managers Focus Group, or on one of our working groups? Are you taking advantage of our webinars? What are your tips for running a great meeting? Let us know, we love to hear from you!

The post Outcomes from the 2019 OCLC RLP Research Retreat appeared first on Hanging Together.

2019 DSpace North American Users Group Call for Proposals / DuraSpace News

The 2019 DSpace North American Users Group planning committee invites proposals for the upcoming meeting which will be held September 23 & 24, 2019 at the Elmer L. Andersen Library at the University of Minnesota in Minneapolis.

This meeting will provide opportunities to discuss ideas, strategies, best practices, use cases, and the future development of DSpace 7 with members of the DSpace community including repository developers, technology directors, and institutional repository managers.

We are looking for proposals to cover a variety of topics including, but not limited to:

  • DSpace 7 development and integration
  • Upgrading or migrating to DSpace
  • Accessibility
  • DSpace for research data
  • DSpace for cultural heritage
  • Analytics and assessment
  • Institutional repositories / scholarly communication issues
  • “Show and Tell” – share your success and challenges
  • Anything else you would like to share with the community!

We are seeking proposals in the following formats:

  • Lightning Talk (5-10 min) – a brief, freestanding presentation, with or without slides, including Q&A
  • Presentation (20 min) – a more comprehensive, freestanding presentation, including Q&A
  • Discussion Panel (45 min) – a collection of brief presentations on a topic or area, including a moderated Q&A or open discussion
  • Workshop – an instructor-led workshop on a topic or tool
  • Birds-of-a-feather – breakout sessions for attendees to engage in a particular topic

Submit a proposal by June 21, 2019.

Notifications of acceptance will be sent by July 15, 2019.

Need some ideas? Check out past North American user group meeting programs on the conference wiki!

Questions? Contact us at

The 2019 DSpace North American User Group Meeting is jointly sponsored by the University of Minnesota Libraries and the Texas Digital Library.

We encourage members of the wider open repository community and those interested in learning more about the open source DSpace repository platform to participate. More information about accommodations, registration, and schedule will be made available on the conference website.

The post 2019 DSpace North American Users Group Call for Proposals appeared first on

Demand Is Even Less Insatiable Than It Used To Be / David Rosenthal

In Demand Is Far From Insatiable I looked at Chris Mellor's overview of the miserable Q2 numbers from Seagate, Nearline disk drive demand dip dropkicks Seagate: How deep is the trough, how deep is the trough?, and Western Digital,  Weak flash demand and disk sales leave Western Digital scrabbling to claw back $800m a year. This quarter was equally dismal. Below the fold, the gory details.

The details are in Paul Kunert's ironically headlined WD, Seagate romp over Q3 finishing with glorious sales and profit. The sub-head is more accurate:
Oh no, sorry that was another quarter in a different year - top and bottom lines wobble amid corp procurement slowdown
As in the previous quarter, Western Digital had the worst of it:
sales for the quarter ended 29 March fell 26.7 per cent year-on-year to $3.7bn, with hard disks down 20.4 per cent to $2.1bn and flash sliding by a third to $1.6bn. As a result the bit biz lot $374m for the quarter.

The sequential average selling price per gigabyte declined 23 per cent. WD now expects total disk exabyte shipments in capacity to be flat or slightly up this year.

Gross profit for the quarter fell to $579m from $602m a year earlier, operating expenses dropped by $40m to $973m, leaving a loss from operations of $394m, versus an operating profit of $914m in Q3 ’18.

Interest expenses and income, along with tax, resulted in a net loss of $581m, versus a net profit of $61m.
Seagate was less unhappy:
revenues came in a $2.31bn, down 17.5 per cent year-on-year. ... Demand for product from the hyperscale vendors was up “slightly” but not enough to “fully offset the slower demand from OEM and others global cloud customers”.

“As a reminder, demand for our Nearline drives began to slow in the December quarter, as cloud service providers work through the inventory build-up during calendar 2018. However, we anticipate this pause to be short lived”.

Seagate shipped 77 exabytes of capacity in the quarter, down 12 per cent sequentially.

As for the rest of its results, operating expenses fell 12 per cent to $2.01bn. Operating profit was $236m, versus a profit from ops of $441m a year earlier. Interest income and expenses left net profit at $195m, down from $381m. Earnings per share were just $0.69, less than half compared to this time last year.
Why did "cloud service providers" have an "inventory build-up during calendar 2018"? Because the demand for storage from their customers was even further from insatiable than the drive vendors expected. Even the experts fall victim to the "insatiable demand" myth.

In Apple and Samsung feel the pain as smartphone market slumps to lowest shipments in 5 YEARS, Kunert reports that the demand for flash memory is equally far from insatiable:
Figures from analyst Canalys show that 313.9 million handsets were shifted in the three months, down 6.8 per cent - the sixth straight quarter of shrinkage in a sector that has given consumers little reason to upgrade.

Chaebol Samsung, which filed its calender Q1 yesterday, held onto the top spot by some margin - though a 10 per cent decline in sales to 71.5 million clipped its market share from 23.6 per cent a year ago to 22.8 per cent. Huawei snuck into second spot with 50 per cent climb in shipments to 59.1 million, giving it a 18.8 per cent share of sales, versus 11.7 in Q1 '18.

The troubles continued at Apple, following a disappointing Xmas holidays quarter, as it reported 40.2 million unit shipments, down 23.2 per cent, reducing its share to 12.8 per cent from 15.5 per cent.
These high-end smartphones are a big part of the demand for flash memory.

RA21 Draft RP session timeout recommendation considered harmful / Eric Hellman

Hey everybody, I implemented RA21 for access to the blog!

Well, that was fun.

I'm contributing comments about the recently published NISO draft "Recommended Practice" (RP) on "Improved Access to Institutionally-Provided Information Resources" a. k. a. "Resource Access in the 21st Century" (RA21). Official comments can be submitted until May 17th. The draft has much to recommend it, but it appears to have flaws that could impair the success of the effort. My first comment concerned the use of secure communication channels. I expect to write two more. I'm posting the comments here so you can easily comment.

RA21 Draft RP session timeout recommendation considered harmful

RA21 hopes to implement a user authentication environment which allows seamless single sign-on to a large number of service provider websites. Essential to RA21's vision is to replace a hodge-podge of implementations with a uniform, easily recognizable user interface.

While a uniform sign-in flow will be a huge benefit to end users, it introduces an increased vulnerability to an increasingly common type of compromise, credential phishing.  A credential phishing attack exploits learned user behavior by presenting the user with a fraudulent interface cloned from a legitimate service. The unsuspecting user enters credentials into the fraudulent website without ever being aware of the credential theft. RA21 greatly reduces the difficulty of a phishing attack in three ways:
  1. Users will learn and use the same sign-in flow for many, perhaps hundreds, of websites. Most users will occasionally encounter the RA21 login on websites they have never used before.
  2. The uniform visual appearance of the sign-in button and identity provider selection step will be trivial to copy. Similarly, a user's previously selected identity provider will often be easy for an attacker to guess, based on the user's IP address.
  3. If successful, RA21 may be used by millions of authorized users, making it difficult to detect unauthorized use of stolen credentials.
If users are trained to enter password credentials even once per day, they are unlikely to notice when they are asked for identity provider credentials by a website crafted to mimic a real identity provider.

For this very reason, websites commonly used for third party logins, such as Google and Facebook, use timeouts much longer than the 24 hour timeouts recommended by the RA21 draft RP. To combat credential theft, they add tools such as multi-factor authentication and insert identity challenges based on factors such as user behavior and the number of devices used by an account.

Identity providers participating in RA21 need to be encouraged to adopt these and other anti-phishing security measures; the RA21 draft's recommended identity provider session timeout (section 2.7) is not in alignment with these measures and is thus counterproductive. Instead, the RP should encourage long identity provider session timeouts, advanced authentication methods, and should clearly note the hazard of phishing attacks on the system. Long-lived sessions will result in better user experience and promote systemic security. While the RP cites default values used in Shibboleth, there is no published evidence that these parameters have suppressed credential theft; the need for RA21 suggests that the resulting user experience has been far from "seamless".

BC Digitized Collections: Towards a Microservices-based Solution to an Intractable Repository Problem / Code4Lib Journal

Our Digital Repository Services department faced a crisis point in late 2017. Our vendor discontinued support for our digital repository software, and an intensive, multi-department, six-month field survey had not turned up any potential replacements that fully met our needs. We began to experiment with a model that, rather than migrating to a new monolithic system, would more closely integrate multiple systems that we had already implemented—ArchivesSpace, Alma, Primo, and MetaArchive—and introduce only one new component, namely Mirador. We determined that this was the quickest way to meet our needs, and began a full migration in spring of 2018. The primary benefit of a microservices-based solution for our collections was the potential for customization; we therefore present our experiences in building and migrating to this system not as a blueprint but as a case study with lessons learned. Our hope is that in sharing our experience, we can help institutions in similar situations determine 1) whether a microservices-based solution is a feasible approach to their problem, 2) which services could and should be integrated and how, and 3) whether the trade-offs inherent in this architectural approach are worth the flexibility it offers.

Building a better book widget: Using Alma Analytics to automate new book discovery / Code4Lib Journal

Are we doing enough to market newly acquired book titles? Libraries purchase and subscribe to many new book titles each year, both print and electronic. However, we rely on the expectation that users will periodically search our systems to discover newly acquired titles. Static lists and displays have been traditional marketing methods for libraries, but require tedious time and effort to maintain. Without a practical solution for an academic library, East Tennessee State University developed an automated process to generate book widgets utilizing data from Alma Analytics. These widgets are now deployed in our subject guides, website, and on our digital displays. This article outlines the development and implementation of these widgets. We also discuss the challenges we encountered, such as finding image covers and custom subject tagging.

Managing Discovery Problems with User Experience in Mind / Code4Lib Journal

Williams Libraries recently developed a system for users to report problems they included while using the library catalog/discovery layer (Primo). Building on a method created by the Orbis Cascade Alliance, we built a Google form that allows users to report problems connecting to full text (or any other issue) and automatically includes the permalink in their response. We soon realized that we could improve the user experience by automatically forwarding these reports into our Ask a Librarian email service (LibAnswers) so we could offer alternative solutions while we worked on fixing the initial issue. The article will include an explanation of the process, reactions from public service staff, methods for managing the problems once submitted, and code shared on GitHub for those interested in implementing the tool at their own library.

Responsive vs. Native Mobile Search: A Comparative Study of Transaction Logs / Code4Lib Journal

The Consortium of Academic and Research Libraries in Illinois (or CARLI) is comprised of 130 libraries, a majority of which participate in the union catalog I-Share for resource sharing. The consortium implemented VuFind 4, a responsive web interface, as their shared union catalog in December 2017. This study compared search transaction logs from a native mobile app that serves the consortium with search transactions in the responsive mobile browser. Library professionals in the consortium sought to understand the nature of mobile search features by evaluating the relative popularity of mobile devices used, search terms, and search facets within the two mobile search options. The significance of this research is that it provides comparative data on mobile search features to the library UX community.

Large-Scale Date Normalization in ArchivesSpace with Python, MySQL, and Timetwister / Code4Lib Journal

Normalization of legacy date metadata can be challenging, as standards and local practices for formulating dates have varied widely over time. With the advent of archival management systems such as ArchivesSpace, structured, machine-actionable date metadata is becoming increasingly important for search and discovery of archival materials. This article describes a recent effort by a group of Yale University archivists to add ISO 8601-compliant dates to nearly 1 million unstructured date records in ArchivesSpace, using a combination of Python, MySQL, and Timetwister, a Ruby gem developed at the New York Public Library (NYPL).

Visualizing Fedora-managed TEI and MEI documents within Islandora / Code4Lib Journal

The Early Modern Songscapes (EMS) project [1] represents a development partnership between the University of Toronto Scarborough’s Digital Scholarship Unit (DSU), the University of Maryland, and the University of South Carolina. Developers, librarians and faculty from both institutions have collaborated on an intermedia online platform designed to support the scholarly investigation of early modern English song. The first iteration of the platform, launched at the Early modern Songscapes Conference, held February 8-9, 2019 at the University of Toronto’s Centre for Reformation and Renaissance Studies, serves Fedora-held Text Encoding Initiative (TEI) and Music Encoding Initiative (MEI) documents through a JavaScript viewer capable of being embedded within the Islandora digital asset management framework. The viewer presents versions of a song’s musical notation and textual underlay followed by the entire song text. This article reviews the status of this technology, and the process of developing an XML framework for TEI and MEI editions that would serve the requirements of all stakeholder technologies. Beyond the applicability of this technology in other digital scholarship contexts, the approach may serve others seeking methods for integrating technologies into Islandora or working across institutional development environments.

Creating a Low-cost, DIY Multimedia Studio in the Library / Code4Lib Journal

This case study will explain steps in creating a multimedia studio inside a health sciences library with existing software and a minimal budget. From ideation to creation to assessment, the process will be outlined in development phases and include examples of documentation, user feedback, lessons learned, and future considerations. We’ll explore multimedia software like One Button Studio, GameCapture, Kaltura, Adobe Creative Cloud, Garage Band, and others and compare their effectiveness when working on audio and visual projects in the library.

Testing Sprint Wrap Up - Save the Date! / Islandora

We've wrapped up yet another amazing community sprint. This time around, volunteers from 9 different organizations put Islandora 8 through its paces. Bugs and documentation gaps were uncovered as community members worked through an ever-expanding list of test cases. The Islandora 8 committers have responded to testing feedback, and we've already seen improvements roll in. There was even a special guest appearance by a wild Ruebot!

The Islandora Foundation would like to thank everyone who generously donated their time to critically reviewing Islandora 8 as it makes its way to release. Individuals from the following institutions actively took part in testing:

  • University of Tennessee
  • UNLV
  • SFU
  • Islandora Foundation
  • UNC Charlotte
  • Arizona State University
  • UTSC
  • York University

We'd also like to give a very special thanks to our committers, and in particular, to Natkeeran Kanthan, who showed tremendous initiative in collecting and documenting test cases.

And now that we've been through both a documentation and testing sprint, we now are making our final preparations before release. So mark your calendars, because we're releasing Islandora 8 on May 31st! We have a Github milestone set up that contains all the issues we'd like to resolve before then, so keep an eye on it to track our progress. When released, you can expect to to see the following features in Islandora 8:

  • Object types
    • Collections
    • Images (Basic and Large)
    • Audio
    • Video
    • Binaries
    • PDFs
  • Multiple file systems
    • Fedora
    • Public
    • Private
    • And many more…
    • View entities with GET
    • Create entities with POST
    • Update entities with PATCH
    • Remove entities with DELETE
    • Add files to objects with PUT
  • Solr search
    • Configure search index through the UI
  • Custom viewers:
    • Openseadragon
    • PDF.js
  • Custom field types:
    • Extended Date Time Format (EDTF)
    • Typed Relation
    • Authority Link
  • Custom entities for:
    • People
    • Families
    • Organizations
    • Locations
    • Subjects
  • Derivatives for
    • Image
    • Audio
    • Video
  • Access control
    • Hide content from users and search
    • Hide sensitive fields from users
  • Control repository events through the UI
    • Index RDF in Fedora
    • Index RDF in a Triplestore
    • Derivatives
    • Switching themes
    • Switching displays/viewers
    • Switching forms
  • Bulk ingest using CSV
  • Migration tools for Islandora 7
  • Views
    • Configure lists of content
    • Perform actions in bulk on lists of content

And if what you're looking for doesn't happen to be on that list, we will be actively soliciting community input for what features to tackle first from our proposed technical roadmap. So be on the look out for more from us at the Islandora Foundation as we work our way through releasing Islandora 8!

How Employees Are Key to Digital Transformation / Lucidworks

The momentum for companies to undergo digital transformation continues to grow. However, too often, digital transformation is thought of as just a technological change, forgetting that we need stewards of change. It turns out that for companies to succeed with digital transformation, they have to shift focus beyond technology and invest in their people.

Telstra, Australia’s largest telecommunications carrier, offers digital transformation services worldwide. As the underbelly of countless enterprises and organizations around the globe, Telstra surveyed 3,810 respondents from 12 industries in 14 markets around the world on where they were on their transformation journey.

As highlighted in the resulting Disruptive Decision-Making Report (as well as other research) businesses are actually further along with transforming their technology than they are with changing the corporate culture and individual mindsets of employees — both of which are necessary to profit from digital transformation.

In fact, Telstra found that organizations currently have the lowest confidence in how their people are contributing to digital transformation decision-making, when compared to facets such as technology and processes.

It’s More Than Dropping in Technology

According to the survey, organizations that have devoted the most attention to people and processes are more digitally mature and more advanced in their transformation journeys. In short, companies that overlook or ignore the people side of digital transformation will fail.

But like so many aspects of digital transformation, achieving this focus is easier said than done. The survey’s authors recommend three steps companies can take to improve the people dimension of digital transformation:

  1. Understand what digital transformation means for your organization
  2. Empower people and strengthen processes
  3. Be confident in your technology so you concentrate on driving change

These recommendations support the idea that the only way to thrive in the people dimension is by ensuring that digital transformation isn’t seen as an outside force that is inflicted on staff. Instead, it must be done in partnership, with those at the C-level setting a vision that has buy-in at all levels of the organization.

To gain a better sense of how companies are navigating these changes in practice, we looked at some best practices and case studies. We summarize them below.

Change Is an Organizational Anathema

“We know from neuroscience – and from everyday life – that groups tend to resist change,” says Ellen Leanse, chief people officer at Lucidworks, who has taught neuroscience and innovation at Stanford University. “Employees join an organization, and when it comes time to change, they tend to grasp to the status quo rather than look objectively at what the change will bring about.

Leanse, who speaks of the brain as “an ancient technology we use to navigate modern life,” explains that the brain tends to resist change and unknowns as part of its default fast thinking mode.

“The brain, ultimately, is a survival device. It optimizes for familiar behavior and often resists objective, open-minded, curious thinking. After all, it has no proof that the ‘‘new’’ will keep it, or us, safe and alive. Research shows that shared beliefs and behaviors stay even more entrenched in group settings. It’s easy to see some evolutionary benefits for that sort of psychology. Yet we all know the price when ‘groupthink’ sets in.”

To address this, Leanse suggests getting buy-in on a shared vision: something the group is willing to agree is a worthy goal. Then, emphasize cooperation and curiosity as paths to achieving this vision. This helps guide groups out of familiar, comfortable behaviors and find motivators (such as social connection and a sense of learning) that increase the satisfaction associated with the goal.

Help them feel like owners, empowered to meet their own (and shared) goals, as well as the organization’s. “When transformation is seen as a shared journey to a result a group agrees to,” says Leanse, “discomfort and resistance can fall away.”

Culture and People Are as Important as Technology

As new technology is being inserted into nearly all aspects of the business, a profound shift is taking place. As an article in the Enterprisers Project points out, this shift has a significant  impact on company culture.

“Digital transformation initiatives often reshape workgroups, job titles, and longtime business processes. When people fear their value and perhaps their jobs are at risk, IT leaders will feel the pushback,” write the article’s authors.

To achieve this unity, the article suggests IT leaders should start with empathy to understand fully the anxieties and fears staff might have about integrating new technology.

A 2018 survey from McKinsey & Company came to similar conclusions. While emphasizing the importance of technology in transformations, the report also found that companies with strong leadership and enterprise-wide workforce planning and talent-development practices had higher success rates than those that did not.

Empower workers. In an MIT Sloan Management Review analysis of digital transformation, George Westerman, Didier Bonnet, and Andrew McAfee argue that worker empowerment is vital to positive digital transformations. This can involve showing workers how the new technology will help them do their jobs more efficiently or the benefits of a more digital workplace that allows staff greater freedom to work when, where, and how they want to.

The McKinsey survey also reinforces the idea that workers should be empowered. “Another key is giving employees a say on where digitization could and should be adopted. When employees generate their own ideas about where digitization might support the business, respondents are 1.4 times more likely to report success, it says. The report also recommended that workers play a key role in enacting changes.

Prioritizing culture. One of the key traps many companies fall into is underestimating the importance of culture in digital transformation. As Greg Satell argues in a piece for Inc., digital transformation is human transformation. To gain buy in, a strong vision that aligns with business objectives is crucial. But so is starting transformation with automation that improves the day-to-day lives of staff, reducing the need for tedious tasks.

Articles in Jabil and CIO also point out that companies must engage their staff from the outset in the transformation process to overcome employee pushback and larger organizational resistance.

These articles and the Telstra survey illustrate that technology can take an organization only so far when it comes to digital transformation. People play a key part, as technology can’t replace the need for organizational cohesion that comes only when everyone from the C-suite and down understands why change is necessary, what the changes will be, and how the business will improve as a result.

But it all starts with changing what we look for in employees, says Leanse. “By changing our perspective from looking for people who will just follow orders, we need to make sure that we hire people who have excelled in cooperation and who embrace empowerment.

Cooperation says we are all in it together, she explains. And empowerment means you are giving your employees the tools to learn, grow and control their autonomy — while they help the organization.

Dan Woods is a Technology Analyst, Writer, IT Consultant, and Content Marketer based in NYC.

The post How Employees Are Key to Digital Transformation appeared first on Lucidworks.

Library concerns in an evolving landscape: highlights from the 2019 OCLC RLP Research Retreat / HangingTogether

On 23 and 24 April, members of the OCLC Research Library Partnership gathered together with OCLC staff in Dublin, Ohio for the OCLC RLP Research Retreat. This event was conceived as an opportunity to “dig in” to some of the research around transformations in scholarly communication and the evolving research library landscape.

To help frame that discussion, we invited two speakers to provide their insights into how libraries are evolving in response to demographic, social, and technological changes.

Generational change and the decline of lower skill employment in academic libraries

Stanley Wilder(Louisiana State University) gave a fascinating presentation in which he revealed some key changes in the library workforce:

  • Impending rapid turnover in library dean positions
  • The evolution of skills required for librarianship
  • The decline of lower-skilled staff library positions as the number of skilled positions increase

Wilder’s analysis was drawn from twenty years of rich data collected by the Association of Research Libraries (ARL). The OCLC RLP is a diverse and transnational membership organization, with many non-ARL libraries; however, the peerless data set that ARL has been collecting for decades gives valuable insights into professional shifts. [download slides]

Stanley Wilder presenting at the OCLC RLP Research Retreat

The evolving scholarly record and the utility of OCLC Research models

Keith Webster (Carnegie Mellon University) presented as a roundup of OCLC Research “greatest hits” as viewed through through a Carnegie Mellon lens. He particularly articulated how OCLC Research models helps support his high level understanding of the signals and drivers acting on the library landscape. For instance,

  • Keith specifically called out the utility of the University Futures, Library Futures report, describing it as a “wakeup call” for libraries, forcing us to see that the goals and mission of the parent institution is shaping our libraries. We should no longer be defining ourselves by the size of our collections, and he finds the library services framework detailed in that report offers new ways for us to consider how library services align with institutional goals.
  • He frequently uses the Evolving Scholarly Record and Stewardship of the Evolving Scholarly Record reports to help him explain the rapidly changing scholarly communications landscape to others at his institution, and he has taken action to deploy tools to support all parts of the evolving scholarly record. “As the ESR report model points out, we are in an environment where it’s not just the published outcome of the research process that is relevant.”
  • He also specifically cited how OCLC Research has helped to think about the local CMU collection as a subset of the larger facilitated collection.

Discussion themes

But most of the day was spent in small group interactions, as RLP staff led participants in discussions focused on the major future challenges and opportunities facing research libraries, using Art of Hosting techniques to facilitate meaningful activities. Discussions were lively, and we captured some high level themes focusing on:

Library value proposition

  • Libraries need to better understand and be more responsive to campus stakeholders. These stakeholders not only represent campus collaborators but frequently control funding sources. This helped to confirm work we have planned.
  • Low expectations of libraries, created by a misunderstanding of the value we actually offer, is a threat.
  • Libraries touch everyone on campus. We should be thinking of ourselves as a force for removing friction – that could be through technology, or it could be through building partnerships.
  • For the library to lead, we need to be the envy of others on campus. 

Library skills development

  • Research libraries are challenged to acquire and develop new skills for staff in key areas.
  • Staff turnover creates opportunity, but that opportunity for fresh perspectives can be thwarted by a culture that is resistant to change.
  • Libraries must invest in leadership development and change management.  
  • We are switching from a “pay to read” to a “pay to publish” model: how does this shift impact our work?

Intra-institutional collaboration

  • Libraries need to collaborate around further consolidating shared services, and then invest in what is important locally, that no one else can do.
  • University mandates create opportunities, giving us a moment to change discussions and reframe roles.

Extra-institutional collaboration

  • We need to figure out how to assess the viability of organizations we are asked to invest in. 
  • We need to appropriately balance investments and effort between “local” and “legacy”.
Participants at the Research Retreat

OCLC Research staff also presented on some of our own models. The presentation was intended to unpacking the models, which are inspired by the OCLC RLP and created for adaptation and use in planning. We also wanted to expose our way of working — how research questions can lead to further explorations and discussions and can inspire more work. [download slides]

Event attendees gave the Research Retreat high marks. This was an invaluable experience for OCLC Research as well, helping to direct our future investigations in areas of need for research libraries. A blog post describing our takes aways and next steps will follow.

The post Library concerns in an evolving landscape: highlights from the 2019 OCLC RLP Research Retreat appeared first on Hanging Together.

Making blog posts count as part of a not-so-secret feminist agenda / Mita Williams

Secret Feminist Agenda & Masters of Text

I am an academic librarian who has earned permanence – which is the word we use at the University of Windsor to describe the librarian-version of tenure. When I was hired, there was no explicit requirement for librarians to publish in peer-reviewed journals. Nowadays, newly hired librarians at my place of work have an expectation to create peer-reviewed scholarship, although the understanding of how much and what kinds of scholarship count has not been strictly defined.

While I have written a few peer reviewed articles, most of my writing for librarians has been on this particular blog (since 2016) and for ten years prior to that at New Jack Librarian (with one article, in between blogs, hosted on Medium).

On my official CV that I update and submit to my institution every year, these peer-reviewed articles are listed individually. Under, “Non-referred publications”, I have a single line for each of my blogs. And yet, I have done so much more writing on blogging platforms compared to my peer reviewed work (over 194K words from 2006-2016, alone). And my public writing has been shared, saved, and read many, many times over my peer-reviewed scholarship.

Now, as I have previously stated, I already have permanence. So why should I care if my blog writing counts in my work as an academic librarian?

That was my thinking, so I didn’t care. That is, until a couple of weeks ago when a podcast changed my mind.

That podcast was Hannah McGregor’s Secret Feminist Agenda.

Secret Feminist Agenda is a weekly podcast about the insidious, nefarious, insurgent, and mundane ways we enact our feminism in our daily lives.

About, Secret Feminist Agenda

McGregor’s podcast is part of a larger SSHRC-funded partnership called Spoken Web that “aims to develop a coordinated and collaborative approach to literary historical study, digital development, and critical and pedagogical engagement with diverse collections of spoken recordings from across Canada and beyond”.

The episode that changed my mind was 3-26 which is dedicated largely a conversation of McGregor with another academic and podcaster, Ames Hawkins. In it, there were two particular moments that reconfigured my thinking.

The value of creative practice

The first was when their conversation turned to scholarly creative practice (around the 16 minute mark):

What did we learn about scholarly podcasting… How and when and where we create new knowledge, that’s what we call scholarship, generally, right?

Secret Feminist Agenda, 3.26

Their conversation about what counts as scholarship and how it can be valued is a great listen. And it opened the possibility in my mind to consider this writing a form of creative, critical work.

While most of my public writing is explanatory or persuasive in nature, there is definitely a subset of my work that I would consider a form of creative practice. I know that these works are creative because when I sit down to write them, I don’t have an idea of the final form of the text until it is finished. I am compelled to work through ideas that I feel might have something to them, but the only way to tell is to get closer.

These more creative writings tend to be my least-popular works that are never shared by others on social media. Examples include How Should Reality Be (my own version of Reality Hunger) and G H O S T S T O R I E S.

And yet, those writings were necessary precursors to later works that were built from those first iterations and have ended up being well-received: Libraries are for use. And by use, I mean copying and Haunted libraries, invisible labour, and the librarian as an instrument of surveillance. These second iterations are more formal but not works of formal scholarship. I still think they fall under the category of creative, critical work.

Both writing and conversation can act as a practice to discover and uncover ideas in a way that feels very different than the staking of intellectual territory and making claims, like so much scholarship.

Using tenure to break space open

The second passage that struck me comes in at the 52:26 mark, when Hannah tells this story:

Hannah: I met a prof at the Modernist Studies Association Conference a few years ago who was telling me that he does a comic book podcast with a friend of his and they’ve been doing it for years and it has quite a popular following, and I was like, “Oh, awesome! Do you count that as your scholarly output?” and he said “No, I don’t need to. I have tenure.” And I was like, “Well, but, couldn’t you use tenure as a way to break space open for those who don’t but want to be doing that kind of work? Isn’t there another way to think about what it means to have security as a position from which you can radicalize?”, but that so often doesn’t seem to prove to be the case.

Ames: “Well, and now we’re back to that’s feminist thinking -what you said there and what that person is illustrating is not feminist thinking…”

Secret Feminist Agenda, 3.26

Oof. Hearing that bit was a bit of a gut-punch.

I can and will do better.

That being said, I’m not entirely sure how my corpus of public writing should be accounted for. Obviously, the volume of words produced is not an appropriate measure. Citation counts from scholarly works might be deemed a valuable measure as but many scholars deliberately exclude public writing from their bibliographies, I feel this metric systematically undervalues this type of writing. And while page views and social media counts should stand for something, I don’t think you can make the case that popularity is an equivalent of quality.

Secret Feminist Agenda goes through its own form of peer review.

I would love to see something similar for library blogs such as my own. There is work that needs to be done.