Planet Code4Lib

Auditing The Integrity Of Multiple Replicas / David Rosenthal

The fundamental problem in the design of the LOCKSS system was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:
  • Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
  • Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
  • Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.
Our approach, published now more than 16 years ago, was to have each node in the network compare its content with that of the consensus among a randomized subset of the other nodes holding the same content. They did so using a peer-to-peer protocol using proof-of-work, in some respects one of the many precursors of Satoshi Nakamoto's Bitcoin protocol.

Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.

A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work.

Their abstract reads:
Cloud computing has been an essential technology for providing on‐demand computing resources as a service on the Internet. Not only enterprises but also individuals can outsource their data to the cloud without worrying about purchase and maintenance cost. The cloud storage system, however, is not fully trustable. Cloud data integrity auditing is crucial for defending against the security threats of data in the untrusted multicloud environment. Storing multiple replicas is a commonly used strategy for the availability and reliability of critical data. In this paper, we summarize and analyze the state‐of‐the‐art multiple‐replica integrity auditing schemes in cloud data storage. We present the system model and security threats of outsourcing data to the cloud with classification of ongoing developments. We also summarize the existing data integrity auditing schemes for multicloud data storage. The important open issues and potential research directions are addressed.

System Architecture

There are three possible system architectures for auditing the integrity of multiple replicas:
  • As far as I'm aware, LOCKSS is unique in using a true peer-to-peer architecture, in which nodes storing content mutually audit each other.
  • In another possible architecture the data owner (DO in Yangfei Lin et al's nomenclature) audits the replicas.
  • Yangfei Lin et al generally consider an architecture in which a trusted third party audits the replicas on behalf of the DO.

Proof-of-Possession vs. Proof-of-Retrievability

There are two kinds of audit:
  • A Proof-of-Retrievability (PoR) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed and every bit was intact.
  • A Proof-of-Possession (PoP) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed, but not that every bit was intact. The paper uses the acronym PDP for Provable Data Possession.

Immutable, Trustworthy Storage

The reason integrity audits are necessary is that storage systems are neither reliable nor trustworthy, especially at scale. Some audit systems depend on storing integrity tokens, such as hashes, in storage which has to be assumed reliable. If the token storage is corrupted, it may be possible to detect but not recover from the corruption. It is generally assume that, because the tokens are much smaller than the content to whose integrity they attest, they are correspondingly more reliable. But it is easy to forget that both the tokens and the content are made of the same kind of bits, and that even storage protected by cryptographic hardware has vulnerabilities.

Encrypted Replicas

In many applications of cloud storage it is important that confidentiality of the data is preserved by encrypting it. In the digital preservation context, encrypting the data adds a significant single point of failure, the loss or corruption of the key, so is generally not used. If encryption is used, some means for ensuring that the ciphertext of each replica is different is usually desirable, as is the use of immutable, trustworthy storage for the decryption keys. The paper discusses doing this via probabilistic encryption using public/private key pairs, or via symmetric encryption using random noise added to the plaintext.

If the replicas are encrypted they are not bit-for-bit identical and thus their hashes will be different whether they are intact or corrupt. Thus a homomorphic encryption algorithm must be used:
Homomorphic encryption is a form of encryption with an additional evaluation capability for computing over encrypted data without access to the secret key. The result of such a computation remains encrypted.
In Section 3.3 Yangfei Lin et al discuss two auditing schemes based on homomorphic encryption:
One of the schemes they discuss uses Paillier encryption, another homomorphic technique. It and RSA are compared here. See also here.


If an audit operation is not to involve downloading the entire content, it must involve the auditor requiring the system storing the replica to perform a computation that:
  • The storage system does not know the result of ahead of time.
  • Takes as input part (PoP) or all (PoR) of the replica.
Thus, for example, asking the replica store for the hash of the content is not adequate, since the store could have pre-computed and stored the hash, rather than the content.

PoP systems can, for example, satisfy these requirements by requesting the hash of a random range of bytes within the content. PoR systems can, for example, satisfy these requirements by providing a random nonce that the replica store must prepend to the content before hashing it. It is important that, if the auditor pre-computes and stores these random values, they be kept secret from the replica stores. If the replica store discovers them, it can pre-compute the responses to future audit requests and discard the content without detection.

Unfortunately, it is not possible to completely exclude the possibility that a replica store, or a conspiracy among the replica stores, has compromised the storage holding the auditor's pre-computed values. A ideal design of auditor would generate the random values at each audit time, rather than pre-computing them. Alas, this is typically possible only if the auditor has access to a replica stored in immutable, trustworthy storage (see above). In the mutual audit architecture used by the LOCKSS system, the nodes do have access to a replica, albeit not in reliable storage, so the random nonces the system uses are generated afresh for each audit.

It is an unfortunate reality of current systems that, over long periods, preventing secrets from leaking and detecting in a timely fashion that they have leaked are both effectively impossible.

Auditing Dynamic vs. Static Data

In the digital preservation context, the replicas being audited can be assumed to be static, or at least append-only. The paper addresses the much harder problem of auditing replicas that are dynamic, subject to updates through time. In Section 3.2 Yangfei Lin et al discuss a number of techniques for authenticated data structures (ADS) to allow efficient auditing of dynamic data:
There are three main ADSs: rank-based authenticated skip list (RASL), Merkle hash tree (MHT), and map version table (MVT).


Yangfei Lin et al summarize the seven methods they describe in Table 4:

Adapted from Yangfei Lin et al: Table 4
Data Structure
Dynamic Data
MR-PDP-NONOPseudo-random functions for masking, BLS signatures
EMC-PDP-YESNOSymmetric encryption diffusion, BLS signatures
DMR-PDP-NOYESPaillier encryption
FHE-PDP-YESYESFully homomorphic encryption
MB-PMDDPMap version tableYESYESRequire extra storage but less computation
MR-MHTMerkle hash tableYESYESReplica as a subtree
MuR-DPAMerkle hash tableYESYESBlock replica as a subtree

As you can see, only four of the seven satisfy both of their requirements and if I interpret this (my emphasis):
an overall comparison summary among the existing replicated data possession auditing schemes ... is shown in Table 4.
correctly all are PoP not PoR.

But What About Blockchain?

In discussions of integrity verification these days, the idea of using a blockchain is inevitably raised. A recent example is R&D: Blockchain-Based Publicly Verifiable Cloud Storage by Nachiket Tapas et al. Their abstract reads:
Cloud storage adoption, due to the growing popularity of IoT solutions, is steadily on the rise, and ever more critical to services and businesses. In light of this trend, customers of cloud-based services are increasingly reliant, and their interests correspondingly at stake, on the good faith and appropriate conduct of providers at all times, which can be misplaced considering that data is the "new gold", and malicious interests on the provider side may conjure to misappropriate, alter, hide data, or deny access. A key to this problem lies in identifying and designing protocols to produce a trail of all interactions between customers and providers, at the very least, and make it widely available, auditable and its contents therefore provable. This work introduces preliminary results of this research activity, in particular including scenarios, threat models, architecture, interaction protocols and security guarantees of the proposed blockchain-based solution.
Ether miners 7/9/19
As usual, Tapas et al simply assume the Ethereum blockchain is immutable, secure and always available, despite the fact that its security depends on its being decentralized, which it isn't. Since the infrastructure on which their edifice is based does not in practice provide the conditions their system depends upon, it would not in practice provide the security guarantees they claim for it.

Everything they want from the Ethereum blockchain could be provided by the same kind of verifiable logs as are used in Certificate Transparency, thereby avoiding the problems of public blockchains. But doing so would face insuperable scaling problems under the transaction rates of industrial cloud deployments.

WikiConference North America 2019: Reliability / HangingTogether

Attendees of WikiConference North America 2019

Over the weekend I attended WikiConference North America in Cambridge, Massachusetts. This was my fourth time participating in this meeting, which is a wonderful gathering for Wikimedians as well as librarians, educators and others interested in open access to information. This meeting is purposefully expansive, including colleagues from Mexico, the Caribbean and Canada. It was wonderful to see so many Caribbean participants, possibly more so even than from Mexico or Canada (likely due to the emergence of a new and lively Wikimedians of the Caribbean User Group ).

This year the conference was held in conjunction with the Credibility Coalition and featured a “credibility summit” with participants from Google, Facebook and Microsoft alongside members of the Wikimedia Movement. This convergence facilitated necessary and timely discussions on credibility, reliability and the role that these organizations and communities play in combating fake information on the internet.

There was a contingent of librarians / Wikibrarians attending the conference, several talks that touched on Wikicite, and many talks on Wikidata.  From my perspective, notable talks included:

  • Presenters from Vanderbilt University Library walked through how they are considering using Wikdata (or Wikibase) as a Research Information Management (RIM) system. [notes]
  • Former OCLC Wikpedian-in-Residence and OCLC WebJunction instructor Monika Senjul-Jones gave a lightning talk on our recent Wikipedia + Libraries: Health and Medical Information course.
  • Will Kent of WikiEdu facilitated a discussion on Building a Wikidata Curriculum. This was fed by lessons learned, both by Will and his WikiEdu collaborators and others in the audience who have been teaching Wikidata to others. The session etherpad has a number of good resources for teaching Wikidata.
  • A Harvard libraries panel on digital humanities resources / “non traditional scholarship” included discussion about whether / how these resources might be used in a Wikipedia article and how you might use Wikidata to describe / model them. [notes]
  • A presentation from University of New Mexico explored how Wikipedia might stand in for electronic resources (specifically, to let selectors do analysis on children’s literature). Could they use Wikipedia articles instead of subscription resources, particularly when selectors are interested in finding books that center non-dominant cultures? [slides | article]. I’ve been looking at lists like 1000 Black Girl Books and those featured on We Need Diverse Books and have been thinking about how these resources show up (or, mostly don’t) in Wikipedia and Wikidata, so this is definitely a topic I’m interested in.
  • CiteUnseen is a Wikipedia user plugin that shows what sources are used to cite a Wikipedia article (books, websites, newspaper articles, etc.), and also flags sources that are from questionable sources.

I facilitated a discussion on “gap” projects and what tools / techniques those projects use. Some projects create simple crowd-sourced lists. Others leverage out-of-copyright topical encyclopedias or biographical dictionaries and then push structured data gleaned from transcriptions of these sources into Wikidata and then use Listeriabot to generate lists from them (here is one example from the Women in Red project). Other projects like Art+Feminism are focusing effort on articles that already exist but that are at risk, using a combination of Wikidata and information from Wikipedia articles. Overall I have been impressed by how many gap projects leveraging Wikidata to identify and prioritize work.

The event was funded by the Credibility Coalition, Craig Newmark Philanthropies, and the Craig Newmark School of Journalism, along with support from the Knowledge Futures Group, MIT Open Learning, and many greater Boston area arts organizations. Many thanks are due to the funders, the many volunteers and the program committee for making this fun and thought-provoking meeting possible. I look forward to next year’s event, which will be hosted by Wikimedia Canada.

The post WikiConference North America 2019: Reliability appeared first on Hanging Together.

Guest Post: Is the US Supreme Court in lockstep with Congress when it comes to abortion? / Harvard Library Innovation Lab

This guest post is part of the CAP Research Community Series. This series highlights research, applications, and projects created with Caselaw Access Project data.

Abdul Abdulrahim is a graduate student at the University of Oxford completing a DPhil in Computer Science. His primary interests are in the use of technology in government and law and developing neural-symbolic models that mitigate the issues around interpretability and explainability in AI. Prior to the DPhil, he worked as an advisor to the UK Parliament and a lawyer at Linklaters LLP.

The United States of America (U.S.) has seen declining public support for major political institutions, and a general disengagement with the processes or outcomes of the branches of government. According to Pew's Public Trust in Government survey earlier this year, "public trust in the government remains near historic lows," with only 14% of Americans stating that they can trust the government to do "what is right" most of the time. We believed this falling support could affect the relationship between the branches of government and the independence they might have.

One indication of this was a study on congressional law-making which found that Congress was more than twice as likely to overturn a Supreme Court decision when public support for the Court is at its lowest compared to its highest level (Nelson & Uribe-McGuire, 2017). Furthermore, another study found that it was more common for Congress to legislate against Supreme Court rulings that ignored the legislative intentions, or rejects positions taken by federal, state, or local governments — due to ideological differences (Eskridge Jr, 1991).

To better understand how the interplay between the U.S. Congress and Supreme Court has evolved over time, we developed a method for tracking the ideological changes in each branch using word embeddings and text corpora generated. For Supreme Court, we used the opinions for the cases provided in the CAP dataset — though we extended this to include other federal court opinion to ensure our results were stable. As for Congress, we used the transcribed speeches of the Congress from Stanford's Social Science Data Collection (SSDS) (Gentzkow & Taddy, 2018). We use the case study of reproductive rights (particularly, the target word "abortion"), which is arguably one of the more contentious topics ideologically divided Americans have struggled to agree on. Over the decades, we have seen shifts in the interpretation of rights by both the U.S. Congress and Supreme Court that has arguably led to the expansion of reproductive rights in the 1960s and a contraction in the subsequent decades.

What are word embeddings? To track these changes, we use a quantitative method of tracking semantic shift from computational linguistics, which is based on the co-occurrence statistics of words used — and corpora of Congress speeches and the Court's judicial opinions. These are also known as word embeddings. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems. This allows us to see, using the text corpus as a proxy, how they have ideologically leaned over the years on the issue of abortion, and whether any particular case led to an ideological divide or alignment.

For a more detailed account on word embeddings and the different algorithms used, I highly recommend Sebastian Ruder's "On word embeddings".

Our experimental setup In tracking the semantic shifts, we evaluated a couple of approaches using a word2vec algorithm. Conceptually, we formulate the task of discovering semantic shifts as follows. Given a time sorted corpus: corpus 1, corpus 2, …, corpus n, we locate our target word and its meanings in the different time periods. We chose the word2vec algorithm based comparisons made on the performance of the different algorithms which were count-based, prediction-based or a hybrid of the two on a corpus of U.S. Supreme Court opinions. We found that although there is variability in coherence and stability as a result of the algorithm chosen, the word2vec models show the most promise in capturing the wider interpretation of our target word. Between the two word2vec algorithms — Continuous Bag of Words (CBOW) and Skip-Gram Negative Sampling (SGNS) — we observe similar performance, however, the latter showed more promising results in capturing case law related to our target word at a specific time period.

As we test one algorithm in our experiments — a low dimensional representation learned with SGNS — with the incremental updates method (IN) and diachronic alignment method (AL), we got results for two models SGNS (IN) and SGNS (AL). In our implementation, we use parts of the Python library gensim and supplement this with implementations by Dubossarsky et al. (2019) and Hamilton et al. (2016b) for tracking semantic shifts. For the SGNS (AL) model, we only extract regular word-context pairs (w,c) for time slices and trained SGNS on these. For the SGNS (IN) model, we similarly extract the regular word-context pairs (w,c), but rather than divide the corpus and train on separate time bins, we train the first time period and incrementally add new words, update and save the model.

To tune our algorithm, we performed two main evaluations (intrinsic and extrinsic) on samples of our corpora, comparing the performance across different hyperparameters (window size and minimum word frequency). Based on these results, the parameters used were MIN = 200 (minimum word frequency), WIN = 5 (symmetric window cut-off), DIM = 300 (vector dimensionality), CDS = 0:75 (context distribution smoothing), K = 5 (number of negative samples) and EP = 1 (number of training epochs).

What trends did we observe in our results? We observed some notable trends from the changes in the nearest neighbours to our target word. Using the nearest neighbours to abortion indicates how the speakers or writers who generated our corpous associate the word and what connotations it might have in the group.

To better assess our results, we conducted an expert interview with a Womens and Equalities Specialist to categorise the words as: (i) a medically descriptive word, i.e., it relates to common medical terminology on the topic; (ii) a legally descriptive word, i.e., it relates to case, legislation or opinion terminology; and (iii) a potentially biased word, i.e., it is not a legal or medical term and thus was chosen by the user as a descriptor.

Nearest Neighbours Table Key. Description of keys used to classify words in the nearest neighbours by type of terminology. These were based on the insights derived from an expert interview.

Table showing "Category", "Colour Code", and "Description" for groups of words.

A key observation we made on the approaches to tracking semantic shifts is that depending on what type of cultural shift we intend to track, we might want to pick a different method. The incremental updates approach helps identify how parts of a word sense from a preceding time periods change in response to cultural developments in the new time period. For example, we see how the relevance of Roe v. Wade (1973) changes across all time periods in our incremental updates model for the judicial opinions.

In contrast, the diachronic alignment approach better reflects what the issues of that specific period are in the top nearest neighbours. For instance, the case of Roe v. Wade (1973) appears in the nearest neighbours for the judicial opinions shortly after it is decided in the decade up to 1975 but drops off our top words until the decades up to 1995 and 2015, where the cases of Webster v. Reproductive Health Services (1989), Planned Parenthood v. Casey (1992) and Gonzales v. Carhart (2007) overrule aspects of Roe v. Wade (1973) — hence, the new references to it. This is useful for detecting the key issues of a specific time period and explains why it has the highest overall detection performance of all our approaches.

Local Changes in U.S. Federal Court Opinions. The top 10 nearest neighbours to the target word "abortion" ranked by cosine similarity for each model.

Table displaying "Incremental Updates" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

Table displaying "Diachronic Alignment" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

Local Changes in U.S. Congress Speeches. The top 10 nearest neighbours to the target word "abortion" ranked by cosine similarity for each model.

Table displaying "Incremental Updates" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

Table displaying "Diachronic Alignment" for the years 1965, 1975, 1985, 1995, 2005, and 2015.

These preliminary insights allow us to understand some of the interplay between the Courts and Congress on the topic of reproductive rights. The method also offers a way to identify bias and how it may feed into the process of lawmaking. As such, for future work, we aim to refine the methods to serve as a guide for operationalising word embeddings models to identify bias - as well as the issues that arise when applied to legal or political corpora.

Librarianship at the Crossroads of ICE Surveillance / In the Library, With the Lead Pipe

In Brief
Information capitalism, the system where information, a historically, largely free and ubiquitous product of basic communication, is commodified by private owners for profit, is entrenched in our society. Information brokers have consolidated and swallowed up huge amounts of data, in a system that leaves data purchase, consumption, and use largely unregulated and unchecked. This article focuses on librarian ethics in the era of information capitalism, focusing specifically on an especially insidious arena of data ownership: surveillance capitalism and big data policing. While librarians value privacy and intellectual freedom, librarians increasingly rely on products that sell personal data to law enforcement, including Immigration and Customs Enforcement (ICE). Librarians should consider how buying and using these products in their libraries comports with our privacy practices and ethical standards.

By Sarah Lamdan


As a fellow librarian, I’m here to warn you: ICE is in your library stacks. Whether directly or indirectly, some of the companies that sell your library research services also sell surveillance data to law enforcement, including ICE (U.S. Immigration and Customs Enforcement). Companies like Thomson Reuters and RELX Group (formerly Reed Elsevier), are supplying billions of data points, bits of our personal information, updated in real time, to ICE’s surveillance program.1 Our data is being collected by library vendors and sold to the police, including immigration enforcement officers, for millions of dollars.

This article examines the privacy ethics conundrum raised by contemporary publishing models, where the very services libraries depend upon to fill their collections endanger patron privacy. In the offline world of paper collections and library stacks, librarians adhere to privacy ethics and practices to ensure intellectual freedom and prevent censorship. But librarians are unprepared to apply those same ethical requirements to digital libraries. As our libraries transition to largely digital collections2, we must critically assess our privacy ethics for the digital era.3 Where are the boundaries of privacy in libraries when several “data services”4 corporations that also broker personal data own the lion’s share of libraries’ holdings?

After describing library vendors’ data selling practices and examining how those practices affect privacy in libraries, this article concludes by suggesting that library professionals organize beyond professional organizations. Librarians can demand vendor accountability and insist that vendors be transparent about how they use, repackage, and profit from personal data.

An Overview of Vendors’ Data Brokering Work

The consolidation of library vendors in the digital age has created a library services ecosystem where several vendors own the majority of databases and services upon which libraries rely.5 This puts libraries at the whim of publishing giants like Elsevier, Springer, and Taylor and Francis. This article uses Thomson Reuters and RELX Group, major publishing corporations that own Westlaw and Lexis6 , as case studies to demonstrate how information consolidation and the rise of big data impact library privacy. Thomson Reuters and RELX Group do not just duopolize the legal research market, they are powerful players in many library collections. They own a bevy of news sources and archives, academic collections including ScienceDirect, Scopus, and ClinicalKey, and all of the Reed Elsevier journals.7 Companies like Thomson Reuters and RELX Group are gradually buying up information collections that libraries and their patrons depend upon.

In addition to selling research products, both Thomson Reuters and RELX are data brokers, companies that sell personal data to marketing entities and law enforcement including ICE.8 Data brokering is fast becoming a billion dollar industry. Personal information fuels the “Big Data economy,” a system that monetizes our data by running it through algorithm-based analyses to predict, measure, and govern peoples’ behavior.9 While data brokering for commercial gain (to predict peoples’ shopping habits and needs) is insidious, the sale of peoples’ data to law enforcement is even more dangerous. Brokering data to law enforcement fuels a policing regime that tracks and detains people based not on human investigation, but on often erroneous pools of data traded between private corporations and sorted by discriminatory algorithms.10 Big data policing disparately impacts minorities, creating surveillance dragnets in Muslim communities, overpolicing in black communities, and sustaining biases inherent in the U.S. law enforcement system.11 In the immigration context, big data policing perpetuates problematic biases with little oversight12 , resulting in mass surveillance, detention, and deportation.13

ICE pays RELX Group and Thomson Reuters millions of dollars for the personal data it needs to fuel its big data policing program.14 Thomson Reuters supplies the data used in Palantir’s controversial FALCON program, which fuses together a multitude of databases full of personal data to help ICE officers track immigrant targets and locate them during raids.15 LexisNexis provides ICE data that is “mission critical”16 to the agency’s program tracking immigrants and conducting raids at peoples’ homes and workplaces.17

Information Capitalism Drives Data Brokering

The new information economy is drastically changing vendors’ and libraries’ information acquisition, sales, and purchasing norms. For Thomson Reuters and RELX Group, data brokering diversifies profit sources as the companies transition their services from traditional publishing to become “information analytics” companies.18 These corporations are no longer the publishers that librarians are used to dealing with, the kind that focus on particular data types (academic journals, scientific data, government records, and other staples of academic, public, and specialized libraries). Instead, the companies are data barons, sweeping up broad swaths of data to repackage and sell. Libraries have observed drastic changes in vendor services over the last decade.

New business models are imperative for publishing companies that must maintain profits in a changing information marketplace. They are competing to remain profitable enterprises in an era where their traditional print publishing methods are less lucrative. To stay afloat financially, publishers are becoming predictive data analytics corporations.19 Publishers realize that the traditional publishing revenue streams from books and journals are unsustainable as those items become digital and open access.20 Reed Elsevier, one of the top five largest academic publishers has been “managing down” its print publishing services to focus on more lucrative online data analytics products.21 Reed Elsevier’s corporation rebranded itself RELX Group and Morgan Stanley recategorized RELX Group as a “business company” instead of a “media group.”22

For publishers, changing their business models is imperative to survive in a world where information access is changing dramatically and publishers are learning to maintain their market share in the new digital information regime.23 While print materials are less lucrative, publishers build technology labs, developing tools that stream and manipulate digital materials. Publishers like Thomson Reuters and RELX Group are finding new opportunities to consolidate and sell digital materials.24 Where information used to come in different shapes and sizes (papers, books, cassette tapes, photographs, paintings, newspapers, blueprints, and other disparate, irregular formats) it now flows in a single form, transmitted through fiber optic cables. Thomson Reuters and RELX Group are capitalizing on this new information form, buying up millions of published materials and storing them electronically to create digital data warehouses25 stored in servers. These new publishing enterprises are data hungry and do not discriminate between different types of data, be it academic, government, or personal. They want every data type, to compile as bundles of content to sell. Today’s library vendors are less like local bookstores and more like Costcos stocked with giant buckets of information.26 The new publishing company structure is a “big box” data store of library resources. Libraries buy bundles of journals, databases, and ebooks, and other mass-packaged materials in “big deals.”27

The Costco-ization of publishing drives publishers to collect tons of data, and to make systems that will slice and dice the data into new types of saleable bundles. Thus, publishers morph into data analytics corporations, developing AI systems to parse through huge datasets to gather statistics (“How many times does Ruth Bader Ginsburg say “social justice” in her Supreme Court opinions?”) and predict trends (“How many three pointers will Stephen Curry throw in 2019?”).28

As their vendors’ service models shift, librarians have also shifted from being information owners whose collection development focuses on purchasing materials to information borrowers that rent pre-curated data bundles shared through subscription databases. In 2019, Roxanne Shirazi, a librarian at CUNY’s Grad Center, described the phenomenon of “borrowing” information from gigantic data corporations in a blog post titled The Streaming Library.29 Shirazi compares the modern library to a collection of video subscription streaming services (Hulu, Netflix, Amazon). Libraries subscribe to online collections, “streaming” resources that live within various corporate data collections without owning them. “…Libraries used to purchase materials for shared use […] those materials used to live on our shelves.” But libraries no longer own all of their research materials, they temporarily borrow subscribe to them. Vendors can provide library resources, and make them disappear, at their whim.30

As lenders, library vendors do not end their relationships with libraries when they complete a sale. Instead, as streaming content providers, vendors become embedded in libraries. They are able to follow library patrons’ research activities, storing data about how people are using their services. When companies like Thomson Reuters and RELX Group are simultaneously library service providers and data brokers they can access library patron data and repackage that data for profit.31 Library vendors collect more and more patron data as they develop services to track patron preferences and make collection development decisions.32 Librarians have long been concerned with the privacy implications of digital authentication features vendors put in products to help verify patron identities and track their use of online databases.33 When vendors that track library patrons also participate in data brokering, it is entirely possible that patron data is in the mix of personal data the companies sell as data brokers.34 Neither Thomson Reuters or RELX Group has denied doing so.35 Furthermore, in 2018, both Thomson Reuters and RELX Group modified their privacy statements to clarify that they use personal data across their platforms, with business partners, and with third party service providers.36

In the current information economy, librarians increasingly lack leverage to confront powerful corporate vendors like Thomson Reuters and RELX Group.37 Information capitalism, the transition of industrialist capitalism to an economic system that assigns commercial value to knowledge, information, and data38, simultaneously intensifies privacy concerns in our libraries and empowers data corporations. As publishing conglomerates buy more and more data, libraries have little choice but to purchase their research products from these information monopolies. Data brokering is an especially threatening form of information capitalism, but other manifestations of information capitalism have also seeped into librarianship. When information sellers limit access to online content, put up paywalls, and charge exorbitant article processing charges (APCs), they profit from our patrons’ information needs and our roles as information providers.

We are beholden to information capitalism39, and our profession is captured by this new brand of digital warehouse-style publishing. If we want information, we must pay a premium to wealthy data barons. The power wielded by huge publishing companies makes it hard for librarians that negotiate contracts with the companies to demand accountability. Librarians are in the awkward role of being, simultaneously, both “the biggest consumer of the materials [the corporations] sell as well as their biggest critics.”40 When librarians and their patrons try to bypass library vendors and provide open access to information, vendors have the power to stifle those demands. For instance, vendors sued the computer programmer who developed Sci-Hub, a website providing free access to scientific research and texts, forcing the website offline.41 Librarians envision a world where information is free, but live in a reality where they are largely captive to giant publishing companies.

Because personal data is the “big data” empire’s most valuable currency, sought by companies like Thomson Reuters and RELX Group, librarians should be especially concerned about vendors’ gathering personal data in libraries. Data brokering is a multi-billion dollar industry.42 Data brokering capitalizes on lax software and online platform privacy policies43 , scraping and saving troves of personal data to analyze or repackage it for sale. Thus, as publishers become data analytics firms, it is useful for libraries to consider whether they unwittingly fuel the data brokering industry.

Librarians’ Roles in Data Brokering

It is important to begin the discussion about librarians’ roles in patron privacy by drawing a line between privacy ethics and the “vocational awe” that pervades our profession.44 The idea that certain parts of librarians’ work and values are sacred and beyond critique45 is harmful to our profession. We are certainly not obligated to consider ourselves the lone fighters at the front lines of academic freedom or bold crusaders for a larger cause. Much of what librarians have written about protecting patrons’ digital privacy focuses on librarians’ responsibilities, saddling the burden of privacy requirements and responsibilities on libraries and their staffs.46 Library professional education programs teach librarians that they must protect their patrons from online research platforms (clearing caches, erasing patron profiles, logging out of online systems, and other custodial tasks) rather than demanding that corporations stop tracking and collecting data from library patrons. It is not a librarian’s responsibility to save patrons from digital surveillance, rather, it is incumbent upon software developers to protect user privacy in the research tools they create.

Rather than considering libraries the ultimate digital privacy saviors and library ethics as some glowing bastion that librarians are burdened with protecting, we can think of intellectual freedom and privacy ethics as one of many factors to consider when we choose which resources and tools to implement in our libraries. Library ethics are points upon which we should hold our vendors accountable, not obligations to internalize and carry on our backs. While there may be no absolute, ideal privacy solution for our libraries, privacy is something to keep in mind and add to the list of concerns we have about the form and function of modern publishing and research.

Indeed, it is not the job of libraries, but the obligation of library vendors, to ensure that patrons are not surveilled by library products. Beyond unfairly burdening librarians, post hoc efforts to contain invasive digital research tools in libraries are not as effective as preemptively incorporating privacy into library products. Library’s digital hygiene activities are mere attempts to clean up after library vendors that breach patron privacy. When patrons use library vendors’ products, librarians follow behind, erasing profiles, clearing personal data from vendor systems, and trying to erase patrons’ digital footprints. We take on the work of cleaning up after our vendors.

Instead, our vendors should be proactively protecting our patrons’ privacy. Privacy expert Ann Cavoukian coined the concept “privacy by design” for the knowledge economy, believing that in the age of information capitalism, information capitalists should build privacy measures into their products by default. Cavoukian set out seven principles that have been adopted by law in other nations, including the European Union (EU) in its General Data Protection Regulation (GDPR).47 The principles require that online services, including research tools and resources, be designed to proactively protect privacy. According to the principles, research products should default to privacy. Privacy should be embedded into research products’ design, with “end to end” privacy throughout the entire data lifecycle, from the moment data is created to its eventual disposal.48 These privacy measures should be transparent and clear to the end user. For instance, users should know where their data will end up, especially if their data may be packaged and resold in a data brokering scheme.49

While the EU has embraced privacy by design and required the companies doing business in its member nations to adhere to the seven principles, there is no privacy by design requirement for research services in the U.S. This leaves U.S. librarians in an ethically complicated role as major information technology users who adhere to patron privacy standards. Librarians’ information access roles keep us at the forefront of technological advancement, as most information access occurs online.50 We are information technology’s early adopters51 , and we serve as gatekeepers to troves of online data collections. Oftentimes our role makes us information technology’s first critics, sounding warnings about products and practices that are oppressive to our patrons and that violate our ethical duties to protect patron privacy and intellectual freedom.52

As technology critics, we tend to focus on technologies a la carte, on a product-by-product basis.53 By honing in on specific products, companies, and practices, we’ve been able to condemn specific problems. We speak out against subscription fees and paywalls54 and e-book publishers’ give and take of online book collections.55 But scrutinizing specific products ignores a holistic critique of library vendors. When we step back and view our vendors as a class, we can see a large-scale issue that foreshadows our profession: all of the world’s information is being consolidated by several gigantic data corporations. We must consider how vendors becoming “technology, content, and analytics” businesses56 threatens the daily work of libraries and the privacy of those we serve.

Even as library privacy is threatened by vendors, librarians’ abilities to influence vendors’ privacy practices are decreasing as publishing companies change their business models. Publishing and data companies’ new data products and new, non-library-based data access points (including websites and apps) have created scores of new, non-library customers. Our vendors depend less on library customers as they diversify their customer base and recognize that they can sell directly to researchers without relying on library gatekeepers. In the last decade, Thomson Reuters has been criticized for trying to work around law librarians. The company even issued a controversial ad saying that patrons on a first name basis with their librarians are “spending too much time at the library” when they should use Westlaw from their offices instead.57 Through anti-competitive pricing schemes and sales practices, Lexis has similarly demonstrated its decreasing consideration of librarians in its marketing and sales plans.58 Librarians and their needs are getting pushed towards the back of the customer service queue. Declining library-vendor relations59 decrease librarians’ access to participate in vendor decision making.

Librarians cannot count on government intervention to protect library privacy in the digital age. While most states officially recognize and regulate library privacy60 , the information capitalism that incentivizes data brokering has gone largely unchecked. Federal and state governments do little to regulate information capitalism. The Federal Trade Commission has tried to break RELX Group’s monopoly on data brokering61 , but there is no comprehensive regulatory scheme in place to prevent the consolidation of information by several private entities or the unauthorized sale of personal data to law enforcement. Without regulation, library professionals are left to deal with vendors who flout privacy best practices and threaten patron privacy. Librarians should not be responsible for fixing vendor privacy practices. Instead, they should condemn them.

Solutions: Organizing Against Library Surveillance

While librarians’ relationships with their vendors may be changing, librarians still wield power as information consumers. Librarians can organize to 1) demand accountability from our vendors, and 2) insist on transparency to ensure that vendors comply with our ethics.

There are two major privacy issues raised by data brokers working as library vendors, and librarians can organize around both. The first issue is that the money libraries pay for products helps vendors develop surveillance products. The second issue is that the data that patrons provide vendors while using their products in libraries could be sold to law enforcement. These are two discrete problems that impact patron privacy, and vendors should be prepared to address both issues with librarians. The issues of libraries funding surveillance with subscription fees and library vendors including library patron data in their surveillance products are both major issues that could be the difference between library privacy and libraries as surveillance hubs.

If library products sell our patron data to the government, we are essentially inviting surveillance in our libraries. When libraries pay data brokering publishing giants to enter their libraries and serve their patrons without ensuring that their patron data will not be included in data brokering products, the government does not even have to ask librarians to track researchers. Government agencies can enter libraries electronically, inserting government surveillance in the Trojan horse of online research tools. Or they can buy the data collected by the information companies, like ICE does with Thomson Reuters and RELX Group.

If libraries are funding the research and development on surveillance products with our product subscription fees, libraries are spending money, often provided by patrons membership fees or taxes, on companies that use the income to build surveillance infrastructure that surveills various people and communities that may include library patrons. For instance, in law librarianship, law libraries collectively pay millions of dollars for Lexis and Westlaw each year. According to Thomson Reuters and RELX Group’s annual reports, that money is not kept in a separate pool of profits. It ostensibly funds their growing technology labs that create data analytics products and helps the companies afford scores of private data caches sold by smaller data brokering services. Especially in the post-9/11 surveillance regime, information vendors have been fighting for spots in the booming surveillance data markets62

Publishers like RELX Group are experts at cornering information markets. They’ve already bought the lion’s share of our academic publishing resources63 , from products where scholars incubate their research to the journals that publish the research after peer review, and even the post-publication “research evaluation” products and archives. The companies cash in at every step of academic research, profiting off of academics’ free labor.64 Thomson Reuters and Reed Elsevier are similarly cornering the legal information market. Beyond owning legal research products, they’re selling the surveillance products that help law enforcement track, detain, and charge people with crimes. When those swept up in law enforcement surveillance inevitably need lawyers, the lawyers use Westlaw and Lexis to represent them. The publishing companies transform legal research profits into products that help law enforcement create more criminal and immigration law clients.65

Librarians have the right to demand accountability from vendors about where patron data and subscription fees are being used. As major products customers, libraries can demand that the products they purchase maintain their ethical standards. Libraries do not have to sacrifice ethics and privacy norms for corporations like RELX Group and other information capitalists. We can research and learn about our products and their corporate purveyors and consider our privacy and intellectual freedom principles in relation to the things we buy. We should be able to discover what information our products are collecting about out patrons and who, if anyone, is using that personal data. We should also be able to find out what types of products our subscription fees support. Is the money we pay for library services supporting the research and development of police surveillance products? If it is, we should be able to make purchasing decisions with that surveillance relationship in mind.

To facilitate informed purchasing decisions, libraries can demand information about vendors’ practices. Requiring disclosures about our vendors’ research and product infrastructure should be part of doing business with data companies. With more transparency, librarians can assess which products are better at ensuring patron privacy and supporting intellectual freedom. The ethical conundrums raised by these products are multifaceted: Are we risking privacy and breaking our own ethical code? Are we funding unethical supply chains that harm people and violate ethics in the production of their products? If we are betraying the tenets of intellectual freedom, we must divest. Some library patrons, including University of California San Francisco faculty66 and thousands of mathematicians have already advocated for boycotting and divesting from companies like RELX Group over pricing practices.67 Universities are beginning to drop their Elevier contracts and thousands of scholars are protesting Elsevier over the company’s “exorbitantly high prices.”68 Activism around pricing suggests that, rather than relying on corporations with sketchy practices, librarians can support and talk more about alternate companies and startups or create our own resources, open access consortia, and search options as alternatives to companies involved in ICE surveillance. When powerful academic institutions like the University of California divest from RELX Group’s Elsevier products, it shows that large libraries can lead the way in pushing back against problematic vendor practices.

Importantly, holding vendors accountable should happen beyond the confines of library professional organizations, which are largely funded by the very vendors we need to hold accountable. Organizations that usually serve as librarians’ organizing hubs depend so thoroughly on funding from corporate vendors that they are not the best venues for criticizing library products and the corporations that sell them. Although the connections between research products and law enforcement surveillance unearth huge privacy concerns for libraries, professional library organizations are loath to discuss those concerns. Fighting corporate privacy issues may look the same as fighting FBI or other government surveillance to library professionals in their daily work (surveillance is surveillance whether it’s being conducted by the FBI or through RELX Group), but our professional organizations treat corporate and government practices very differently.

Historically, library organizations have fought alongside librarians against government surveillance in libraries. The American Library Association (ALA) has protested government surveillance in libraries, decrying the PATRIOT Act’s Sections 215 and 505, provisions that give the federal government sweeping authority to surveil people and obtain peoples’ library records.69 In fact, ALA and its members’ protests were so persistent that FBI agents called librarians “radical” and “militant,” and U.S. Attorney General John Ashcroft decried librarians as “hysterical.”70 ALA pushed back, partnering with the American Civil Liberties Union (ACLU) to deploy anonymous browsing tools and other resources to protect library patrons’ privacy.71

Library organizations’ reactions to corporate surveillance, so far, have been much different. A blog post about library privacy and research vendors’ participation in ICE surveillance titled “LexisNexis’s Role in ICE Surveillance and Librarian Ethics” was taken down from the American Association of Law Libraries (AALL) website within minutes of being posted, replaced by a message stating: “This post has been removed on the advice of AALL General Counsel.”72 While professional library organizations are comfortable standing up to the government when it threatens library patron privacy, the same organizations are not prepared to stand up to library vendors for the same privacy invasions.

There are several reasons for the disparate ways library organizations react to government surveillance versus vendor surveillance. The main rationale offered by AALL when it removed the blog post critiquing legal research vendors was that vendors are equal members in the organization and that the critique of their relationships with ICE amounted to “collective member actions” that raise antitrust issues. This rationale is nonsensical, implying that librarians voicing concerns about Thomson Reuters and RELX Group ICE contracts is akin to a group boycott designed to stifle competition among legal research vendors.73 This improbable excuse was likely a smokescreen designed to stop AALL members from potentially upsetting the organizations’ key donors. AALL relies on Thomson Reuters and RELX Group to sponsor their activities and scholarship programs. When library vendors are middlemen between library patrons and government surveillance, librarians may be prohibited from critiquing vendor practices in professional organizations’ forums.

The next wave of privacy concerns will come from our vendors and information sources, and they will require librarians organizing resistance outside of their professional organizations. As we begin to do this organizing work, we should keep track of the ways our vendors are changing and what that means for our ethical standards. This article focuses on surveillance, but it’s not the only issue that arises when publishers become data corporations. Librarians must either drop our privacy pretenses or create privacy policies that push back against information capitalism and data barons. Privacy is a new supply chain ethics problem, and librarians are stuck in its wake as major information technology purchasers and providers, promoters and gatekeepers. Privacy settings in digital products should be the default.74 Unfortunately, privacy defaults are aspirational, and largely unimplemented. When dealing with information corporations hungry for data to put on its warehouse shelves, for bundling and selling to new customers, librarians can make it clear that the surveillance work these companies do is forbidden in our stacks.


The author would like to thank Kellee Warren, Scott Young, and Ian Beilin for their thoughtful edits and for sagely shepherding this article through the peer review process. She would also like to thank Yasmin Sokkar Harker, Nicole Dyszlewski, Julie Krishnaswami, Rebecca Fordon, and the many other law librarians who have offered feedback, advice, and support throughout this research process.

The author also recognizes and applauds the critical work and purpose of In the Library with the Lead Pipe. Its role as an open access, peer reviewed library journal that supports creative solutions for major library issues makes the publication a vital part of our profession. The volunteer efforts of those who take on the challenge to “improve libraries, professional organizations, and their communities of practice by exploring new ideas, starting conversations, documenting our concerns, and arguing for solutions” are necessary for our sustenance and growth as information specialists and make discussions like the one in this article possible.


Amin, Kemi. (2019) “UCSF Faculty Launch Petition to Boycott Elsevier in Support of Open Access,” UCSF Library Blog (March 11, 2019).

Barclay, Donald A. (2015) “Academic Print Books are Dying. What’s the Future?.” The Conversation (November 10, 2015).

Beverungin, Armin et al. (2012) “The Poverty of Journal Publishing.” Organization 19:6, 929-938.

Bintliff, Barbara et al. (2015) Fundamentals of Legal Research. Tenth Edition. Foundation Press.

Bowen Ayre, Lori. (2017) “Protecting Patron Privacy: Vendors, Libraries, and Patrons Each Have a Role to Play.” Collaborative Librarianship 9:1, Article 2.

Buranyi, Stephen. (2017) “Is the Staggeringly Profitable Business of Scientific Publishing Bad for Science?” The Guardian (June 27, 2017).

Bureau of Labor Statistics. (2016) Employment Trends in Newspaper Publishing and Other Media.

Carcamo, Cindy. (2019) “For ICE, Business as Usual is Never Business as Usual in an Era of Trump.” Los Angeles Times (Nov. 4, 2019).

Cavoukian, Ann. (n.d.) Privacy by Design: The 7 Foundational Principles.

Cohen, Dan. (2019) “The Books of College Libraries Are Turning Into Wallpaper.” The Atlantic (May 26, 2019).

Cookson, Robert. (2015) “Reed Elsevier to Rename Itself RELX Group.” Financial Times (Februrary 26, 2015).

Craig, Brian P. (20019) “Law Firm Reference Librarian, a Dying Breed?” LLAGNY Law Lines Summer 2009,

Crotty, David. (2019) “Welcome to The Great Acceleration.” The Scholarly Kitchen (January 2, 2019),

Davis, Caroline. (2019) Print Cultures: A Reader in Theory and Practice. Red Globe Press.

Dixon, Pam. (2008) “Ethical Issues Implicit in Library Authentication and Access Management: Risks and Best Practices.” Journal of Library Administration 47:3-4, 141-162.

Dooley, Jim. (2016) “University of California, Merced: Primarily an Electronic Library.” In Suzanne M. Ward et al., eds., Academic E-Books: Publisher, Librarians, and Users, 93-106. Purdue University Press.

Dunie, Matt. (2015) “Negotiating With Content Vendors: An Art or A Science?.” E-content Quarterly 1:4.

Durrance, Joan C. (2004) “Competition or Convergence? Library and Information Science Education at a Critical Crossroad.” Advances in Librarianship 28 (December 29, 2004), 171-198.

Elsevier Inc. et al v. Sci-Hub et al, No. 1:2015 cv 04282 (S.D.N.Y. 2015).

Elsevier Labs.

Enis, Matt. (2019) “Publishers Change Ebook and Audiobook Models; Libraries Look for Answers” Library Journal (July 17, 2019),

Ettarh, Fobazi. (2018) “Vocational Awe and Librarianship: The Lies We Tell Ourselves.” In the Library With a Lead Pipe (January 10, 2018),

EU General Data Protection Regulation.

Federal Trade Commission. (2014) “Data Brokers: A Call For Transparency and Accountability.”

Federal Trade Commission. (2008) “FTC Challenges Reed Elsevier’s Proposed $4.1 Billion Acquisition of ChoicePoint, Inc.” (September 16, 2008),

Federal Trade Commission. “Group Boycotts.”

Funk, McKenzie. (2019) “How ICE Picks Its Targets in the Surveillance Age.” New York Times (October 2, 2019).

Gardiner, Carolyn C. (2016) “Librarians Find Themselves Caught Between Journal Pirates and Publishers.” Chronicle of Higher Education (February 18, 2016).

Government Security News. (2013) “ICE Will Utilize LexisNexis Databases to Track Down Fugitive […]” (September 11, 2013),

Gressel, Michael. (2014) “Are Libraries Doing Enough to Safeguard Their Patrons’ Digital Privacy?” The Serials Librarian 67:2, 137-42.

Guthrie Ferguson, Andrew. (2017) The Rise of Big Data Policing: Surveillance, Race, and the Future of Law Enforcement. NYU Press.

Hines, Shawnda. (2019) “ALA Launches National Campaign Against Ebook Embargo.” (Sept, 11, 2019)

Hodnicki, Joe. (2018) “Does WEXIS Use Legal Research User Data in Their Surveillance Search Platforms?” Law Librarian Blog (July 16, 2018),

Hodnicki, Joe. (2018) “Early Coverage of AALL-LexisNexis Anticompetitive Tying Controversy.” Law Librarian Blog (June 15, 2018),

Hodnicki, Joe. (2017) “LexisNexis’s Role in ICE Surveillance and Librarian Ethics.” Law Librarian Blog (December 11, 2017),

Ignatow, Gabe. (2017) Information Capitalism, The Wiley‐Blackwell Encyclopedia of Globalization, G. Ritzer, Ed.

Inmon, Bill. (2005) Building the Data Warehouse. 4th Ed. Wiley.

Johnson, Peggy. (2018) Fundamentals of Collection Development and Management 4th ed. ALA Editions.

Joseph, George. (2019) “Data Company Directly Powers Immigration Raids in Workplace.” WNYC (July 16, 2019).

Justification for Other than Full and Open Competition, Solicitation No. HSCECR-13-F-00032,

Kalhan, Anil. (2014) “Immigration Surveillance.” Maryland Law Review 74:1, Article 2,

Kennedy, Bruce M. (1989) “Confidentiality of Library Records: A Survey of Problems, Policies, and Laws.” Law Library Journal 81:4, 733-767.

Kulp, Patrick. (2018) “Here’s How Publishers Are Opening Their Data Toolkits to Advertisers.” AdWeek (May 29, 2018),

Lambert, April et al. (2016) “Library patron privacy in jeopardy an analysis of the privacy policies of digital content vendors.” Proceedings of the Association for Information Science and Technology (February 24, 2016),

Lamdan, Sarah. (2015) “Social Media Privacy: A Rallying Cry to Librarians.” The Library Quarterly 85:3, 261-277 (July 2015).

Lamdan, Sarah. (2019) “When Westlaw Fuels ICE Surveillance: Legal Ethics in the Era of Big Data Policing.” 43 N.Y.U. Review of Law and Social Change 43: 2, 255-293 (2019),

LexisNexis Privacy Statement.

Lipscomb, Carolyn E. (2001) “Mergers in the Publishing Industry.” Bulletin of the Medial Library Association 89:3, 307-308,

Michael, Mike and Deborah Lupton. (2015) “Toward a Manifesto for the ‘Public Understanding of Big Data.’” Public Understanding of Science 25(1), 104–116.

Mijente. (2018) “Who’s Behind ICE: The Tech and Data Companies Fueling Deportations.”

Morris-Suzuki, Tessa. (1986) “Capitalism in the Computer Age.” New Left Review 1:160, 81-91.

NARA Records Management Key Terms and Acronyms.

Noble, Safiya Umoja. (2018) Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press.

Opher, Albert et al. (2014) The Rise of the Data Economy: Driving Value Through Internet of Things Data Monetization.

Palfrey, John and Urs Gasser. (2010) Born Digital: Understanding the First Generation of Digital Natives. Basic Books.

Peterson, Andrea. (2014) “Librarians won’t stay quiet about government surveillance.” Washington Post (October 3, 2014).

Pollock, Dan and Ann Michael. (2019) “Open Access Mythbusting: Testing Two Prevailing Assumptions About the Effects of Open Access Adoption.” Learned Publishing, 32: 7-12.

Posada, Alejandro and George Chen. (2017) “Publishers are increasingly in control of scholarly infrastructure and why we should care: A Case Study of Elsevier.” The Knowledge G.A.P.

Resnick, Brian and Julia Belluz. (2019) “The War To Free Science: How Librarians, Pirates, and Funders Are Liberating the World’s Academic Research from Paywalls.” Vox (July 10, 2019),

RIPS Law Librarian Blog. (2017) “Post Removed.” (December 5, 2017)

Roberts, Sarah T. (2019) Behind the Screen: Content Moderation in the Shadows of Social Media. Yale University Press.

Routley, Nick. (2018) “The Multi-Billion Dollar Industry That Makes Its Living From Your Data.” Visual Capitalist (April 4, 2018),

Selbst, Andrew D. (2017) “Disparate Impact in Big Data Policing.” Georgia Law Review 52:109, 109-195.

Shirazi, Roxanne. (2019) “The Streaming Library.” (August 27, 2019)

Shirazi, Roxanne (@RoxanneShirazi). (2019) Twitter (August 26, 2019, 7:08 PM),

SPARC*. (n.d) “Big Deal Cancellation Tracking.”

Thomson Reuters. (2015) “Thomson Reuters to Launch Data and Innovation Lab in Waterloo Ontario.” (September 16, 2015)

Thomson Reuters. (2018) “Privacy Statement.”

van Loon, Ronald. “RELX Group: A Transformation Story, Our Stories.”

van Loon, Ronald. “The Data Driven Lawyer: How RELX is Using AI to Help Transform the Legal Sector.”

Yeh, Chih-Liang. (2018) “Pursuing Consumer Empowerment in the Age of Big Data: A Comprehensive Regulatory Framework for Data Brokers.” Telecommunications Policy 42, 282–92.

Zhang, Sarah. (2019)”The Real Cost of Knowledge.” The Atlantic (March 4, 2019),

  1. Sarah Lamdan, When Westlaw Fuels ICE Surveillance: Legal Ethics in the Era of Big Data Policing, 43 NYU Review of Law & Social Change 255, 277 (2019),
  2. Libraries are trending towards digitized collections. For instance, University of California’s Merced campus transitioned to a 90% digital library according to its 2003 development plans. Jim Dooley, “University of California, Merced: Primarily an Electronic Library,” in Academic E-Books: Publisher, Librarians, and Users 93-106 (Suzanne M. Ward, et al. eds. 2016).
  3. April Lambert, et al., “Library patron privacy in jeopardy an analysis of the privacy policies of digital content vendors,” Proceedings of the Association for Information Science and Technology (February 24, 2016).
  4. Alejandro Posada & George Chen, “Publishers are increasingly in control of scholarly infrastructure and why we should care: A Case Study of Elsevier,” The Knowledge G.A.P. (September 20, 2017),
  5. The phenomenon of library services consolidation is not new, but it has increased as library services move to online platforms. See Carolyn E. Lipscomb, “Mergers in the Publishing Industry,” Bulletin of the Medial Library Association (2001), Consolidation among vendors has changed the way libraries approach collection development and acquisition, pushing librarians from an a la carte model, where librarians pick their collection based on specific needs and titles, to a “big deal” model, where librarians buy huge bundles of information from only several publishers that own the lions’ share of library materials and user platforms. See Peggy Johnson, Fundamentals of Collection Development and Management 10-11 (4th ed. 2018).
  6. Westlaw and Lexis are the go-to digital collections and research tools of law librarianship, Barbara Bintliff, et al., Fundamentals of Legal Research, Tenth Edition (April 7, 2015).
  7. All Elsevier Digital Solutions,
  8. Federal Trade Commission, Data Brokers: A Call For Transparency and Accountability (2014),
  9. Albert Opher, et al,, The Rise of the Data Economy: Driving Value Through Internet of Things Data Monetization (2014),; Mike Michael & Deborah Lupton, Toward a Manifesto for the ‘Public Understanding of Big Data’, Public Understanding of Science (2015).
  10. Andrew Guthrie Ferguson, The Rise of Big Data Policing: Surveillance, Race, and the Future of Law Enforcement (2018).
  11. Andrew D. Selbst, Disparate Impact in Big Data Policing, 52 Georgia Law Review 109 (2017),
  12. Safiya Umoja Noble, Algorithms of Oppression (2018).
  13. Anil Kalhan, Immigration Surveillance, 74 Maryland Law Review 1, 6 (2014),
  14. McKenzie Funk, How ICE Picks Its Targets in the Surveillance Age, NY Times (October 2, 2019),
  15. George Joseph, Data Company Directly Powers Immigration Raids in Workplace, WNYC (July 16, 2019),
  16. Justification for Other than Full and Open Competition, Solicitation No. HSCECR-13-F-00032,
  17. ICE’s Fugitive Operations Support Center has contracted with LexisNexis for its Accurint databases since 2013. Government Security News, ICE Will Utilize LexisNexis Databases to Track Down Fugitive […] (September 11, 2013), ICE’s Fugitive Operations target and surveil immigrants, apprehending people in large sweeps. See Cindy Carcamo, For ICE, Business as Usual is Never Business as Usual in an Era of Trump, Los Angeles Times (November 4, 2019),
  18. Sarah Lamdan, When Westlaw Fuels ICE Surveillance: Legal Ethics in the Era of Big Data Policing, 43 N.Y.U. Review of Law and Social Change 255, 275 (2019).
  19. Donald A. Barclay, “Academic print books are dying. What’s the future?,” The Conversation (November 10, 2015),; Dan Cohen, “The Books of College Libraries Are Turning Into Wallpaper,” The Atlantic (May 26, 2019),
  20. In 2016, the Bureau of Labor Statistics reported on the bleak employment outlook in the traditional publishing industry, which showed that employment in the book, traditional news, and periodical industry declined since 1990, as employment in online and movie industries soared. Bureau of Labor Statistics, Employment Trends in Newspaper Publishing and Other Media (2016), Meanwhile, open access becomes increasingly normalized part of our information ecosystem. See Dan Pollock & Ann Michael, “Open Access Mythbusting: Testing Two Prevailing Assumptions About the Effects of Open Access Adoption,” The Association of Learned & Professional Society Publishers (January 2019).
  21. Ronald Van Loon, RELX Group: A Transformation Story, Our Stories,,
  22. Ibid.
  23. Caroline Davis, Print Cultures: A Reader in Theory and Practice 267 (2019).
  24. In 2015, Thomson Reuters opened its data and innovation lab in Ontario, Canada to develop machine learning and AI products. See Thomson Reuters, Thomson Reuters to Launch Data and Innovation Lab in Waterloo, Ontario (September 16, 2015), Similarly, RELX Group is developing laboratories to study and develop data analytics technologies. See, Elsevier Labs,
  25. The concept of the “data warehouse” was originally conceived by computer scientist Bill Inmon, He envisioned data warehouses as centralized storage for large collections data integrated from various sources. Bill Inmon, Building the Data Warehouse (4th ed. 2005).
  26. The National Archives call large, aggregated datasets “big buckets.” NARA Records Management Key Terms and Acronyms,
  27. Big deal purchasing began in the late 1990’s when large publishers began offering libraries deals on aggregated bundles of content that provided discounts compared to purchasing titles individuals. SPARC*, Big Deal Cancellation Tracking,
  28. One example of the ways publishers are monetizing their collections via data analytics is news media’s use of data analytics to gauge and analyze readers’ emotional reactions to news articles and monitor which topics resonate most with readers to better target marketing campaigns. See Patrick Kulp, “Here’s How Publishers Are Opening Their Data Toolkits to Advertisers,” AdWeek (May 29, 2018), RELX Group describes its big data technology as leveraging user data to analyze its digital collections, creating commercially viable analyses for profit. See Ronald van Loon, The Data Driven Lawyer: How RELX is Using AI to Help Transform the Legal Sector,
  29. Roxanne Shirazi, “The Streaming Library” (August 27, 2019),
  30. As libraries transition from paper collections to electronic collections, eBook vendors control collection access through pricing, limited time contracts, and other tactics made possible in a system where libraries do not own their materials but purchase licenses to stream materials from vendor databases. Matt Enis, Publishers Change Ebook and Audiobook Models; Libraries Look for Answers, Library Journal (July 17, 2019),
  31. Roxanne Shirazi (@RoxanneShirazi), Twitter (August 26, 2019, 7:08 PM), See also Sarah Lamdan, “When Westlaw Fuels ICE Surveillance: Legal Ethics in the Era of Big Data Policing,” 43 N.Y.U. Review of Law and Social Change, Volume 43, pp. 255, 290 (2019).
  32. Lori Bowen Ayre, “Protecting Patron Privacy: Vendors, Libraries, and Patrons Each Have a Role to Play,” 9 Collaborative Librarianship 1 (March, 2017),
  33. Pam Dixon, “Ethical Issues Implicit in Library Authentication and Access Management: Risks and Best Practices,” 47 Journal of Library Administration 141 (2008).
  34. Chih-Liang Yeh, “Pursuing Consumer Empowerment in the Age of Big Data: A Comprehensive Regulatory Framework for Data Brokers,” 42 Telecommunications Policy 282–92 (2018),
  35. Joe Hodnicki, Does WEXIS Use Legal Research User Data in Their Surveillance Search Platforms?, Law Librarian Blog (July 16, 2018),
  36. Thomson Reuters Privacy Statement,; LexisNexis Privacy Statement,
  37. Matt Dunie, “Negotiating With Content Vendors: An Art or A Science?,” E-content in Libraries, A Marketplace Perspective (Sue Polanka, ed.), Library Technology Reports, ALA. This writing describes how libraries struggle to cover rising cost of data bundles with decreasing budgets. Libraries must pay for the collections that patrons require by decreasing spending in other categories, like personnel and even numbers of library branches. Yet, the vendors have discovered that their content is so critical that, despite raising prices, libraries continue to acquire content at the same rate. The reliance of libraries on vendor content gives vendors the leverage to set prices ever higher.
  38. Gabe Ignatow, “Information Capitalism” (2017)
  39. Tessa Morris-Suzuki describes the ways the growth of the “information economy” limits access to freely available information, placing once-accessible research and reporting behind paywalls and monetizing information that used to be considered a public good. Morris-Suzuki identifies libraries as the former hub for free information “paid for by society as a whole”, and describes how the commodification of information alters both the concept of librarians as spaces where information is not commodified, and also libraries’ access to information collections. Tessa Morris-Suzuki, “Capitalism in the Computer Age,” The New Left Review (1986),
  40. Carolyn C. Gardiner, “Librarians Find Themselves Caught Between Journal Pirates and Publishers,” Chronicle of Higher Education (February 18, 2016),
  41. Elsevier Inc. et al v. Sci-Hub et al, No. 1:2015 cv 04282 (S.D.N.Y. 2015).
  42. Nick Routley, “The Multi-Billion Dollar Industry That Makes Its Living From Your Data,” Visual Capitalist (April 4, 2018),
  43. Ann Cavoukian, Privacy by Design: The 7 Foundational Principles,
  44. Fobazi Ettarh, “Vocational Awe and Librarianship: The Lies We Tell Ourselves,” In the Library With a Lead Pipe (January10, 2018),
  45. Ibid.
  46. See Michael Gressel, “Are Libraries Doing Enough to Safeguard Their Patrons’ Digital Privacy?,” 67 The Serials Librarian 137 (2014). Librarians are tasked with educating patrons about digital privacy hygeine and ensuring that every public access computer in their libraries be properly set up to protect patrons against software practices that violate privacy. However, it is not the responsibility of librarians to be masters of digital privacy, rather, corporations should be held accountable and made to design their products in a way that protects everyone, including library patrons. See Sarah Lamdan, “Social Media Privacy: A Rallying Cry to Librarians,” 85 The Library Quarterly (July 2015).
  47. EU General Data Protection Regulation, See also, Ann Cavoukian, Privacy by Design: The 7 Foundational Principles,
  48. Ibid.
  49. Ibid.
  50. Most modern information is born-digital, and librarians are pivoting from paper collections to online collection curation/building/digitization, John Palfrey & Urs Gasser, Born Digital: Understanding the First Generation of Digital Natives (2010).
  51. Joan C. Durrance, Competition or Convergence? Library and Information Science Education at a Critical Crossroad, 28 Advances in Librarianship 171 (December 29, 2004),
  52. For instance, Sofiya Umoja Noble warns about bias in search algorithms in her book, Algorithms of Oppression: How Search Engines Reinforce Racism, and Sarah T. Roberts writes about how social media moderation, behind the scenes, takes an emotional toll on its workers in Behind the Screen: Content Moderation in the Shadows of Social Media.
  53. We protest individual contracting schemes by various vendors but we do not examine information capitalism as its own structure.
  54. Brian Resnick & Julia Belluz, “The War To Free Science: How Librarians, Pirates, and Funders Are Liberating the World’s Academic Research from Paywalls”, Vox (July 10, 2019),
  55. For example librarians organized against MacMillan’s e-book embargo in 2019. See Shawnda Hines, “ALA Launches National Campaign Against Ebook Embargo,” (September 11, 2019)
  56. Robert Cookson, “Reed Elsevier to Rename Itself RELX Group,” Financial Times (Februrary 26, 2015),
  57. Brian P. Craig, “Law Firm Reference Librarian, a Dying Breed?,” LLAGNY Law Lines (Summer 2009),
  58. Joe Hodnicki, Early Coverage of AALL-LexisNexis Anticompetitive Tying Controversy, Law Librarian Blog (June 15, 2018),
  59. In 2011, investment analyst Claudio Aspesi asked an Elsevier CEO about “the deteriorating relationship with the libraries” and the CEO declined to respond about the relationship between the major publisher and its library customers. Stephen Buranyi, “Is the Staggeringly Profitable Business of Scientific Publishing Bad for Science?,” The Guardian (June 27, 2017),
  60. Bruce M. Kennedy, “Confidentiality of Library Records: A Survey of Problems, Policies, and Laws,” 81 Law Library Journal 733 (1989),
  61. In 2008, the Federal Trade Commission ordered Reed Elsevier to divest part of its ChoicePoint aquisition to Thomson Reuters to ensure competition between the two data brokers, but no overarching law or particular action has broken up the data companies’ duopoly or significantly regulated privacy in the data broker industry. See Federal Trade Commission, “FTC Challenges Reed Elsevier’s Proposed $4.1 Billion Acquisition of ChoicePoint, Inc.” (September 16, 2008),
  62. Mijente, “Who’s Behind ICE: The Tech and Data Companies Fueling Deportations” (2018),
  63. David Crotty, “Welcome to The Great Acceleration,” The Scholarly Kitchen (January 2, 2019),
  64. Armin Beverungin, et al., “The Poverty of Journal Publishing,” 19 Organization 929 (2012).
  65. This cycle of publishing and surveillance is described in Sarah Lamdan, “When Westlaw Fuels ICE Surveillance: Legal Ethics in the Era of Big Data Policing,” 43 N.Y.U. Review of Law & Social Change 255 (2019),
  66. Kemi Amin, “UCSF Faculty Launch Petition to Boycott Elsevier in Support of Open Access,” UCSF Library Blog (March 11, 2019),
  67. Sarah Zhang, “The Real Cost of Knowledge,” The Atlantic (March 4, 2019),
  68. Ibid.
  69. Sarah Lamdan, “Social Media Privacy: A Rallying Cry to Librarians,” 85 The Library Quarterly 5-6 (July 2015).
  70. Ibid.
  71. Andrea Peterson, “Librarians won’t stay quiet about government surveillance,” Washington Post (October 3, 2014),
  72. RIPS Law Librarian Blog, “Post Removed” (December 5, 2017), The post was shared by a law librarian on his personal blog. See Joe Hodnicki, “LexisNexis’s Role in ICE Surveillance and Librarian Ethics,” Law Librarian Blog (December 11, 2017),
  73. Federal Trade Commission, Group Boycotts,
  74. Ann Cavoukian, Privacy by DesignFoundational Principles,

Fellow Reflection: Sarah Nguyen / Digital Library Federation

Sarah NguyenThis post was written by Sarah Nguyen, who received a Students & New Professionals Fellowship to attend this year’s DLF Forum.

Sarah (@snewyuen) is an advocate for open, accessible, and secure technologies. While studying as a Master of Library and Information Science candidate at the University of Washington iSchool, she is expressing research interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, AssistantResearch Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, Instructional Design Technologist at CUNY City Tech Open Education Resources Program, and Archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance.

DLF Forum 2019 took place during a lucky (questionable climate changing) pocket where hurricane season left and sunny Tampa, FL welcomed us onto the Seminole and Tocobaga lands—it reminded me of the Saved by the Bell: Palm Springs Weekend (1991) set.

The view from outside of the DLF Forum Conference Hotel. Taken by me.

I joined my first Forum as a DLF Student & New Professional Fellow, and I left with new perspectives to explore. The range of topics ranged widely, but there were three sessions that encompassed a theme that stood out to me, which I will highlight in this #DLFForum recap post.

As a current MLIS student, where do I fit in, what can I make of this, and what can the field offer to me when I graduate, is what keeps me up at night. For me, the week started off with the tone that Dr. Marisa Duarte set during her opening plenary, “Beautiful Data: Justice, Code, and Architectures of the Sublime” (shared notes). She told us about her journey in and out of LIS and about her advocacy on algorithmically-delivered data, its effects on society, justice, and how libraries  have the knowledge and social power to do “reflexive justice work”. Duarte emphasized how we, as trained and/or degree’d information professionals, can educate our diverse population about digital biases and spread awareness on how to approach today’s “pan-capitalist algorithmic domination.” Duarte’s keynote touched farther than basic proactive equity, diversity, and inclusion (EDI) efforts, but how exposing the truth of current digital environments is an intentional practice that will not happen overnight. Her realistic but optimistic attitude toward digital librarianship gave me a better idea on how I’d like to represent and convey my digital library practices into the future.

Later that morning, I attended “I’m not an archivist, but…”: Working “Archives-Adjacent” to Transcend Traditional Library Roles” (slides & #m1c tweets), a panel of trained archivists, Dinah Handel, Monique Lassere, Jenna Freedman, Mary Kidd, and Stefanie Ramsay, who do not touch the actual objects but keep other archive operations lifelines well-oiled: systems and operation coordination, digitization service management, digital project librarianship, and communication (respectively). This panel brought me back to Duarte’s message about the power that librarians have in conveying accessible digital infrastructures amongst Big Tech products, or in this panel’s case, within the bureaucracy of the ivory towers. It’s not only just building an unbiased database, but also receiving the proper credit for the undefined, immeasurable, extra-mile they contribute to the archives. There’s a person behind that database infrastructure, that GitHub pull request, and those need to be acknowledged to equitable standards as other traditional archives positions (Mirza & Seale, 2017). I would be remiss to mention how the interactive slides were the best and that further demonstrates how much creative care these archives-adjacent colleagues put into their daily work:

A slide with a spinning card from Mary Kidd’s presentation. Taken by me.

On the last morning, Chela Scott-Weber’s portion of #w1d: “The Story Disrupted: Memory Institutions and Born Digital Collecting” (shared notes & #w1d tweets) resonated with me. She emphasized “we have been and are collecting reactively within systems of traditionally white institutions. Consider power models, think less about format specific collections and build proactive, collaborative practices around community, people, and phenomena.” Her push for EDI representation in born-digital collections made me proud to be starting my professional experiences with projects centered around the community of born-digital creators, instead of putting the collectors’ admiration desires first.

The post Fellow Reflection: Sarah Nguyen appeared first on DLF.

Fellow Reflection: Doyin Adenuga / Digital Library Federation

Doyin AdenugaThis post was written by Doyin Adenuga, who received a Focus Fellowship to attend this year’s DLF Forum.

Doyin has an MLIS degree from the University of British Columbia (UBC), Canada, and has been a librarian for four years. He has spent ten years in computer/systems support in Nigeria, as well as five years as technology coordinator/assistant editor at a textbook publishing center at the University of Wisconsin-Madison, where he became interested in providing support to users of information. He is interested primarily in the world of digital librarianship. Doyin started his library career during iSchool @ UBC co-op term as DSpace Cataloguer, and is currently the Electronic Resources Librarian at Houghton College, where he has installed and maintains a DSpace system. The public interface of DSpace @ Houghton College only provides searching and browsing of its multiple collections, while other features of this institutional repository system are disabled. You can find him on Twitter at @nugadoy.

I thank the 2019 Digital Library Federation (DLF) Forum committee for the honor of being selected as one of the fellows and being fully sponsored to attend the Forum. I have attended a number of library meetings and conferences but not one focused on digital library, and I appreciate the opportunity. Though the DLF Forum brings together specifically digital library professionals, I will encourage everyone in any area of information profession to attend at least one DLF forum.

The two areas that drew my interest even before the forum were sessions on open source repository solutions (other than DSpace) and privacy concerns with digital collections. I attended the session “Samvera – Sustainable Digital Repository Solutions & Community of Practice” ( and was intrigued by what Samvera has to offer, such as its digital object viewer option. The need for digitization of institutional records to “advancing research, learning, social justice, & the public good” ( increases and having options for open digital library solutions will surely help. The presenters rightly said that “no one system fits all,” but with the limited time for the presentation, I became interested in investigating Samvera further.

As much work is put in setting up a digital repository system, so also much more attention is needed for continuous privacy concerns of its digital contents. The three presentations at the “privacy” ( ) session gave practical examples of handling privacy concerns and how the issues that arose were handled professionally. I would summarize this session into these three areas: pre-archiving privacy concerns and institutional workflows intended to minimize them, post-archiving scenarios and actions taken to resolve them, and privacy pedagogy at the institutional level.

Finally, the Forum gave me the opportunity to connect with other information professionals, some of the DLF working groups, and a locally based digital humanity group, which I otherwise might not have known existed, and I am really grateful.

The post Fellow Reflection: Doyin Adenuga appeared first on DLF.

Academic Publishers As Parasites / David Rosenthal

This is just a quick post to draw attention to From symbiont to parasite: the evolution of for-profit science publishing by UCSF's Peter Walter and Dyche Mullins in Molecular Biology of the Cell. It is a comprehensive overview of the way the oligopoly publishers obtained and maintain their rent-extraction from the academic community:
"Scientific journals still disseminate our work, but in the Internet-connected world of the 21st century, this is no longer their critical function. Journals remain relevant almost entirely because they provide a playing field for scientific and professional competition: to claim credit for a discovery, we publish it in a peer-reviewed journal; to get a job in academia or money to run a lab, we present these published papers to universities and funding agencies. Publishing is so embedded in the practice of science that whoever controls the journals controls access to the entire profession."
My only criticisms are a lack of cynicism about the perks publishers distribute:
  • They pay no attention to the role of librarians, who after all actually "negotiate" with the publishers and sign the checks.
  • They write:
    we work for them for free in producing the work, reviewing it, and serving on their editorial boards
    We have spoken with someone who used to manage top journals for a major publisher. His internal margins were north of 90%, and the single biggest expense was the care and feeding of the editorial board.
And they are insufficiently skeptical of claims as to the value that journals add. See my Journals Considered Harmful from 2013.

Despite these quibbles, you should definitely go read the whole paper.

LITA Opens Call for Innovative LIS Student Writing Award for 2020 / LITA

The Library and Information Technology Association (LITA), a division of the American Library Association (ALA), is pleased to offer an award for the best unpublished manuscript submitted by a student or students enrolled in an ALA-accredited graduate program. Sponsored by LITA and Ex Libris, the award consists of $1,000, publication in LITA’s referred journal, Information Technology and Libraries (ITAL), and a certificate. The deadline for submission of the manuscript is February 28, 2020.

The award recognizes superior student writing and is intended to enhance the professional development of students. The manuscript can be written on any aspect of libraries and information technology. Examples include, but are not limited to, digital libraries, metadata, authorization and authentication, electronic journals and electronic publishing, open source software, distributed systems and networks, computer security, intellectual property rights, technical standards, desktop applications, online catalogs and bibliographic systems, universal access to technology, and library consortia.

To be eligible, applicants must follow these guidelines and fill out the application form (PDF). Send the signed, completed forms electronically no later than February 28, 2020, to the Award Committee Chair, Julia Bauder, at

The winner will be announced in May 2020.

About LITA

The Library and Information Technology Association (LITA) is the leading organization reaching out across types of libraries to provide education and services for a broad membership of nearly 2,400 systems librarians, library technologists, library administrators, library schools, vendors, and many others interested in leading edge technology and applications for librarians and information providers. LITA is a division of the American Library Association. Follow us on our BlogFacebook, or Twitter.

About Ex Libris
Ex Libris, a ProQuest company, is a leading global provider of cloud-based solutions for higher education. Offering SaaS products for the management and discovery of the full spectrum of library and scholarly materials, as well as mobile campus solutions driving student engagement and success, Ex Libris serves thousands of customers in 90 countries. For more information about Ex Libris, see our website, and join us on FacebookYouTubeLinkedIn, and Twitter.


Jenny Levine

Executive Director

Library and Information Technology Association


2019 AMIA Cross-Pollinator: Marlo Longley / Digital Library Federation


Marlo LongleyThe Association of Moving Image Archivists (AMIA) and DLF will be sending Marlo Longley to attend the 2019 DLF/AMIA Hack Day and AMIA conference in Baltimore, Maryland! During the event, Marlo will collaborate on projects with other attendees to develop solutions for digital audiovisual preservation and access.

About the Awardee

Marlo Longley is a Digital Repository Developer for the Metropolitan New York Library Council (METRO). At METRO he is working on the 2020 release of Archipelago, an open source repository system committed to flexible metadata. Last year he helped build Canyon Cinema’s new catalog search and website, and got interested in technical issues facing moving image archives from there. Marlo is based in Oakland, CA.

About Hack Day and the Award

The sixth AMIA+DLF Hack Day (November 13 at the Renaissance Baltimore Harborplace) will be a unique opportunity for practitioners and managers of digital audiovisual collections to join with developers and engineers for an intense day of collaboration to develop solutions for digital audiovisual preservation and access.

The goal of the AMIA + DLF Award is to bring “cross-pollinators”–developers and software engineers who can provide unique perspectives to moving image and sound archivists’ work with digital materials, share a vision of the library world from their perspective, and enrich the Hack Day event–to the conference.

Interested in participating in Hack Day either virtually or in person? Registration is free! Sign up now:

The post 2019 AMIA Cross-Pollinator: Marlo Longley appeared first on DLF.

Using Machine Learning to Extract Nuremberg Trials Transcript Document Citations / Harvard Library Innovation Lab

In Harvard's Nuremberg Trials Project, being able to link to cited documents in each trial's transcript is a key feature of site navigation. Each document submitted into evidence by prosecution and defense lawyers is introduced in the transcript and discussed, and the site user is offered the possibility at each document mention to click open the document and view its contents and attendant metadata. While document references generally follow various standard patterns, deviations from the pattern large and small are numerous, and correctly identifying the type of document reference – is this a prosecution or defense exhibit, for example – can be quite tricky, often requiring teasing out contextual clues.

While manual linkage is highly accurate, it becomes infeasible over a corpus of 153,000 transcript pages and more than 100,000 document references to manually tag and classify each mention of a document, whether it be a prosecution or defense trial exhibit, or a source document from which the former were often chosen. Automated approaches offer the most likely promise of a scalable solution, with strategic, manual, final-mile workflows responsible for cleanup and optimization.

Initial prototyping by Harvard of automated document reference capture focused on the use of pattern matching in regular expressions. Targeting only the most frequently found patterns in the corpus, Harvard was able to extract more than 50,000 highly reliable references. While continuing with this strategy could have found significantly more references, it was not clear that once identified, a document reference could be accurately typed without manual input.

At this point Harvard connected with Tolstoy, a natural language processing (NLP) AI startup, to ferret out the rest of the tags and identify them by type. Employing a combination of machine learning and rule-based pattern matching, Tolstoy was able to extract and classify the bulk of remaining document references.

Background on Machine Learning

Machine learning is a comprehensive branch of artificial intelligence. It is, essentially, statistics on steroids. Working from a “training set” – a set of human-labeled examples – a machine learning algorithm identifies patterns in the data that allow it to make predictions. For example, a model that is supplied many labeled pictures of cats and dogs will eventually find features of the cat images that correlate with the label “cat,” and likewise, for “dog.” Broadly speaking, the same formula is used by self-driving cars learning how to respond to traffic signs, pedestrians, and other moving objects.

In Harvard’s case, a model was needed that could learn to extract and classify, using a labeled training set, document references in the court transcripts. To enable this, one of the main features used was surrounding context, including possible trigger words that can be used to determine whether a given trial exhibit was submitted by the prosecution or defense. To be most useful, the classifier needed to be very accurate (correctly labeled as either prosecution or defense), precise (minimal false positives), and have a high recall (few missing references).

Feature Engineering

The first step in any machine learning project is to produce a thorough, unbiased training set. Since Harvard staff had already identified 53,000 verified references, Tolstoy used that, along with an additional set generated using more precise heuristics, to train a baseline model.

The model is the predictive algorithm. There are many different families of models a data scientist can choose from. For example, one might use a support vector machine (SVM) if there are fewer examples than features, a convolutional neural net (CNN) for images, or a recurrent neural net (RNN) for processing long passages requiring memory. That said, the model is only a part of the entire data processing pipeline, which includes data pre-processing (cleaning), feature engineering, and post-processing.

Here, Tolstoy used a "random forest" algorithm. This method uses a series of decision-tree classifiers with nodes, or branches, representing points at which the training data is subdivided based on feature characteristics. The random forest classifier aggregates the final decisions of a suite of decision trees, predicting the class most often output by the trees. The entire process is randomized as each tree selects a random subset of the training data and random subset of features to use for each node.

Models work best when they are trained on the right features of the data. Feature engineering is the process by which one chooses the most predictive parts of available training data. For example, predicting the price of a house might take into account features such as the square footage, location, age, amenities, recent remodeling, etc.

In this case, we needed to predict the type of document reference involved: was it a prosecution or defense trial exhibit? The exact same sequence of characters, say "Exhibit 435," could be either defense or prosecution, depending on – among other things – the speaker and how they introduced it. Tolstoy used features such as the speaker, the presence or absence of prosecution or defense attorneys' names (or that of the defendant), and the presence or absence of country name abbreviations to classify the references.


Machine learning is a great tool in a predictive pipeline, but in order to gain very high accuracy and recall rates, one often needs to combine it with heuristics-based methods as well. For example, in the transcripts, phrases like “submitted under” or “offered under” may precede a document reference. These phrases were used to catch references that had previously been missed. Other post-processing included catching and removing tags from false positives, such as years (e.g. January 1946) or descriptions (300 Germans). These techniques allowed us to preserve high precision while maximizing recall.

Collaborative, Iterative Build-out

In the build-out of the data processing pipeline, it was important for both Tolstoy and Harvard to carefully review interim results, identify and discuss error patterns and suggest next-step solutions. Harvard, as a domain expert, was able to quickly spot areas where the model was making errors. These iterations allowed Tolstoy to fine-tune the features used in the model, and amend the patterns used in identifying document references. This involved a workflow of tweaking, testing and feedback, a cycle repeated numerous times until full process maturity was reached. Ultimately, Tolstoy was able to successfully capture more than 130,000 references throughout the 153,000 pages, with percentages in the high 90s for accuracy and low 90s for recall. After final data filtering and tuning at Harvard, these results will form the basis for the key feature enabling interlinkage between the two major data domains of the Nuremberg Trials Project: the transcripts and evidentiary documents. Working together with Tolstoy and machine learning has significantly reduced the resources and time otherwise required to do this work.

Getting Started with Caselaw Access Project Data / Harvard Library Innovation Lab

Today we’re sharing new ways to get started with Caselaw Access Project data using tutorials from The Programming Historian and more.

The Caselaw Access Project makes 360 years of U.S. case law available as a machine-readable text corpus. In developing a research community around the dataset, we’ve been creating and sharing resources for getting started.

In our gallery, we’ve been developing tutorials and our examples repository for working with our data alongside research results, applications, fun stuff, and more:

The Programming Historian shares peer-reviewed tutorials for computational workflows in the humanities. Here are a group of their guides for working with text data, from processing to analysis:

We want to share and build ways to start working with Caselaw Access Project data. Do you have an idea for a future tutorial? Drop us a line to let us know!

Meet Monica Granados, one of our Frictionless Data for Reproducible Research Fellows / Open Knowledge Foundation

The Frictionless Data for Reproducible Research Fellows Programme is training early career researchers to become champions of the Frictionless Data tools and approaches in their field. Fellows will learn about Frictionless Data, including how to use Frictionless Data tools in their domains to improve reproducible research workflows, and how to advocate for open science. Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content.

Hello there! My name is Monica Granados and I am a food-web ecologist, science communicator and a champion of open science. There are not too many times or places in my life where it is so easy to demarcate a “before” and an “after.” In 2014, I travelled to Raleigh, North Carolina to attend the Open Science for Synthesis (OSS) course co-facilitated by the National Centre for Ecological Synthesis and Analysis and the Renaissance Computing Institute.

I was there to learn more about the R statistical programming language to aid my quest for a PhD. At the conclusion of the course I did come home with more knowledge about R and programming but what I couldn’t stop thinking about was what I learned about open science. I came home a different scientist, truth be told a different person. You see at OSS I learned that there was a different way to do science – an approach so diametrically opposite to what I had been taught in my five years in graduate school. Instead of hoarding data and publishing behind paywalls, open science asks – wouldn’t science be better if our data, methods, publications and communications were open?

When I returned back from Raleigh, I uploaded all of my data to GitHub and sought out open access options for my publications. Before OSS I was simply interested in contributing my little piece to science, but after OSS I dedicated my career to the open science movement. In the years since OSS, I have made all my code, data and publications open and I have delivered workshops and designed courses for others to work in the open. I now run a not-for-profit that teaches researchers how to do peer-review using open access preprints and I am a policy analyst working on open science at Environment and Climate Change Canada. I wanted to become a Frictionless Data fellow because open science is continually evolving. I wanted to learn more about reproducible research. When research is reproducible, it is more accessible and that sets off a chain reaction of beneficial consequences. Open data, methods and publications mean that if you were interested in knowing more about the course of treatment your doctor prescribed or you are in doctor in the midst of an outbreak searching for the latest data on the epidemic, or perhaps you are a decision maker looking for guidance on what habitat to protect, this information is available to you. Easily, quickly and free of charge.

I am looking forward to building some training materials and data packages to make it easier for scientists to work in the open through the Frictionless Data fellowship. And I look forward to updating you on my and my fellow fellows’ progress.

Frictionless Data for Reproducible Research Fellows Programme

More on Frictionless Data

The Fellows programme is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. Frictionless Data’s other current projects include the Tool Fund, in which four grantees are developing open source tooling for reproducible research. The Fellows programme will be running until June 2020, and we will post updates to the programme as they progress.

• Originally published at

Wikipedia and the deep backfile / John Mark Ockerbloom

We’re continuing to add serial information to the Deep Backfile project that I announced here last month.  I’m adding some of the existing information in The Online Books Page serials listings and our first serial renewals listings that hadn’t initially been linked in when I made the first announcement.  I’ve added journals with deep backfiles from a couple more publishers (Oxford and Cambridge). I’ve started adding some new information on a few journals that I’ve heard people be interested in.  And I’ve heard from some librarians who are interested in contributing more information, which I welcome, since there are a lot of journals with information still to fill in.

But we needn’t stop with librarians and journals.  I’ve seen many kinds of serials written about online that potentially have public domain content, or the otherwise offer free online issues.  Many of them have articles about them in Wikipedia, sometimes short summary stub, and sometimes more extensive write-ups.  I’m most familiar with English Wikipedia, the largest and oldest, and recently wondered how many serials had free online issues or were old enough to potentially have public domain issues.  So I decided to answer that question by building a table for that set of serials.

It turns out to be a very big table:  over 10,000 serials with English Wikipedia articles that have free or potentially public domain content.  That’s bigger than the combination of all the other publisher and provider tables I currently link to from the Deep Backfile page.  There are lots of serials in it with no copyright or free issue information available, and it would take any single person a very long time to find such information, verify it, and fill it in.

But I think it’s still useful in its current state.  You can use it to find out about a lot of public domain and open access serials you’ve probably never heard of, as well as many that you have.  You can click through serial titles to see their Wikipedia articles, and improve on them if you have more information.  You can click on their Wikidata IDs to see and add to their metadata.  (As you can see from the relatively small number of end dates shown in the “coverage” column, there is limited information currently in Wikidata for many of the serials.)  You can see what we know about their copyrights, and about free online issue availability, and follow the “Contact us” links if you want to contribute more information about either of those.  (Last month’s post included instructions on how to research serial copyrights. Links to the two main resources you need to research them– the first renewals listing and the Copyright Office database— are now provided directly from the form you get to when you select a “Contact us” link.)  And when a new English Wikipedia article and Wikidata entry on a serial gets added that shows it was published before 1964, it will be automatically added to this table the next time we generate it.

Whether you’re a Wikipedian, a librarian, or just a reader interested in journals, magazines, newspapers, comics, or other serials, I hope you find this information useful, and I invite you to help fill it in as your interests and time permit.  Let me know or comment here if you have any questions, comments, or suggestions.


What is the Distant Reader and why should I care? / Eric Lease Morgan

The Distant Reader is a tool for reading. [1]

wall paper by eric

Wall Paper by Eric

The Distant Reader takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process.

The Distant Reader empowers one to use & understand large amounts of textual information both quickly & easily. For example, the Distant Reader can consume the entire issue of a scholarly journal, the complete works of a given author, or the content found at the other end of an arbitrarily long list of URLs. Thus, the Distant Reader is akin to a book’s table-of-contents or back-of-the-book index but at scale. It simplifies the process of identifying trends & anomalies in a corpus, and then it enables a person to further investigate those trends & anomalies.

The Distant Reader is designed to “read” everything from a single item to a corpus of thousand’s of items. It is intended for the undergraduate student who wants to read the whole of their course work in a given class, the graduate student who needs to read hundreds (thousands) of items for their thesis or dissertation, the scientist who wants to review the literature, or the humanist who wants to characterize a genre.

How it works

The Distant Reader takes five different forms of input:

  1. a URL – good for blogs, single journal articles, or long reports
  2. a list of URLs – the most scalable, but creating the list can be problematic
  3. a file – good for that long PDF document on your computer
  4. a zip file – the zip file can contain just about any number of files from your computer
  5. a zip file plus a metadata file – with the metadata file, the reader’s analysis is more complete

Once the input is provided, the Distant Reader creates a cache — a collection of all the desired content. This is done via the input or by crawling the ‘Net. Once the cache is collected, each & every document is transformed into plain text, and along the way basic bibliographic information is extracted. The next step is analysis against the plain text. This includes rudimentary counts & tabulations of ngrams, the computation of readability scores & keywords, basic topic modeling, parts-of-speech & named entity extraction, summarization, and the creation of a semantic index. All of these analyses are manifested as tab-delimited files and distilled into a single relational database file. After the analysis is complete, two reports are generated: 1) a simple plain text file which is very tabular, and 2) a set of HTML files which are more narrative and graphical. Finally, everything that has been accumulated & generated is compressed into a single zip file for downloading. This zip file is affectionately called a “study carrel“. It is completely self-contained and includes all of the data necessary for more in-depth analysis.

What it does

The Distant Reader supplements the traditional reading process. It does this in the way of traditional reading apparatus (tables of content, back-of-book indexes, page numbers, etc), but it does it more specifically and at scale.

Put another way, the Distant Reader can answer a myriad of questions about individual items or the corpus as a whole. Such questions are not readily apparent through traditional reading. Examples include but are not limited to:

  • How big is the corpus, and how does its size compare to other corpora?
  • How difficult (scholarly) is the corpus?
  • What words or phrases are used frequently and infrequently?
  • What statistically significant words characterize the corpus?
  • Are there latent themes in the corpus, and if so, then what are they and how do they change over both time and place?
  • How do any latent themes compare to basic characteristics of each item in the corpus (author, genre, date, type, location, etc.)?
  • What is discussed in the corpus (nouns)?
  • What actions take place in the corpus (verbs)?
  • How are those things and actions described (adjectives and adverbs)?
  • What is the tone or “sentiment” of the corpus?
  • How are the things represented by nouns, verbs, and adjective related?
  • Who is mentioned in the corpus, how frequently, and where?
  • What places are mentioned in the corpus, how frequently, and where?

People who use the Distant Reader look at the reports it generates, and they often say, “That’s interesting!” This is because it highlights characteristics of the corpus which are not readily apparent. If you were asked what a particular corpus was about or what are the names of people mentioned in the corpus, then you might answer with a couple of sentences or a few names, but with the Distant Reader you would be able to be more thorough with your answer.

The questions outlined above are not necessarily apropos to every student, researcher, or scholar, but the answers to many of these questions will lead to other, more specific questions. Many of those questions can be answered directly or indirectly through further analysis of the structured data provided in the study carrel. For example, each & every feature of each & every sentence of each & every item in the corpus has been saved in a relational database file. By querying the database, the student can extract every sentence with a given word or matching a given grammer to answer a question such as “How was the king described before & after the civil war?” or “How did this paper’s influence change over time?”

A lot of natural language processing requires pre-processing, and the Distant Reader does this work automatically. For example, collections need to be created, and they need to be transformed into plain text. The text will then be evaluated in terms of parts-of-speech and named-entities. Analysis is then done on the results. This analysis may be as simple as the use of concordance or as complex as the application of machine learning. The Distant Reader “primes the pump” for this sort of work because all the raw data is already in the study carrel. The Distant Reader is not intended to be used alone. It is intended to be used in conjunction with other tools, everything from a plain text editor, to a spreadsheet, to database, to topic modelers, to classifiers, to visualization tools.


I don’t know about you, but now-a-days I can find plenty of scholarly & authoritative content. My problem is not one of discovery but instead one of comprehension. How do I make sense of all the content I find? The Distant Reader is intended to address this question by making observations against a corpus and providing tools for interpreting the results.


[1] Distant Reader –

Getting ready for Open Data Day 2020 on Saturday 7th March / Open Knowledge Foundation

Open Data Day 2020

Next year marks the 10th anniversary of Open Data Day! Open Data Day is the annual event where we gather to reach out to new people and build new solutions to issues in our communities using open data.

The next edition will take place on Saturday 7th March 2020.

Over the last decade, this event has evolved from a small group of people in a few cities trying to convince their governments about the value of open data, to a full-grown community of practitioners and activists around the world working on putting data to use for their communities. 

Like in previous years, the Open Knowledge Foundation will continue with the mini-grants scheme giving between $200 and $300 USD to support great Open Data Day events across the world, so stay tuned for that. 

In the meantime, you can collaborate on the website. is on Github. Pull requests are welcome and we have a bunch of issues we’d love to get through. 

If coding is not your thing but you know a language besides English, you can translate the website into your language, or update one of the other nine languages available so far.

If you have started planning your Open Data Day event for next year, the new form to start populating the map will be available soon. You can also connect with others and spread the word about Open Data Day using the #OpenDataDay or #ODD2020 hashtags. Alternatively you can join the Google Group to ask for advice or share tips.

To get inspired, you can read more about everything from this year’s edition on our wrap-up blog post.

It’s the Most Wonderful Time of the Year … / HangingTogether

It’s World Digital Preservation Day – a time of celebration and good cheer with friends and family! Not to mention a great way to raise awareness about the importance of preserving digital materials.

To get into the spirit of the day, take some time to check out the series of blog posts sponsored by the Digital Preservation Coalition, featuring thoughts, opinions, and stories from around the world on the theme “At-Risk Digital Materials”.

The posts are being published on a rolling basis throughout the day. Have a look – it’s a great opportunity to take the temperature of current thinking on digital preservation.

I was pleased to make a contribution to the posts, which is here.

Happy World Digital Preservation Day to you and yours!

The post It’s the Most Wonderful Time of the Year … appeared first on Hanging Together.

Twitter / pinboard

RT @kiru: The #Code4Lib Journal's issue 46 (2019/4) has been just published: . Worldcat Search API, Go…

eResearch Australasia 2019 trip report / Peter Sefton

By Mike Lynch and Peter Sefton

I'm re-posting / self-archiving this from the UTS eResearch Blog.

Mike Lynch and Peter Sefton attended the 2019 eResearch Australasia conference in Brisbane from 22-24 October 2019, where we presented a few things - and a pre-conference summit on the 21st held by the Australian Research Data Commons, where Mike presented our report from our small discovery project on scalable repository technology. UTS paid for the trip.

What we presented - our work on Simple Scalable Research Data Repositories

We've posted fleshed-out versions of our conference papers as usual. Mike presented a short version of ARDC funded work on data repositories at both the the summit and the conference and Peter had also put in an abstract for longer version which is less technically focussed and gives more of the context for why this work is important.

Peter presented an update on our ongoing work on describing and packaging research data - this time focussing on the new merged standard Research Object Crate(RO-Crate) and looking at what's coming next.

Diversity breakfast

I (Mike) went to the breakfast given in honour of the late Dr Jacky Pallas, a senior figure in the eResearch community who had given a keynote at last year's eResearch Australasia

The speaker was Dr Toni Collis, a research software engineer and director of Women in High Performance Computing, on how lack of diversity is damaging your research, making the point that diverse research and support teams can be demonstrated to be more effective in terms of performance, and the importance of equity, diversity and inclusivity in attracting and retaining talent.

Research as a primary function of Electronic Health Records

Prof Nikolajz Zeps' presentation was one of a couple of talks which sold themselves as being provocative, in that they were arguing for a loosening of traditional, restrictive health care ethics and consent practices so that data could be made more readily available for research. He made the point that data sharing is very difficult in the Australian healthcare system not just because of ethical restrictions but because the system is so fragmented. He argued for an integration of research and clinical data consent and management practices, which would allow the information flows required for medical research to be used for the health care system itself to better monitor the effectiveness of treatments and patient outcomes. Moving the consent process from one of research ethics to clinical ethics can make things simpler, in terms of administration.

(I was less impressed by the other provocative keynote in which the speaker said "no-one ever died as the result of a health-care data breach", which I thought was a bit of posturing, even though it got applause from some of the audience.)

Notable presentations

Galaxy Australia and the Australian Bioinformatic Commons

These two presentations were part of the bioinformatics stream, about the Australian node of the global Galaxy workflow and computational platform, and the Australian BioCommons Pathfinder Project.

9 Reproducible Research Things

Amanda Miotto from Griffith University's eResearch team presented on running workshops to introduce researchers to good reproducibility practice which focussed on immediate benefits, like safeguarding a research team against individual members leaving or falling ill, and practical steps.

The workshop materials are available on GitHub

Physiome Journal

An interesting presentation by Karin Lundengård of the Auckland Bioengineering Institute about Physiome Journal, which will publish validated and reproducible mathematical models of physiological processes. The models will also be made available as Jupyter notebooks and shared on the Gigantum platform.


A lot of what we spoke about in this BOF, which was chaired by Ingrid Mason, echoed what I (Mike) had heard the week before at a Big Data for the Digital Humanities symposium in Canberra which was organised by the ARDC and AARNet (and which I should write a blog post about).

The challenges faced by digital humanities researchers and support staff mirror one another - researchers are unsure of the right way to engage with technical staff and vice versa, and good collaboration is too labour-intensive to be sustainable if it's going to be spread beyond a minority of researchers who are already linked in to support networks.

This gave me a kind of wistful feeling about an earlier keynote from Dell about machine learning in medical science, because the speaker was very enthusiastic about moving HPC tools out of the realm where researchers needed to become technology experts to use them at all, into something more like commodity software. Although there are some areas of the humanities where this sort of thing is starting to happen - transcription is one which came up both in Canberra and Brisbane.

Data Discovery

I (Peter) chaired a session on Data Discovery with a couple of lead-in talks that outlined what's going on in the world of generic research data discovery, leading into a discussion.

From our viewpoint at UTS it was useful to get confirmation that discovery services are converging on using for high-level description of data sets, for indexing by other services. Which is good, because that's the horse we bet on at UTS. It's being used by both Research Data Australia (RDA) and the new player Google dataset search (that's run by a tiny team apparently, but it will have a huge impact on how everyone has to structure their metadata).

Amir Aryani (Swinburne) and Melroy Almeida (Australian Access Federation) presented on ORCID graph looking at collaboration networks. This is testament to the power of using strong, URI-based identifiers, once you start doing that, metadata changes from an un-reliable soup of differently spelled ambiguous names, to something you can do real analytics on.

Adrian Burton (ARDC, ex ANDS) has been dealing with metadata a long time - he put it that the approach had won, and suggested that this might be a bit of a loss. The RIF-CS standard that ANDS inherited and built RDA around had a entity based model, with Collections, Parties, Activities and services (based on ISO 2146) rather than simple flat name-value metadata. I agree that the entity model was a strength of RIF-CS But actually, for those that want to convey rich context about data, then with linked data can do everything RIF-CS can, more elegantly and with more detail. See the work we've been doing on RO-Crate which takes things to a deeper level with descriptions of files and (soon) even variables inside files including provenance chains (what people and equipment did to make those files from observations, or other files).

The leaders of that session have done a follow up survey, so I think they'll be putting out more info soon.


The RO-Crate talk I gave (Peter here) was in a stream on Digital Preservation and data packaging.

Erin Gallant and Gavin Kennedy, also from AARNet about digital preservation.

picture of the panel

I was talking, not taking notes, but we discussed what the research community and cultural collections folks can learn from each other - actually I think we made some of the same mistakes, both the eResearch community and GLAM sector invested in big silos which ended up not just storing data, but making it difficult to move, re-use. To labour the metaphor a bit, silos have small holes in the bottom, so getting data in and out is slow.

Mike's diagram of an OCFL Repository shows an alternative approach - instead of putting data in a container with constricted ingress and egress, lay it all out in the open. I'm not an expert in preservation systems, but I do know that that's the approach taken by the open source Archivematica preservation system (Note: I've done a bit of work for Artefactual systems which looks after it), it works as an application that sits beside a set of files on disk - if needed you can use the grandparent of all APIs, that is file operations to fetch data. All of our talks we gave, linked above, were about this idea in one way or another.

Picture of data layed-out and labelled in rows (like a field not a silo) - by Mike Lynch

Trusted Repository certification - one for the UTS Roadmap

I (Peter again) attended a session MCed by Richard Ferrers from ARDC with contributions from people from a range of institutions and repositories who are part of an ARDC community of practice.

They talked about the Core Trust Seal repository certification program - and the process of getting certified.

Here's some background on CTS:

Core Certification and its Benefits

Nowadays certification standards are available at different levels, from a core level to extended and formal levels. Even at the core level, certification offers many benefits to a repository and its stakeholders. Core certification involves a minimally intensive process whereby data repositories supply evidence that they are sustainable and trustworthy. A repository first conducts an internal self-assessment, which is then reviewed by community peers. Such assessments help data communities—producers, repositories, and consumers—to improve the quality and transparency of their processes, and to increase awareness of and compliance with established standards. This community approach guarantees an inclusive atmosphere in which the candidate repository and the reviewers closely interact.

In addition to external benefits, such as building stakeholder confidence, enhancing the reputation of the repository, and demonstrating that the repository is following good practices, core certification provides a number of internal benefits to a repository. Specifically, core certification offers a benchmark for comparison and helps to determine the strengths and weaknesses of a repository.

Right now at UTS we're in the process of making a new Digital Strategy, aligned the UTS 2027 Strategy - one of the core goals (which are still evolving so we can't link to them just yet) is to have trusted systems. CTS would be a great way for the IT Department (that's us) to demonstrate to the organisation that we have the governance, technology and operational model in place to run a repository.

We're talking now about getting at least the first step (self certification) on the 2021 Roadmap - but before that, we'll see if we can join the community discussion and start planning.

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Australia License.

Jobs in Information Technology: November 6, 2019 / LITA

New This Week

Visit the LITA Jobs Site for additional job listings and information on submitting your own job posting.

Invitation from the Mexican DSpace Users Group: “Installation, Configuration and Customization DSpace 6.3 / DuraSpace News

The Mexican DSpace Users Group is pleased to offer the community the opportunity to register for a virtual course on “Installation, Configuration and Customization DSpace 6.3” with Sheyla Salazar Waldo y Julian Timal Tlachi from the Mexican DSpace Users Group.

The objective of this course is to explain how DSpace software (version 6.3) is installed on an operating system,  in this case Windows. The user will be able to configure and personalize DSpace, as well as have control of testing and seeing the complete installation process.

The course will begin on Friday, November 8, please to register:

En español

El Grupo Mexicano de Usuarios de DSpace se complace en ofrecer a la comunidad la oportunidad de registrarse para un curso virtual sobre “Instalación, Configuración y Personalización de DSpace 6.3” con Sheyla Salazar Waldo y Julian Timal Tlachi del Grupo Mexicano de Usuarios de DSpace.

El objetivo de este curso es explicar cómo se instala el software DSpace (versión 6.3) en un sistema operativo, en este caso Windows. El usuario podrá configurar y personalizar DSpace, así como tener el control de las pruebas y ver el proceso completo de instalación.

El curso comenzará el viernes 8 de noviembre, por favor regístrese:

The post Invitation from the Mexican DSpace Users Group: “Installation, Configuration and Customization DSpace 6.3 appeared first on

Islandora/Fedora Camp in Arizona - Call for Proposals / Islandora

Doing something great with Islandora and/or Fedora that you want to share with the community? Have a recent project that the world just needs to know about? Send us your proposals to present at the joint Islandora and Fedora Camp in Arizona! Presentations should be roughly 20-25 minutes in length (with time after for questions) and deal with Islandora and/or Fedora in some way. The camp will be focussed on the latest versions of Islandora and Fedora, so preference will be given to sessions that relate to Islandora 8 and Fedora 4 and higher, but we still welcome proposals relating to earlier versions.

All we need is a session title and a brief abstract. Submit your proposal here.

Project Gutenberg and the Distant Reader / Eric Lease Morgan

The venerable Project Gutenberg is perfect fodder for the Distant Reader, and this essay outlines how & why. (tl;dnr: Search my mirror of Project Gutenberg, save the result as a list of URLs, and feed them to the Distant Reader.)

Project Gutenberg

wall paper by Eric

Wall Paper by Eric

A long time ago, in a galaxy far far away, there was a man named Micheal Hart. Story has it he went to college at the University of Illinois, Urbana-Champagne. He was there during a summer, and the weather was seasonably warm. On the other hand, the computer lab was cool. After all, computers run hot, and air conditioning is a must. To cool off, Micheal went into the computer lab to be in a cool space.† While he was there he decided to transcribe the United States Declaration of Independence, ultimately, in the hopes of enabling people to use a computers to “read” this and additional transcriptions. That was in 1971. One thing led to another, and Project Gutenberg was born. I learned this story while attending a presentation by the now late Mr. Hart on Saturday, February 27, 2010 in Roanoke (Indiana). As it happened it was also Mr. Hart’s birthday. [1]

To date, Project Gutenberg is a corpus of more than 60,000 freely available transcribed ebooks. The texts are predominantly in English, but many languages are represented. Many academics look down on Project Gutenberg, probably because it is not as scholarly as they desire, or maybe because the provenance of the materials is in dispute. Despite these things, Project Gutenberg is a wonderful resource, especially for high school students, college students, or life-long learners. Moreover, its transcribed nature eliminates any problems of optical character recognition, such as one encounters with the HathiTrust. The content of Project Gutenberg is all but perfectly formatted for distant reading.

Unfortunately, the interface to Project Gutenberg is less than desirable; the index to Project Gutenberg is limited to author, title, and “category” values. The interface does not support free text searching, and there is limited support for fielded searching and Boolean logic. Similarly, the search results are not very interactive nor faceted. Nor is there any application programmer interface to the index. With so much “clean” data, so much more could be implemented. In order to demonstrate the power of distant reading, I endeavored to create a mirror of Project Gutenberg while enhancing the user interface.

To create a mirror of Project Gutenberg, I first downloaded a set of RDF files describing the collection. [2] I then wrote a suite of software which parses the RDF, updates a database of desired content, loops through the database, caches the content locally, indexes it, and provides a search interface to the index. [3, 4] The resulting interface is ill-documented but 100% functional. It supports free text searching, phrase searching, fielded searching (author, title, subject, classification code, language) and Boolean logic (using AND, OR, or NOT). Search results are faceted enabling the reader to refine their query sans a complicated query syntax. Because the cached content includes only English language materials, the index is only 33,000 items in size.

Project Gutenberg & the Distant Reader

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process. Project Gutenberg and the Distant Reader can be used hand-in-hand.

As described in a previous posting, the Distant Reader can take five different types of input. [5] One of those inputs is a file where each line in the file is a URL. My locally implemented mirror of Project Gutenberg enables the reader to search & browse in a manner similar to the canonical version of Project Gutenberg, but with two exceptions. First & foremost, once a search has been gone against my mirror, one of the resulting links is “only local URLs”. For example, below is an illustration of the query “love AND honor AND truth AND justice AND beauty”, and the “only local URLs” link is highlighted:

search result

Search result

By selecting the “only local URLs”, a list of… URLs is returned, like this:



This list of URLs can then be saved as file, and any number of things can be done with the file. For example, there are Google Chrome extensions for the purposes of mass downloading. The file of URLs can be fed to command-line utilities (ie. curl or wget) also for the purposes of mass downloading. In fact, assuming the file of URLs is named love.txt, the following command will download the files in parallel and really fast:

cat love.txt | parallel wget

This same file of URLs can be used as input against the Distant Reader, and the result will be a “study carrel” where the whole corpus could be analyzed — read. For example, the Reader will extract all the nouns, verbs, and adjectives from the corpus. Thus you will be able to answer what and how questions. It will pull out named entities and enable you to answer who and where questions. The Reader will extract keywords and themes from the corpus, thus outlining the aboutness of your corpus. From the results of the Reader you will be set up for concordancing and machine learning (such as topic modeling or classification) thus enabling you to search for more narrow topics or “find more like this one”. The search for love, etc returned more than 8000 items. Just less than 500 of them were returned in the search result, and the Reader empowers you to read all 500 of them at one go.


Project Gutenberg is very useful resource because the content is: 1) free, and 2) transcribed. Mirroring Project Gutenberg is not difficult, and by doing so an interface to it can be enhanced. Project Gutenberg items are perfect items for reading & analysis by the Distant Reader. Search Project Gutenberg, save the results as a file, feed the file to the Reader and… read the results at scale.

Notes and links

† All puns are intended.

[1] Michael Hart in Roanoke (Indiana) – video:; blog posting:

[2] The various Project Gutenberg feeds, including the RDF is located at

[3] The suite of software to cache and index Project Gutenberg is available on GitHub at

[4] My full text index to the English language texts in Project Gutenberg is available at

[5] The Distant Reader and its five different types of input –

Editorial / Code4Lib Journal

If you build it, I'll probably come.

MatchMarc: A Google Sheets Add-on that uses the WorldCat Search API / Code4Lib Journal

Lehigh University Libraries has developed a new tool for querying WorldCat using the WorldCat Search API.  The tool is a Google Sheet Add-on and is available now via the Google Sheets Add-ons menu under the name “MatchMarc.” The add-on is easily customizable, with no knowledge of coding needed. The tool will return a single “best” OCLC record number, and its bibliographic information for a given ISBN or LCCN, allowing the user to set up and define “best.” Because all of the information, the input, the criteria, and the results exist in the Google Sheets environment, efficient workflows can be developed from this flexible starting point. This article will discuss the development of the add-on, how it works, and future plans for development.

Designing Shareable Tags: Using Google Tag Manager to Share Code / Code4Lib Journal

Sharing code between libraries is not a new phenomenon and neither is Google Tag Manager (GTM). GTM launched in 2012 as a JavaScript and HTML manager with the intent of easing the implementation of different analytics trackers and marketing scripts on a website. However, it can be used to load other code using its tag system onto a website. It’s a simple process to export and import tags facilitating the code sharing process without requiring a high degree of coding experience. The entire process involves creating the script tag in GTM, exporting the GTM content into a sharable export file for someone else to import into their library’s GTM container, and finally publishing that imported file to push the code to the website it was designed for. This case study provides an example of designing and sharing a GTM container loaded with advanced Google Analytics configurations such as event tracking and custom dimensions for other libraries using the Summon discovery service. It also discusses processes for designing GTM tags for export, best practices on importing and testing GTM content created by other libraries and concludes with evaluating the pros and cons of encouraging GTM use.

Reporting from the Archives: Better Archival Migration Outcomes with Python and the Google Sheets API / Code4Lib Journal

Columbia University Libraries recently embarked on a multi-phase project to migrate nearly 4,000 records describing over 70,000 linear feet of archival material from disparate sources and formats into ArchivesSpace. This paper discusses tools and methods brought to bear in Phase 2 of this project, which required us to look closely at how to integrate a large number of legacy finding aids into the new system and merge descriptive data that had diverged in myriad ways. Using Python, XSLT, and a widely available if underappreciated resource—the Google Sheets API—archival and technical library staff devised ways to efficiently report data from different sources, and present it in an accessible, user-friendly way,. Responses were then fed back into automated data remediation processes to keep the migration project on track and minimize manual intervention. The scripts and processes developed proved very effective, and moreover, show promise well beyond the ArchivesSpace migration. This paper describes the Python/XSLT/Sheets API processes developed and how they opened a path to move beyond CSV-based reporting with flexible, ad-hoc data interfaces easily adaptable to meet a variety of purposes.

Natural Language Processing in the Humanities: A Case Study in Automated Metadata Enhancement / Code4Lib Journal

The Black Book Interactive Project at the University of Kansas (KU) is developing an expanded corpus of novels by African American authors, with an emphasis on lesser known writers and a goal of expanding research in this field. Using a custom metadata schema with an emphasis on race-related elements, each novel is analyzed for a variety of elements such as literary style, targeted content analysis, historical context, and other areas. Librarians at KU have worked to develop a variety of computational text analysis processes designed to assist with specific aspects of this metadata collection, including text mining and natural language processing, automated subject extraction based on word sense disambiguation, harvesting data from Wikidata, and other actions.

“With One Heart”: Agile approaches for developing Concordia and crowdsourcing at the Library of Congress / Code4Lib Journal

In October 2018, the Library of Congress launched its crowdsourcing program By the People. The program is built on Concordia, a transcription and tagging tool developed to power crowdsourced transcription projects. Concordia is open source software designed and developed iteratively at the Library of Congress using Agile methodology and user-centered design. Applying Agile principles allowed us to create a viable product while simultaneously pushing at the boundaries of capability, capacity, and customer satisfaction. In this article, we share more about the process of designing and developing Concordia, including our goals, constraints, successes, and next steps.

Talking Portraits in the Library: Building Interactive Exhibits with an Augmented Reality App / Code4Lib Journal

With funding from multiple sources, an augmented-reality application was developed and tested by researchers to increase interactivity for an online exhibit. The study found that augmented reality integration into a library exhibit resulted in increased engagement and improved levels of self-reported enjoyment. The study details the process of the project including describing the methodology used, creating the application, user experience methods, and future considerations for development. The paper highlights software used to develop 3D objects, how to overlay them onto existing exhibit images and added interactivity through movement and audio/video syncing.

Factor Analysis For Librarians in R / Code4Lib Journal

This paper offers a primer in the programming language R for library staff members to perform factor analysis. It presents a brief overview of factor analysis and walks users through the process from downloading the software (R Studio) to performing the actual analysis. It includes limitations and cautions against improper use.

Announcing the First International DSpace-CRIS User Group Meeting / DuraSpace News

From Susanna Mornati, 4Science

4Science, together with The Library Code and other organizations in the DSpace-CRIS community, such as the Hamburg University of Technology, the Fraunhofer Gesellschaft, the Georg-August-University Goettingen, the Otto-Friedrich-University Bamberg, the University of Bern, the University of Trieste, invite all  institutions interested in DSpace-CRIS, the free open-source Research Information Management System (aka CRIS/RIMS), to join in the first International DSpace-CRIS User Group Meeting to be held in Muenster (Germany, EU) on November 18, 2019.

The event is free and is organized in the framework of the euroCRIS Membership Meeting that will be held on November 18-20, 2019 at the University of Münster, Germany, EU. At there are details about the euroCRIS registration, program, venue, transport, and accommodation. Participation also in the euroCRIS event is strongly encouraged.

The DSpace-CRIS User Group Meeting is scheduled for Monday, November 18 from 14h to 17h, at: Johannisstr. 8-10, room KTh IV, Muenster (click here for the map), with thanks to the generous hospitality of the University of Muenster.

14:00 – 14:10 Introduction
14:10 – 14:30 DSpace-CRIS roadmap
14:30 – 15:30 Experiences from participants
15:30 – 15:40 Break
15:40 – 16:40 Discussion with participants
16:40 – 17:00 Future plans and conclusions


• Introduction and DSpace-CRIS roadmap, future plans and conclusions: Susanna Mornati and Andrea Bollini, 4Science (Italy), Pascal Becker, The Library Code (Germany)

• Experiences from participants and discussion:

Beate Rajski and Oliver Goldschmidt, Hamburg University of Technology (Germany)
Daniel Beucke, Georg-August University of Goettingen (Germany)
Michael Erndt and Dirk Eisengräber-Pabst, Fraunhofer Gesellschaft (Germany)
Steffen Illig, Otto-Friedrich-University Bamberg (Germany)
Anna Keller, University of Bern (Switzerland)
Jordan Piščanc, University of Trieste (Italy)

Other participants are invited to share their experiences and wish-list (open discussion)

Registration and details at:

Looking forward to meeting you at the DSpace-CRIS User Group Meeting!

The post Announcing the First International DSpace-CRIS User Group Meeting appeared first on

Nominate a Colleague Doing Cutting Edge Work in Tech Education for the LITA Library Hi Tech Award / LITA

Nominations are open for the 2020 LITA/Library Hi Tech Award, which is given each year to an individual or institution for outstanding achievement in educating the profession about cutting edge technology within the field of library and information technology. Sponsored by the Library and Information Technology Association (LITA) and Library Hi Tech, the award includes a citation of merit and a $1,000 stipend provided by Emerald Publishing, publishers of Library Hi Tech. The deadline for nominations is December 31, 2019.

The award, given to either a living individual or an institution, may recognize a single seminal work or a body of work created during or continuing into the five years immediately preceding the award year. The body of work need not be limited to published texts but can include course plans or actual courses and/or non-print publications such as visual media. Awards are intended to recognize living persons rather than to honor the deceased; therefore, awards are not made posthumously. More information and a list of previous winners can be found on the LITA website.

Nominations must include the name(s) of the recipient(s), basis for nomination, and references to the body of work and should be submitted using the online nomination form.

The award will be presented at the LITA President’s Program during the 2020 Annual Conference of the American Library Association in Chicago, IL.

About LITA

The Library and Information Technology Association (LITA) is the leading organization reaching out across types of libraries to provide education and services for a broad membership of nearly 2,400 systems librarians, library technologists, library administrators, library schools, vendors, and many others interested in leading edge technology and applications for librarians and information providers. Follow us on our BlogFacebook, or Twitter.

About Emerald Publishing

Founded in 1967, Emerald Publishing today manages a range of digital products, a portfolio of nearly 300 journals, more than 2,500 books and over 450 teaching cases. More than 3,000 Emerald articles are downloaded every hour of every day. The network of contributors includes over 100,000 advisers, authors and editors. Globally, Emerald has an extraordinary reach with 12 offices worldwide and more than 4,000 customers in over 120 countries. Emerald is COUNTER 4 compliant. It is also a partner of the Committee on Publication Ethics (COPE) and works with Portico and the LOCKSS initiative for digital archive preservation. It also works in close collaboration with a number of organizations and associations worldwide.


Jenny Levine

Executive Director

Library and Information Technology Association

Meet RO-Crate / Peter Sefton

By Peter Sefton

This presentation was given by Peter Sefton at the eResearch Australasia 2019 Conference in Brisbane, on the 24th of October 2019.

Meet RO-Crate <p>

This presentation is part of a series of talks delivered here at eResearch Australasia - so it won’t go back over all of the detail already covered - see the introduction of datacrate in 2017 and and the 2018 update. The standard formerly known as DataCrate has been subsumed into a new standard called Research Object Crate - RO-Crate for short.

 <p>Eoghan Ó Carragáin (chair) Peter Sefton (co-chair) Stian Soiland-Reyes (co-chair) Oscar Corcho Daniel Garijo Raul Palma Frederik Coppens Carole Goble José María Fernández Kyle Chard Jose Manuel Gomez-Perez Michael R Crusoe Ignacio Eguinoa Nick Juty Kristi Holmes Jason A. Clark Salvador Capella-Gutierrez Alasdair J. G. Gray Stuart Owen Alan R Williams

This is a recent snapshot of the makeup of the current RO-Crate team- compiled by Stian.

What is RO-Crate? <p>RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.</p> <p>

The website says: RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.

2017-06-16 Cry for help! Cameron Neylon: As a researcher...  2017-07-02 Research Data Crate started 2017-10-12 DataCrate 0.1 2018-03-22 DataCrate 0.2 2018-03-22 RDA BoF: Approaches to Research Data Packaging 2018-08-06 DataCrate 0.3 2018-09-11 Calcyte 0.3.0 2018-09-27 DataCrate 1.0 2018-10-02 npm install calcyte@1.0.0 2018-10-29 Workshop on Research Object RO2018 2019-02-13 RO Lite 0.1 2019-03-28 First RO-Lite community call 2019-05-02 RO-Crate use case gathering 2019-05-30 Google Docs-mode 2019-06-07 Open Repositories workshop: Research Data Packaging 2019-08-23 npm install calcyte@1.0.6 2019-09-24 Workshop on Research Object RO2019 2019-09-12 RO-Crate 0.2 2019-11-?? RO-Crate 1.0

This is a timeline for the merging of the Research Object packaging work with DataCrate - again compiled by Stian. While our DataCrate work was driven by practical concerns and a desire to describe research data with high-quality metadata Research Object shared those concerns but with more of a focus on reproducibility and detailed provenance for research data.


This is what an RO-Crate looks like if you open the HTML file that’s in the root directory (or you see one on the web).


This is the home page for RO-Crate.


Where did RO-Crate come from? RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see background.

👨‍⚕️ Man Health Worker 👩‍⚕️ Woman Health Worker 👨‍🎓 Man Student 👩‍🎓 Woman Student 👨‍🏫 Man Teacher 👩‍🏫 Woman Teacher 👨‍⚖️ Man Judge 👩‍⚖️ Woman Judge 👨‍🌾 Man Farmer 👩‍🌾 Woman Farmer 👨‍🍳 Man Cook 👩‍🍳 Woman Cook 👨‍🔧 Man Mechanic 👩‍🔧 Woman Mechanic 👨‍🏭 Man Factory Worker 👩‍🏭 Woman Factory Worker 👨‍💼 Man Office Worker 👩‍💼 Woman Office Worker 👨‍🔬 Man Scientist 👩‍🔬 Woman Scientist 👨‍💻 Man Technologist 👩‍💻 Woman Technologist 👨‍🎤 Man Singer 👩‍🎤 Woman Singer 👨‍🎨 Man Artist 👩‍🎨 Woman Artist 👨‍✈️ Man Pilot 👩‍✈️ Woman Pilot 👨‍🚀 Man Astronaut 👩‍🚀 Woman Astronaut 👨‍🚒 Man Firefighter 👩‍🚒 Woman Firefighter 👮 Police Officer 👮‍♂️ Man Police Officer 👮‍♀️ Woman Police Officer 🕵 Detective 🕵️‍♂️ Man Detective 🕵️‍♀️ Woman Detective 💂 Guard 💂‍♂️ Man Guard 💂‍♀️ Woman Guard 👷 Construction Worker 👷‍♂️ Man Construction Worker 👷‍♀️ Woman Construction Worker 🤴 Prince 👸 Princess 👳 Person Wearing Turban 👳‍♂️ Man Wearing Turban 👳‍♀️ Woman Wearing Turban 👲 Man With Skullcap 🧕 Woman With Headscarf 🤵 Man in Tuxedo 👰 Bride With Veil 🤰 Pregnant Woman 🤱 Breast-Feeding 👼 Baby Angel 🎅 Santa Claus 🤶 Mrs. Claus

Who is it for?

The RO-Crate effort brings together practitioners from very different backgrounds, and with different motivations and use-cases. Among our core target users are: a) research engaged with computation and data-intensive, wokflow-driven analysis; b) digital repository managers and infrastructure providers; c) individual researchers looking for a straight-forward tool or how-to guide to “FAIRify” their data; d) data stewards supporting research projects in creating and curating datasets.


RO-Crate is a collaboration between people all over the world, but the Editors are from Cork, Manchester and Katoomba Version one of the standard will be out in by Summer. But which summer? Standard reference points are important. Standards are important.

Which brings us the benefits of Standards. Without this standardised date format chaos would reign. What if that date had been written 05/08 or 08/05 - someone might end up eating food from May in August, or worse, eating last August’s food in May.

Anyway, If you find a partner who’ll adopt the ISO 8601 data standard then ...

… you should marry them.

Like how we married the Research Object and DataCrate - we bonded over standardisation.


Let’s explore standards a bit more. Iif you see this in metadata - what does it mean?

Is it a name given to the resource? URI:

An honorific like Ms, or Dr? As it would be in the FOAF ontology.

Or a very specific meaning relating to job titles? As in

In RO-Crate - there’s an HTML page which ships with each dataset that allows you to browse the object in as much detail as the author described it and we are careful to avoid ambiguity by adding help links to each metadata term so you see the definition.


Just wanted to shout out to ResearchGraph - led by Amir Aryani at Swinburne Uni - they are also using

 <p>🖥️ 👩🏾‍🔬

RO-Crates ship with two files, a human readable one and a machine readable JSON file. The two views (human and machine) of the data are equivalent - in fact the HTML version is generated from the JSON-LD version, via the DataCrate nodejs library.


And here’s an automatically generated diagram extracted from the sample DataCrate showing how two images were created. The first result was an image file taken by me (as an agent) using two instruments (my camera and lens), of a place (the object: Catalina park in Katoomba). A sepia toned version was the result of a CreateAction, with the instrument this time being the ImageMagick software. The DataCrate also contains information about that CreateAction such as the command used to do the conversion and the version of the software-as-instrument.

convert -sepia-tone 80% test_data/sample/pics/2017-06-11\ 12.56.14.jpg test_data/sample/pics/sepia_fence.jpg

This way of representing file provenance is Action-centred - the focus is on the action that creates a file, rather than the more usual metadata approach of having the file at the centre with properties for “Author” and the like. The action-based approach is MUCH more flexible as it can model the contribution of multiple agents and instruments separately at the expense of being somewhat counter-intuitive to those of us who are used to a library-card approach to metadata where the work is at the centre and has simple properties.

There was a question after this presentation about whether I had the arrows in this diagram pointing in the right direction. Yes, I do! The convention here is the standard way of representing a subject-predicate-object semantic triple with the subject as the source of the arrow, the predicate (in this case property) as a label, and the pointy end pointing at the object.

🐥 <p>

What’s new / developing at the moment in the RO-Crate world? I will illustrate by looking at recent activity on our Github project.


We’re working on ways to describe not just files, but the CONTENTS of files - using properties like variableMeasured.


We have a way to describe a workflow


and actions that can be performed on data such as firing up a computational environment to re-run the workflow.


You too can add Use Cases like this one about software containers.


Breakig news: In the last couple of months Marco La Rosa, an independent developer working for PARADISEC, has ported 10,000 data and collection items into RO-Crate format, AND built a portal which can display them. This means that ANY repository with a similar structure Items in Collections could easily re-use the code and the viewers for various file types.


This shows an intralinear transcription where you can play various segments of a recording and see the transcription.


The .eaf files in the previous example are produced using ELAN software. Marco has done the groundwork for a system that could work across multiple repositories and for stand-alone RO-Crates - the crate metadata describes the files, and what format they’re in, and the viewer which is an HTML page either served by a repository or possibly just off your hard disk, can use that information to load an appropriate viewer.


RO-Crate will be released in version 1 in November 2019 - we were aiming for October, but missed that.

We will publish the parts that are well-tested and stable, and immediately start on a new version with bleeding-edge cases.

We want input from potential users, current and prospective implementers and help drafting new parts of the spec is welcome.

You can join the team

    <a rel="license" href=""><img alt="Creative Commons Licence" style="border-width:0" src="" /></a><br />This work is licensed under a <a rel="license" href="">Creative Commons Attribution 3.0 Australia License</a>.

FAIR Simple Scalable Static Research Data Repository / Peter Sefton

This presentation was given by Peter Sefton & Michael Lynch at the eResearch Australasia 2019 Conference in Brisbane, on the 24th of October 2019.

FAIR Simple Scalable Static Research Data Repository Dr Peter Sefton and Michael Lynch University of Technology Sydney

Welcome - we’re going to share this presentation. Peter/Petie will talk through the two major standards we’re building on, and Mike will talk about the software stack we ended up with.

The project in a nutshell A static, file-based research data repository platform using open standards and off-the-shelf web technology OCFL – versioned file storage RO-Crate – dataset / object metadata Solr – index and discovery nginx – baked in access control <p>

This project is about building highly scalable research data repositories quickly, cheaply and above all sustainably by using Standards for organizing and describing data.


We had a grant to continue our OCFL work from the Australian Research Data Commons. (I’ve used the new Research Organisation Registry (ROR) ID for ARDC, just because it’s new and you should all check out the ROR).


OCFL Specifications

This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories. Specifically, the benefits of the OCFL include:
  • Completeness, so that a repository can be rebuilt from the files it stores

  • Parsability, both by humans and machines, to ensure content can be understood in the absence of original software

  • Robustness against errors, corruption, and migration between storage technologies

  • Versioning, so repositories can make changes to objects allowing their history to persist

  • Storage diversity, to ensure content can be stored on diverse storage infrastructures including conventional filesystems and cloud object stores


TODO  Progressively drill down on the architecture until it’s “just”  a partition.

Here’s a screenshot of what an OCFL object looks like - it’s a series of versioned directories, each with a detailed inventory.


One of the standards we are using is RO-Crate - for describing research data sets. I presented this at eResearch as well [TODO - link]


This is an example of an RO-Crate showing that each Crate has a human-readable HTML view as well as a machine readable view.

 <p>🖥️ 👩🏾‍🔬

The two views (human and machine) of the data are equivalent - in fact the HTML version is generated from the JSON-LD version using a tool called CalcyteJS.


This is a screenshot of work very much in progress - it’s a shows an example of the repository system working at the smallest scale, showing a single collection, “Farms to Freeways”; a social history project from Western Sydney, which we have exported into RO-Crate format as a demonstration. Each of the participants has been indexed for discovery. In a more deployment for a institutional repository, datasets would be indexed at the top level only. The point is to show that this software will be highly configurable.


OCFL needs some explaining. I’ve had a couple of conversations with developers where it takes them a little while to get what it’s for.


But they DO get it the standard is well designed.

over to Mike ... <p>

Off-the-shelf components <p>

Solr is an efficient search engine.

nginx is an industry-standard scalable web server, used by companies like DropBox and Netflix

Both are standard, open-source, easy to deploy and keep patched: unlike dedicated data repositories, which tend be fussy and make your server team swear.

ocfl-nginx Resolves a URL: <p>https://my.repo/3eacb986d1.v4/PATH/TO/FILE.html</p> <p>To a file in an ocfl repository:</p> <p>/mnt/ocfl/3e/ac/b9/86/d1/v4/content/PATH/TO/FILE.html</p> <p>

Mapping incoming URLs to the right file in the ocfl repository is straightforward and done with an extension in nginx’s minimal flavour of JavaScript. This slide simplifies things a bit: in real life we have URL ids which solr maps to OIDs and then to a pairtree path.

Todo: we want to use the Memento standard so that clients can request versioned rsources.

We are also looking at versioned DOIs pointing to versioned URLs and resources

Code <p>

ocfl-js - a Node library for building and updating OCFL repositories ro-crate-js - a Node library for working with RO-Crates ocfl-nginx - an extension to nginx allowing it to serve versioned OCFL content a Docker image for ocfl-nginx solr-catalog - a Node library for indexing an OCFL repository into a Solr index data-portal - a single-page application for searching a Solr index

The codebase is in a lot of places but that’s consistent with the approach - they are all just components which we can deploy as we need them

The nginx extension is very small and would be easy to reimplement against another server

Access control licences on datasets in RO-Crate nginx authenticates users local group service map users to licences nginx enforces access on search results and payloads

This is the most prototypical / primitive part of what we’ve got so far.

Licences on RO-Crate are indexed in the solr index. nginx authenticates web users, looks up which licences they can access, and applies access control to both search results and payloads.

At the moment, we’ve got a test server which doesn’t authenticate but which only serves datasets with a public licence and denies access to everything else.

Access control at its most basic <p>

The screenshot on the left is a Solr query showing public and internal licences

The screenshot on the right is a basic web view of what nginx serves to an unauthenticated guest user - datasets with internal licenses aren’t shown

Development strategy Agile development around well-designed data standards pays off Successful collaborations with PARADISEC and State Library of New South Wales, showing the feasibility and ease-of-use of both OCFL and RO-Crate It’s worth engaging with standards at the new/evolving stage, even if this requires a bit of running around to keep up <p>

Good data standards make incremental development much easier.

We were able to get real results in one- and two-day workshops with teams from PARADISEC and the State Library of New South Wales, both with large, structured digital humanities collections behind APIs.

Both the OCLF and RO-Crate standards are new and changing, but agile development means that it’s OK and even productive to keep pace with this and feed back into community consultation.


In the last couple of months Marco La Rosa, an independent developer working for PARADISEC, has ported 10,000 data and collection items into RO-Crate format, AND built a portal which can display them. This means that ANY repository with a similar structure Items in Collections could easily re-use the code and the viewers for various file types.


The Mitchell Collection - digitised public domain books with detailed metadata in METS and specialised OCR standards. We spend a day at the State Library and were able to successfully extract books into directories of JPEGs and metadata, package these using RO-Crate and start building an OCFL repository.

Acknowledgements and links UTS: Moises Sacal Bonequi PARADISEC: Marco De La Rosa, Nick Thieberger State Library of New South Wales: Euwe Ermita <p> Docker: mikelynch/nginx-ocfl</p> <p>

Creative Commons Licence
This work is licensed under a Creative Commons Attribution 3.0 Australia License.

When libraries and librarians pretend to be neutral, they often cause harm / Meredith Farkas

libraries are not neutral pin

Two recent events made me think (again) about the toxic nature of “library neutrality” and the fact that, more often than not, neutrality is whiteness/patriarchy/cis-heteronormativity/ableism/etc. parading around as neutrality and causing harm to folks from historically marginalized groups. The insidious thing about whiteness and these other dominant paradigms is that they are largely invisible to people in the dominant groups. It’s depressing to say this, but I sometimes feel grateful for the antisemitic macroaggressions and microaggressions I’ve been a victim of over the years because they opened my eyes to what it feels like to be othered and bullied and made me more sensitive to when it happens to others. That doesn’t mean I don’t get things wrong plenty of the time and cause harm unintentionally (we all do), but I am trying to be better because I don’t want anyone to feel the way I did when I was a target.

The first event that got me thinking about this is the fact that the Toronto Public Library, against a flurry of opposition, allowed the feminist and transphobic Megan Murphy to give a talk in one of their meeting rooms entitled “Gender Identity: What does It Mean for Society, the Law and Women?” Murphy is on a crusade to “protect” women and children from transwomen who seek to use women-only facilities like bathrooms or locker rooms. She has been banned on Twitter for her transphobia and misgendering in the past. TPL already has a pretty robust room booking policy that says —

Contracting Party’s event will not promote, or have the effect of promoting, discrimination, contempt or hatred for any group or person on the basis of race, ethnic origin, place of origin, citizenship, colour, ancestry, language, creed (religion), age, sex, gender identity, gender expression, marital status, family status, sexual orientation, disability, political affiliation, membership in a union or staff association, receipt of public assistance, level of literacy or any other similar factor.

But TPL didn’t see this event as something that promoted discrimination, contempt, or hatred. According to the City Librarian of TPL, Vickery Bowles, their stated purpose was “to have an educational and open discussion on the concept of gender identity and its legislation ramifications on women in Canada.” Now, let’s imagine that we could go in a time machine to the past. Can you imagine some of these titles being discussed in libraries?

“to have an educational and open discussion on the concept of blacks living in white neighborhoods and its ramifications on the safety of white women in the United States.”

“to have an educational and open discussion on the concept of Jews as teachers and its ramifications on our impressionable children in Germany.”

Clearly I’m dense because I can’t see a difference between any of these lecture topics. It is treating the existence and/or civil rights of one group as something that is 1) up for debate and 2) a danger to others. I’m baffled how anyone could not see such a talk as something “promoting, discrimination, contempt or hatred,” and yet Vickery Bowles is being treated like a hero for standing up against censorship in a number of publications (see Kris Joseph’s excellent blog post for links to a few of them). For more on the TPL controversy, other excellent blog posts you may want to consult are authored by —

Another thing came up this week on an occasion that should have been such a positive one. OLA Quarterly, the official publication of the Oregon Library Association (of which I’m a member and served on its Board last year) came out with a mostly fantastic issue focused on Equity, Diversity, and Inclusion. I’ve read it cover to cover and was so impressed with the way library workers in our state and in all sorts of positions in their organizations have made efforts (big and small) to improve diversity, equity, and inclusion. There’s some great stuff in the issue. Unfortunately, it ended with an article entitled “Yes, but … One Librarian’s Thoughts About Doing It Right” by Heather McNeil. I’m sure most of you can guess that with a title like that, no good can come, and you’d be so very right.

Honestly, the only positive thing I can see ever coming from this article is that when someone asks in the future what people mean by white fragility or by the idea of white people centering themselves in conversations about diversity, I have something to point to. Truly, I’ve seen no clearer example. It’s hard for me to imagine what would possess a librarian with a long and celebrated career as a children’s librarian to write something so uncollegial, offensive, and dismissive of diversity (not to mention poorly written and supported) as her parting gift to the profession upon her retirement. I can only imagine that her feeling that we have “overcorrect[ed] ourselves” on issues of diversity was so strong that she believed she was doing us all a favor in sharing it. And if that isn’t whiteness in its purest form, I don’t know what is. Her misrepresentation of criticisms of Dr. Seuss books, Dr. Debbie Reese’s speech (the text of which is available so you can form your own conclusions), the blog Reading While White, and others trying to improve the diversity of books in libraries, celebrate diverse books, and critique whiteness in libraries were egregious and mostly unsupported.

Like others, I wrote a letter to the editors of OLA Quarterly, which I also shared on Twitter and on our state library listserv. My hope is that the editors will address this issue publicly and revisit their editorial standards so something this unprofessional is never published in OLA Quarterly again. However, what troubles me most is that lots of people read this article prior to its inclusion in the issue and thought it appropriate for publication. Again, clear evidence of how invisible whiteness can be to people who are white.

McNeil argues in her article that the Caldecott Committee does not consider the race or ethnicity of the author in their voting, but that’s pretty much impossible in a racist society. What we find beautiful and touching and important is very much based on our worldview, which, when we’ve been baked in a racist society, is influenced by whiteness. And based on McNeil’s article, it’s clear that some people are more aware of their problematic biases than others. It left me wondering whether members of the Newbery and Caldecott Committees are given implicit bias training so they can be more aware of how their biases impact their views of each book. If not, they absolutely should.

What strikes me about both of these issues is the fundamental lack of empathy expressed for people from historically marginalized groups. McNeil seems to worry much more about libraries with limited budgets (who might not want to buy diverse books that she believes won’t circulate) and Dr. Seuss lovers than about young children of color who might be impacted by racist caricatures or a lack of books in their library’s collection featuring protagonists who look like them. In the case of Toronto, even if the Library decided to hold firm on allowing the meeting room to be used on intellectual freedom grounds, they could have provided affirmation for their trans patrons in the form of statements and programming. That City Librarian Bowles would not even deign to acknowledge that trans women are women suggests to me that there is nothing “neutral” about the library’s stance. The fact that they see the question of whether trans women are women as an academic question that could reasonably be up for debate speaks volumes.

I can’t even fathom what all this feels like for LGBTQ+ staff at the Toronto Public Library who are not only being harmed by this, but in whose names these harms are being perpetrated. I felt angry about the article in OLA Quarterly on behalf of those whose needs and legitimate claims were being minimized and dismissed by McNeil, but I also felt like it made all Oregon library workers look bad. It made me feel embarrassed to be an OLA member.

In both of these cases, supporting diversity, equity, and inclusion are seen as things that are nice to do, but are secondary to other values libraries hold, like intellectual freedom. I wrote about the tension between access & diversity and intellectual freedom in American Libraries and while I was not allowed to take a strong stand in that publication, I can say here that I unequivocally put people over ideals (especially people who are frequently victimized by institutions). To me, events by white supremacists or TERFs (trans-exclusionary radical feminists) are designed to repudiate the dignity and existence of marginalized groups and to make those groups feel unsafe. How can we say we welcome everyone into our libraries if we welcome folks who explicitly make people from marginalized groups feel unwelcome? But instead, libraries hide behind the idea of neutrality and not taking sides when clearly, TPL did choose a side. So did McNeil. So did I. And hanging onto your supposed neutrality only ensures that your behavior and choices are going to be influenced by whiteness/patriarchy/cis-heteronormativity/ableism/etc.

Key to stopping situations like this from happening is helping people become aware of their own biases and privilege, but clearly that is a difficult pill for many white library workers to swallow. I was asked last Spring to serve on an Oregon Library Association Equity, Diversity, and Inclusion (EDI) Task Force that is going to have its first meeting soon. I was originally really excited to serve on this group because I could see that libraries and library workers in the state needed educational tools that facilitate open discussions and encourage critical reflection about EDI issues and privilege. I could imagine creating a multi-modal learning program where people read articles, watch videos, critically reflect on their own blogs, and participate in F2F or virtual group discussions. After this week, that need is even more glaring. When I saw that our charge was focused on creating an EDI plan, I worried that we would be simply creating a meaningless document that the OLA Board will file away and maybe develop a few long-term goals around. I hope I’m wrong and we really move the needle on EDI in the state. I think I’ve just been burned too many times when working to create transformative planning documents that administrators just file away and ignore. I want to support meaningful work and I don’t want to feel so cynical about it.

What makes me hopeful is reading the other articles in this OLA Quarterly issue where library workers are moving the needle on making their libraries, collections, and the information ecosystem more diverse, equitable, and inclusive in ways large and small. There is great work happening in Oregon. I hope you’ll take the time to read some of their stories too and will amplify them more than McNeil’s terrible contribution.


Libraries are not neutral image credit Zines by JCimage is available here (where you can buy the button and some pretty great zines!)

Islandora/Fedora Camp in Arizona - Instructors Announced! / Islandora

The first Islandora event of 2020 will also be our first joint event with Fedora! From February 24 - 26, we will be partnering with LYRASIS and hosted by Arizona State University to bring you a three day camp packed with the latest in both Islandora and Fedora. Registration is now open!

Our focus will be on the latest versions of each, so this is an excellent opportunity to learn all about Islandora 8 and get some hands-on experience. The camp will be led by a group of experienced instructors with expertise spanning the front-end and code base of both platforms:

Melissa Anez has been working with Islandora since 2012 and has been the Community and Project Manager of the Islandora Foundation since it was founded in 2013. She has been a frequent instructor in the Admin Track and developed much of the curriculum, refining it with each new Camp. Lately she has been enjoying the challenge of fitting two versions of Islandora into a single day of workshops!

Danny Lamb has his B.Sc. in Mathematics and has been programming since before he could drive. He is currently serving as the Islandora Foundation's Technical Lead, and hopes to promote a collaborative and respectful environment where constructive criticism is encourage and accepted. He is married with two children, and lives on beautiful Prince Edward Island, Canada. If he had free time, he'd be spending it in front of his kamado style grill.

Bethany Seeger is a software developer in the library at Amherst College, a liberal arts college in Massachusetts. She’s a Fedora committer, and also is the lead committer and release manager of the ISLandora Enterprise (ISLE) project. She was an instructor at Fedora Camp in Austin, TX, and co-led the ISLE workshop at Islandoracon 2019. Bethany has lurked in the Islandora community for a while watching Islandora 8 develop; during this time, she’s installed Islandora 7 (manually, and then using ISLE) and Islandora 8 (using Ansible). Currently she is working on migrating a custom Fedora 3 repository to Islandora 7 (using ISLE) with the hopes of adopting Islandora 8 in the very near future. Bethany enjoys explaining complicated processes in plain English.

Seth Shaw jumped directly into developing with Islandora 8, and became a committer in 2018. He developed the Controlled Access Terms module and an ArchivesSpace integration module. He has been teaching workshops for over a decade but this will be his first Islandora Camp. His day job is as an Application Developer for Special Collections at the University of Nevada, Las Vegas.

David Wilcox is the Product Manager for Fedora at LYRASIS. He has been working with the Fedora and Islandora communities since 2011. David organizes camps and workshops for Fedora, where he is also frequently an instructor. The Arizona camp will be the first to feature both Fedora and Islandora, and he is excited for the opportunity to bring this new, combined camp to the community.


Meet Daniel Ouso, one of our Frictionless Data for Reproducible Research Fellows / Open Knowledge Foundation

The Frictionless Data for Reproducible Research Fellows Programme is training early career researchers to become champions of the Frictionless Data tools and approaches in their field. Fellows will learn about Frictionless Data, including how to use Frictionless Data tools in their domains to improve reproducible research workflows, and how to advocate for open science. Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content.

You can call me Daniel Ouso. My roots trace to the lake basin county of Homabay in the Equatorial country in the east of Africa; Kenya. Currently, I live in its capital Nairobi – once known as “The Green City in the Sun”, although thanks to the poor stewardship to Mother Nature this is now debatable. The name is Maasai for a place of cool waters.

But enough of beautiful Kenya. I work in the International Centre of Insect Physiology and Ecology as a Bioinformatics expert within the Bioinformatics Unit involved in bioinformatics training and genomic data management. I am a master of science in Molecular biology and Bioinformatics (2019) from Jomo Kenyatta University of Agriculture and Technology, Kenya. My previous work is in infectious disease management and a bit of conservation. My long-term interest is in disease genomics research.

I am passionate about research openness and reproducibility, which I gladly noticed as a common interest in the Frictionless Data Fellowship (FDF). I have had previous experience working on a Mozilla Open Science project that really piqued my interest in wanting to learn skills and to expand my knowledge and perspective in the area. To that destination, this fellowship advertised itself as the best vehicle, and it was a frictionless decision to board. My goal is to become a better champion for open-reproducible research by learning data and metadata specifications for interoperability, the associated programmes/libraries/packages and data management best practices. Moreover, I hope to discover additional resources, to network and exchange with peers, and ultimately share the knowledge and skills acquired.

Knowledge is cumulative and progressive, an infinite cycle, akin to a corn plant, which grows into a seed from a seed, in between helped by the effort of the farmer and other factors. Whether or not the subsequent seed will be replanted depends, among other competitions, on its quality. You may wonder where I am going with this, so here is the point: for knowledge to bear it must be shared promiscuously; to be verified and to be built upon. The rate of research output is very fast, and so is the need for advancement of the research findings. However, the conclusions may at times be wrong. To improve knowledge, the goal of research is to deepen understanding and confirm findings and claims through reproduction. However, this is dependent on the contribution of many people from diverse places, as such, there is an obvious need to remove or minimise obstacles to the quest for research excellence. As a researcher, I believe that to keep with the rate of research production, findings and data from it must be made available in a form that doesn’t antagonise its re-use or/and validation for further research. It means reducing friction on the research wheel by making research easier, cheaper and quicker to conduct, which will increase collaboration and prevent the reinvention of the wheel. To realise this, it is incumbent on me (and others) to make my contribution both as a producer and an affected party, especially seeing that exponentially huge amounts of biological data continue to be produced. Simply, improving research reproducibility is the right science of this age.

I am a member of The Carpentries community as an instructor and currently also in the task force planning the CarpentryCon2020, and hope to meet some of OKF community members there. I am excited to join this community as a Frictionless Data Fellowship! You can find important links and follow my fellowship here.

Frictionless Data for Reproducible Research Fellows Programme

More on Frictionless Data

The Fellows programme is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. Frictionless Data’s other current projects include the Tool Fund, in which four grantees are developing open source tooling for reproducible research. The Fellows programme will be running until June 2020, and we will post updates to the programme as they progress.

• Originally published at

Twitter / pinboard

RT @mjingle: Who's excited for the next #code4lib conference?! It will be in Pittsburgh, PA from March 8-11. Is your org interes…

Propose a Topic for the ITAL “Public Libraries Leading the Way” Column / LITA

Information Technology and Libraries (ITAL), the quarterly open-access journal published by ALA’s Library Information Technology Association, is looking for contributors for its regular “Public Libraries Leading the Way” column. This column highlights a technology-based innovation or approach to problem solving from a public library perspective. Topics we are interested in include the following, but proposals on any other technology topic are welcome.

  • 3-D printing and makerspaces
  • Civic technology
  • Drones
  • Diversity, equity, and inclusion and technology
  • Privacy and cyber-security
  • Virtual and augmented reality
  • Artificial intelligence
  • Big data
  • Internet of things
  • Robotics
  • Geographic information systems and mapping
  • Library analytics and data-driven services
  • Anything else related to public libraries and innovations in technology

To propose a topic, use this brief form, which will ask you for three pieces of information:

  • Your name
  • Your email address
  • A brief (75-150 word) summary of your proposed column that describes your library, the technology you wish to write about and your experience with it.

Columns are in the 1,000-1,500 word range and may include illustrations. These will not be research articles, but are meant to share practical experience with technology development or uses within the library. Proposals are due by November 30, and selections will be made by December 15.

If you have questions, contact Ken Varnum, Editor, at

Anna Goben: Evergreen Contributor of the Month / Evergreen ILS

The Evergreen Outreach Committee is pleased to announce that October’s Contributor of the Month is Anna Goben of the Indiana State Library (ISL). Anna serves as the Evergreen Indiana Program Director and Associate Database Analyst at ISL, and has been involved with Evergreen Indiana since its earliest days in 2007. Anna oversees daily operations of the Evergreen Indiana consortium. 

“Because I work with staff fighting through workflow and policy issues daily, I am especially focused on monitoring and helping to fund development which will enhance the daily experience of Evergreen for patrons and staff,” Anna tells us. She has coordinated with other Evergreen consortia to collectively fund development and also brings her workflow knowledge to Launchpad, where she has filed 33 bugs.

Anna has been involved with the international Evergreen community since 2013. She credits attending the Evergreen Conference with “fir[ing] me up to get involved in the wider community,” and encourages new members to attend community events. “I would suggest that any meeting that involves members of multiple Evergreen communities will get you excited,” she says, “As you learn that they have the same enthusiasms, frustrations, and experiences that you deal with regularly.”

Anna and ISL have served as hosts of the community Hack-a-way in 2016, 2017, and 2019. Those community members who have attended the Hack-a-way can attest to the stellar hosting capabilities Anna brings to this event, which includes a homemade first-night welcome dinner for attendees.

Anna has been heavily involved with two large transitions affecting the Evergreen community in recent years – as current President of the Evergreen Project Board, she is part of the ongoing process to establish Evergreen as its own 501(c)(3) organization. This started in 2017 with the initial transition from the Software Freedom Conservancy to MOBIUS as the project’s “home”, and Anna and the current Board are working hard to permanently establish Evergreen as its own organization.

Additionally, Anna coordinated the creation of the Evergreen Community Development Initiative (ECDI) under the aegis of ISL. ECDI has assumed all development contracts from the former MassLNC. As MassLNC did, ECDI will serve as a clearinghouse for Evergreen community development funds and will continue managing cooperative development projects for the benefit of the community at large. 

This cooperative spirit is something Anna embodies. “I’m always so excited to be part of a community that makes the changes they want to see rather than just feeling like they’re stuck with what they have.”

Do you know someone in the community who deserves a bit of extra recognition? Please use this form to submit your nominations. We ask for your email in case we have any questions, but all nominations will be kept confidential.

Any questions can be directed to Andrea Buntz Neiman via or abneiman in IRC.

Aviation's Groundhog Day / David Rosenthal

Searching for 40-year old lessons for Boeing in the grounding of the DC-10 by Jon Ostrower is subtitled An eerily similar crash in Chicago 40-years ago holds lessons for Boeing and the 737 Max that reverberate through history. Ostrower writes that it is:
The first in a series on the historical parallels and lessons that unite the groundings of the DC-10 and 737 Max.
I hope he's right about the series, because this first part is a must-read account of the truly disturbing parallels between the dysfunction at McDonnell-Douglas and the FAA that led to the May 25th 1979 Chicago crash of a DC-10, and the dysfunction at Boeing (whose management is mostly the result of the merger with McDonnell-Douglas) and the FAA that led to the two 737 MAX crashes. Ostrow writes:
The grounding of the DC-10 ignited a debate over system redundancy, crew alerting, requirements for certification, and insufficient oversight and expertise of an under-resourced regulator — all familiar topics that are today at the center of the 737 Max grounding. To revisit the events of 40 years ago is to revisit a safety crisis that, swapping a few specific details, presents striking similarities four decades later, all the way down to the verbiage.
Below the fold, some commentary with links to other reporting.

The Regulators

The DC-10 crashed because one of the pylons holding the under-wing engines broke, with massive damage to the wing. Despite this obvious mechanical failure, it took 12 days for the FAA to ground DC-10s:
On June 6, all 138 DC-10s at eight U.S. airlines were ordered grounded by the FAA when it revoked the jet’s airworthiness certificate and would stay that way for 37 days in 1979. The FAA initially opposed the grounding and the crash forced a legal battle with the American Airline Passengers Association, which sought an injunction to halt DC-10 flying in the U.S. “pending fuller analysis,” according to coverage in Flight. Inspections in the days that followed the Chicago crash revealed cracks on the engine pylons on other aircraft. FAA Administrator Langhorne Bond had no choice but to withdraw the roughly 275-seat jet’s airworthiness certificate. Carriers and regulators around the world — totaling some 274 aircraft, including 74 in Europe — followed suit.

McDonnell Douglas called the order “an extreme and unwarranted act.”
The cause of the Lion Air crash wasn't clear, but after the Ethiopian Airlines crash it took China and Indonesia less than a day to ground the 737 MAX:
Responding to the second crash of a Boeing 737 Max 8 soon after takeoff in less than five months, China and Indonesia ordered their airlines on Monday to ground all of these aircraft that they operate.

The Civil Aviation Administration of China noted in its announcement on Monday morning of the grounding that both the Ethiopian Airlines crash on Sunday and a Lion Air crash in Indonesia in late October had involved very recently delivered Boeing 737 Max 8 aircraft that crashed soon after takeoff.

Indonesia joined China about nine hours later in also ordering its airlines to stop operating their Boeing 737 Max 8 aircraft.
It took the FAA three days to realize that they couldn't allow the 737 MAX to continue flying:
On Wednesday, when announcing the grounding of the 737 MAX, the FAA cited similarities in the flight trajectory of the Lion Air flight and the crash of Ethiopian Airlines Flight 302 last Sunday.
It is doubtful whether the FAA would have acted so fast had they not been preempted. In both cases, the FAA set up a blue-ribbon review board. The 1980 board concluded:
“The committee finds that, as the design of airplanes grows more complex, the FAA is placing greater reliance on the manufacturer,” the blue-ribbon panel wrote in 1980. “The FAA’s human resources are not remotely adequate to the enormous job of certifying an airliner,” wrote Newhouse, and said the lure of more attractive salaries in the private sector meant 94% of approval work was delegated to the manufacturers. “The committee finds that the technical competence and up-to-date knowledge required of people in the FAA have fallen behind those in industry.”
Dominic Gates reports this is still the case today:
The FAA, citing lack of funding and resources, has over the years delegated increasing authority to Boeing to take on more of the work of certifying the safety of its own airplanes.

Early on in certification of the 737 MAX, the FAA safety engineering team divided up the technical assessments that would be delegated to Boeing versus those they considered more critical and would be retained within the FAA.

But several FAA technical experts said in interviews that as certification proceeded, managers prodded them to speed the process. Development of the MAX was lagging nine months behind the rival Airbus A320neo. Time was of the essence for Boeing.

A former FAA safety engineer who was directly involved in certifying the MAX said that halfway through the certification process, "we were asked by management to re-evaluate what would be delegated. Management thought we had retained too much at the FAA."

"There was constant pressure to re-evaluate our initial decisions," the former engineer said. "And even after we had reassessed it ... there was continued discussion by management about delegating even more items down to the Boeing Company."

Even the work that was retained, such as reviewing technical documents provided by Boeing, was sometimes curtailed.
The 2019 review agreed:
The New York Times would call the panel’s findings “damning” for Boeing and the FAA. The JATR, which included regulators from nine countries along with the U.S., found “signs of undue pressure” on the delegated Boeing staff responsible for regulatory approvals of the MCAS system, which it said (without elaborating) “may be attributed to conflicting priorities and an environment that does not support FAA requirements.”
The 1980 review pointed  to the problem that the regulations were treated as the maximum the manufacturer needed to do:
Maynard Pennell, retired Boeing executive and aerodynamicist drafted to the blue-ribbon commission for the review of the FAA and DC-10 told Newhouse: “Douglas met the letter of the FAA regulations, but it did not build as safe an airplane as it could have. This was not a deliberate policy on its part…Douglas was determined not to over-run or do more than required by regulation to do.”
The 2019 review agreed:
The JATR concluded Boeing broadly met every regulation, but raised “the foundational issue” of whether or not regulations can go far enough to foster a safety culture without creating complacency. “To the extent they do not address every scenario, compliance with every applicable regulation and standard does not necessarily ensure safety.
Ostrower ends by explicitly linking the two reviews:
On the closing page of its 1980 report, the blue-ribbon committee made a recommendation stemming directly from the lessons it saw as crucial from the 1979 DC-10 crash. The report recommended that each commercial aircraft manufacturer “consider having an internal aircraft safety organization to provide additional assurance of airworthiness to company management.” [Emphasis theirs] McDonnell Douglas had created roving non-advocate review boards to assess program safety, according to a former Douglas executive, but it stopped short of a central organization. But the virtue of the recommendation didn’t end in 1980. Whether it realized it or not, Boeing’s Board of Directors on September 30, 2019 adopted the committee’s suggestion, forty years later.

The Manufacturer

It is important to note that the current Boeing management evolved from the McDonnell-Douglas management of 1997. As I wrote in Boeing 737 MAX: Two Competing Views :
[Maureen Tkacik] recounts how Boeing bought the failing McDonnell-Douglas in 1997 and basically handed management of the combined company to the team that had driven McDonnell-Douglas into the ditch:
The line on Stonecipher was that he had “bought Boeing with Boeing’s money.” Indeed, Boeing didn’t ultimately get much for the $13 billion it spent on McDonnell Douglas, which had almost gone under a few years earlier. But the McDonnell board loved Stonecipher for engineering the McDonnell buyout, and Boeing’s came to love him as well.
In fact, the Stonecipher-engineered buyout closed on 1st August 1997, a mere 68 days after the DC-10 crash.

Work on the DC-10 started in 1968 and it entered service in 1971. Ostrower writes:
Douglas was determined to beat the L-1011 Tristar to the sky in 1970 and did so 10 weeks before Lockheed. The externally similar looking tri-jet occupied an identical spot in the market. And arriving first would be part of the competitive advantage, Douglas surmised. That expediency by Douglas (recently merged in 1967 with McDonnell) would invite some of withering criticism from those tasked with officially evaluating the jet after Flight 191, including that its design might’ve met the letter of the law, but fell far short of its spirit of safety.
They were right about the "competitive advantage":
Although the L-1011 was more technologically advanced, the DC-10 would go on to outsell the L-1011 by a significant margin due to the DC-10's lower price and earlier entry into the market.
In Flawed analysis, failed oversight: How Boeing and FAA certified the suspect 737 MAX flight control system Dominic Gates of  the Seattle Times explains that Boeing was desperate to get the 737 MAX in the air because the Airbus A320 Neo had a 9-month lead in the market. And that Boeing also had a serious competitive disadvantage against Airbus. Airbus's planes are fly-by-wire, and the flight control software minimizes the differences between different models, reducing the need for pilot training. Boeing was also desperate to ensure that pilots certified for earlier 737 vesions would not need significant training to fly the MAX. Gates writes that Boeing:
had promised Southwest Airlines Co., the plane’s biggest customer, to keep pilot training to a minimum so the new jet could seamlessly slot into the carrier’s fleet of older 737s, according to regulators and industry officials.

[Former Boeing engineer] Mr. [Rick] Ludtke [who worked on 737 MAX cockpit features] recalled midlevel managers telling subordinates that Boeing had committed to pay the airline $1 million per plane if its design ended up requiring pilots to spend additional simulator time. “We had never, ever seen commitments like that before,” he said.


What we see in these two cases, just as we saw in the Global Financial Crisis, and are seeing now with the FAANGS, is that the bigger the company, the easier time it has gaming the regulatory system in its favor. The emasculation of anti-trust enforcement has let Boeing, the Wall Street banks, the cable companies, and the FAANGS get too-big-to-fail. And in the absence of effective anti-trust remedies, these companies need not fear the regulators. They can be ignored, strong-armed, or as we see with the cable companies' FCC and Boeing's FAA, captured.

ALCTS, LITA and LLAMA collaborate for virtual forum / LITA

The Association for Library Collections & Technical Services (ALCTS), the Library and Information Technology Association (LITA) and the Library Leadership & Management Association (LLAMA) have collaborated to create The Exchange, an interactive, virtual forum designed to bring together experiences, ideas, expertise and individuals from these American Library Association (ALA) divisions. Modeled after the 2017 ALCTS Exchange, the Exchange will be held May 4, May 6 and May 8 in 2020 with the theme “Building the Future Together.”

As a fully online interactive forum, the Exchange will give participants the opportunity to share the latest research, trends and developments in collections, leadership, technology, innovation, sustainability and collaborations. Participants from diverse areas of librarianship will find the three days of presentations, panels and activities both thought-provoking and highly relevant to their current and future career paths. The Exchange will engage an array of presenters and participants, facilitating enriching conversations and learning opportunities. Everyone, members and non-members alike, are encouraged to register and bring their questions, experiences and perspectives to the events. Registration opens Nov. 4.

The Exchange Working Group welcomes proposals for the May 2020 forum that highlight the innovation happening in the profession and that span across areas that may have been traditionally siloed. Proposal topics should be relevant to the overarching theme, as well as the daily themes for each session. Daily themes include leadership and change management, continuity and sustainability, and collaborations and cooperative endeavors. The deadline for proposals is Dec. 6. Proposals can be submitted using the Presentation Proposal Form.

Before submitting a proposal, visit the Exchange website to learn what makes a strong proposal, view the success criteria, and check out the session formats.

The Exchange is presented by the Association for Library Collections & Technical Services (ALCTS), the Library and Information Technology Association (LITA) and the Library Leadership & Management Association (LLAMA), divisions of the American Library Association. To get more information about the proposed future for joint projects such as the Exchange, join the conversation about #TheCoreQuestion.


Brooke Morris-Chott

Program Officer, Communications