Planet Code4Lib

Filling our cups at the home office water cooler / HangingTogether

Image of a multi-person virtual meeting on a home laptop.

A few weeks ago I learned that Crystal, an OCLC colleague I barely knew, loves to play board games, that she and her husband  just became empty nesters (like my wife and me), and that she just moved to a new place not far from my house. Both vaccinated, a game night seemed within our grasp and a new friendship had already formed.

People who know me know that I’m a pretty extroverted and gregarious person. I take delight in meeting new people and working on projects elbow-to-elbow. I enjoy a good whiteboard throw-down. I relish the passion of intellectual debate and peer review. And I love all these things even more when they happen in the office with my colleagues.

Since that fateful Friday the 13th in March 2020, my own team at OCLC adopted a practice of meeting  online three times a week for 30 minutes just to touch base, crack jokes, put funny comments in the chat, and generally check in on each other’s physical and mental well-being. Work topics only sneak in about 30% of the time. Some of my colleagues have called it a “lifeline” in troubling times.

Prior to the pandemic, our team has worked with folks outside the Dublin HQ and includes one member who works remotely so we’re adept at working virtually. Even so, working in regular proximity to colleagues can power collaboration. But my relatively small team already knows each other quite well. With about 1,200 global colleagues at OCLC, how do we to boost collaboration and infuse remote work with the relationship building we used to rely on serendipity and the shared coffee station to provide?

Several aspects of remote work, from reclaimed commuting time to new work attire, have created more balance for many of us. But a virtual environment makes it challenging to make and foster social connections. Chats about vacation plans (remember those?), new pets, family updates, your latest Netflix binge, and, oh yeah, even work stuff—the water cooler conversations.

Image of a person holding a cell phone about to start Microsoft Teams.

So, imagine my delight when my colleagues in OCLC Human Resources launched the (opt-in) OCLC WaterCooler. Each week, a bot called Icebreaker (part of our Microsoft Teams platform) pairs you with a random participating staff member each week and participants can choose how and when to connect with their match. you So far, I’ve had ten matches and only three have been with someone I already knew.

Managing a data science team in OCLC Research, I’m naturally skeptical and critical of the claims of Artificial Intelligence (see our report, Responsible Operations). But the Team’s Icebreaker algorithms don’t suffer from conscious or unconscious bias, so it’s also a great tool for supporting equity, diversity, and inclusion in the workplace by helping us get out of our well-trodden networks. All of my meetups have been with people with varied expertise and experiences, folks from different cultural backgrounds and different OCLC departments. These diverse interactions have not only changed and enriched my perspective, they also make for much more lively and interesting conversations.  

I know that my mental and emotional health thrives when I take time to fill my cup, and my work is more innovative and efficient when I connect with people outside of my own team, even for a few minutes each week. Getting out of our ruts is important—the WaterCooler and Icebreaker makes this happen.

The WaterCooler is a godsend. In a year of feeling disconnected and isolated or being myopically focused on my own team’s priorities and work, I’m now meeting with OCLC colleagues in a way that has energized and inspired me.

The other day I chatted with Pradeep. We’d worked in the same building for three years but never had a conversation. We talked about family, moving to Ohio, data science, and the illumination at the tunnel’s end created by the vaccine. I could have seen Pradeep in the OCLC cafeteria or ridden the elevator with him 50 times and not made the same kind of connection. I could have even been in a face-to-face meeting with him and not had the opportunity to break the ice in the way this tool has. Maybe that’s on me, but I’m happy for the new awareness. My modest case of agoraphobia is waning now that I’m vaccinated and I’m looking forward to returning to the office. When I’m back, I will keep using this tool. I can’t wait to chat with my new friends and colleagues online and in person at a real—not virtual—water cooler.

The post Filling our cups at the home office water cooler appeared first on Hanging Together.

Mempool Flooding / David Rosenthal

In Unstoppable Code? I discussed Joe Kelly's suggestion for how governments might make it impossible to transact Bitcoin by mounting a 51% attack using seized mining rigs. That's not the only way to achieve the same result, so below the fold I discuss an alternative approach that could be used alone or in combination with Kelly's concept.

The Lifecycle Of The Transaction

The goal is to prevent transactions in a cryptocurrency based on a permissionless blockchain. We need to understand how transactions are supposed to work in order to establish their attack surface:
  • Transactions transfer cryptocurrency between inputs and outputs identified by public keys (or typically hashes of the keys).
  • The input creates a proposed transaction specifying the amount for each output, and a miner fee, then signs it with their private key.
  • The proposed transaction is broadcast to the mining pools, typically by what amounts to a Gossip Protocol.
  • Mining pools validate the proposed transactions they receive and add them to a database of proposed transactions, typically called the "mempool".
  • When a mining pool starts trying to mine a block, they choose some of the transactions from their mempool to include in it. Typically, they choose transactions (or sets of dependent transactions) that yield the highest fee should their block win.
  • Once a transaction is included in a winning block, or more realistically in a sequence of winning blocks, it is final.

Attack Surface

My analysis of the transaction lifecycle's attack surface may well not be complete, but here goes:
  • The security of funds before a proposed transaction depends upon the private key remaining secret. The DarkSide ransomware group lost part of their takings from the Colonial Pipeline compromise because the FBI knew the private key of one of their wallets.
  • The gossip protocol makes proposed transactions public. A public "order book" is necessary because the whole point of a permissionless blockchain is to avoid the need for trust between particpants. This leads to the endemic front-running I discussed in The Order Flow, and which Naz automated (see How to Front-run in Ethereum).
  • The gossip protocol is identifiable traffic, which ISPs could be required to block.
  • The limited blocksize and fixed block time limit the rate at which transactions can leave the mempool. Thus when the transaction demand exceeds this rate the mempool will grow. Mining pools have limited storage for their mempools. When the limit is reached mining pools will drop less-profitable transactions from their mempools. Like any network service backed by a limited resource, the mempool is vulnerable to a Distributed Denial of Service (DDoS) attack.
  • Each mining pool is free to choose transactions to include in the blocks they try to mine at will. Thus a transaction need not appear in the mempool to be included in a block. For example, mining pools' own transactions or those of their friends could avoid the mempool, the equivalent of "dark pools" in equity markets.
  • Once a transaction is included in a mined block, it is vulnerable to a 51% attack.

Flooding The Mempool

Lets focus on the idea of DDoS-ing the mempool. As John Lewis of the Bank of England wrote in 2018's The seven deadly paradoxes of cryptocurrency:
Bitcoin has an estimated maximum of 7 transactions per second vs 24,000 for visa. More transactions competing to get processed creates logjams and delays. Transaction fees have to rise in order to eliminate the excess demand. So Bitcoin’s high transaction cost problem gets worse, not better, as transaction demand expands.
Worse, pending transactions are in a blind auction to be included in the next block. Because users don't know how much to bid to be included, they either overpay, or suffer a long delay or possibly fail completely. The graph shows this effect in practice. As the price of Bitcoin crashed on May 18th and HODL-ers rushed to sell, the average fee per transaction spiked to over $60.

The goal of the attack is to make victims' transactions rare, slow and extremely expensive by flooding the mempool with attackers' transactions. Cryptocurrencies have no intrinsic value, their value is determined by what the greater fool will pay. If HODL-ers find it difficult and expensive to unload their HODL-ings, and traders find it difficult and expensive to trade, the "price" of the currency will decrease. This attack isn't theoretical, it has already been tried. For example, in June 2018 Bitcoin Exchange Guide reported:
What appears to be happening is a bunch (possibly super spam) of 1 satoshi transactions (smallest unit in bitcoin) which will put a decent stress test if sustained. Some are saying near 4,500 spam transactions and counting.
This is obviously not an effective attack. There is no incentive for the mining pools to prefer tiny unprofitable transactions over normal user transactions. Unless it were combined with a 51% attack, an effective flooding attack needs to incentivize mining pools who are not part of the attack to prefer the attackers' transactions to those of victims. The only way to do this is to make the attacker's transactions more profitable, which means they have to come with large fees.

If a major government wanted to mount a flooding attack on, for example, Bitcoin they would need a lot of Bitcoin as ammunition. Fortunately, at least the US government has seized hundreds of millions of dollars of cryptocurrencies:
Mr. Raimondi of the Justice Department said the Colonial Pipeline ransom seizure was the latest sting operation by federal prosecutors to recoup illicitly gained cryptocurrency. He said the department has made “many seizures, in the hundreds of millions of dollars, from unhosted cryptocurrency wallets” used for criminal activity.
If they needed more, they could always hack one of the numerous vulnerable exchanges.

With this ammunition the government could generate huge numbers of one-time addresses and huge numbers of valid transactions among them with fees large enough to give them priority. The result would be to bid up the Bitcoin fee necessary for victim transactions to get included in blocks. It would be hard for mining pools to identify the attackers' transactions, as they would be valid and between unidentifiable addresses. As the attack continued this would ensure that:
  • The minimum size of economically feasible transactions would increase, restricting trading to larger and larger HODL-ers, or to exchanges.
  • The visible fact that Bitcoin was under sustained, powerful attack would cause HODL-ers to sell for fiat or other cryptocurrencies. This would depress the "price" of Bitcoin, as the exchanges would understand the risk that the attack would continue and further depress the price.
  • Mining pools, despite receiving their normal rewards plus increased fees in Bitcoin, would suffer reduction of their income in fiat terms.
  • Further, the mining pools need transactions to convert their rewards and fees to fiat to pay for power, etc. With transactions scarce and expensive, and reduced fiat income, the hash rate would decline, making a 51% attack easier.

How Feasible Are Flooding Attacks?

Back on May 18th, as the Bitcoin "price" crashed to $30K, its blockchain was congested and average fees spiked to $60. Clearly, the distribution of fees would have been very skewed, with a few fees well above $60 and most well below; the median fee was around $26. Fees are measured in Satoshi, 10-8 of a BTC, so at that time the average fee was 60 / 3*104 BTC or 2*105 Satoshi. Lets assume that ensuring that no transactions with less than 2*105 Satoshi as a fee succeed is enough to keep the blockchain congested.

Lets assume that when the Feds claim to have seized hundreds of millions of dollars of cryptocurrencies they mean $5*108 or 5*108/3*104 BTC or 1.67*1012 Satoshi. That would be enough to pay the 2*105 Satoshi for 6*107 transactions. At 6 transaction/second that would keep the blockchain congested for nearly 116 days or nearly 4 months. In practice, the attack would last much longer, since the attackers could dynamically adjust the fees they paid to keep the blockchain congested as, inevitably, the demand for transactions from victims declined as they realised it was futile.

Ensuring that almost no victim transactions succeeded for 4 months would definitely greatly reduce the BTC "price". Thus the 16,700 BTC the mining pools would have earned in fees, plus the 104,400 BTC they would have earned in block rewards during that time would be worth much less than the $3.6B they would represent at a $30K "price". Funding the mining pools is a downside of this attack, but the increment is only about 14% in BTC terms, so likely to be swamped by the decrease in fiat terms.

Potential Defenses

Blockchain advocates argue that one of the benefits of the decentralization they claim for the technology is "censorship resistance". This is a problem for them because defending against a mempool flooding attack requires censorship. The mining pools need to identify and censor (i.e. drop) the attackers' transactions. Fortunately for the advocates, the technology is not actually decentralized (3-4 mining pools have dominated the hash rate for the last 7 years), so does not actually provide "censorship resistance". The pools could easily conspire to implement the necessary censorship. Unfortunately for the advocates, the attackers would be flooding with valid transactions offering large fees, so the pools would find it hard to, and not be motivated to, selectively drop them.

16,700 BTC is only about half of Tesla's HODL-ings, so it would be possible for a whale, or a group of whales, to attempt to raise the cost of the attack, or equivalently reduce its duration, by mounting a simultaneous flood themselves. The attackers would respond by reducing their flood, since the whales were doing their job for them. This would be expensive for the whales and wouldn't be an effective defense.

Since it is possible for mining pools to include transactions in blocks they mine, and the attack would render the mempool effectively useless, one result of the attack would be to force exchanges and whales to establish "dark pool" type direct connections to the mining pools, allowing the mining pools to ignore the mempool and process transactions only from trusted addresses. This would destroy the "decentralized" myth, completing the transition of the blockchain into a permissioned one run by the pools, and make legal attacks on the exchanges an effective weapon. Also, the mining pools would be vulnerable to government-controlled "trojan horse" exchanges, as the bad guys were to ANOM encrypted messaging.


If my analysis is correct, if would be feasible for a major government to mount a mempool flooding attack that would seriously disrupt, but not totally destroy Bitcoin and, by extension other cryptocurrencies. The attack would amplify the effect of using seized mining power as I discussed on Unstoppable Code?. Interestingly, the mempool flooding attack is effective irrespective of the consensus mechanism underlying the cryptocurrency. It depends only upon a public means of submitting transactions.

Editorial: Closer to 100 than to 1 / Code4Lib Journal

With the publication of Issue 51, the Code4Lib Journal is now closer to Issue 100 than we are to Issue 1. Also, we are developing a name change policy.

Adaptive Digital Library Services: Emergency Access Digitization at the University of Illinois at Urbana-Champaign During the COVID-19 Pandemic / Code4Lib Journal

This paper describes how the University of Illinois at Urbana-Champaign Library provided access to circulating library materials during the 2020 COVID-19 pandemic. Specifically, it details how the library adapted existing staff roles and digital library infrastructure to offer on-demand digitization of and limited online access to library collection items requested by patrons working in a remote teaching and learning environment. The paper also provides an overview of the technology used, details how dedicated staff with strong local control of technology were able to scale up a university-wide solution, reflects on lessons learned, and analyzes nine months of usage data to shed light on library patrons’ changing needs during the pandemic.

Assessing High-volume Transfers from Optical Media at NYPL / Code4Lib Journal

NYPL’s workflow for transferring optical media to long-term storage was met with a challenge: an acquisition of a collection containing thousands of recordable CDs and DVDs. Many programs take a disk-by-disk approach to imaging or transferring optical media, but to deal with a collection of this size, NYPL developed a workflow using a Nimbie AutoLoader and a customized version of KBNL’s open-source IROMLAB software to batch disks for transfer. This workflow prioritized quantity, but, at the outset, it was difficult to tell if every transfer was as accurate as it could be. We discuss the process of evaluating the success of the mass transfer workflow, and the improvements we made to identify and troubleshoot errors that could occur during the transfer. A background of the institution and other institutions’ approaches to similar projects is given, then an in-depth discussion of the process of gathering and analyzing data. We finish with a discussion of our takeaways from the project.

Better Together: Improving the Lives of Metadata Creators with Natural Language Processing / Code4Lib Journal

DC Public Library has long held digital copies of the full run of local alternative weekly, Washington City Paper, but had no official status as a rights grantor to enable use. That recently changed due to a full agreement being reached with the publisher. One condition of that agreement, however, was that issues become available with usable descriptive metadata and subject access in time to celebrate the upcoming 40th anniversary of the publication, which at that time was in six months. One of the most time intensive tasks our metadata specialists work on is assigning description to digital objects. This paper details how we applied Python’s Natural Language Toolkit and OpenRefine’s reconciliation functions to the collection’s OCR text to simplify subject selection for staff with no background in programming.

Choose Your Own Educational Resource: Developing an Interactive OER Using the Ink Scripting Language / Code4Lib Journal

Learning games are games created with the purpose of educating, as well as entertaining, players. This article describes the potential of interactive fiction (IF), a type of text-based game, to serve as learning games. After summarizing the basic concepts of interactive fiction and learning games, the article describes common interactive fiction programming languages and tools, including Ink, a simple markup language that can be used to create choice based text games that play in a web browser. The final section of the article includes code putting the concepts of Ink, interactive fiction, and learning games into action using part of an interactive OER created by the author in December of 2020.

Enhancing Print Journal Analysis for Shared Print Collections / Code4Lib Journal

The Western Regional Storage Trust (WEST), is a distributed shared print journal repository program serving research libraries, college and university libraries, and library consortia in the Western Region of the United States. WEST solicits serial bibliographic records and related holdings biennially, which are evaluated and identified as candidates for shared print archiving using a complex collection analysis process. California Digital Library’s Discovery & Delivery WEST operations team (WEST-Ops) supports the functionality behind this collection analysis process used by WEST program staff (WEST-Staff) and members. For WEST, proposals for shared print archiving have been historically predicated on what is known as an Ulrich’s journal family, which pulls together related serial titles, for example, succeeding and preceding serial titles, their supplements, and foreign language parallel titles. Ulrich’s, while it has been invaluable, proves problematic in several ways, resulting in the approximate omission of half of the journal titles submitted for collection analysis. Part of WEST’s effectiveness in archiving hinges upon its ability to analyze local serials data across its membership as holistically as possible. The process that enables this analysis, and subsequent archiving proposals, is dependent on Ulrich’s journal family, for which ISSN has been traditionally used to match and cluster all related titles within a particular family. As such, the process is limited in that many journals have never been assigned ISSNs, especially older publications, or member bibliographic records may lack an ISSN(s), though the ISSN may exist in an OCLC primary record. Building a mechanism for matching on ISSNs that goes beyond the base set of primary, former, and succeeding titles, expands the number of eligible ISSNs that facilitate Ulrich’s journal family matching. Furthermore, when no matches in Ulrich’s can be made based on ISSN, other types of control numbers within a bibliographic record may be used to match with records that have been previously matched with an Ulrich’s journal family via ISSN, resulting in a significant increase in the number of titles eligible for collection analysis. This paper will discuss problems in Ulrich’s journal family matching, improved functional methodologies developed to address those problems, and potential strategies to improve in serial title clustering in the future.

How We Built a Spatial Subject Classification Based on Wikidata / Code4Lib Journal

From the fall of 2017 to the beginning of 2020 a project had been carried out to upgrade spatial subject indexing in North Rhine-Westphalian Bibliography (NWBib) from uncontrolled strings to controlled values. For this purpose, a spatial classification with around 4,500 entries was created from Wikidata and published as SKOS (Simple Knowledge Organization System) vocabulary. The article gives an overview over the initial problem and outlines the different implementation steps.

Institutional Data Repository Development, a Moving Target / Code4Lib Journal

At the end of 2019, the Research Data Service (RDS) at the University of Illinois at Urbana-Champaign (UIUC) completed its fifth year as a campus-wide service. In order to gauge the effectiveness of the RDS in meeting the needs of Illinois researchers, RDS staff developed a five-year review consisting of a survey and a series of in-depth focus group interviews. As a result, our institutional data repository developed in-house by University Library IT staff, Illinois Data Bank, was recognized as the most useful service offering by our unit. When launched in 2016, storage resources and web servers for Illinois Data Bank and supporting systems were hosted on-premises at UIUC. As anticipated, researchers increasingly need to share large, and complex datasets. In a responsive effort to leverage the potentially more reliable, highly available, cost-effective, and scalable storage accessible to computation resources, we migrated our item bitstreams and web services to the cloud. Our efforts have met with success, but also with painful bumps along the way. This article describes how we supported data curation workflows through transitioning from on-premises to cloud resource hosting. It details our approaches to ingesting, curating, and offering access to dataset files up to 2TB in size--which may be archive type files (e.g., .zip or .tar) containing complex directory structures.

On the Nature of Extreme Close-Range Photogrammetry: Visualization and Measurement of North African Stone Points / Code4Lib Journal

Image acquisition, visualization, and measurement are examined in the context of extreme close-range photogrammetric data analysis. Manual measurements commonly used in traditional stone artifact investigation are used as a starting point to better gauge the usefulness of high-resolution 3D surrogates and the flexible digital tool sets that can work with them. The potential of various visualization techniques are also explored in the context of future teaching, learning, and research in virtual environments.

Optimizing Elasticsearch Search Experience Using a Thesaurus / Code4Lib Journal

The Belgian Art Links and Tools (BALaT) ( is the continuously expanding online documentary platform of the Royal Institute for Cultural Heritage (KIK-IRPA), Brussels (Belgium). BALaT contains over 750,000 images of KIK-IRPA’s unique collection of photo negatives on the cultural heritage of Belgium, but also the library catalogue, PDFs of articles from KIK-IRPA’s Bulletin and other publications, an extensive persons and institutions authority list, and several specialized thematic websites, each of those collections being multilingual as Belgium has three official languages. All these are interlinked to give the user easy access to freely available information on the Belgian cultural heritage. During the last years, KIK-IRPA has been working on a detailed and inclusive data management plan. Through this data management plan, a new project HESCIDA (Heritage Science Data Archive) will upgrade BALaT to BALaT+, enabling access to searchable registries of KIK-IRPA datasets and data interoperability. BALaT+ will be a building block of DIGILAB, one of the future pillars of the European Research Infrastructure for Heritage Science (E-RIHS), which will provide online access to scientific data concerning tangible heritage, following the FAIR-principles (Findable-Accessible-Interoperable-Reusable). It will include and enable access to searchable registries of specialized digital resources (datasets, reference collections, thesauri, ontologies, etc.). In the context of this project, Elasticsearch has been chosen as the technology empowering the search component of BALaT+. An essential feature of this search functionality of BALaT+ is the need for linguistic equivalencies, meaning a term query in French should also return the matching results containing the equivalent term in Dutch. Another important feature is to offer a mechanism to broaden the search with elements of more precise terminology: a term like "furniture" could also match records containing chairs, tables, etc. This article will explain how a thesaurus developed in-house at KIK-IRPA was used to obtain these functionalities, from the processing of that thesaurus to the production of the configuration needed by Elasticsearch.

Pythagoras: Discovering and Visualizing Musical Relationships Using Computer Analysis / Code4Lib Journal

This paper presents an introduction to Pythagoras, an in-progress digital humanities project using Python to parse and analyze XML-encoded music scores. The goal of the project is to use recurring patterns of notes to explore existing relationships among musical works and composers. An intended outcome of this project is to give music performers, scholars, librarians, and anyone else interested in digital humanities new insights into musical relationships as well as new methods of data analysis in the arts.

Reflection: My third year at GitLab and becoming a non-manager leader / Cynthia Ng

Wow, 3 years at GitLab. Since I left teaching, because almost all my jobs were contracts, I haven't been anywhere for more than 2 years, so I find it interesting that my longest term is not only post-librarian-positions, but at a startup! Year 3 was full on pandemic year and it was a busy one. Due to the travel restrictions, I took less vacation than previous years, and I'll be trying to make up for that a little by taking this week off.

Meta: Apology To Commentors / David Rosenthal

I thought that the blizzard of spam comments had miraculously stopped, but no. What appears to have happened is that Blogger stopped sending me mail for each comment. Although this greatly helped my peace of mind, it meant that actual relevant comments sat in the queue being ignored along with the spam. I've put a reminder in my calendar to check the queue every few days, and rescued some comments from purgatory.

All Aboard for Fedora 6.0 / DuraSpace News

As you may have heard, earlier this month the Fedora 6.0 Release Candidate was announced, which means we are moving full steam ahead toward an official full production release of the software. After 2 long years of laying down the tracks to guide us toward a shiny new Fedora, this train is nearly ready to leave the station and we couldn’t be more excited. We fully expect to see a production release in the coming weeks and want the community to climb on board as we charge ahead.

So what’s with all the train metaphors? Allow me to explain. It is my pleasure to finally be unveiling the Fedora train!

This image was a collaboration between the Fedora team and the Communication Outreach Marketing and Community Sub-Group and was created by Sam Mitchell, the graphic designer at LYRASIS. We wanted to create something that was representative of the progress we have made in reaching Fedora 6.0 as well as our desire to bring all community members forward to the latest version of the software. Enter the Fedora train.

With it’s sleek, modern design and capacity to hold many passengers, the train felt like the ideal symbol for Fedora. You may notice the hex shapes, integrated in to the image, and this was another intentional piece we wanted to incorporate. Much like the hex stickers that are often seen decorating the fronts of laptops, they fit together seamlessly and in infinite configurations. This is how we feel about Fedora. Fedora is the base hex, on which you can build your repository. The hex is also a subtle nod to Fedora 6.0 – the hex being a 6-sided figure to denote the most recent version without being overly literal.

If you’ve been following along with any of our Fedora presentations on the conference circuit this year, you will have seen that we are always using the train metaphor in our presentation titles – Fedora Software & Community Update: All Aboard for Fedora 6.0. In the development of Fedora 6.0 we recognized a need to bring the community forward and as a result have invested in the creation of extensive tooling, documentation and support for the migration process.

Where can you find the train? We have made a small store on Red Bubble where you can purchase a few items with the new image on it. You can access the store here. At present, it is set up as at-cost pricing, so there is no profit being made by Fedora on the sale of these items. In the future we hope to have the train image available on more merchandise when we get back to in-person conferences and events.


A special thanks goes out Sam Mitchell at LYRASIS, the Communication Outreach Marketing & Communications Sub-Group and Fedora staff team for all their work in developing this image. We are excited to be able to share it with you today and hope you love what it represents as much as we do.


The post All Aboard for Fedora 6.0 appeared first on

Don't Think / Ed Summers

As part of some research I’ve been a part of I’ve recently had the opportunity to do a bit of reading and chatting about the special role of theorizing in research. For some context, Jess, Shawn and I have been spending the past 1/2 a year or so talking about different ways to examine the use of web archives, mostly by looking at links on the web that point at web archives. If you are interested we’re talking about some of this next week as part of RESAW2021 (which is free and entirely online).

Part of this work has been trying to generate a set of categories of use that we’ve been observing in our data. We started out with a predefined set of categories that were derived from our own previous research and reading. We started looking for evidence of these categories in our link data. But we found over the course of doing this work and talking about it that what we were actually engaged in wasn’t really developing a theory per se but theorizing. Recognizing the difference was extremely helpful (thanks Jess!)

Jess introduced us to a couple texts that were very helpful in helping me distinguish how theorizing is related to theory. Interestingly they also connected with some previous work I’ve been doing in the classroom. I just wanted to note these papers here for Future Me, with a few notes.

The Hammond paper was the gateway to the Swedberg, but I read them the other way around. It didn’t matter too much because both papers are about the role of theorizing in research, and how it’s related to but distinct from research involving a theory. If you read them definitely read the Hammond first because its shorter, and sets things up nicely for the deeper dive that Swedberg provides.

Hammond reviews the literature around theorizing (which includes Swedberg) and also discusses some interviews he conducted with researchers at his university to better understand the role of theory in their work. These first person accounts were helpful because they highlighted the degree of uncertainty and anxiety that researchers felt around their use of theory. Hammond and his participants noted the connection to Grounded theory (Glaser & Strauss, 1967), and the role of reflexivity in qualitative methods more generally. But he suggestes that a broader discussion of theorizing takes it out of the realm of what happens when asking specific research questions, and into the exploratory work that needs to happen before.

Hammond’s basic point is that theorizing is the search for explanations, not the explanation itself. It is the process of identifying ‘patterns and regularities’ that can sometimes lead to hypothesis and theory. The Swedberg piece does many things, but its basic point is that this theorizing work is crucial for theory, and that it’s not really talked about enough. The premium is on the theory, but that we are often left in the dark about how that theory was generated, because all the focus goes into the verification of that theory. Swedberg has this idea of the prestudy which is work that happens before a theory is expressed, and tested/verified. The prestudy is empirical work that is creative and aimed at making a discovery. He draws quite a bit on Peirce’s idea of abductive reasoning, or the practice of guessing right. He pushes back on the idea that empirical data should only be collected in the context of theory justification, and that data is extremely valuable in exploratory work.

In addition to Peirce, Swedberg also draws on Wittgenstein’s philosophy of “Don’t think but look!” which highlights how existing concepts (theories) can actually be a barrier to insight. Sometimes simply restating phenomena without using the name of the concept can unlock things. He also mentions Bachelard (1984) “epistemological obstacles” to theorizing such as managing data when theorizing, and overly reliance on existing theory instead of engaging in theorizing. I’ve definitely experienced the managing data problem as we’ve been looking at links to web archives at multiple levels of abstraction, and how to keep them organized for recall without overly prescribing what they signify. He also cites Mills (2000) quite a bit who stressed that researchers should strive to be their own theorist in addition to being their own methodologist–and that theorizing can be learned. One method he suggested for gaining insight and bypassing existing theories is to dump out all the data (folders in his case) and sorting them again.

Swedberg’s idea of the prestudy is compelling I think because in part it is a call for there to be more writing about the prestudy so that we can learn how to do it. This reminds me a bit of what I really liked about Law (2004) which attempted to look at where social science gets done. If we don’t know where our ideas come from how will be able to recognize what they contain, and what they might be missing? For Swedberg theorizing can be neatly summarized as:

  1. Observing and choosing something to investigate
  2. Naming the central concept(s)
  3. Building* out a theory (metaphors, comparisons, diagrams)
  4. Completing the tentative theory

The goal of theorizing is to build heuristic tools, tools that are unpolished, a bit messy, fit to purpose and non-formalized rather than definitive explanations.

… concepts should primarily be used as heuristic tools at the stage of theorizing, that is, to discover something new, and not to block the discovery process by forcing some interesting observation into some bland category. Insisting on exact operational definitions is usually not helpful at this stage. According to a well-known formulation, concepts should at this stage be seen as sensitizing and not as definitive.

I’m just grateful for this connection between theorizing and some elements of pragmatic philosophy that I’ve been drawn to for some time … and also to have some new people to read: Mills and Bachelard. Practical advice for theorizing, and how to do it, is especially important when starting new projects, and seems like an essential ingredient for staying happy and productive in this line of work. It should be hard, but it should be fun too, right?


Bachelard, G. (1984). The New Scientific Spirit. Boston: Beacon Press.

Glaser, B., & Strauss, A. (1967). The discovery of grounded theory: Strategies for qualitative research. Aldine.

Law, J. (2004). After method: Mess in social science research. Routledge.

Mills, C. W. (2000). The sociological imagination. Oxford [England] New York: Oxford University Press.

Research Object Crate (RO-Crate) Update / Peter Sefton

Research Object Crate (RO-Crate) Update Peter Sefton & Stian Soiland-Reyes

This was presented by Peter Sefton at the Open Repositories 2021 conference on 2021-06-10 (in Australia). RO-Crate has been presented at Open Repositories several times, including a workshop in 2019, so we won’t go through a very detailed introduction but we WILL start with with a quick introduction for those who have not seen it before.

☁️ 📂 <p>📄 ID? Title? Description?</p> <p>👩‍🔬👨🏿‍🔬Who created this data? 📄What parts does it have? 📅 When? 🗒️ What is it about? ♻️ How can it be reused? 🏗️ As part of which project? 💰 Who funded it? ⚒️ How was it made? Addressable resources Local Data 👩🏿‍🔬 🔬

RO-Crate is method for describing a dataset as a digital object using a single linked-data metadata document which can have descriptions of files and resources that are local or remote, and can contain discipline-appropriate context for the data.

📂 <p>🔬 🔭 📹 💽 🖥️ ⚙️🎼🌡️🔮🎙️🔍🌏📡💉🏥💊🌪️

The dataset may contain any kind of data resource about anything, in any format as a file or URL

📂 <p>|-- Folder1/ |          |-- file1.this |          |-- file2.that |-- Folder2/ |     -- file1.this |          |-- file2.that |-2021-04-08 07.58.17.jpg { "@id": "2021-04-08 07.58.17.jpg", "@type": "File", "contentSize": 3271409, "dateModified": "2021-04-08T07:58:17+10:00", "description": "", "encodingFormat": [ { "@id":  "" }, "image/jpeg" ], "name": "Cute puppy" },</p> <p>

Each resource can have a machine readable description in JSON-LD format

📂 <p>|-- Folder1/ |          |-- file1.this |          |-- file2.that |-- Folder2/ |       |-- file1.this |          |-- file2.that |-2021-04-08 07.58.17.jpg</p> <p>

A human-readable description and preview can be in an HTML file that lives alongside the metadata

What does this mean for repositories? It means that a repository can show the contents of a digital object using either a standard display library, or a customised one.

♻️ <p>📂 📈Chart1</p> <p>🏭 CreateAction Date: 2021-04-01 ⚙️Software / workflow Name: My Workflow URL: 🔬instrument</p> <p>🐥result 👩🏽‍🔬Agent

Provenance and workflow information can be included - to assist in data and research-process re-use.

What does this mean for repositories? Repositories will be able to launch software environments; if the digital object can be run in an emulator, or a notebook environment then there is potential to launch that.

  🎁🗜️ 📮🚚 <p>

RO-Crate Digital Objects may be packaged for distribution eg via Zip, Bagit and OCFL Objects.


Since last Open Repositories we have reached V1.1. The main change are tidying up the file extension and making it clear that RO-Crates are not just packages of files - they are aggregations of local and remote objects, we’ll cover some other changes as well in the rest of the talk.


RO-Crate Tools keep coming.


RO-Crate is being adopted in a number of projects


And RO-Crate is a foundation standard of the Arkisto platform - which was covered in the presentation before this one.


The RO-Crate team is now working on profiles - these will be guidance for humans and validation tools who want to use RO-Crate for specific purposes.

Image by Bryan Derksen - Original image Cup or faces paradox.jpg uploaded by Guam on 28 July 2005, SVG conversion by Bryan Derksen, CC BY-SA 3.0,

Machine and human readable, search engine friendly and developer familiar. FAIR Object middleware Standard Web Native PIDs + JSON-LD +, off the shelf archiving formats <p>Self-describing Typed by profiles + add more and domain ontologies</p> <p>Extensible, descriptive and content openedness, honouring legacy, diversity, and known and unknown unknowns - one size does not fit all. A valid RO-Crate JSON-LD graph MUST describe: The RO-Crate Metadata File Descriptor The Root Data Entity Zero or more Data Entities Zero or more Contextual Entities

We are working on aligning RO-Crate with the work going on internationally on FAIR Digital Objects - coming from the standpoint of having a working FAIR-inspired way to create digital objects already.


From a forthcoming paper by Soiland-Reyes et al

<> is a European cross-domain registry of computational workflows, supported by European Open Science Cloud projects, e.g. EOSC-Life, and research infrastructures including the pan-European bioinformatics network ELIXIR. As part of promoting workflows as reusable tools, WorkflowHub includes documentation and high-level rendering of the workflow structure independent of its native workflow definition format. The rationale is that a domain scientist can browse all relevant workflows for their domain, before narrowing down their workflow engine requirements. As such, the WorkflowHub is intended largely as a registry of workflows already deposited in repositories specific to particular workflow languages and domains, such as UseGalaxy.euand Nextflow nf-core .


RO-Crate is featuring in discussions with Dataverse as a way of packing data.


RO-Crate is going to be integrated with Zenodo as part of the CS3MESH4EOSC project, and by extension presumably the Invenio digital library framework. Of course RO-Crates can be deposited as they are can be wrapped as Zip files.

Other discussions / work going on Ecological data description (via University of Queensland) Machine-actionable Research Data Management Plans - eg mapping to RO-Crate BioExcel - discussions are taking place Via Australian Research Data Commons: Australian Text Analytics Platform - data object description for Jupyter notebooks and other workspaces Language Data Commons - potential building on techniques used in PARADISEC BioCompute Objects (BCO) community-led effort to standardise submissions of computational workflows to biomedical regulators. And IBISBA, ELIXIR, the EOSC-Life Cluster project, the DISSCo Synthesis+ SDR pipelines and the EOSC Reliance project in geosciences A major Japanese institute (via Paul Walk)

There are now enough things happening with RO-Crate that it is getting hard to keep track of it all - this slide is an incomplete view of what’s happening now.


RO-Crate is an open group - anyone can sign up - we have meetings twice a month that alternate between the European Morning and late evening / Australian late afternoon / Early morning.

Arkisto: a repository based platform for managing all kinds of research data / Peter Sefton

Arkisto: a repository based platform for managing all kinds of research data Peter Sefton1, Marco La Rosa2, Michael Lynch1 <p>University of Technology Sydney The University of Melbourne

This presentation by Peter Sefton, Marco La Rosa and Michael Lynch was delivered at Open Repositories 2021 conference on 2021-06-10 (Australian time) - Marco La Rosa did most of the talking, with help from Michael Lynch.

This presentation is  FAIR driven

We want to emphasise that this presentation is based on the FAIR principles that data should be Findable, Accessible, Interoperable and Reusable.

 <p>Repositories: institutional, domain or both</p> <p>Find / Access services Research Data Management Plan Workspaces:</p> <p>working storage domain specific tools domain specific services collect describe analyse Reusable, Interoperable data objects deposit early deposit often Findable, Accessible, Reusable data objects reuse data objects V1.1  © Marco La Rosa, Peter Sefton 2021</p> <p>🗑️ Active cleanup processes  workspaces considered ephemeral 🗑️ Policy based data management

This schematic is a high level view of research data management showing workspaces enabling research activities (collect, analyse, describe) linked to repositories in a continuous cycle. This is similar to the software development model of commit early and commit often but in this case, deposit well described objects often and re-use as required. Workspaces can include systems like Redcap, OwnCloud and other active work systems and they should be treated as ephemeral and dirty. Repositories can include systems like Zenodo and Figshare where FAIR objects are managed for long term preservation and re-use.


Data must be well described in open standards Data not locked up We know it’s portable between applications Data storage layer is COMPLETELY separate from the services layer(s) So how? Use standards.... Weren’t you wondering about the picture of a STANDARD Poodle?

 <p>ANSWER:  OCFL<br />

UTS has been an early adopter of the OCFL (Oxford Common File Layout) specification - a way of storing file sustainably on a file system (coming soon: s3 cloud storage) so it does not need to be migrated. I [presented on this at the Open Repositories conference] ( PARADISEC has built a scalable and performant demonstrator using OCFL. Completeness, so that a repository can be rebuilt from the files it stores Parsability, both by humans and machines, to ensure content can be understood in the absence of original software Robustness against errors, corruption, and migration between storage technologies Versioning, so repositories can make changes to objects allowing their history to persist Storage diversity, to ensure content can be stored on diverse storage infrastructures including conventional filesystems and cloud object stores

 Repository (Find / Access) services CS3MESH4EOSC FAIR Description Service Publish stable FAIR digital objects (Research Object Crates) Describe data sets future connectors V1.0 © Marco La Rosa, Peter Sefton 2021

This is an example of service connection in the CS3MESH4EOSC. In this schematic we can see a FAIR Description Service forming the bridge between the CS3MESH4EOSC services (workspaces) and various repositories. A FAIR description service uses linked-data to describe data and its context. The next slide shows some of what you might want to add to a Digital Object (data package) to make it Findable, Interoperable and Reusable.

☁️ 📂 <p>📄 ID? Title? Description?</p> <p>👩‍🔬👨🏿‍🔬Who created this data? 📄What parts does it have? 📅 When? 🗒️ What is it about? ♻️ How can it be reused? 🏗️ As part of which project? 💰 Who funded it? ⚒️ How was it made? Addressable resources Local Data 👩🏿‍🔬 🔬

RO-Crate is method for describing a dataset as a digital object using a single linked-data metadata document

  • Lightweight approach to packaging research data with metadata - the examples here aid in the F, A, I and R in FAIR. For example “how can it be reused” means a data licence that specifies who can access, use and distribute data. And “How was it made” is important for reuse and interoperability - think file formats, resolution etc.
  • Community effort - 40+ contributors from AU, EU and US.
  • Can aggregate files and/or any URI-addressable content, with contextual information to aid decisions about re-use. (Who What When Where Why How).
  • Uses as the main ontology, with domain-specific extensions
  • Has human readable summaries of datasets

Sefton is an Editor of the specification:

FAIR Digital Object Export Service FAIR Description Service Repositories CS3MESH4EOSC Specific export coded for each service Export for curation Describe FAIR Digital Object packages (RO-Crate) Publish stable FAIR Digital Objects (RO-Crate) Check out for reuse (Research Object Crates) V1.1 © Marco La Rosa, Peter Sefton 2021

When your data is well described you can start thinking about higher level processes and workflows connecting workspaces to repositories. And your services can evolve over time as requirements change and systems improve without needing to transform your data first. Too often research data management infrastructures get caught up in the specific technologies / systems to be implemented without considering how an ecosystem of services can work as whole. If the previous architecture slide was a low level view of the CS3MESH4EOSC implementation then this is a higher level view of a possible architecture that connects workspaces to repositories.

Identity: authentication, authorisation and group services  workspaces repositories V0.1 DRAFT  © Marco La Rosa, Peter Sefton 2021  <p>Provisioning

Going up another few levels we can see that the picture is incomplete. The environment of repositories and workspaces needs more services to actually form a functional system for end users. Going forward we need to think about how to do cross-service authentication of parties and authorization of access to resources, and group membership; licensing, environment provisioning etc. In this way we can tie together the active workspaces and repository services into a cohesive application for end users.

(See a blog post from Peter Sefton floating ideas about how we might close this gap specifically for data-access licenses).

Here’s a schematic of just such an environment at UTS - this shows how the the Stash research data management system, which is an instance of ReDBox, orchestrates workspaces and connects them to a research data catalogue (which is actually now a repository).

Who is doing this?

So who is doing this? The PARADISEC ( project has built a demonstrator ( using these technologies that is scaleable and performant with approximately 70TB of data!

TOOLS 🧰 ⚒️ and SPECS OCFL Spec: Research Object Crate (RO-Crate) Spec: UTS: OCFL JS UTS OCFL JS Implementation: CoEDL OCFL JS implementation: UTS RO Crate / SOLR portal: Describo: CoEDL Modern PARADISEC: CoEDL OCFL tools:

And there’s an ever growing ecosystem of tools and libraries.

Describo is an application to build RO-Crates. Installable as a desktop application it simplifies the process of packaging up data as RO-Crates.

The Arkisto website all the things we talked about here and more; it has links to all the Standards used, a growing number of case studies, abstract use cases and links to tools; repository.

FIIR Data Management; Findable Inaccessible Interoperable and Reusable? / Peter Sefton

This is a work in progress post I'm looking for feedback on the substance - there's a comment box below, email me, or see me on twitter: @ptsefton.

I am posting this now because I have joined a pair of related projects as a senior technical advisor and an we will have to look at access-authorization to data on both - licenses will vary from open, to click-through agreements, to complex cultural restrictions such as TK Licenses;

  1. Australian Text Analytics Platform (ATAP)
  2. Language Data Commons for Australia (LDaCA)

Summary: Not all research data can be made openly available (for ethical, safety, privacy, commercial or other reasons) but a lot can reasonably be sent over the web to trusted parties. If we want to make it accessible (as per the "A" in the FAIR data principles) then at present each data repository/service has to handle its own access controls. In this post I argue that if we had a Group Service or Licence Service that allowed research teams to build their own groups and/or licences then the service could issue a Group Access Licence URLs. Other services such as repositories in a trusted relationship with the Group/Licence Service holding content with digital licences which had such URLs could do a redirect dance (like with oAuth and other authentication protocols), sending the users who request access to digital objects to the Group/Licence Service which could authenticate them and check if they have access rights then let the repository know whether or not to give them access.

In this post I will look at some missing infrastructure for doing FAIR data (Reminder: FAIR is Findable, Accessible, Interoperable, Reusable data) - and will cite the FAIR principles.

If a dataset can be released under an open licence then that's no problem but if data is only available for reuse under some special circumstances to certain users for certain purposes then the research sector lacks general-purpose infrastructure to support this. Tech infrastrucure aside, we do have a way of handling this legally. You specify these special conditions using a licence as per the FAIR principles.

R1.1. (Meta)data are released with a clear and accessible data usage licence

The licence might say (in your local natural language) "Members of international research project XYX can access this dataset". Or "contact us for a specific licence (and we'll add you to a license-holder group if approved)".

Now the dataset can be deposited in a repository, which will take care of some of the FAIR principles for you including the F-word stuff.


The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.

F1. (Meta)data are assigned a globally unique and persistent identifier

F2. Data are described with rich metadata (defined by R1 below)

F3. Metadata clearly and explicitly include the identifier of the data they describe

F4. (Meta)data are registered or indexed in a searchable resource

Yes, you could somehow deal with all that with some bespoke software service, but the simplest solution is to use a repository, or if there isn't one, work with infrastructure people to set one up - there are a number of software solutions that can help provide all the needed services. The repository will typically issue persistent identifiers for digital objects and serve up data using a standarised communication protocol (usually) HTTP(S).


Once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation.

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

A1.1 The protocol is open, free, and universally implementable

A1.2 The protocol allows for an authentication and authorization procedure, where necessary

A2. Metadata are accessible, even when the data are no longer available

But repository software can not be trusted to understand license text and thus cannot work out who to make non-open data available to - so what will (usually) happen is it will either just make the non-open data available only to the depositor and administrators. The default is to make it Inaccessible via what repository people call "mediated access" - ie you have to contact someone to ask for access and then they have to figure out how to get the data to you.

At the Australian Data Archive they have the "request access" part automated:


To download open access data and documentation, click on the “Download” button next to the file you are interested in. Much of the data in the ADA collection has controlled access, denoted by a red lock icon next to the file. Files with controlled access require you to request access to the data, by clicking on the “Request Access” button.

In some cases the repository itself will have some kind of built in access control using groups, or licences or some-such. For example, the Alveo virtual lab funded by NeCTAR in Australia, on which I worked, has a local licence checker, as each collection has a licence. Some licences just require a click-through agreement, others are associated with lists of users who have paid money, or are blessed by a group-owner.

I'm not citing Alveo as a much-used or successful service, it was not, overall, a great success in terms of uptake, but I think it has a good data-licence architecture; there is a licence-component that was separate from the rest of the system. The licence checking sits in front of the "Findability" part of the data and the API - not much of that data is available without at least some kind of licence that users have to agree to.


This pattern makes a clear separation between the licence as an abstract, identifiable thing, and a service to keep track of who holds the licence.

Question is, could we do something like this at national or global scale?

We are part of the way there - we can authenticate users in a number of ways, eg by the Australian Access Federation (AAF) and equivalents around the world, and there are protocols that allow a service to authenticate using Google, Facebook, Github et al. These all rely on variants of a pattern where a user of service A is redirected to an authentication service B where they put in their password or a one-time key, and whatever other mechanism the IT department deem necessary, and then are redirected back to service A with an assurance from B that this person is who they say they are.

What we don't have (as far as I'm aware) is a general purpose protocol for checking whether someone holds a licence. A repository could redirect a web user to a Group Licence Server and the user could transact with the licence service, authenticate themselves (in whatever way that licence service supports) and then the licence service could check it's internal lists of who has what licence and then return it. If the license is just a click through then the user could to the clicking - or request access,or pay money or whatever is required.

(We are aware of the work on FAIR Digital Objects and the FDO Forum - it does say there that:

FAIR Digital Objects (FDO) provide a conceptual and implementation framework to develop scalable cross-disciplinary capabilities, deal with the increasing data volumes and their inherent complexity, build tools that help to increase trust in data, create mechanisms to efficiently operate in the domain of scientific assertions, and promote data interoperability.

Colleagues and I have started discussions with the folks there.)

Those of us who were around higher-ed-tech in the '00s in Australia will remember MAMS - the Meta Access Management system - the leader James Dalziel was at all the eResearch-ish conferences talking about this shared federation (that would allow you to log in to other people's systems (we got that - it's the aforementioned AAF), with fantastic user stories about being able to log into a data repository and then by virtue of the fact that you're a female anthropologist, gain access to some cultural resources (we didn't get that bit). I remember Kent Fitch then from the National Library, one of the team that build the national treasure Trove 😀 bursting that particular bubble over beers after one such talk. He asked: How do you identify an anthropologist? Answer - a university authentication system certainly can't.

I realised a long long time later that while you can't identify the anthropologists, or tell 'em apart from the ethnographers or ethnomusicologists etc that they can and make their own groups, via research projects, collaborations and scholarly societies. You could have a group that listed the members of a scholarly society and use that for certain kinds of access control, and you could, of course let the researchers self select people they want to share with - let them set up their groups.

What if we had a class of stand-alone service where anyone could set up a group and add users to it? A project lead could decide on what is an acceptable way to authenticate, via Academic Federations like AAF or ORCID, public services like Github or Facebook etc and then add a list of users via email addresses or other IDs. And what if there was a way to auto-populate that group by linking through to OSF groups, or Github organisations, Slack etc (all of which use different APIs and none of which know about licences in this sense as far as I know). This would be useful for groups of researchers who need access to storage, compute, and yes, datasets with particular licence provisions. There could be free-to-use group access for individuals and paid services for orgs like learned societies who can use the list to make deals with infrastructure providers for example. And there need not only be one of these services, they'd work well at a National level I think but could be more granular or discipline based.

(Does such a thing already exist? Did I miss it? Let me know in the comments below or on twitter - I'm @ptsefton)

We could do this something like the way modern authentication services work with a simple hand-off of a user to an authentication service, but with the addition of licence URL, to a service that says - yep I vouch for this person they have a licence to see the data.


The above interaction diagram is purely a fantasy of mine. I'm not an Internet Engineer - so I have probably made some horrible errors, please let me know.

Obviously this requires a trust-framework; repositories would have to trust the licence servers and vice-versa and these relationships would have to be time-limited and renewable. You wouldn't want to trust a service for longer than their domain registration for example in case someone else you don't trust buys the domain, that kind of thing. And you'd want some public key stuff happening so that transactions are signed (a further mitigation against domain squatters - they would presumably not have your private key).

And this is not an overly complicated application we're talking about - all access-controlled APIs already have to do some of this locally. It's the the governance - the trust federations that will take time and significant resources (so lets start now :-).

And while we're on the subject of trust - this scheme would work in the same way most research does - with trust in the people working on the projects - typically they have access to data and are trusted to keep it safe. Being a member of a project team in health research, for example often involves joining a host organization as an honorary staff member, and being subject to all its policies and procedures. Some research groups have high levels of governance; people are identified using things like nursing registrations, and other certifications; some are ad-hoc collections of people identified by a PI using any old-email address.

NOTE: for data that needs to be kept really really secure? Data that can never even be put on a hard drive and left on a bus - then this proposal is not the scheme for that data - that's where you'd be looking at a Secure eResearch Platform (SeRP) where the data lives in a walled garden and can be inspected only via a locked-down terminal application, or even stricter, you might only have secure on-site access to data that's air-gapped from any network.

Here's a sketch of some infrastructure. Essentially this is what happened inside Alveo - the question is can it be distributed so repositories can be de-coupled from authorization services?


Service Ceiling: The High Cost of Professional Development for Academic Librarians / In the Library, With the Lead Pipe

By Bridgette Comanda, Jaci Wilkinson, Faith Bradham, Amanda Koziura, and Maura Seale

In Brief

Academic librarian salaries are shrinking, but conferences and professional membership fees are increasing. How is this impacting our field and our colleagues? During early 2020, we fielded a national survey of academic librarians about their professional development and service costs that gathered over 600 responses. The results of this survey reveal the inequitable landscape of professional development funding for academic librarians in the United States prior to the COVID-19 pandemic, which has likely served to exacerbate those inequities. In this paper, we describe the results of the survey, and the various inequities the survey responses reveal. We illustrate how the cost of professional development and service is a “service ceiling”, a barrier to inclusion for many in our field, and the implications of this exclusion.

About Us

The members of this research team are all academic librarians in a wide range of roles. We are all white women. This scholarship is a strategic act to leverage our privilege as white women to reveal and challenge the stark racial and economic inequities in librarianship, specifically those around professional development and service. We take to heart Megan Watson’s (2017) argument that white women librarians have the opportunity and obligation to examine both whiteness in librarianship and our complicity in maintaining and reproducing it. She calls for white librarians to “unflinchingly deconstruct the inequitable foundations upon which our work is built. In short, we must transform our culture, not simply our demographics, if we wish to become truly inclusive organizations” (159). In this spirit, we aim to open the dialogue about financial realities and how they shape our field through the group we founded; Librarians for Equitable Professional Development (LEPD). For more information about LEPD, including salary disclosures, please visit the Librarians for Equitable Professional Development website.


Gathered around a display case at the Rock N’ Roll Hall of Fame, drinking complementary alcohol, the Association of College & Research Libraries’s (ACRL) 2019 All-Conference Reception struck us as…over the top. We had each spent over $1,000 to attend, but mused that much of ACRL felt bloated, generic, and irrelevant to us despite its reputation as the most important conference in academic librarianship. We felt, and still feel, obligated to attend and to submit to present at ACRL. The counterweight to this professional obligation? Low salaries and hefty student loan payments. Did anyone else feel this financial strain and professional pressure to attend the biggest, and most expensive, conferences in our field?

Two months later, we received a very public “yes” to that question as an anonymous Open Letter about Financial Inequalities at Academic Conferences circulated widely on social media. In it, the writer censures the decision by library conferences to not waive registration fees for presenters. They write, “by ‘saving’ money in not reducing or waiving conference fees for presenters, presenting and attendance at the conference is only for those peers who can afford it” (Anonymous, 2019). Financial inequities of salary or institutional benefits cut our field off from new ideas and reinforce lines of exclusion. Besides creating this “service ceiling”, no remuneration for the labor of creating presentations or chairing committees is, as the anonymous writer (2019) puts it, “a purely loathsome sentiment” that devalues the labor and scholarship of academic librarians.  

The American Library Association, of which ACRL is a division, is the parent organization to the largest and most expensive conferences and provides access to much service work in academic librarianship, yet there is a distinct lack of meaningful analysis and reflection, internally or externally, about the cost of these activities and how it impacts academic librarians and librarianship. Professional development and service are frequently tied to promotion and tenure, leading us to ask: how much does it cost, out-of-pocket, for a faculty librarian to gain tenure and promotion? For an academic librarian who is not classified as faculty (and thus whose institution might not necessarily provide support), what impact does this high price tag to participation in professional development and service have? And finally, if presenting at and attending the premiere professional development opportunities in our field is increasingly out of reach due to a high price tag, whose voices are left out of the conversation? To answer these questions, we formed Librarians for Equitable Professional Development, an informal organization dedicated to the study and advocacy of more equitable professional development in libraries. 

From February-May 2020, we conducted a survey that asked academic librarians in the United States to share how much they spend on service and professional development, and if/how their institutions pay for these obligations. We received 626 responses. Using descriptive and qualitative analysis, we’ve compiled a snapshot into how academic librarians think and make decisions about service and professional development activities, as well as what sort of financial support they receive from their institutions. While the survey was open, COVID-19 grew into a global pandemic. As higher education grapples with the impact of COVID-19 on budgets, we feel that this survey will provide a meaningful look into the “pre-pandemic” professional development and service financial landscape for academic librarians. We acknowledge that as COVID-19 pandemic-related budget shortfalls loom (if they aren’t already affecting us), the conversation around affordability in professional development and service will become even more relevant.

Literature Review

The (lack of) research on financial inequity in academic librarianship

We begin by exploring the limited research on our topic, and then examine salary, student loan debt, and the cost of professional association memberships and conferences to provide context to our analysis. We then look more broadly at how academia’s investment in white supremacy, capitalism and the patriarchy reinforces silence around affordability and financial equity in academic librarianship. This is made evident by the lack of scholarship, conversation, and transparency around the financial realities of professional development.  

Although there is much research around the status and professional identities of academic librarians, there is little work on financial inequities in academic librarians’ professional development. We self-published our own analysis of the change in conference and membership costs related to ALA and ACRL from 1999-2020; results show during a period when academic librarian wages were stagnant, some conferences have raised prices significantly (Wilkinson, et al. 2021). Mr. Library Dude wrote several extensively researched, and entertainingly irritated, blog posts in 2015 about ALA membership costs (Hardenbrook, 2015). Blessinger and Kelly (2011) considered the effect of the 2008 recession on funding for professional development for tenure track librarians at Association of Research Libraries (ARL) member libraries, and found that they experienced a reduction in funding for professional development but no corresponding decrease in tenure requirements. Smigielski, Laning, and Daniels (2014) found that professional development funding played a role in the promotion and tenure of ARL librarians, but did not investigate levels of funding. The most detailed research we found, Vilz and Dahl Poremski (2015), investigated support structures for tenure including professional development funding for tenure track librarians broadly, and found that while 97% received at least some funding for various activities, only 48% were satisfied with the amount. The authors do note, however, that with low levels of funding, “tenure criteria are essentially an unfunded de facto mandate” (p. 162). ALA’s “Guidelines for Appointment, Promotion, and Tenure of Academic Librarians” (2010) includes the need for professional development, while the ALA-APA Advocating For Better Salaries Toolkit (Dorning et al., 2014) suggests advocating for faculty status, but neither identify a need for financial support for professional development. Not all academic librarians have faculty status or opportunities for promotion, however, and might not have explicit professional development and service expectations. Leebaw and Logsdon (2020)’s recent survey found that 60% of academic librarians identified as “faculty or faculty-like.” Moreover, faculty status for librarians is declining (Walters, 2016). Finally, we note that understandings of “professionalism” within librarianship are inextricably connected to socioeconomics, but this topic is outside the scope of this essay.  

The economics of becoming an academic librarian, briefly

At $64,750, the median yearly salaries for academic librarians are greater than for other categories of librarians (Bureau of Labor Statistics (BLS), 2020c), and salaries have remained static when accounting for inflation over the past twenty years. However, related macroeconomic factors weigh heavily on librarians who have graduated in the past twenty years. Graduate student debt has increased dramatically since 2000; then, graduate degree seekers borrowed an average of $27,800. In 2016, the average amount increased to $37,270 (Webber & Burns, 2020). Graduate students also may have cumulative debt from their undergraduate education (Webber & Burns, 2020).  And the cost of gaining an undergraduate degree has also increased: the published in-state tuition and fee price at the average public four-year institution has increased 278% in the past thirty years; at private nonprofit four-year and public two-year institutions, average tuition and fees have doubled (Ma et al.,2020). The 2017 Student Loan Debt and Housing Report found that the median total debt for borrowers between the ages of 22-35 was $41,200. In a 2016 study of librarians, 30.6% had over $25,000 of library school debt, and 52.7% from that group had received no financial aid (Halperin, 2018). Beginning librarians with an MLS working in an academic library made an average salary of $53,953 in 2019; in 2006 (the earliest data available from the ALA-APA salary database) librarians in the same category made $40,761, or $52,625 in 2019 dollars (ALA, 2020a). Current salaries have kept up with inflation, but have grown only marginally. This means a third of academic librarians begin their careers with debt the equivalent of half or more of their annual salary. The debt associated with obtaining the required degrees has risen over recent years, but salaries have not kept pace.

The economics of white supremacy

The persistence of whiteness in librarianship has been a source of concern for LIS scholars for decades (see, for example, this extensive and ever-expanding bibliography (Strand, 2019)). The latest Bureau of Labor Statistics (2019) data describe the librarian workforce as 79.9% women and 87.8% white. Black people represent between 6-10%, Asian Americans 3-5%, and Latinx 9.8-10% of library workers, depending on the data source (Department for Professional Employees, 2020; Household Data National Averages, 2019). ALA’s most recent demographic data from 2012 reveals that academic librarianship is also overwhelmingly white, (ALA 2020b). 

There is no data that breaks down academic librarianship salaries by race and gender but wage and net worth gaps between people who identify as white, Black, and Latinx in the United States are well documented and intersectional (Parker et al., 2016). Current median weekly earnings for white women, regardless of occupation, are 81.5% of those of white men (BLS, 2019). Black women earn 69.9 cents and Latinas earn 63.8 cents on the dollar compared to white men (BLS, 2019). Jennifer Vinopal has argued that these gaps are crucial: “the library staffing pipeline is rooted in the discrepancies in socioeconomic status based on race and ethnicity, discrepancies which are inherited generationally” (2016). Although academic librarians are likely to have one or more graduate degrees, educational attainment does not guarantee upward mobility for Black people and “the benefits of schooling often flow in unequal measure to Blacks relative to whites” (Parker et al., 2016). Black and Latinx people are more likely to have to borrow money to pursue graduate education (Webber & Burns, 2020). The COVID-19 pandemic has further exacerbated racial inequities in the workplace; data shows Black and Latinx people are more likely to have experienced loss of employment or a wage cut due to COVID-19 (Parker et al., 2020). Academic librarians experience rising student loan debt and stagnant salaries, but for systematically marginalized library workers, this is compounded by these broader financial inequities, which impact their ability to not only participate in the development of LIS, but also their ability to enter it in the first place. 

The intersection of economic inequities, race, and vocational awe

Academic librarianship does not sit outside of the social inequities mentioned above, but academic librarians frequently refuse to acknowledge them. This is reflected in the lack of salary information in job postings, the precarious “diversity” residency, the lack of meaningful salary data by race, and the absence of both data and research around funding for professional development in academic libraries. These gaps can be tied directly to the interlocking influences of vocational awe and white supremacy in academic librarianship. Jones and Okun (2001) identify “Fear of Open Conflict,” “Individualism,” and “Right to Comfort” as characteristics of organizational white supremacy. Organizational white supremacy scorns any critique of the status quo as complaint, but particularly that which is expressed by workers from marginalized groups, and shuts down open conflict, in order to preserve the comfort of the powerful, while promoting competition among staff. Promoting competition among workers fosters distrust (often especially toward marginalized workers) and makes it difficult for workers to challenge inequities. Drawing on the work of Diane Gusa, Nataraj, Hampton, Matlin, and Meulemans (2020) identify “white institutional presence” in common academic library practices and norms, including, we suggest, silence regarding economic inequities in the workplace. Vocational awe, as theorized by Ettarh (2018), similarly leads librarians to avoid conflict by turning work into a calling and obfuscating issues of low salaries and burnout. Vocational awe binds librarians to “absolute obedience to a prescribed set of rules and behaviors, regardless of any negative effect on librarians’ own lives” (Ettarh, 2018) and is foundational to the profession (Nataraj et al., 2020; Stahl, 2020). Drawing on Victor Ray’s (2019) work, which argues that organizations are inherently racialized, Jennifer Ferretti (2020) has recently articulated the need to bring the insights of critical librarianship to bear on the power dynamics, culture, and organizational structures of our workplaces. Economic transparency, she argues, is a key element of more equitable and antiracist workplaces.

In her 1984 book Sister Outsider, Audre Lorde uses the academic conference as the exemplar of privilege. White women fought for the right to be included in these gatherings while ignoring intersectional factors that left out women of color: “If white American feminist theory need not deal with the differences between us, and the resulting difference in our oppressions, then how do you deal with the fact that the women who clean your houses and tend your children while you attend conferences on feminist theory are, for the most part, poor women and women of color?” ([1984] 2007, p. 112) A similar lack of understanding plagues academic librarianship. Intertwining inequalities work together to create a powerful barrier that blocks marginalized groups from becoming or advancing in their careers as academic librarians, a few of which we’ve highlighted in this literature review. The high price to participate in professional organizations and attend conferences, the scarcity of adequately funded professional positions (tenured or not), the pay gap between white men and all others experienced across all types of work in the United States, and the expense of obtaining graduate degrees cumulate to form a significant negative financial impact for academic librarians from systematically marginalized communities. This paper seeks to begin a conversation about financial barriers to academic librarians’ participation in professional development and service. With this focus, we hope to begin to explicate how these financial barriers serve as gatekeepers for a majority white and largely homogeneous field.


Our survey sought information about academic librarians in the United States: their participation in professional development and service, the level of institutional support they received, and what barriers to participating they had encountered. The survey asked respondents to focus on the past five years at their current institution and was approved by the Indiana University Institutional Review Board (Protocol # 2001988850). The complete survey is available in Appendix A. 

We primarily recruited participants through email and social media. We posted the recruitment letter and a link to the survey via email within relevant, national professional organizations and interest group listservs. Additionally, the researchers recruited participants via their personal social media and groups specific to librarianship. The survey sample was not meant to be a representative sample of academic librarians in the United States but was rather a convenience sample, and through social media, a snowball sample as the link was shared and retweeted. By focusing on recruitment through organizational listservs, recruited participants are likely to be active in professional development and service, which is the focus of the survey. However, there were some pitfalls in our recruitment strategy. We were limited to distributing our survey to listservs to which we had access, which meant that the listservs we focused on were more heavily related to certain subgroups of academic librarians.  As all of us are white, middle-class women; we are not members of any spaces for marginalized groups within librarianship, and we did not intentionally distribute our survey to these spaces (with the caveat that our Tweets about the survey may have been shared within such spaces). However, this limitation is not an excuse for not working harder to reach marginalized groups within librarianship with our survey. We do not recommend this approach to recruitment. For future projects, we will use an equity lens in our recruitment process and pursue the intentional inclusion of the experiences of marginalized groups. We also plan to broaden the perspective of our research group as a whole through the addition of new member(s) who identify as marginalized.

Qualitative Data

The survey included three optional open-ended questions. Using an inductive and grounded theory approach, two researchers read through the responses to each of the three questions and engaged in initial coding of the responses. The two researchers then discussed, refined, and created definitions, resulting in a codebook for each question. The two researchers then returned to the questions and selectively coded each response; each response could have between one and three codes, and codes were generally applied in the order they appeared in the responses. A few responses were not coded, because they were unclear or seemed to be responses to a different question. We used SPSS to check inter-rater reliability on the codes, and most were in substantial agreement (see Appendix B). Our analysis relies on primary codes and secondary codes if they are in substantial agreement. Because responses could receive multiple codes, our analysis will discuss response codes rather than number (or percentage) of respondents.


Descriptive Analysis

Overall Funding

The total number of survey respondents was 626. 80% of our respondents identified as women, and 85% as white, which corresponds nearly exactly to the gender and racial breakdown for librarians as reported by the Bureau of Labor Statistics (Household Data National Averages, 2019). Of the 15% of respondents who did not identify as white, 1.7% were Black, 3.8% were Asian American, 5.3% were Latinx. American Indian/Alaskan Native and Native Hawaiian/Pacific Islander each constituted less than a percentage point of our respondents. Respondents were then asked a series of questions about their institution and role. Most respondents were at large, urban, public research institutions, about half were early career librarians, and about half supervise others. In addition, 40% of respondents were classified as tenure-track faculty, 23% were non-tenure track faculty, and 28% were classified as staff. 61% of respondents work at institutions with a promotion system for librarians, and of those, 89% stated that their promotion was contingent on professional development or service (Figure A).

Is librarian promotion contingent upon professional development or service?

Figure A

We asked participants to disclose the amount of funding received per year for professional development and service. 6% of participants received no institutional funding, 13% received less than $500, 61% received $500-$2000, and 20% received over $2000 per year. When receiving funding, 27% are reimbursed by their institutions following the event, while 64% receive a combination of pre-payment in advance by their institution and reimbursement.

To get a sense of what costs are covered, we asked respondents to report how frequently their institutions defrayed some portion of the costs for common expenses (Table A). Notably, 76% of respondents said that their institutions never cover the cost of association membership dues, while 15% reported sometimes, and only 8% reported that they were covered. 66% of respondents said that their institutions always defrayed conference registration and 33% reported that registration costs were sometimes defrayed.  

What costs does your institutional funding defray?

Conference registration67% (371)33% (186)0.2% (1)558
Training/workshop/webinar (in-person or virtual)47% (260)50% (275)3% (17)552
Professional organization membership dues8% (46)15% (84)76% (419)549
Travel54% (299)46% (254)0.5% (3)556
Accommodations54% (300)45% (253)0.7% (4)557
Meals40% (224)52% (289)8% (42)555
Other27% (12)47% (21)27% (12)45

Table A

Nearly 38% of our participants reported that they face barriers in accessing their institutional funding (Figure B). A full 81% have self-funded professional development or service within the past 5 years (Figure C). Of these, 84% have spent up to $1000 on self-funding, while 14% have self-funded over $1000. Interestingly, women were more likely to report self-funding than men: 84% of women versus 72% of men. Finally, 86% of our respondents stated that they have abstained from professional development and service opportunities due to the cost (Figure D). 

Figure B

Figure C

Figure D

Funding by Position, Institution Type, and Institution Size

Our survey included three options for position type: staff, non-tenure-track faculty, and tenure-track faculty. As seen in Table B, 9.1% of staff receive no funding for professional development and service, which is about double the number of faculty-status librarians with no funding. 38.7% of tenure-track faculty receive $1000 or less.

How much funding do you receive per year from your institution? (Position)

No funding$1-$500$501-$1000$1001-2000$2001-3000$3000+
Tenure track5%14%24%37%12%7%
Non-tenure track4%9%21%45%16%5%

Table B

We grouped institutions into four categories: 4-year undergraduate colleges, community colleges, comprehensive colleges (undergraduate and master’s programs), and research universities. 14.6% of community college librarians receive no funding, while 5.3% of comprehensive college and 4.2% of research university librarians receive no funding; the overall percentage of respondents receiving no funding is 6% (Table C). Community college librarians receive the least funding overall, with 61.8% of community college librarians receiving $1000 or less per year from their institution. In contrast, 66.2% of research university librarians receive $1001-$3000 in funding per year, with an additional 10.4% of these receiving over $3000 per year. 

How much funding do you receive per year from your institution? (Institution Type)

No funding$1-$500$501-$1000$1001-2000$2001-$3000$3000+
4 Year Undergraduate College0%20%21%39%18%2%
Community College15%36%26%18%3%3%
Comprehensive College5%16%28%41%5%5%
Research University4%2%17%46%20%10%

Table C

Our survey asked participants to categorize their institution as large, medium, small, or very small, according to the Carnegie classification for higher education (n.d.). There were too few responses in the “very small” category to ensure anonymity, so those numbers are not reported here. However, as seen in Table D, librarians at large and medium institutions received more funding than those at small institutions. 54.3% of librarians at small institutions receive $1000 or less annually, compared to 26.1% at large and 23.9% at medium institutions who receive $1000 or less.

How much funding do you receive per year from your institution? (Institution size)

No Funding$1-$500$501-$1000$1001-$2000$2001-$3000$3000+

Table D

These numbers are similar when looking at self-funding. 80.8% of all respondents reported self-funding occasionally, with an additional 1.9% responsible for self-funding all of professional development (for a total of 82.7% respondents funding some or all of their professional development). Table E breaks this down by institution type. 93% of librarians at 4 year undergraduate colleges either self-fund occasionally or entirely, followed by 85.1% at community colleges, 83.6% at comprehensives, and 81.3% at research universities. Nearly 8% of community college librarians are entirely or mostly self-funded, compared to 1.9% of all respondents. 

Have you ever had to self-fund? (Institution type)

YesEntirely self-fundedNo
4 Year Undergraduate College91%2%7%
Community College77%8%15%
Comprehensive College84%016%
Research University81%0.3%19%

Table E

Qualitative Analysis

Funding Decisions

The first open-ended question was a follow up to the survey question, “How is funding made available to you by your institution?” If respondents answered “It depends,” we asked them to use the text box to explain further. Many respondents clarified in the text box that despite having a set amount each year, they still had to apply for funding. The most common codes we assigned to this question are in Table F, which includes our definitions and representative responses. The coding for this question includes both primary and secondary codes, both of which were in substantial agreement.

CodeCodebook DefinitionRepresentative Response
Admin decidesLibrary administration determines whether or not to fund something.“My dean provides a set amount to each librarian that changes wildly from year to year and isn’t based on rank or the individual opportunities”
Amount basedThe library funds a set amount of the request based on costs, role (e.g. presenter v. attendee), job description, rank, tenure status, performance evaluation, etc.“Amount varies depending on requirements to participate (such as committee membership), presentations, applicability to essential job functions–all at the discretion of the dean.”
Set amount determined annuallyA set amount of money determined annually regardless of process. This can include annual caps amounts based on annual projected needs. This often requires an application and/or approval process.“I receive a set amount every year, but can submit requests for funding for individual opportunities if particular things come up that would put me over my set amount.”
Ad hoc additional fundsWhen people are granted amounts of money on a case by case basis beyond their usual yearly allotment. This often includes an application process for approval. “Set amount budgeted every year, with extra absorbed into general library budget if individual librarians don’t use. Also options to seek extra funding if we can justify it is good for our library to be represented at an event.”

Table F

While policies for accessing funding are varied, most response codes (69%) described their institution as having an annual set amount earmarked for professional development by the department or library that may have to be applied for. Some said the amount was inconsistent from year to year, or was such a low amount that it did not meaningfully contribute. Almost 10% of the response codes reported that their administration decided whether or not to fund on a case-by-case basis. Some respondents who described an annually set amount included that additional funds beyond this amount were available on an ad hoc basis. Amount-based and percentage-based funding policies were mentioned by some participants, where the amount or percentage funded was determined by the type of cost (50% of lodging might be covered, e.g, or the role of the staff member requesting (presenting at a conference would receive more funding, e.g.). 

Barriers to Funding

We also asked: “Please describe any barriers that prevent you from accessing institutional funding for professional development or service activities.” The most common primary codes were:

CodeCodebook DefinitionRepresentative Response
AusterityNot enough money overall, if there is any at all. This can include campus austerity measures and rotating who is eligible for funding at any one time.“Limited budget. I am one of six new library faculty (some positions are new) and it seems that the budget for PD has not increased accordingly.”
Low fundingLimits on the amount of money provided that often preclude all but the least expensive offerings from being fully covered. These include low funding caps.“If the event goes a single penny over we are required to pay for it out of pocket. It means that not everything gets covered and we have to often pay for things out of pocket. I have to pay out of pocket for something I’m required to go to, to keep my job. It’s insane.”
ReimbursementCan be hard to cover funds upfront. Reimbursement processes are lengthy (to the point to accumulating interest on credit cards), difficult, delayed, or otherwise burdensome. Reimbursement systems are hard to use and not flexible.“Our state reimbursement process is ludicrous. Up until last year, we were required to submit itemized receipts for every single meal. Hotels *must* be the conference hotel or you can’t go over the state rate — even if the hotel you are choosing is cheaper than conference hotel. I use my funds for membership dues, conference registration fees, and plane tickets.  And for that last, I’ve had to write paragraphs explaining why I won’t take a flight with a connection that will put me in transit for 12 hours when a direct flight costing $25 more will get me there in 3. I just pay for food/hotel myself.”
Unclear, inconsistent, burdensome logisticsIncludes opaque, inconsistent, or burdensome processes, unclear, inconsistent, or burdensome criteria, general lack of transparency, lack of financial office/personnel, and general difficulty in applying and/or getting approval.“The largest barrier is the lack of transparency.  On any [sic] given year we don’t know how much money will be granted or available and it is often disbursed somewhat arbitrarily, or at least that’s how it feels since the librarians are left out of the process entirely. Some librarians don’t apply for funding because they don’t think it’s worth their time given how it’s distributed.”

Table G

Unclear, inconsistent, and burdensome logistics for accessing funding was the most frequently cited barrier. 18% of response codes identified austerity-related barriers, meaning that there was little to no funding available, while other respondents reported low caps on funding. Reimbursement issues, such as an inability to pay for costs upfront and lengthy reimbursement processes after, also appeared as barriers. Finally, some of our respondents stated that their position itself was a barrier to accessing funding, with their institutions funding some but not others, as one respondent described: “The funding isn’t equitable across departments; our department is told we’ll be reimbursed for one national conference per fiscal year, while others have no such restrictions. Still, if you ask the right person and they’re in a good mood, they might approve your travel for a second conference.” Some respondents said that funding was claimed too quickly for them to access it. Other barriers to accessing funding brought up by our respondents included internal and external competition for funding; lack of staffing for coverage; and internal inequities within the library.

Lost Opportunities and Missing Voices

The final open-ended question asked respondents to “Please describe the types of professional development and service opportunities you’ve chosen not to pursue due to their cost.” Many respondents gave specific examples of conferences, organizational memberships, and learning opportunities that they had not pursued. The most common codes were:

CodeCodebook DefinitionRepresentative Response
Attending conferences Includes international conferences, distant/national, not local conferences, and multiple conferences in one year.“A lot of my conferences I like to attend happen every other year or less frequently than that. Sometimes, all of these conferences fall in the same fiscal year. To make these more affordable for me, I can’t attend three conferences that require flights and hotels. I avoid most ALA and ACRL committees because I can’t afford to attend every year/every conference.”
MembershipsPaid memberships in professional organizations, both within librarianship and outside“Also, I have chosen to not join national professional associations as the yearly dues are also a little more than my budget can handle right now.”
Online learningWebinars, online courses, certificates, seminars, continuing education“Courses, webinars, seminars, conferences basically anything that costs more than $50.”
TrainingVendor trainings, technical training, bootcamps, workshops “I chose not to pursue attending the national user group conference for our ILS (part of my job is to support this system). I chose not to attend any training beyond that which is free from vendors. I chose not to pursue training related to our ILS, which would have aided in implementing new system features.”

Table H

When asked to describe opportunities they’ve chosen to not pursue due to costs, our respondents frequently reported foregoing conferences at the national, international, and, for some, even at the local level. Leadership and training institutes, such as ACRL’s Immersion Program and association memberships were also named as opportunities not taken by our participants. Twenty-seven respondents walked away from presentation opportunities due to the costs involved, even after their proposals had been accepted or they’d received a scholarship or award. As one respondent described, “I was selected to attend as a scholarship recipient and because the hotel and flight were outside of my price range, I was unable to attend… I would love to be a member of more organizations but I physically can’t be.” Service and leadership opportunities were also foregone by some respondents due to cost.


In many cases, academic librarians who wish to retain their jobs cannot choose whether or not to engage in professional development and service. 89% of survey respondents said promotion was based in some part on it, but their participation is not supported. 85% have at some point chosen not to pursue a professional development opportunity due to cost. In this section, we consider the implications of the high cost of academic librarians’ professional development. In her essay on critical approaches to quantitative research, Selinda Berg (2018) suggests that we also examine the outliers, underrepresented, and statistical minorities in order to develop a more holistic understanding. We agree emphatically with Berg when she states that “outliers are no less important despite their smaller numbers” (p. 231). Although some of the responses referenced here were not representative, we wish to draw attention to them, as they reflect broader societal inequities. 

True Costs

While ALA membership fees have remained stagnant over the past twenty years (Wilkinson et al., 2021), data from our survey leads us to believe that affording membership fees is difficult for some academic librarians. 76% of respondents reported that their funding could not be used to pay membership fees; perhaps not coincidentally, ALA membership has dropped 11% in the past decade. One survey respondent wrote that they had given up their ALA membership:

“Since my institution doesn’t cover membership costs, I would have to pay my own dues…The tiers, the additional costs for interest groups, then paying even more for access to professional development resources, the costs simply aren’t worth the return.” 

Members also receive lower conference registration rates, and while registration rates for ACRL haven’t increased more than inflation over the past 20 years, registration for ALA conferences have (ALA Annual’s cost has increased 30% and ALA Midwinter 40%) (Wilkinson et al., 2021). Survey respondents often talked about their inability to attend major conferences, which affected their decision to pursue professional service:

“I elected to not pursue higher level service within ACRL to avoid having to attend ALA… I can’t save for retirement, pay off my student loans, and defray my medical expenses and spend a lot of time or money to engage beyond what I do now.”

Other respondents directly connected their inability to pay for professional opportunities with other economic stressors in their lives, such as a high cost of living (notably, the presence of academic institutions often directly affects the cost of housing; see this chart for selected librarian salaries and cost of living):

“The main barrier is that there is no institutional funding for faculty, except through grants that won’t cover librarian-specific professional development (but may cover sending a cohort to a particular conference focused on teaching, for example)… Given the very high cost of living in my area, I have decided that I simply cannot afford to spend thousands of dollars of my own money traveling to conferences such as ALA, ACRL, or even the CARL conference in California anymore. Instead, beginning this year, I’ve decided to rely on free online webinars.”

The cost to attend a conference extends beyond registration. One respondent described barriers related to travel “The primary barrier is travel costs – coming from a rural college, the closest airports are 2 hours away, and most destinations require more than one flight to transfer. This typically requires an additional travel day with an additional night of lodging.” Additionally, many ancillary costs, such as gas and lodging, have risen in the past twenty years (Wilkinson et al., 2021). Hidden costs exist for librarians with physical disabilities attending in-person conferences. One survey respondent explained, “I have a disability which sometimes requires that I pay for things like rental mobility devices (if not supplied by the conference), or more expensive transportation options. These extra costs are not accounted for so sometimes I have to skip opportunities because the out of pocket costs would be too much for me to eat.” Women are more likely than men to be unpaid primary caregivers (CDC, 2018) and family care was mentioned multiple times as an additional cost: “Given my salary and my family obligations, I cannot afford anything more than a couple hours from home, and I cannot stay overnight ever. So if it isn’t local, I don’t go.” Financial norms around professional development assume a certain type of academic librarian: middle-class or wealthy, abled, without caregiving responsibilities, partnered or married, and living in an easily accessible metro area. Given the systemic financial inequities experienced by library workers from the marginalized groups we described earlier, we suggest that these norms also assume white librarians. 

Inequitable Structures

Laborious, complicated, or opaque practices and policies to obtain funding or reimbursement was a key theme in our survey results. These opaque processes are part and parcel of a system that privileges and reinforces white, middle-class values, as A. S. Galvan (2015) points out. This creates a closed and biased system in which funding is reserved for those who know how to navigate these processes — namely, those who are white and financially secure — and shuts out everyone else. It is essentially the hidden curriculum of academic bureaucracy. One respondent explains:

“We have a percentage-based reimbursement model (e.g., 30% of conference registration covered) that is applied inconsistently and capriciously across staff. The overall effect is discouraging…” 

Another describes how reimbursement after the event acts as a barrier:

“The mental gymnastics of planning is a barrier. It’s difficult to plan conference/event expenses when I most [sic] expenses will not be reimbursed until AFTER the conference/event. I try to save money by booking accommodations and travel in advance, but often times [sic] this means I am on the hook for interest accrued on my credit card for 2-3 months…”

The practice of reimbursement instead of direct payment emerged as a key barrier in our survey. Only 9% reported that their institution paid all costs up front. For the remaining 91%, this means possibly accruing credit card interest, yet another cost and burdensome for librarians experiencing financial precarity. This model also assumes an open line of credit or extra cash on hand. But this isn’t the case for all academic librarians: “I do not have a credit card that can float thousands of dollars at a time while I wait to be reimbursed. I have been told to get a credit card or ask family for money to pay for professional development opportunities.” Macroeconomic factors such as student loan debt and the racialized and gendered financial inequities that pervade U.S. society also affect academic librarians. But our policies and systems around professional development and service fail to acknowledge this, instead assuming an ideal academic librarian that doesn’t exist.  

Ultimately, our survey and subsequent analysis found that levels of institutional support are often insufficient when coupled with the economic constraints faced by many academic librarians. This is often a formidable obstacle for those who need or want to participate in professional development and service, particularly early-career librarians or those facing a tenure/promotion process. COVID-19 is already affecting academic libraries, as budgets are preemptively cut to address revenue loss and anticipated enrollment drops (Friga, 2020) and will likely exacerbate the inequities we’ve described. 

Gaps and Next Steps

This paper relies on descriptive statistics and qualitative coding, as we experienced some issues in accessing SPSS remotely during the pandemic; a next step might include a more complex statistical analysis. We also hoped to be able to analyze the survey results through the lens of racial equity, but realized that we did not have enough data from librarians of marginalized groups to draw meaningful conclusions without compromising the anonymity of the respondents, although our survey sample is representative of the profession in terms of race and gender (Bureau of Labor Statistics, 2019). We do feel this question is incredibly important, however, and hope to work with librarian colleagues from marginalized groups in the future. 


“I find it really ironic and disheartening that a profession so supposedly devoted to equality enforces such financially restrictive and exclusionary practices. Attending expensive conferences all but insures [sic] greater recognition and faster advancement for those in more comfortable financial positions.”  

We, too, are disheartened and disappointed. The results of our survey are clear: there are significant barriers for librarians to participate in professional development and service. The impact of these barriers? A service ceiling that promotes homogeneity (specifically, whiteness, economic security, and ableism) and suppresses diversity. How many librarians are priced out of contributing to our field due to a lack of financial support from their institutions, the high costs of participation in associations, and an overall bleak economic landscape? We fear that without structural  change, the privilege of working to shape our field through professional development and service will continue to be only available to a small, elite subsection of our colleagues. Indeed, librarianship continues to fail in recruiting and retaining workers who are from systematically marginalized communities. This, and the inequities we’ve identified in funding for professional development, narrows both the discussions within and future visions of academic librarianship. We do not hear from so many colleagues, including those who work at community colleges, in rural areas, at institutions that serve marginalized groups, or at under-resourced institutions, and so our initiatives and outcomes often leave them out.

ACRL 2021 is an apt example. ACRL announced in fall 2020 that their biennial conference, recognized as the most prestigious event for US-based academic librarians, would be held exclusively online in April 2021. In December 2020, accepted presenters discovered that the speaker agreement forms required them to register for the conference at the $289 early bird rate. ACRL adapted their conference format in acknowledgment of the global pandemic, yet their pricing structure reflects a refusal to acknowledge that the finances of their constituents may have been impacted by the same pandemic. This inconsiderate approach was emphasized by the responses of disappointment to this news on social media (Figure E) and by ACRL-New Jersey’s letter to ACRL. Our research tells us, with resounding clarity, that those in our field desire to participate in professional development and service, but employers and professional associations deny us this opportunity. 

Figure E

Yet, we see promising avenues in our work and elsewhere to combat the service ceiling. To start this process, we must reckon with the ways that white supremacy is embedded in our profession’s professional development and service norms. We erect barriers that (often indirectly) prevent women, academic librarians from marginalized groups, and less well-resourced colleagues from participating in professional development and service opportunities, which in turn negatively impacts their ability to gain promotion or tenure. White supremacy creates a professional culture that silences conversations about financial equity that might shed light on the issues presented in our research. An important step in removing the silence around finances in our field is insisting on financial transparency and accountability at the professional association level–the ALA Midwinter 2020 budget fiasco particularly highlights the need for transparency by organizations that continually raise prices without disclosing how those costs are distributed across the organization (Schwartz, 2020). Sliding scale pricing, reparations-aligned pricing, simplified membership fees, virtual participation, and incentives in the form of reduced or comped registration and/or membership for service or participation would all go a long way to making professional development in our field more accessible and equitable. At the personal level, creating a more transparent atmosphere about our individual financial realities can help reduce the implicit pressure to self-fund. We’re taking a small step in this direction by making our salaries public, and encourage our colleagues to be similarly open about their financial privilege and struggles. 

Because financial inequity in LIS professional development and service is linked so heavily to systems of oppression, we recognize the need for this conversation to continue in multiple directions. Our group, Librarians for Equitable Professional Development, hopes to build on this exploratory research, moving from a single project to a collection of strategic endeavors. More research on this topic needs to be undertaken with marginalized communities in LIS; LEPD would like to partner with these communities and collect data about their professional development and service experiences. Another example of future work we envision taking on is advocating for our major professional organizations to “open their books” to allow us, or other researchers, to audit how membership and conference fees are spent. The end goal isn’t just questioning the fiscal soundness of elaborate receptions at ACRL. Ultimately, we need our workplaces and professional organizations in librarianship to recognize and dismantle their oppressive, inequitable professional development and service practices. Failure to eliminate the service ceiling is racist, ableist, and propagates a scholarly environment of homogeneity and mediocrity. Our entire field suffers because of these exclusionary practices.


The authors would like to thank our reviewers, Lalitha Nataraj and Kellee Warren, and Ian Beilin for preparing the article for publication. We would also like to thank Craig Smith, the Assessment Specialist at the University of Michigan Library, for his help with drafting and revising the survey.


American Library Association (2020a). ALA-APA library salary database.

American Library Association. (2020b). ALA personal membership: Benefits & types. 

American Library Association (2010). A guideline for the appointment, promotion and tenure of academic librarians. 

Anonymous. (2019, May 2). Open letter about financial inequalities at academic conferences. 

Berg, S. (2018). Quantitative researchers, critical librarians: Potential allies in pursuit of a socially just praxis. In K. P. Nicholson & M. Seale (Eds.), The Politics of Theory and the Practice of Critical Librarianship (pp. 225-235). Library Juice Press.

Blessinger, K., & Costello, G. (2011). The effect of economic recession on institutional support for tenure-track librarians in ARL institutions. The Journal of Academic Librarianship, 37(4), 307–311.

Bureau of Labor Statistics. (2019). Household data annual averages: Median weekly earnings of full-time wage and salary workers by selected characteristics. U.S. Department of Labor.

Bureau of Labor Statistics. (2020, November 11). Occupational outlook handbook, librarians and library media specialists. U.S. Department of Labor.

Carnegie Classification of Institutions of Higher Education. (n.d.) Basic classification description.

Department for Professional Employees. (2020) Library professionals: Facts & figures. AFL-CIO. 

Dorning, J., Dunderdale, T., Farrell, S. L., Geraci, A., Rubin, R., & Storrs, J. (2014). Advocating for better salaries toolkit.  American Library Association. 

Ettarh, F. (2018). Vocational awe and librarianship: The lies we tell ourselves. In the Library with the Lead Pipe. 

Executive Board of ACRL-NJ. (2021). New Jersey librarians response to ACRL 2021 conference fees.

Ferretti, J. (2020). Building a critical culture: How critical librarianship falls short in the workplace. Communications in Information Literacy, 14(1), 134-152.

Galvan, A. (2015). Soliciting performance, hiding bias: Whiteness and librarianship. In the Library with the Lead Pipe.

Halperin, J. R. (2018). A contract you have to take: Debt, sacrifice, and the library degree. In J. Percell, L. C. Sarin, P. T. Jaeger, & J. Carlo Bertot (Eds.), Re-envisioning the MLS: Perspectives on the Future of Library and Information Science Education (Vol. 44A, pp. 25–43). Emerald Publishing Limited.

Hardenbrook, J. (2015, February 15).  ALA: The membership cost is too damn high? Mr. Library Dude.

Jones, K. & Okun, T. (2001). The characteristics of white supremacy culture from Dismantling Racism: A Workbook for Social Change Groups. Showing Up for Racial Justice. 

Leebaw, D., & Lodsgon, A. (2020). Power and status (and lack thereof) in academe: Academic freedom and academic librarians. In the Library With a Lead Pipe.

Lourde, A. (2007) Sister outsider: essays and speeches. Rev. ed. Crossing Press.

Ma, J., Pender, M., & Libassi, CJ. (2020). Trends in college pricing and student aid, 2020. Trends in higher education series. College Board.

Nataraj, L., Hampton, H., Matlin, T. R., & Meulemans, Y. N. (2020). “Nice white meetings”: Unpacking absurd library bureaucracy through a Critical Race Theory lens. Canadian Journal of Academic Librarianship, 6, 1–15.

Parker, K., Horowitz, J.M,. & Brown, A. (2020, April 21). About half of lower-income Americans report household job or wage loss due to COVID-19. Pew Research. 

Pew Research Center. (2016, June 27). On views of race and inequality, Blacks and whites are worlds apart.

Ray, V. (2019). A theory of racialized organizations. American sociological review, 84 (1), 26–53.

Schwartz, M. (2020, February 14). American Library Association’s $2 million shortfall prompts demands for transparency, reform: ALA Midwinter 2020. Library Journal. 

Stahl, L. (2020, October 23). Librarian, read thyself. The Rambling. 

Smigielski, E. M., Laning, M. A., & Daniels, C. M. (2014). Funding, time, and mentoring: A study of research and publication support practices of ARL member libraries. Journal of library administration, 54(4), 261–276.

Strand, K. J. (2019). Disrupting whiteness in libraries and librarianship: A reading list. University of Wisconsin-Madison Libraries., Office of the Gender and Women’s Studies Librarian.

Sundstrom, P, & Sokoloff, J. (2021). Librarian salaries adjusted for local cost of living.!/vizhome/Librariansalarieslocalcostofliving/Slopechart.

Vilz, A. J., & Poremski, M. D. (2015). Perceptions of support systems for tenure-track librarians, College & Undergraduate Libraries, 22(2) 149-166. 

Vinopal, J. (2016). The quest for diversity in library staffing: from awareness to action. In the Library With the Lead Pipe. 

Watson, M. (2017). White feminism and distributions of power in academic libraries. In  G. Schlesselman-Tarango (Ed.) Topographies of whiteness: Mapping whiteness in library and information science (p. 143-174). Library Juice Press.

Walters, W. H. (2016). Faculty status of librarians at U.S. research universities. The Journal of Academic Librarianship, 42(2), 161–171.

Webber, K. L., & Burns, R. (2020). Increases in graduate student debt in the US: 2000 to 2016. Research in higher education, 1-24.

Wilkinson, J., Bradham, F., Comanda, B., Koziura, A., & Seale, M. (2021, January 22). LIS Conference & Membership Costs 1999-2020. Librarians for Equitable Professional Development.

Appendix A

Academic Librarian Professional Development and Service Costs

Informed Consent Statement for Research Protocol # 2001988850, Indiana University 


You are invited to participate in an online survey on the financial costs  related to academic librarian professional development and service in the United States. This  survey’s purpose is to investigate financial equity in LIS academic libraries professional  development. Professional development includes conference attendance, professional service  and its ties to tenure/promotion, and presenting at conferences & meetings. Read the full  recruitment letter. PARTICIPATION 

Your participation in this survey is voluntary. You may  refuse to take part in the research or exit the survey at any time without penalty. 


The purpose of this study is to investigate financial equity in LIS  academic libraries professional development. This research project is conducted by: Jaci  Wilkinson, Head, Discovery and User Experience, Indiana University (Principal  Investigator) 


We are hoping to collect responses  from approximately fifty participants but we are not capping the number of responses. 


You will be asked to take a survey asking about your  professional development costs and your position. It should take approximately 10 minutes to  complete. 


The possible risks or discomforts of the study are minimal. This survey  asks questions about the support available for librarians’ professional development at your  institution, which may cause you to feel a little uncomfortable answering these more sensitive  survey questions. BENEFITS 

You will receive no direct benefits from participating in this  research study. However, your responses will help us learn more about financial equity for  professional development across a wide range of academic libraries, which is currently very  underrepresented in LIS research. WILL MY INFORMATION BE PROTECTED? 

Your survey  answers are confidential. No identifying information will be asked of you. As such, no names or  other identifying information would be included in any publications or presentations based on  these data. 


You will not be paid for participating in  this study. 


There is no cost to you for  taking part in this study. 


If you have questions at any time about the study or the  procedures, you may contact the principal investigator, Jaci Wilkinson, at . For questions about your rights as a research participant, to discuss problems, complaints, or  concerns about a research study, or to obtain information or to offer input, please contact the IU  Human Subjects Office at 800-696-2949 or at .


You may print a  copy of this form for your records. Clicking the Next button in the survey indicates that you have  read the above information, you are 18 years of age or older, and you voluntarily agree to  participate in this study  

Q2 Demographics

Q30 Are you employed as a librarian at an academic institution in the United States? o Yes (1)  

o No (3)  

Skip To: End of Survey If Are you employed as a librarian at an academic institution in the United States?  = No 

Q3 How long have you been at your current institution? 

o0-5 years (1)  

o6-10 years (2)  

o11+ years (3)  

Q4 Are you in a supervisory role at your current institution? 

o Yes: supervising mostly staff/faculty (1)  

o Yes: supervising mostly students (2)  

o No (3)

Q5 What best describes the institution where you currently work? 

o Community college (1)  

o4-year undergraduate college (2)  

o Comprehensive college (undergraduate and master’s programs) (3) 
o Research university (4)  

o Other (5) ________________________________________________ 

Display This Question: 

If What best describes the institution where you currently work? != Community college 

Q9 What is the estimated size of your current institution? (From the Carnegie Classification  description for size and setting: four-year category.) 

o Very small (fewer than 1,000 enrolled) (1)  

o Small (1,000 – 3,000 enrolled) (2)  

o Medium (3,001 – 9,999 enrolled) (3)  

O Large (at least 10,000 enrolled) (4)  

Display This Question: 

If What best describes the institution where you currently work? = Community college
Page 4 of 15 

Q33 What is the estimated size of your institution? (From the Carnegie Classification description  for size and setting: two-year category.) 

o Very small (fewer than 500 enrolled) (1)  

o Small (500 – 1,999 enrolled) (2)  

o Medium (2,000 – 4,999 enrolled) (3)  

oLarge (5,000 – 9,999 enrolled) (4)  

o Very large (at least 10,000 enrolled) (5)  

Q6 Select the description which best describes your institution: 

o Public (1)  

o Private (2)  

o For-profit (3)  

Q8 What setting best describes your institution? 

o Rural (1)  

o Suburban (2)  

o Urban (3)

Q10 How is your current position classified? 

o Staff (1)  

o Tenure track faculty (2)  

o Non-tenure track faculty (4)  

o Other (3) ________________________________________________ 

Display This Question: 

If How is your current position classified? != Tenure track faculty 

Q26 Does your institution have a promotion system for librarians? o Yes (1)  

o No (2)  

o Not sure (3)  

Display This Question: 

If How is your current position classified? = Tenure track faculty 

Or Does your institution have a promotion system for librarians? = Yes 

Q27 Is librarian promotion contingent upon professional development or service?
o Yes (1)  

o No (2)  

o Not sure (3)

Q11 Are you a member of any library or subject-area professional associations? Select all that  apply below and list any additional organizations in the “Other” area. 

▢ ALA (1)  

▢ ACRL (2)  

▢ AASL (3)  

▢ A state library association (4)  

▢ AACU (5)  

▢ SLA (6)  

▢ Other (7) ________________________________________________

Q12 How do you identify?  

This question is being asked to identify any potential association between gender identity and  professional development or service funding. 

▢ Woman (1)  

▢ Man (2)  

▢ Transgender (3)  

▢ Gender non-conforming (4)  

▢ Genderqueer (6)  

▢ Non-binary (7)  

▢ Prefer to self-describe: (5)  


▢ Prefer not to say (8)

Q31 Select one or more racial/ethnic categories to describe yourself: 

This question is being asked to identify any potential association between race/ethnicity and  professional development or service funding. 

▢ White (1)  

▢ Black or African American (2)  

▢ American Indian or Alaska Native (3)  

▢ Asian (4)  

▢ Native Hawaiian or other Pacific Islander (5)  

▢ Hispanic or Latinx origin (A person of Cuban, Mexican, Puerto Rican, South or  Central American descent, or other Spanish culture or origin, regardless of race) (8)  

▢ Prefer to self-describe: (7)  


▢ Prefer not to say (6)  

Q13 Professional Development and Service Support  

This section of the survey asks about your experience with institutionally-provided financial  support for professional development and service opportunities from the past five years or your most current position (in case you’ve been in your current position less than five years).

Q15 How does your institution provide professional development funding?
oInstitution directly pays (4)  

o Reimbursement in advance (2)  

o Reimbursement AFTER the the professional service or service event/opportunity occurs  (1)  

o A combination of reimbursement and payment in advance (3)  

Q16 How much funding for professional development and service do you receive from your  institution annually, on average? 

o No funding (1)  

o$1 – $250 (2)  

o$251 – $500 (3)  

o$501 – $1,000 (4)  

o$1,001 – $2,000 (5)  

o$2,001 – $3,000 (6)  

o More than $3,000 (7)  

Skip To: Q18 If How much funding for professional development and service do you receive from your  institution an… = No funding

Q17 How is funding made available to you by your institution? 

oI receive a set amount every year. (1)  

oI submit requests for funding for individual opportunities. (2)  

oIt depends (use text area to explain further). (3)  


o Not sure (4)  

Q29 What costs does your institutional funding defray? 

Always (1) Sometimes (2) Never (3) 

Conference registration (6) o o o Training/workshop/webinar  

(in-person or virtual) (7) o o o
Professional organization  

membership dues (8) o o o Travel (1) o o o Accommodations (2) o o o Meals (4) o o o
Other (5) o o o

Q18 Are there any barriers that prevent you from accessing institutional funding for professional  development or service activities? 

o Yes (1)  

o No (2)  

Display This Question: 

If Are there any barriers that prevent you from accessing institutional funding for professional dev… =  Yes 

Q19 Please describe any barriers that prevent you from accessing institutional funding for  professional development or service activities. 


Q20 Self-funded Professional Development and Service  

This section of the survey asks about your experience with self-funding for professional  development and service opportunities from the past five years or your most current  position (in case you’ve been in your current position less than five years).

Q21 Have you ever had to self-fund a professional development activity or service? Please  consider any time you have self-funded, even those beyond a 5-year time frame.
      o No (1)  

o Yes, on occasion. (2)  

o Yes, my professional development and service is completely self-funded. (3)  

Skip To: Q24 If Have you ever had to self-fund a professional development activity or service? Please  consider a… = No 

Q22 Which activities have you self-funded? (Select all options below that apply).
▢ Conference (presenting + attending) (1) 
▢ Conference (just attending) (2)  

▢ In-person training (3)  

▢ Distance training (e.g. webinar) (4)  

▢ Professional membership dues (5)  

▢ Other (6) ________________________________________________

Q23 How much have you spent of your own money on professional development and service  activities in the past year? 

o$0 – $250 (2)  

o$250 – $500 (3)  

o$501 – $1,000 (4)  

o$1,001 – $2,000 (5)  

o$2,001 – $3,000 (6)  

o More than $3,000 (7)  

Q24 Are there professional development and service opportunities you’ve chosen not to pursue  due to their cost? 

o Yes (1)  

o No (2)  

o Not sure (3)  

Display This Question: 

If Are there professional development and service opportunities you’ve chosen not to pursue due to  t… = Yes 

Q25 Please describe the types of professional development and service opportunities you’ve  chosen not to pursue due to their cost. 


Appendix B


· Primary code: .702 (Substantial agreement)

· Secondary code: .619 (Substantial agreement)


· Primary code: .778 (Substantial agreement)

· Secondary code: .571 (Moderate agreement)


· Primary code: .931 (Almost perfect agreement)

· Secondary code: .631 (Substantial agreement)

NDSA Adds Three New Members / Digital Library Federation

As of 8 June 2021, the NDSA Leadership unanimously voted to welcome its three most recent applicants into the membership.

Each new member brings a host of skills and experience to our group. UCLA is the largest digital library in the University of California system and collects every format, they collect, preserve, and publish petabytes of data. The Crowley Company has provided commercial and archival digitization and micrographic services for those needing to digitally preserve content since 1980. Matthew Breitbart, NDSA’s first Affiliate Member, is an expert at FADGI standards and an M-19-21-SME who has worked with digital information in museums and the private sector.

Each organization participates in one or more of the various interest and working groups – so keep an eye out for them on your calls and be sure to give them a shout out. Please join me in welcoming our new members. To review our list of members, you can see them here.

~ Nathan Tallman, NDSA Vice Chair

The post NDSA Adds Three New Members appeared first on DLF.


The MPG/SFX server will undergo scheduled maintenance due to a hardware upgrade. The downtime will start at 5 pm. Services are expected to be back after approximately one hour.

We apologize for any inconvenience.

Unreliability At Scale / David Rosenthal

Thomas Claiburn's FYI: Today's computer chips are so advanced, they are more 'mercurial' than precise – and here's the proof discusses two recent papers that are relevant to the extraordinary levels of reliability needed in long-term digital preservation at scale. Below the fold some commentary on both papers.

I started writing about the way keeping large numbers of bits for long periods of time posed a fundamental engineering problem with 2007's A Petabyte For A Century. A little later I summarized my argument:
The basic point I was making was that even if we ignore all the evidence that we can't, and assume that we could actually build a system reliable enough to preserve a petabyte for a century, we could not prove that we had done so. No matter how easy or hard you think a problem is, if it is impossible to prove that you have solved it, scepticism about proposed solutions is inevitable.
The "koan" I used was to require that the storage system have a 50% probability that every bit survived the century unchanged. The test to confirm in one year that the requirement was met would be economically infeasible, costing around three orders of magnitude more than the system itself. These papers study a problem with similar characteristics, silent data corruption during computation rather than during storage.


The Facebook team introduce their paper thus:
Facebook infrastructure initiated investigations into silent data corruptions in 2018. In the past 3 years, we have completed analysis of multiple detection strategies and the performance cost associated. For brevity, this paper does not include details on the performance vs cost tradeoff evaluation. A follow up study would dive deep into the details. In this paper, we provide a case study with an application example of the corruption and are not using any fault injection mechanisms. This corruption represents one of the hundreds of CPUs we have identified with real silent data corruption through our detection techniques.
In other words, they tell the story of how a specific silent data corruption was detected and the root cause determined:
In one such computation, when the file size was being computed, a file with a valid file size was provided as input to the decompression algorithm, within the decompression pipeline. The algorithm invoked the power function provided by the Scala library ... Interestingly, the Scala function returned a 0 size value for a file which was known to have a non-zero decompressed file size. Since the result of the file size computation is now 0, the file was not written into the decompressed output database.

Imagine the same computation being performed millions of times per day. This meant for some random scenarios, when the file size was non-zero, the decompression activity was never performed. As a result, the database had missing files. The missing files subsequently propagate to the application. An application keeping a list of key value store mappings for compressed files immediately observes that files that were compressed are no longer recoverable. This chain of dependencies causes the application to fail. Eventually the querying infrastructure reports critical data loss after decompression. The problem’s complexity is magnified as this manifested occasionally when the user scheduled the same workload on a cluster of machines. This meant the patterns to reproduce and debug were non-deterministic.
Even at first sight this is a nightmare to debug. And so it turned out to be :
With concerted debugging efforts and triage by multiple engineering teams, logging was enabled across all the individual worker machines at every step. This helped narrow down the host responsible for this issue. The host had clean system event logs and clean kernel logs. From a system health monitoring perspective, the machine showed no symptoms of failure. The machine sporadically produced corrupt results which returned zero when the expected results were non-zero.
Once they had a single machine on which it was possible to reproduce the data corruption they could investigate in more detail:
From the single machine workload, we identified that the failures were truly sporadic in nature. The workload was identified to be multi-threaded, and upon single threading the workload, the failure was no longer sporadic but consistent for a certain subset of data values on one particular core of the machine. The sporadic nature associated with multi-threading was eliminated but the sporadic nature associated with the data values persisted. After a few iterations, it became obvious that the computation of
Int(1.153) = 0
as an input to the math.pow function in Scala would always produce a result of 0 on Core 59 of the CPU. However, if the computation was attempted with a different input value set
Int(1.152) = 142
the result was accurate.
Next they needed to understand the specific sequence of instructions causing the corruption. This turned out to be as much of a nightmare as anything else in the story. The application, like most similar applications in hyperscale environments, ran in a virtual machine that used Just-In-Time compilation, rendering the exact instruction sequence inaccessible. They had to use mutiple tools to figure out what the JIT compiler was doing to the source code, and then finally achieve an assembly language test:
The assembly code accurately reproducing the defect is reduced to a 60-line assembly level reproducer. We started with a 430K line reproducer and narrowed it down to 60 lines.
The Facebook team conclude:
Silent data corruptions are real phenomena in datacenter applications running at scale. We present an example here which illustrates one of the many scenarios that we encounter with these data dependent, reclusive and hard to debug errors. Understanding these corruptions helps us gain insights into the silicon device characteristics; through intricate instruction flows and their interactions with compilers and software architectures. Multiple strategies of detection and mitigation exist, with each contributing additional cost and complexity into a large-scale datacenter infrastructure. A better understanding of these corruptions has helped us evolve our software architecture to be more fault tolerant and resilient. Together these strategies allow us to mitigate the costs of data corruption at Facebook’s scale.


The Google team's paper takes a broader view of the problem that they refer to as corrupt execution errors (CEEs):
As fabrication pushes towards smaller feature sizes and more elaborate computational structures, and as increasingly specialized instruction-silicon pairings are introduced to improve performance, we have observed ephemeral computational errors that were not detected during manufacturing tests. These defects cannot always be mitigated by techniques such as microcode updates, and may be correlated to specific components within the processor, allowing small code changes to effect large shifts in reliability. Worse, these failures are often “silent” – the only symptom is an erroneous computation.

We refer to a core that develops such behavior as “mercurial.” Mercurial cores are extremely rare, but in a large fleet of servers we can observe the disruption they cause, often enough to see them as a distinct problem – one that will require collaboration between hardware designers, processor vendors, and systems software architects.
They recount how CEEs were initially detected:
Imagine you are running a massive-scale data-analysis pipeline in production, and one day it starts to give you wrong answers – somewhere in the pipeline, a class of computations are yielding corrupt results. Investigation fingers a surprising cause: an innocuous change to a low-level library. The change itself was correct, but it caused servers to make heavier use of otherwise rarely-used instructions. Moreover, only a small subset of the server machines are repeatedly responsible for the errors.

This happened to us at Google. Deeper investigation revealed that these instructions malfunctioned due to manufacturing defects, in a way that could only be detected by checking the results of these instructions against the expected results; these are “silent" corrupt execution errors, or CEEs. Wider investigation found multiple different kinds of CEEs; that the detected incidence is much higher than software engineers expect; that they are not just incremental increases in the background rate of hardware errors; that these can manifest long after initial installation; and that they typically afflict specific cores on multi-core CPUs, rather than the entire chip. We refer to these cores as “mercurial."

Because CEEs may be correlated with specific execution units within a core, they expose us to large risks appearing suddenly and unpredictably for several reasons, including seemingly-minor software changes. Hyperscalers have a responsibility to customers to protect them against such risks. For business reasons, we are unable to reveal exact CEE rates, but we observe on the order of a few mercurial cores per several thousand machines – similar to the rate reported by Facebook [8]. The problem is serious enough for us to have applied many engineer-decades to it.
The "few mercurial cores per several thousand machines" have caused a wide range of problems:
Some specific examples where we have seen CEE:
  • Violations of lock semantics leading to application data corruption and crashes.
  • Data corruptions exhibited by various load, store, vector, and coherence operations.
  • A deterministic AES mis-computation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption elsewhere yielded gibberish.
  • Corruption affecting garbage collection, in a storage system, causing live data to be lost.
  • Database index corruption leading to some queries, depending on which replica (core) serves them, being non-deterministically corrupted.
  • Repeated bit-flips in strings, at a particular bit position (which stuck out as unlikely to be coding bugs).
  • Corruption of kernel state resulting in process and kernel crashes and application malfunctions.
Be thankful it isn't your job to debug these problems! Each of them is likely as much of a nightmare as the Facebook example.


The similarity between CEEs and silent data corruption in at-scale long-term storage is revealed when the Google team ask "Why are we just learning now about mercurial cores?":
There are many plausible reasons: larger server fleets; increased attention to overall reliability; improvements in software development that reduce the rate of software bugs. But we believe there is a more fundamental cause: ever-smaller feature sizes that push closer to the limits of CMOS scaling, coupled with ever-increasing complexity in architectural design. Together, these create new challenges for the verification methods that chip makers use to detect diverse manufacturing defects – especially those defects that manifest in corner cases, or only after post-deployment aging.
In other words it is technically and economically infeasible to implement tests for both storage systems and CPUs that assure errors will not occur in at-scale use. Even if hardware vendors can be persuaded to devote more resources to reliability, that will not remove the need for software to be appropriately skeptical of the data that cores are returning as the result of computation, just as the software needs to be skeptical of the data returned by storage devices and network interfaces.

Both papers are must-read horror stories.

Open Data Day 2022 will take place on Saturday 5th March 2022 / Open Knowledge Foundation

Open Data Day 2022

We are pleased to announce that Open Data Day 2022 will take place on Saturday 5th March 2022.

Open Data Day is an annual, global celebration of open data. Each year, groups from around the world create local events to show the benefits of open data in their local community and encourage the adoption of open data policies in government, business and civil society. All outputs are open for everyone to use and re-use.

In March 2021, we registered 327 Open Data Day events from across the world. Discover more about these events on the Open Data Day map, or search for an event in your area.

Thanks to the generous support of our partners – Microsoft, UK Foreign, Commonwealth & Development Office, Mapbox, Latin American Open Data Initiative (ILDA), Open Contracting Partnership and GFDRR – we gave out more than 60 mini-grants to support the running of great community events on Open Data Day 2021. Check out our blog post Open Data Day 2021 – it’s a wrap to find out what events received funding, and what they achieved at their Open Data Day event.

.Open Data Day 2021 Funding Partners

If you or your organisation would like to partner with Open Knowledge Foundation for Open Data Day 2022 please get in touch by emailing We will announce more details about the 2022 mini-grant scheme in the coming months.

For Open Data Day 2022, you can connect with others and spread the word using the #OpenDataDay or #ODD2022 hashtags. Alternatively you can join the Open Data Day Google Group to ask for advice or share tips.

Find out more about Open Data Day by visiting

Introducing Archives Unleashed Cohorts / Archives Unleashed Project

Photo by 🇸🇮 Janko Ferlič on Unsplash

Earlier this year, the Archives Unleashed Project announced the Cohort Program, which aims to support web archives research by providing resource support and mentorship. Starting in July, cohorts will engage in collaborative activities and conduct focused research that explores a variety of web archive collections.

We are pleased to introduce the five teams that will make up our inaugural Cohort program.

AWAC2 — Analysing Web Archives of the COVID Crisis through the IIPC Novel Coronavirus dataset

Valérie Schafer, University of Luxembourg (LU)

Karin De Wild, Leiden University (NL)

Frédéric Clavert, University of Luxembourg (LU)

Niels Brügger, Aarhus University (DK)

Susan Aasman, University of Groningen (NL)

Sophie Gebeil, University of Aix-Marseille (FR)

Investigating transnational events through web archive collections, the AWAC2 team will focus on a distant reading of the IIPC COVID-19 web archival collection to understand actors, content types and interconnectivity throughout it.

Everything Old is New Again: A Comparative Analysis of Feminist Media Tactics between the 2nd- to 4th Waves

Shana MacDonald, University of Waterloo

Aynur Kadir, University of Waterloo

Brianna Wiens, York University

Sid Heeg, University of Waterloo

Project members will explore web archive collections to conduct a comparative analysis of the history of feminist media practices across interdisciplinary multi-media sources. The team expects to produce a timeline of issue responses from different historical moments and map different feminist media practices over this timeline to determine overlaps. The project’s key outcome will be to recover earlier feminist media practices and contextualize them in the digital present.

Mapping and tracking the development of online commenting systems on news websites between 1996–2021

Anne Helmond, University of Amsterdam/University of Siegen

Johannes Paßmann, University of Siegen

Robert Jansma, University of Siegen

Luca Hammer, University of Siegen

Lisa Gerzen, University of Siegen

This project aims to reconstruct a history of online commenting by examining the role of commenting technologies in the popularisation of commenting practices. It will do so by examining the distribution and evolution of commenting technologies on the top 25 Dutch, German, and world news websites from 1996–2021, to understand how they have shaped the practices of users. This will allow them to explore the interplay between technologies and practices of the past and to investigate histories of natively-born technologies and practices.

Crisis Communication in the Niagara Region during the COVID-19 Pandemic

Tim Ribaric, Brock University

David Sharron, Brock University

Cal Murgu, Brock University

Karen Louise Smith, Brock University

Duncan Koerber, Brock University

Using web archives collected by Brock University, this project will examine how organizations in the Niagara region have responded to government COVID-19 mandates. Analysis will focus on investigating three types of entities: local government, non-profit organizations, and major private entities. Findings from this research aim to inform future crisis communication organizational planning, specifically at the local and municipal level. The project will also create several open computational notebooks to support teaching, learning, and research.

Viral health misinformation from Geocities to COVID-19

Shawn Walker, Arizona State University

Michael Simeone, Arizona State University

Kristy Roschke, Arizona State University

Anna Muldoon, Arizona State University

This project will examine and compare two case studies of health misinformation: HIV mis/disinformation circulating on Geocities in the mid-1990s to early 2000s with the role of official COVID-19 Dashboards in COVID mis/disinformation. This work contributes to our understanding of current and historical health misinformation as well as the connections between them, and will also garner insights into how historical narratives of health misinformation have been recycled and repurposed.

Introducing Archives Unleashed Cohorts was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.

Working across campus is like herding flaming cats / HangingTogether

Toffee needs to work on his social interoperability skills

Do you work at a university? If so, did you know that you work in a complex, adaptive system? And did you know that that makes it hard to build productive working relationships across campus?

This was the starting point of the first session of the joint OCLC-LIBER online workshop Building Strategic Relationships to Advance Open Scholarship at your Institution, based on the findings of the recent OCLC Research report Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise. This three-part workshop brought together a dynamic group of international participants to examine the challenges of working across the institution, identify strategies and tactics for cross-unit relationship building, and develop a plan for increasing their own social interoperability. In this post, we share some of the great insights and perspectives offered by our participants in the workshop’s first session: Understanding Social and Structural Norms that Shape Academic Institutional Collaboration.

We began with a brief tutorial on the concept of social interoperability – the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding – and why it is difficult to achieve in a university environment.

Working across campus is challenging for everyone

According to systems engineering expert and former university leader William Rouse, universities are examples of complex adaptive systems, characterized by, among other things, highly decentralized decision-making authority, independent agents that sometimes work at cross-purposes, lots of self-organization that occurs outside existing hierarchies, and limited ability to elicit desired behaviors through “top-down” direction or mandates.[1] This can make coordinated decision-making and collective effort a challenging proposition – like “herding flaming cats,” as one of our interviewees for the Social Interoperability report described their cross-unit collaboration experience. The missing ingredient? Social interoperability.

Figure 1: The key elements of a complex, adaptive system … or herding cats!

During small-group breakout discussions, our workshop participants endorsed the relevance of Rouse’s description of the university environment to their own experiences of working on campus. Responses to hearing that their work environment was a complex, adaptive system ranged from “Not surprised” to “Revelatory!” Some went even further than Rouse, describing the university environment as “anarchistic” and a venue where “everyone does what they want.” One participant observed that universities do not function from the top down: instead, things seem to just happen on their own. Overall, the breakout group discussions confirmed that the idea of universities as complex, adaptive systems resonated deeply as a description of the environment in which relationship-building across campus took place. And for some, there was a sense of relief to learn that achieving social interoperability under these conditions is hard for everyone!

Although Rouse’s model emerged from his experiences in US higher education, there was widespread agreement among our workshop participants that his ideas applied quite well to universities in other national contexts. Several participants from European higher education systems noted that while their university environments tended to be more hierarchical than those of their US counterparts, the general characteristics of a complex, adaptive system that Rouse described were still evident on their campuses.  

Participants in the breakout discussions shared a number of obstacles that, in their experience, can stand in the way of building productive relationships across campus. Many participants cited the complexity of the university as a fundamental challenge: there can be a profusion of stakeholder roles relevant to a given project, there is often no single point of contact even within units, and it may not be clear whom to approach. One participant suggested that too often “the arm doesn’t know what the hand is doing.” Difficulties in aligning priorities was mentioned frequently as forestalling productive partnerships; everyone has their interests (and bosses) to consider. Related to this was concern about “treading on others’ toes,” and the need to clarify roles and responsibilities – as one participant observed, someone may take the responsibility to move a collaborative effort forward, but then find themselves responsible for most of the work, too. And some participants expressed uncertainty over the authority they possessed to initiate partnerships with other departments or offices, or how best to reach out to colleagues elsewhere on campus (e.g., is a direct email sufficient for initial contact? If a decision is needed, is it better to work through regular channels?).

Lack of speed also made cross-unit partnerships difficult: participants noted that it was often difficult to get things moving, and once begun, to proceed expeditiously. As one person put it, there is a need to learn “what buttons to press” to get something done. Staff or leadership turnover in other units presents further complications, as well as a lack of clear communication lines across units. And of course, many of these obstacles have been amplified by the COVID-19 pandemic: as one person described it, Microsoft Teams is not as effectual in bringing together a group for the first time as getting everyone in the same room.

There are also library-specific challenges

Participants noted some obstacles distinct to the library that tended to stand in the way of building cross-unit relationships. One challenge mentioned frequently was the visibility and perception of the library on campus. For example, one participant observed that the biggest obstacle was lack of awareness about what libraries have to offer: all too often, librarians hear the refrain “I did not know the library could do that!” On some campuses, the library may not be seen as a leader or even a prospective partner for research support, and the full value of its services and expertise may be hidden. One participant suggested that librarians make themselves less visible because of the efficiency of access and delivery systems, while another believed that not everyone who uses systems in the library recognizes they are available only because of the library. One person remarked that people in other campus units may be unaccustomed to librarians with functional roles, rather than as subject specialists. Some suggested that it was important for libraries to become more vocal in overcoming “traditional perceptions”, and to remind campus partners that “we can be part of a project from the beginning, rather than just at the end.”

Several participants mentioned that, in their experience, the library was seen as a partner to come to with existing ideas or projects, but not as a co-creator. Similarly, one person remarked that the library needs to be seen not as a support service, but as a collaborator and equal partner. Another emphasized that the library might sometimes seem “invisible,” but it is an important part of the university and can serve as a “strong community hub.” Nevertheless, several participants observed that libraries tend to have no overarching strategy about forming relationships across campus – as one person put it, connections with stakeholders seem to form “by accident.” To remedy this, participants suggested finding ways to embed the library in the university’s formal inter-unit structures, like standing committees, and making sure to present the full range of library skills and capacities in those venues.

A framework for campus stakeholders in research support

Another topic discussed in the breakout was the types of stakeholders that participants commonly worked with on campus, with the Campus Stakeholder Model from the Social Interoperability report serving as a reference point. Participants noted relationships with a wide range of stakeholders that touched on every component of the model: for example, the research office, communications staff, the graduate school, campus computing, faculty affairs units, research centers, digital humanities institutes, post-doctoral students, and so on. One participant remarked that they aspired to work with all stakeholders in the model, but recognized that achieving this is a long journey and takes time. Another noted that they work with many stakeholders in different ways, but this results in a fragmented network of relationships.

Figure 2: Campus stakeholder categories … which ones do you work with?

The discussion highlighted difficulties in establishing productive relationships with some campus units. For example, research centers sometimes present challenges, because there are often many on campus, with each wanting to organize activities in their own way, such as managing their own data. Units on campus offering similar services to libraries can also be difficult to partner with: a data visualization librarian, for example, may find similar capacities offered in discipline-specific centers providing discipline-specific support. It can be difficult to learn who is doing what, and how it relates to what the library is offering – especially when the fabric of services across campus is changing all the time. And as one participant observed, another problem with maintaining productive relationships with units around campus is that often, stakeholders want “centralized services without centralized interference.”

Coming next: Strategies and tactics for improving social interoperability in research support

One theme seemed to pervade the discussion: building bridges to units across campus is all about personal relationships. We frequently heard comments like “it’s all about people” and “relationship building is one to one” and done with “people, not units”, like the “individual researcher, or the individual member of faculty”, or “someone who can represent the other unit.” Of course, the personal element of social interoperability can present challenges as well, ranging from uncertainty about who to approach, to the necessity of starting all over again if an established partner leaves their position. Meeting these challenges requires a toolbox of strategies and tactics for building social interoperability. Want to know more? Watch for another post soon summarizing the highlights from the second session of our workshop series, where we’ll show you that working across campus does not have to feel like herding flaming cats!

Thanks to my colleague Rebecca Bryant for providing helpful suggestions for improving this post!

[1] Rouse, William B. 2016. Universities as Complex Enterprises: How Academia Works, Why It Works These Ways, and Where the University Enterprise Is Headed. New York: Routledge.

The post Working across campus is like herding flaming cats appeared first on Hanging Together.

Saying goodbye to American Libraries magazine / Meredith Farkas

My first column in American Libraries

I’ve been pretty good about not making big life changes during the pandemic. We didn’t get a pandemic dog, even after finally getting our yard completely fenced-in last August. I’ve tamed many, many impulses I had during the pandemic because it seemed like the wrong time to make or unmake big commitments. I didn’t want to regret deciding something that was motivated by my desire not to feel so depressed anymore.*

But the one decision I did make during the pandemic was to stop writing my column for American Libraries. I’ve been writing it so long that I’d actually forgotten what year it started (2007 — wow!). This column in the June issue is my 100th and final article for American Libraries. It was interesting while writing my last column to go back and read issues of the magazine (as well as other writings about librarianship) from back in 2007. The technosaviorism was strong back then along with the idea that libraries were going to go extinct if we didn’t embrace Web/Library 2.0 (barf). I actually looked back on my blog posts from back then and I’m pleased to see that I was almost as skeptical of technosaviorism as I am now. My column was part of a new greater focus on technology in the magazine, and I’m grateful that I was mostly supported in moving the column towards aspects of slow librarianship instead in the final five or so years in which I wrote it. I always struggled with the brevity of the medium (usually my first drafts were at least twice as long as the final product), but I feel good about the 100 articles I put out there.

I don’t know if I’ll ever regret giving up the column, but my discomfort with the whole idea of being a columnist greatly outweighed that concern. I started to feel increasingly weird about the very idea of columnists. Why do I deserve this platform more than others? Why does anyone? I’ve always been uncomfortable with the idea of being an expert; I’m not dispensing wisdom from a mountaintop. I don’t have all the answers. I frequently get things wrong. Having a few voices put on a pedestal above all the others feels like a tool of white supremacy and of the individualistic achievement culture I’ve been railing against. I’ve also been thinking about the space I take up in the profession as a white woman. While I’ve had opportunities to turn down speaking gigs (and all-White panels) and suggest BIPOC women for them, the column loomed large in my mind as a big chunk of real estate I was taking up that could be better filled by BIPOC authors. Of course I have no say in the future of the magazine, but I did suggest to the editors that giving space to a variety of diverse authors would be a lot better than having monolithic columnists. And honestly, I think this is the direction they’ve been moving in over the past couple of years, which is awesome. American Libraries was extremely White when I started writing for it.

Just after I submitted my final column to my editor, Leonard Kniffel passed away. Leonard was the first editor-in-chief I worked with at American Libraries and he was the one who decided to give me a chance as a columnist (which seems even more bonkers to me now than it did then). Like most brand-new things, I had no idea how extraordinary he was as an editor-in-chief until I had others to compare him to. Instead of only getting in touch when unhappy with something I’d written or done (which characterized my subsequent experiences), Leonard was in frequent contact. He’d email to tell me how much he liked a particular column, ask my opinion on things, invite me to events, and he always kept the columnists apprised of upcoming themes or new directions for the magazine. He was so inclusive and made me feel like part of a cherished family of writers and editorial staff rather than simply part of a transaction. He was a really, really decent and kind person and the world is less bright and beautiful without him. I’m also incredibly grateful to Leonard’s friend and my former fellow-columnist, Andrew Pace, who championed me for the columnist gig. I really credit Andrew and Roy Tennant for helping me develop the confidence to really put myself out there with my writing and my speeches. I am so fortunate for the faith they had in me, especially given that the literature shows that men rarely mentor or champion women.

I’m so grateful for the opportunity I had to speak to such a large audience over the years. In 9th grade, a columnist from the Palm Beach Post, Frank Cerabino (who is still writing for them almost 30 years later!), came to speak to my class. He inspired me and I decided then and there that I wanted to be a newspaper columnist. While I didn’t get to write for a newspaper, I feel like I achieved that dream. I really tried to bring my voice and my values into what I wrote and I hope it was useful to others. And I’m grateful to the American Libraries staff I’ve had a chance to work closely with and to everyone there who have made the magazine better reflect the diversity in our profession. And thank you to the people who read my column over the years and got in touch with me, challenged me, and made me better. I loved this job and I wish everyone who has something to say gets the opportunities and encouragement they need to share their unique perspectives widely.

*I feel weird even admitting that I feel depressed because I know I’ve been so lucky compared to what so many people have been through since the start of March 2020. Between work stress, surprise expensive and scary house problems, relationship issues, local weather/environmental disasters, really bad health news, family drama, and cycles of COVID-related anxiety/panic, it’s just been a series of stressors one after the other after the other that haven’t allowed me to feel like I could take a breath. I think individually, I could have handled these issues, but it felt like just as one thing resolved, another took its place. There was no time to just breathe. I feel like I’ve been either hypervigilant and panicked or or paralyzed with exhaustion and apathy. I’m not the most resilient person or the most ok with feeling out of control, so maybe others could have handled all of it better. But instead of my usual coping mechanism (bury myself in work with hyperproductivity and a heavy dash of denial), I’ve tried to honor my feelings and my lack of capacity and give myself the rest and gentle treatment I feel I need and I’m proud of that. My apologies to folks who haven’t heard from me in a while or who have not received my best work.

“Building Community at a Distance” makes the DOLS Short List / Archives Unleashed Project

Photo by Susan Q Yin on Unsplash

As North America heads into the 15th month of the COVID pandemic, we are all too aware of the depth and breadth of impacts the pandemic has had on every sector.

In March 2020, our project planned to host our final datathon event at Columbia University, New York, NY. Having monitored ever-growing outbreaks and witnessing the domino effect of cancelled scholarly events, the team proactively shifted our event online.

In reflecting on this experience, our open-access article, Building community at a distance: A datathon during COVID-19, discusses the implications around transitioning events online, provides practice recommendations to event organizers, and offers a reflection on the role of digital libraries during the Covide-19 pandemic.

We are thrilled to announce our article was selected as one of the Top 5 Articles on COVID-19 Pandemic by the Distance and Online Learning Section (DOLS) of the Association of College and Research Libraries. Erica Getts and Ruth Monnie compiled and annotated scholarship from the past year that examines the impact and lessons learned by libraries during the pandemic.

Our sincerest thanks and gratitude to the DOLS Research and Publication Committee!

We are continually inspired by the scholarship and efforts that have come out of the past year to support and engage members and users of the GLAM (Galleries, Libraries, Archives, and Museums) communities.

The DOLS committee will be leading a Twitter chat this fall to engage with the Top 5 Articles, so be sure to keep an eye out! You can keep on top of all the latest news and publications by visiting the DOLS Section site.

“Building Community at a Distance” makes the DOLS Short List was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unstoppable Code? / David Rosenthal

This is the website of DeFi100, a "decentralized finance" system running on the Binance Smart Chain, after the promoters pulled a $32M exit scam. Their message sums up the ethos of cryptocurrencies:
Governments around the world have started to wake up to the fact that this message isn't just for the "muppets", it is also the message of cryptocurrencies for governments and civil society. Below the fold I look into how governments might respond.

The externalities of cryptocurrencies include:
It seems that recent ransomware attacks, including the May 7th one on Colonial Pipeline and the less publicized May 1st one on Scripps Health La Jolla, have reached a tipping point. Kat Jerick's Scripps Health slowly coming back online, 3 weeks after attack reports:
"It’s likely that it’s taking a long time because of negotiations going on with the perpetrators, and the prevailing narrative is that they have the contents of the electronic health records system that are being used for 'double extortion,'" said Michael Hamilton, former chief information security officer for the city of Seattle and CISO of healthcare cybersecurity firm CI Security, in an email to Healthcare IT News.

If that's true, Scripps certainly wouldn't be alone: The healthcare industry saw a number of high-profile ransomware incidents in the last year, including a cyberattack on Universal Health Services that led to a lengthy network shutdown and a $67 million loss.

More recently, customers of the electronic health record vendor Aprima also reported weeks of security-related outages.
In response governments are trying to regulate (US) or ban (China, India) cryptocurrencies. The libertarians who designed the technology believed they had made governments irrelevant. For example, the Decentralized Autonomous Organization (DAO)'s home page said:
The DAO's Mission: To blaze a new path in business organization for the betterment of its members, existing simultaneously nowhere and everywhere and operating solely with the steadfast iron will of unstoppable code.
This was before a combination of vulnerabilities in the underlying code was used to steal its entire contents, about 10% of all the Ether in circulation.

If cryptocurrencies are based on the "iron will of unstoppable code" how would regulation or bans work?

Nicholas Weaver explains how his group stopped the plague of Viagra spam in The Ransomware Problem Is a Bitcoin Problem:
Although they drop-shipped products from international locations, they still needed to process credit card payments, and at the time almost all the gangs used just three banks. This revelation, which was highlighted in a New York Times story, resulted in the closure of the gangs’ bank accounts within days of the story. This was the beginning of the end for the spam Viagra industry.
Subsequently, any spammer who dared use the “Viagra” trademark would quickly find their ability to accept credit cards irrevocably compromised as someone would perform a test purchase to find the receiving bank and then Pfizer would send the receiving bank a nastygram.
Weaver draws the analogy with cryptocurrencies and "big-game" ransomware:
These operations target companies instead of individuals, in an attempt to extort millions rather than hundreds of dollars at a time. The revenues are large enough that some gangs can even specialize and develop zero-day vulnerabilities for specialized software. Even the cryptocurrency community has noted that ransomware is a Bitcoin problem. Multimillion-dollar ransoms, paid in Bitcoin, now seem to be commonplace.

This strongly suggests that the best way to deal with this new era of big-game ransomware will involve not just securing computer systems (after all, you can’t patch against a zero-day vulnerability) or prosecuting (since Russia clearly doesn’t care to either extradite or prosecute these criminals). It will also require disrupting the one payment channel capable of moving millions at a time outside of money laundering laws: Bitcoin and other cryptocurrencies.
There are only three existing mechanisms capable of transferring a $5 million ransom — a bank-to-bank transfer, cash or cryptocurrencies. No other mechanisms currently exist that can meet the requirements of transferring millions of dollars at a time.

The ransomware gangs can’t use normal banking. Even the most blatantly corrupt bank would consider processing ransomware payments as an existential risk. My group and I noticed this with the Viagra spammers: The spammers’ banks had a choice to either unbank the bad guys or be cut off from the financial system. The same would apply if ransomware tried to use wire transfers.

Cash is similarly a nonstarter. A $5 million ransom is 110 pounds (50 kilograms) in $100 bills, or two full-weight suitcases. Arranging such a transfer, to an extortionist operating outside the U.S., is clearly infeasible just from a physical standpoint. The ransomware purveyors need transfers that don’t require physical presence and a hundred pounds of stuff.

This means that cryptocurrencies are the only tool left for ransomware purveyors. So, if governments take meaningful action against Bitcoin and other cryptocurrencies, they should be able to disrupt this new ransomware plague and then eradicate it, as was seen with the spam Viagra industry.

For in the end, we don’t have a ransomware problem, we have a Bitcoin problem.
I agree with Weaver that disrupting the ransomware payment channel is an essential part of a solution to the ransomware problem. It would require denying cryptocurrency exchanges access to the banking system, and global agreement to do this would be hard. Given the involvement of major financial institutions and politicians, it would be hard even in the US. So what else could be done?

Nearly a year ago Joe Kelly wrote a two-part post explaining how governments could take action against Bitcoin (and by extension any Proof-of-Work blockchain). In the first part, How To Kill Bitcoin (Part 1): Is Bitcoin ‘Unstoppable Code’? he summarized the crypto-bro's argument:
They say Bitcoin can’t be stopped. Just like there’s no way you can stop two people sending encrypted messages to each other, so — they say — there’s no way you can stop the Bitcoin network.

There’s no CEO to put on trial, no central server to seize, and no organisation to put pressure on. The Bitcoin network is, fundamentally, just people sending messages to each other, peer to peer, and if you knock out 1 node on the network, or even 1,000 nodes, the honey badger don’t give a shit: the other 10,000+ nodes keep going like nothing happened, and more nodes can come online at any time, anywhere in the world.

So there you have it: it’s thousands of people running nodes — running code — and it’s unstoppable… therefore Bitcoin is unstoppable code; Q.E.D.; case closed; no further questions Your Honour. This money is above the law, and governments cannot possibly hope to control it, right?
The problem with this, as with most of the crypto-bros arguments, is that it applies to the Platonic ideal of the decentralized blockchain. In the real world economies of scale mean things aren't quite like the ideal, as Kelly explains:
It’s not just a network, it’s money. The whole system is held together by a core structure of economic incentives which critically depends on Bitcoin’s value and its ability to function for people as money. You can attack this.

It’s not just code, it’s physical. Proof-of-work mining is a real-world process and, thanks to free-market forces and economies of scale, it results in large, easy-to-find operations with significant energy footprints and no defence. You can attack these.

If you can exploit the practical reality of the system and find a way to reduce it to a state of total economic dysfunction, then it doesn’t matter how resilient the underlying peer-to-peer network is, the end result is the same — you have killed Bitcoin.
Kelly explains why the idea of regulating cryptocurrencies is doomed to failure:
The entire point of Bitcoin is to neutralise government controls on money, which includes AML and taxes. Notice that there’s no great technological difficulty in allowing for the completely unrestricted anonymous sending of a fixed-supply money — the barrier is legal and societal, because of the practical consequences of doing that.

So the cooperation of the crypto world with the law is a temporary arrangement, and it’s not an honest handshake. The right hand (truthfully) expresses “We will do everything we can to comply” while the left hand is hard at work on the technology which makes that compliance impossible. Sure, Bitcoin is pretty traceable now, and sometimes it even helps with finding criminals who don’t have the technical savvy to cover their tracks, but you’ll be fighting a losing battle over time as stronger, more convenient privacy tooling gets added to the Bitcoin protocol and the wider ecosystem around it.
So yeah: half measures like AML and censorship aren’t going to cut it. If you want to kill Bitcoin, that means taking it out wholesale; it means forcing the system into disequilibrium and inducing economic collapse.
In How To Kill Bitcoin (Part 2): No Can Spend Kelly explains how a group of major governments could seize control of the majority of the mining power and mount a specific kind of 51% attack. The basic idea is that governments ban businesses, including exchanges, from transacting in Bitcoin, and seize 80% of the hash rate to mine empty blocks:
As it stands, after seizing 80% of the active hash rate, you can generate proof-of-work hashes at 4x the speed of the remaining miners around the world.
  • You control ~80 exahashes/sec, they control ~20 exahashes/sec
  • For every valid block that rebel miners, collectively, can produce on the Bitcoin blockchain, you can produce 4
You use your limitless advantage to execute the following strategy:
  • Mine an empty block — i.e. a block which is perfectly valid but contains no transactions
  • Keep 5–10 unannounced blocks to yourself — i.e. mine 5–10 ‘extra’ empty blocks ahead of where the chain tip is now, but don’t actually share any of these blocks with the network
  • Whenever a rebel miner announces a valid block, orphan it (override it) by announcing a longer chain with more cumulative proof-of-work — i.e. announce 2 of your blocks
  • Repeat (go back to 2)
The result of this is that Bitcoin transactions are no longer being processed, and you’ve created a black hole of expenditure for rebel miners.
  • Every time a rebel miner spends $ to mine a block, it’s money down the drain: they don’t earn any block rewards for it
  • All transactions just sit in the mempool, being (unstoppably) messaged back and forth between nodes, waiting to be included in a block, but they never make it in
In other words, no-one can spend their bitcoin, no matter who they are or where they are in the world.
Empty blocks wouldn't be hard to detect and ignore, but it would be easy for the government miners to fill their blocks with valid transactions between addresses that they control.

Things have changed since Kelly wrote, in ways that complicate his solution. When he wrote, it was estimated that 75% of the Bitcoin mining power was located in China; China could have implemented Kelly's attack unilaterally. But since then China has been gradually increasing the pressure on cryptocurrency miners. This has motivated new mining capacity to set up elsewhere. The graph shows a recent estimate, with only 65% in China. As David Gerard reports:
Bitcoin mining in China is confirmed to be shutting down — miners are trying to move containers full of mining rigs out of the country as quickly as possible. It’s still not clear where they can quickly put a medium-sized country’s worth of electricity usage. [Reuters]
Here’s a Twitter thread about the miners getting out of China before the hammer falls. You’ll be pleased to hear that this is actually good news for Bitcoin. [Twitter]
If the recent estimate is correct, Kelly's assumption that a group of governments could seize 80% of the mining power looks implausible. The best that could be done would be an unlikely agreement between the US and China for 72%. So for now lets assume that the Chinese government is doing this alone with only 2/3 of the mining power. Because mining is a random process, they are thus only able on average to mine 2 blocks for every one from the "rebel" miners.

Because Bitcoin users would know the blockchain was under attack, they would need to wait several blocks (the advice is 6) before regarding a transaction as final. The rebels would have to win six times in a row, with probability 0.14%, for a transaction to go through. The Bitcoin network can normally sustain a transaction rate of around 15K/hr. Waiting 6 block times would reduce this to about 20/hr. Even if the requirement was only to wait 3 block times, the rate would be degraded to about 550/hr, so the attack would greatly reduce the supply of transactions and greatly increase their price.

The recent cryptocurrency price crash caused average transaction fees to spike to about $60. In the event of an attack like this HODL-ers would be desperate to sell their Bitcoin, so bidding for the very limited supply of transactions would be intense. Anyone on the buy-side of these transactions would be making a huge bet that the attack would fail, so the "price" of Bitcoin in fiat currencies would collapse. Thus the economics for the rebel miners would look bleak. Their chances of winning a block reward would be greatly reduced, and the value of any reward they did win would be greatly reduced. There would be little incentive for the rebels to continue spending power to mine doomed blocks, so the cost of the attack for the government would drop rapidly once in place.

The cost of the attack is roughly 2/3 of 6.25 BTC/block times 144block/day. Even making the implausible assumption that the price didn't collapse from its current $35K/BTC the cost is $21M/day or $7.6B/yr. A drop in the bucket for a major government. Thus it appears that, until the concentration of mining power in China decreases further, the Chinese government could kill Bitcoin using Kelly's attack. For an analysis of an alternate attack on the Bitcoin blockchain see In The Economic Limits of Bitcoin and the Blockchain by Eric Budish.

Kelly addresses the government attacker:
Normally what keeps the core structure of incentives in balance in the Bitcoin system, and the reason why miners famously can’t dictate changes to the protocol, or collude to double-spend their coins at will, is the fact that for-profit miners have a stake in Bitcoin’s future, so they have a very strong disincentive towards using their power to attack the network.

In other words, for-profit miners are heavily invested in and very much care about the future value of bitcoin, because their revenue and the value of their mining equipment critically depends on it. If they attack the network and undermine the integrity of Bitcoin and its fundamental value proposition to end users, they’re shooting themselves in the foot.

You don’t have this problem.

In fact this critical variable is flipped on its head: you have a stake in the destruction of Bitcoin’s future. You are trying to get the price of BTC to $0, and the value of all future block rewards along with it. Attacking the network to undermine the integrity of Bitcoin and its value proposition to end users is precisely your goal.

This fundamentally breaks the game theory and the balance of power in the system, and the result is disequilibrium.

In short, Bitcoin is based on a Mexican Standoff security model which only works as a piece of economic design if you start from the assumption that every actor is rational and has a stake in the system continuing to function.

That is not a safe assumption.
There are two further problems. First, Bitcoin is only one, albeit the most important, of almost 10,000 cryptocurrencies. Second, some of these other cryptocurrencies don't use Proof-of-Work. Ethereum, the second most important cryptocurrency, after nearly seven years work, is promising shortly to transition from Proof-of-Work to Proof-of-Stake. The resources needed to perform a 51% attack on a Proof-of-Stake blockchain are not physical, and thus are not subject to seizure in the way Kelly assumes. I have written before on possible attacks in Economic Limits Of Proof-of-Stake Blockchains, but only in the context of double-spending attacks. I plan to do a follow-up post discussing sabotage attacks on Proof-of-Stake blockchains once I've caught up with the literature

Infrastructure for heritage institutions – Open and Linked Data / Lukas Koster


In my June 2020 post in this series, “Infrastructure for heritage institutions – change of course  ” , I said:

The results of both Data Licences and the Data Quality projects (Object PID’s, Controlled Vocabularies, Metadata Set) will go into the new Data Publication project, which will be undertaken in the second half of 2020. This project is aimed at publishing our collection data as open and linked data in various formats via various channels. A more detailed post will be published separately.”

In November 2020 we implemented ARK Persistent Identifiers for the central catalogue of the Library of the University of Amsterdam (see Infrastructure for heritage institutions – ARK PID’s  ). And now, in May 2021, we present our open and linked data portals:

Here I will provide some background information on the choices made in the project, the workflow, the features, and the options made possible by publishing the data.

General approach

The general approach for publishing collection data is: determine data sources, define datasets, select and apply data licences, determine publication channels, define applicable data, record and syntax formats, apply transformations to obtain the desired formats and publish. A mix of expertise types is required: content, technical and communication.  In the current project this general approach is implemented in a very pragmatic manner. This means that we haven’t always taken the ideal paths forward, but mainly the best possible options at the time. There will be shortcomings, but we are aware of them and they will be addressed in due course. It is also a learning project.

Data sources

The Library maintains a number of systems/databases containing collection data, although the intention is to eventually minimize the number of systems and data sources in the context of the Digital Infrastructure programme. The bulk of the collection data is managed in the central Ex Libris Alma catalogue. Besides that there is an ArchivesSpace installation, as well as several Adlib systems and a KOHA installation originating from the adoption of collections of other organisations. Most of these databases will probably be incorporated into the central catalogue in the near future.

In this initial project we have focused on the Alma central catalogue of the University only (and not yet of our partner the Amsterdam University of Applied Sciences).

Data licences

According to the official policy, the Library strives to make its collections as open as possible, including the data and metadata required to access these. For this reason the standard default licence for collection data is the Open Data Commons Public Domain Dedication and License (PDDL), which applies to databases as a whole. However, there is one important exception. A large part of the metadata records in the central catalogue originates from OCLC WorldCat. This situation is inherited from the good old Dutch national PICA shared catalogue. Of course there is nothing wrong with shared cataloguing, but unfortunately OCLC requires attribution to the WorldCat community, using an Open Data Commons Attribution License (ODC-BY), according to the WorldCat Community Norms. In practice you have to be able to distinguish between metadata originating from WorldCat and metadata not originating from WorldCat. On the bright side: OCLC considers referencing a WorldCat URI in the data sufficient attribution in itself. In the open data from the central catalogue canonical WorldCat URI’s are present when applicable, so the required licence is implied. But on the dark side: especially in the case of linked data (to which the OCLC ODC-BY licence explicitly applies) “the attribution requirement makes reuse and data integration more difficult in linked open data scenarios like WikiData” as Olaf Janssen of the National Library of The Netherlands cited the DBLP Data Licence Change comment on Twitter. An attribution licence might make sense if the database is reused as a whole, or if, in the case of the implicit URI reference, full database records are reused. But especially in linked data contexts it is not uncommon to reuse and combine individual data elements or properties, leaving out the URI references. This makes an ODC-BY licence practically unworkable. It is time that OCLC reconsider their licence policy and adapt it to the modern world.


The central catalogue contains descriptions of over four million items, of which more than three million books. The rest consists of maps, images, audio, video, sheet music, archaeological objects, museum objects etc. For various practical reasons it is not feasible to make the full database available as one large dataset. That is why it was decided to split the database into smaller segments and publish datasets for physical objects by material type. A separate dataset was defined for digitised objects (images) published in the Image Repository. Because of the large amount of books and manuscripts, of these two material types only datasets of incunabula and letters are published. Other book datasets are available on demand.

In Alma these datasets are defined as “Logical Sets”, which are basically saved queries with dynamic result records. These Logical Sets serve as input for Alma Publishing Profiles, used for creating export files and harvesting endpoints (see below).


Data format: the published datasets only contain public metadata from Alma. Internal business and confidential data are filtered out before publishing. Creator/contributor and subject fields are enriched with URI’s, based on available identifiers from external authority files (Library of Congress Name Authority File and OCLC FAST for more recent records, Dutch National Authors and Subjects Thesaurus for older records). Through these URI’s relations to other global authority files can be established, such as VIAF, Wikidata and AAT. This is especially important for linked data (see below).

If these fields only contain textual descriptions without identifiers, enrichment is not applied. This lack of identifiers is input for the data quality improvement activities  currently taking place. Available OCLC numbers are converted to canonical WorldCat URI’s, as mentioned in the Licences section. These data format transformations are performed using Alma Normalization Rules Sets, from within the Publishing Profiles.

Record and syntax formats: currently the datasets are made available in MARC-XML and Dublin Core Unqualified, two of the standard Alma export formats. For linked data formats, see below.

Publication channels

Downloadable files 

For each Alma Logical Set once a month two export files are generated and written to a separate server. Two separate Alma Publishing Profiles are needed, one for each output format (MARC-XML and Dublin Core). The file names are generated using the syntax [institution-data source-dataset-format], for instance “uva_alma_maps_marc“, “uva_alma_maps_dc“. Alma automatically adds “_new” and zips the files, so the results are for instance “uva_alma_maps_marc_new.tar.gz” and “uva_alma_maps_dc_new.tar.gz“. These export files are moved by a shell script to a publicly accessible directory on the same server, replacing the already existing files in that directory. On the Library Open Data website the links to all, currently twenty, files are published on a static webpage.

OAI-PMH Harvesting 

OAI-PMH harvesting endpoints are created using the same Alma Publishing Profiles, one for each output format. The set_spec and set_name are [dataset-format] and [dataset] respectively. The set_spec is used in the Alma system OAI-PMH call, for instance:

The harvesting links for all datasets/formats are also published on the same static webpage.


For the Alma data source Ex Libris provides a number of API’s both for the Alma backend and for the Alma Discovery/Primo frontend. However, there are some serious limitations in using these. The Alma API’s can only be used for raw data and the full Alma database. No logical sets can be used, nor data transformations using Normalization Rules. This means that data can’t be enriched with PID’s and URI’s, non-public data can’t be hidden, no individual datasets can be addressed. For our business case this means that the Alma API’s are not useful. Alternatively Primo API’s could be used, where the display data is enriched with PID’s and URI’s. However, it is again not possible to publish only specific sets and to filter out private data. The internal local field labels (“lds01”, “lds02”, etc.) can’t be replaced by more meaningful labels. Moreover, for all API’s there are API keys and API call limits.

For our business case an alternative API approach is required, either developing and maintaining our own API’s, or using a separate data and/or API platform.

Linked Data 

Just like API’s, Ex Libris provides linked data features for Alma and Primo, which are not useful for implementing real linked data (yet). The concept of linked data is characterised by the fact that it is essentially a combination of specific formats (RDF) and publication channels (Sparql, content negotiation). Alma provides specific RDF formats (BIBFRAME, JSON-LD, RDA-RDF) with URI enrichment, but it is not possible to publish the RDF with your own PID-URI’s (in our case ARK’s and Handles). Instead internal system dependent URI’s are used. The Alma RDF formats can be used in the Alma Publishing Profiles to generate downloadable files, and in the Alma API’s. We have already seen that the Alma API’s have serious limitations. Moreover, Ex Libris currently does not support Sparql endpoints and content negotiation. These features appear to be on the roadmap however. It is a pity that I have not been able to implement the Ex Libris Alma and Primo linked data features that ultimately resulted from the first linked data session I helped organise at the 2011 IGeLU annual conference and the subsequent establishment of the IGeLU/ELUNA Linked Open Data Working Group ten years ago.

Anyway, we ended up implementing a separate linked data platform that serves as an API platform at the same time: Triply. In order to publish the collection data on this new platform, another separate tool is required for transforming the collection’s MARC data to RDF. For this we currently use Catmandu. We have had previous experience with both tools during the AdamLink project some years ago.

RDF transformation with Catmandu

Catmandu is a multipurpose data transformation toolset, maintained by an international open source community. It provides import and export modules for a large number of formats, not only RDF. Between import and export the data can be transformed using all kinds of “fix” commands. In our case we depend heavily on the Catmandu MARC modules library and the example fix file MARC2RDF by Patrick Hochstenbach, as starting point.

The full ETL process makes use of the MARC-XML dataset files exported by the Alma Publishing profiles. These MARC-XML files are transformed to RDF using Catmandu, and the resulting RDF files are then imported into the Triply data platform using the Triply API.

RDF model

The pragmatic approach resulted in the adoption of a simplified version of the Europeana Data Model (EDM) as the local RDF model for the library collection metadata. EDM largely fits the MARC21 record format used in the catalogue for all material types. EDM is based on Qualified Dublin Core. A MARC to DC mapping is used based on the official Library of Congress MARC to DC mapping, adapted to our own situation.

The original three EDM RDF core classes Provided Cultural Heritage Object, Web Resource, Aggregation are furthermore merged into one: Provided Cultural Heritage Object, with additional subclasses for the individual material types. The Library RDF model description is available from the Open Data website.

Triply data platform

Triply is a platform for storing, publishing and using linked data sets. It offers various ways of using the data: the Browser for browsing through the data, Table for presenting triples in tabular form, Sparql Yasgui endpoints, ElasticSearch full text search, API’s (Triple Pattern Fragments, JavaScript Client) and Stories (articles incorporating interactive visualisations of the underlying data using Sparql Yasgui queries).

The Library of the University of Amsterdam Triply platform currently shows separate datasets for each of the ten datasets defined in Alma, as well as one combined dataset for the nine physical material type datasets. For this Catalogue dataset and for the Image Repository dataset  only, Sparql and ElasticSearch endpoints are defined.

Content negotiation

Content negotiation is the differentiated resolution of a PID-URI to different targets based on requested response formats. This way one PID-URI for a specific item can lead to different representations of the item, for instance a record display for human consumption in a web interface, or a data representation for machine readable interfaces. The Triply API supports a number of response formats (such as Turtle, JSON, JSON-LD, N3 etc.), both in HTTP headers and as HTTP URL parameters.

We have implemented content negotiation for our ARK PID’s as simple redirect rules to these Triply API response formats.


Data publication workflows can differ greatly in composition and maintenance, depending on the underlying systems, metadata quality and infrastructure. The extent of dependency on specific systems is an important factor.

For the central catalogue certain required actions and transformations are performed using internal Alma system utilities: Logical Sets, Publishing Profiles, Normalization Rules, harvesting endpoints. This way basic transformations and publication channels are implemented with system specific utilities.

The more complicated transformations and publication channels (linked data, API’s, etc.) are implemented using generic external tools. In time, it might become possible to implement all data publication features with the Ex Libris toolset. When that time comes the Library should have decided on their strategic position in this matter: implement open data as much as possible with the features of the source systems in use, or be as system independent as possible? Depending fully on system features means that overall maintenance is easier but in the case of a system migration everything has to be developed from scratch again. Depending on generic external utilities means that you have to develop everything from scratch, but in the case of a system migration most of the externally implemented functionality will continue working.

Follow up

After delivering these first open data utilities, the time has come for evaluation, improvement and expansion. Shortcomings, as mentioned above, can be identified based on user feedback and analysis of usage statistics. Datasets and formats can be added or updated, based on user requests and communication with target audiences. New data sources can be published, with appropriate workflows, datasets and formats. The current workflow can be evaluated internally and adapted if needed. The experiences with the data publication project will also have a positive influence on the internal digital infrastructure, data quality and data flows of the library. Last but not least, time will tell if and in what expected and unexpected ways the library collections open data will be used.

Elon Musk Disrupts Cryptocurrencies? / David Rosenthal

Recently there has been a regrettable failure of Bitcoin and most other cryptocurrencies to proceed in an orderly fashion moon-wards. I had great timing. As I started work on this post Tuesday Bitcoin was at $43,629, down 27% from its high of $59,592 on May 9th, already down 8% from its all-time high of $64,899 on April 14th. Yesterday, I woke to find it had bottomed out at $30,000, down 49.7% from the peak. It bounced back to $42,434 before sliding again to $38,914 and recovering to $41,899. On April 21st the average fee per transaction spiked to $63.78. There is no way this makes sense as a store of value or a medium of exchange, only as a vehicle for speculation.

As Jemima Kelley notes in Crypto bros take the fight back to Elon as prices tank the speculators aren't happy and they know who is to blame:
Up until recently utterly enamoured with the self-stylised technoking, the brodom has become incensed after Musk sent the price of bitcoin sliding by tweeting first that Tesla would no longer be accepting bitcoin (because of its environmental impact), and then sent it down further over the weekend after appearing to suggest that Tesla would dump its bitcoin holdings because of the way he was being treated by the bros. (He later clarified that Tesla had not, at this point, sold any bitcoin.) Of course, he also found time to boast about his superior knowledge of money due to his time at PayPal.
Below the fold, I try to dispassionately assign blame for this flaw in the natural order of things.

The immediate reaction of the crypto-bros was, wait for it, to create the 9,857th (F---ELON) and 9,858th (F---ELONTWEET) cryptocurrencies and pump them! That'll show Elon who's boss!

The crypto-bros aren't completely wrong. In Elon Musk’s Bitcoin Fun Continues Matt Levine describes Musk's grip on the cryptocurrency markets, comparing it to a money machine based on a magic lamp with a genie inside who could move prices up or down:
  1. He definitely has the power to move Bitcoin and Dogecoin prices whenever he feels like it.
  2. He definitely exercises that power with some frequency and with no apparent pattern.
  3. He definitely has the money to buy lots of Bitcoin and Dogecoin before making them go up.
  4. If he did do that, he probably would not have much in the way of disclosure obligations, at least not real-time disclosure obligations.
  5. Similarly he could easily sell them before making them go down, without much in the way of disclosure obligations.
  6. The regulation and policing of Bitcoin and Dogecoin trading are rather less comprehensive and aggressive than the regulation and policing of stock trading.
So if he was doing the magic-lamp trading strategy — which, again, I don’t think he is — it would look, to an outside observer, more or less exactly like what he’s currently doing.

I just think that if you presented this possibility to any famous investor throughout history they would absolutely faint with excitement. And here Musk is, the second-richest person in the world, either doing it — in which case it’s one of the greatest trades, and also one of the greatest trolls, in history — or not doing it, rubbing the lamp and making Bitcoin and Dogecoin go up and down and up and down and up and down with his whims, just for fun, leaving billions of dollars of profit on the table because he doesn’t need the money and has the purest imaginable commitment to internet trolling.
Levine is right about the lack of regulation. Hamza Shaban's As cryptocurrency goes wild, fear grows about who might get hurt reports on a recent Congressional hearing:
The SEC’s recent warnings of the dangers of bitcoin follow calls for more muscular government action, establishing a federal watchdog with a clear mandate to oversee cryptocurrency’s regulatory gray area.

“Right now the exchanges trading in these crypto assets do not have a regulatory framework, either at the SEC or our sister agency, the Commodity Futures Trading Commission,” SEC Chair Gary Gensler told Congress earlier this month in one of his first remarks on cryptocurrency regulation. “Right now there’s not a market regulator around these crypto exchanges. And thus there’s really not protection against fraud or manipulation.”
"Not protection against" is an English-level understatement. "Rife with fraud and manipulation" would be nearer the mark. David Gerard is blunter in Bitcoin crashed today — it was market shenanigans, not China:
Today’s trigger won’t have been China, the Treasury, external real-world events, news announcements or macroeconomic considerations. It’ll be shenanigans internal to a thinly-traded, heavily-manipulated, largely unregulated commodity market.
Michael Batnick explains that, in fact, the crypto-bros are mostly wrong in blaming Musk. In Bitcoin is Crashing. This is What it Does he writes:
This is the 10th time since 2017 that Bitcoin was at an all-time high and then fell 30%.
  • -35%, January 2017
  • -33%, March 2017
  • -32%, May 2017
  • -40%, July 2017
  • -41%, September 2017
  • -30%, November 2017
  • -21% December 2017
  • -23% December 2017
  • -84% December 2018
  • -31%, January 2021
  • -26% February 2021
  • -32%, May 2021
And yet despite these brutal declines, the price is up ~50x over the same time. Bitcoin is the definition of no pain, no gain. Did you think getting hilariously rich was easy?
Well, yes, but the risk of losing it all is real. Hamza also quotes Andrew Bailey, governor of the Bank of England:
“Buy them only if you’re prepared to lose all your money.”
As I described in The Bitcoin "Price", cryptocurrencies are not really priced in fiat currencies, but in a "stablecoin" called Tether (USDT) which, like money market funds, is supposed to be tied to the US Dollar (USD). In Tether's case it was originally claimed that it was tied because it was backed 1-for-1 by USD in a bank account.

There are many reasons for skepticism about USDT's tie, among them being Tether's admission in court two years ago that it was only 74% backed by USD, its loss of $850M in the Crypto Capital scam, their $18.5M settlement with the NY Attorney General over false claims, and their recent pathetic attempt to satisfy the transparency requirements of the settlement by publishing two piecharts.

Frances Coppola makes an important point in Tether’s smoke and mirrors. Tether's terms of service place them under no obligation to redeem Tethers for fiat or indeed for anything at all:
Tether reserves the right to delay the redemption or withdrawal of Tether Tokens if such delay is necessitated by the illiquidity or unavailability or loss of any Reserves held by Tether to back to Tether Tokens, and Tether reserves the right to redeem Tether Tokens by in-kind redemptions of securities and other assets held in the Reserves.
Coppola points out that this means that whether only 2.94% of the backing is USD in a bank account isn't the reason for skepticism:
if Tether is simply going to refuse redemption requests or offer people tokens it has just invented instead of fiat currency, it wouldn’t matter if the entire asset base was junk, since it will never have any significant need for cash.

So whether Tether’s "reserves" are cash equivalents doesn't matter. But what does matter is capital.

For banks, funds and other financial institutions, capital is the difference between assets and liabilities. It is the cushion that can absorb losses from asset price falls, whether because of fire sales to raise cash for redemption requests or simply from adverse market movements or creditor defaults.

The accountant's attestations reveal that Tether has very little capital. The gap between assets and liabilities is paper-thin: on 31st March 2021 (pdf), for example, it was 0.36% of total consolidated assets, on a balance sheet of more than $40bn in size. Stablecoin holders are thus seriously exposed to the risk that asset values will fall sufficiently for the par peg to USD to break – what money market funds call “breaking the buck”.

The money market fund Reserve Primary MMF broke the buck in 2008 due to significant losses from its holdings of Lehman paper. Its net asset value (NAV) only fell to 97 cents, but that was enough to trigger a rush for the exit. Reserve Primary became insolvent and was eventually wound up.
The Federal Reserve had to step in to prevent a run on all the other money market funds, promising as with bank accounts to make sure they didn't lose investors' money. They're not going to do that if Tether breaks the buck. And if it does, Tether can simply refuse to redeem USDT, which would make the "prices" of cryptocurrencies entirely untethered from fiat.

David Gerard notices Tether's involvement in today's action:
But someone seems worried about Tether. There are a few thinly-traded USDT/USD pairs on exchanges; Coinbase just launched theirs. Usually, the peg is solid at $1.00 plus-or-minus a tenth of a cent.

Today, the peg slipped — and tethers have been bouncing between $0.90 and $1.10 on different exchanges. This sort of thing hasn’t been seen since October 2018, when Bitfinex found out that $850 million of Tether’s backing had disappeared with their dodgy payments processor.
Bitcoin went from $30,000.00 (precisely) back up over $42,000 this afternoon, coincidentally with Tether sending 500 million USDT to Binance. Tether’s still here, with their thumb on the scale
Bitcoin is a market with a very visible hand.

Update 31st May 2021: As regards pump-and-dump schemes, Izabella Kaminska's Goldman’s not giving up hope on Bitcoin includes this comment on the crash on Wednesday May 19th:
Take the following Nostradamus-esque note which was supposedly posted to 4chan last Tuesday, May 18 predicting the entire ugly trading episode of last week:
How did this insider really know what was coming? Or was it simply a lucky coincidence? Who was his group trying to shake out anyway? (Was it Elon?) How does that bode for other sizeable and clearly transparent corporate treasury holders of bitcoin in the future? And why did they share this information on 4chan?

NFTs and Web Archiving / David Rosenthal

One of the earliest observations of the behavior of the Web at scale was "link rot". There were a lot of 404s, broken links. Research showed that the half-life of Web pages was alarmingly short. Even in 1996 this problem was obvious enough for Brewster Kahle to found the Internet Archive to address it. From the Wikipedia entry for Link Rot:
A 2003 study found that on the Web, about one link out of every 200 broke each week,[1] suggesting a half-life of 138 weeks. This rate was largely confirmed by a 2016–2017 study of links in Yahoo! Directory (which had stopped updating in 2014 after 21 years of development) that found the half-life of the directory's links to be two years.[2]
One might have thought that academic journals were a relatively stable part of the Web, but research showed that their references decayed too, just somewhat less rapidly. A 2013 study found a half-life of 9.3 years. See my 2015 post The Evanescent Web.

I expect you have noticed the latest outbreak of blockchain-enabled insanity, Non-Fungible Tokens (NFTs). Someone "paying $69M for a JPEG" or $560K for a New York Times column attracted a lot of attention. Follow me below the fold for the connection between NFTs, "link rot" and Web archiving.

Kahle's idea for addressing "link rot", which became the Wayback Machine, was to make a copy of the content at some URL, say:
keep the copy for posterity, and re-publish it at a URL like:
What is the difference between the two URLs? The original is controlled by Example.Com, Inc.; they can change or delete it on a whim. The copy is controlled by the Internet Archive, whose mission is to preserve it unchanged "for ever". The original is subject to "link rot", the second is, one hopes, not subject to "link rot". The Wayback Machine's URLs have three components:
  • locates the archival copy at the Internet Archive.
  • 19960615083712 indicates that the copy was made on 15th June, 1996 at 8:37:12.
  • is the URL from which the copy was made.
The fact that the archival copy is at a different URL from the original causes a set of problems that have bedevilled Web archiving. One is that, if the original goes away, all the links that pointed to it break, even though there may be an archival copy to which they could point to fulfill the intent of the link creator. Another is that, if the content at the original URL changes, the link will continue to resolve but the content it returns may no longer reflect the intent of the link creator, although there may be an archival copy that does. Even in the early days of the Web it was evident that Web pages changed and vanished at an alarming rate.

The point is that the meaning of a generic Web URL is "whatever content, or lack of content, you find at this location". That is why URL stands for Universal Resource Locator. Note the difference with URI, which stands for Universal Resource Identifier. Anyone can create a URL or URI linking to whatever content they choose, but doing so provides no rights in or control over the linked-to content.

In People's Expensive NFTs Keep Vanishing. This Is Why, Ben Munster reports that:
over the past few months, numerous individuals have complained about their NFTs going “missing,” “disappearing,” or becoming otherwise unavailable on social media. This despite the oft-repeated NFT sales pitch: that NFT artworks are logged immutably, and irreversibly, onto the Ethereum blockchain.
So NTFs have the same problem that Web pages do. Isn't the blockchain supposed to make things immortal and immutable?

Kyle Orland's Ars Technica’s non-fungible guide to NFTs provides an over-simplified explanation:
When NFT’s are used to represent digital files (like GIFs or videos), however, those files usually aren’t stored directly “on-chain” in the token itself. Doing so for any decently sized file could get prohibitively expensive, given the cost of replicating those files across every user on the chain. Instead, most NFTs store the actual content as a simple URI string in their metadata, pointing to an Internet address where the digital thing actually resides.
NFTs are just links to the content they represent, not the content itself. The Bitcoin blockchain actually does contain some images, such as this ASCII portrait of Len Sassaman and some pornographic images. But the blocks of the Bitcoin blockchain were originally limited to 1MB and are now effectively limited to around 2MB, enough space for small image files. What’s the Maximum Ethereum Block Size? explains:
Instead of a fixed limit, Ethereum block size is bound by how many units of gas can be spent per block. This limit is known as the block gas limit ... At the time of writing this, miners are currently accepting blocks with an average block gas limit of around 10,000,000 gas. Currently, the average Ethereum block size is anywhere between 20 to 30 kb in size.
That's a little out-of-date. Currently the block gas limit is around 12.5M gas per block and the average block is about 45KB. Nowhere near enough space for a $69M JPEG. The NFT for an artwork can only be a link. Most NFTs are ERC-721 tokens, providing the optional Metadata extension:
/// @title ERC-721 Non-Fungible Token Standard, optional metadata extension
/// @dev See
/// Note: the ERC-165 identifier for this interface is 0x5b5e139f.
interface ERC721Metadata /* is ERC721 */ {
/// @notice A descriptive name for a collection of NFTs in this contract
function name() external view returns (string _name);

/// @notice An abbreviated name for NFTs in this contract
function symbol() external view returns (string _symbol);

/// @notice A distinct Uniform Resource Identifier (URI) for a given asset.
/// @dev Throws if `_tokenId` is not a valid NFT. URIs are defined in RFC
/// 3986. The URI may point to a JSON file that conforms to the "ERC721
/// Metadata JSON Schema".
function tokenURI(uint256 _tokenId) external view returns (string);
The Metadata JSON Schema specifies an object with three string properties:
  • name: "Identifies the asset to which this NFT represents"
  • description: "Describes the asset to which this NFT represents"
  • image: "A URI pointing to a resource with mime type image/* representing the asset to which this NFT represents. Consider making any images at a width between 320 and 1080 pixels and aspect ratio between 1.91:1 and 4:5 inclusive."
Note that the JSON metadata is not in the Ethereum blockchain, it is only pointed to by the token on the chain. If the art-work is the "image", it is two links away from the blockchain. So, given the evanescent nature of Web links, the standard provides no guarantee that the metadata exists, or is unchanged from when the token was created. Even if it is, the standard provides no guarantee that the art-work exists or is unchanged from when the token is created.

Caveat emptor — Absent unspecified actions, the purchaser of an NFT is buying a supposedly immutable, non-fungible object that points to a URI pointing to another URI. In practice both are typically URLs. The token provides no assurance that either of these links resolves to content, or that the content they resolve to at any later time is what the purchaser believed at the time of purchase. There is no guarantee that the creator of the NFT had any copyright in, or other rights to, the content to which either of the links resolves at any particular time.

There are thus two issues to be resolved about the content of each of the NFT's links:
  • Does it exist? I.e. does it resolve to any content?
  • Is it valid? I.e. is the content to which it resolves unchanged from the time of purchase?
These are the same questions posed by the Holy Grail of Web archiving, persistent URLs.

Assuming existence for now, how can validity be assured? There have been a number of systems that address this problem by switching from naming files by their location, as URLs do, to naming files by their content by using the hash of the content as its name. The idea was the basis for Bram Cohen's highly successful BitTorrent — it doesn't matter where the data comes from provided its integrity is assured because the hash in the name matches the hash of the content.

The content-addressable file system most used for NFTs is the Interplanetary File System (IPFS). From its Wikipedia page:
As opposed to a centrally located server, IPFS is built around a decentralized system[5] of user-operators who hold a portion of the overall data, creating a resilient system of file storage and sharing. Any user in the network can serve a file by its content address, and other peers in the network can find and request that content from any node who has it using a distributed hash table (DHT). In contrast to BitTorrent, IPFS aims to create a single global network. This means that if Alice and Bob publish a block of data with the same hash, the peers downloading the content from Alice will exchange data with the ones downloading it from Bob.[6] IPFS aims to replace protocols used for static webpage delivery by using gateways which are accessible with HTTP.[7] Users may choose not to install an IPFS client on their device and instead use a public gateway.
If the purchaser gets both the NFT's metadata and the content to which it refers via IPFS URIs, they can be assured that the data is valid. What do these IPFS URIs look like? The (excellent) IPFS documentation explains:<CID>
# e.g
Browsers that support IPFS can redirect these requests to your local IPFS node, while those that don't can fetch the resource from the gateway.

You can swap out for your own http-to-ipfs gateway, but you are then obliged to keep that gateway running forever. If your gateway goes down, users with IPFS aware tools will still be able to fetch the content from the IPFS network as long as any node still hosts it, but for those without, the link will be broken. Don't do that.
Note the assumption here that the gateway will be running forever. Note also that only some browsers are capable of accessing IPFS content without using a gateway. Thus the gateway is a single point of failure, although the failure is not complete. In practice NFTs using IPFS URIs are dependent upon the continued existence of Protocol Labs, the organization behind IPFS. The URIs in the NFT metadata are actually URLs; they don't point to IPFS, but to a Web server that accesses IPFS.

Pointing to the NFT's metadata and content using IPFS URIs assures their validity but does it assure their existence? The IPFS documentation's section Persistence, permanence, and pinning explains:
Nodes on the IPFS network can automatically cache resources they download, and keep those resources available for other nodes. This system depends on nodes being willing and able to cache and share resources with the network. Storage is finite, so nodes need to clear out some of their previously cached resources to make room for new resources. This process is called garbage collection.

To ensure that data persists on IPFS, and is not deleted during garbage collection, data can be pinned to one or more IPFS nodes. Pinning gives you control over disk space and data retention. As such, you should use that control to pin any content you wish to keep on IPFS indefinitely.
To assure the existence of the NFT's metadata and content they must both be not just written to IPFS but also pinned to at least one IPFS node.
To ensure that your important data is retained, you may want to use a pinning service. These services run lots of IPFS nodes and allow users to pin data on those nodes for a fee. Some services offer free storage-allowance for new users. Pinning services are handy when:
  • You don't have a lot of disk space, but you want to ensure your data sticks around.
  • Your computer is a laptop, phone, or tablet that will have intermittent connectivity to the network. Still, you want to be able to access your data on IPFS from anywhere at any time, even when the device you added it from is offline.
  • You want a backup that ensures your data is always available from another computer on the network if you accidentally delete or garbage-collect your data on your own computer.
Thus to assure the existence of the NFT's metadata and content pinning must be rented from a pinning service, another single point of failure.

In summary, it is possible to take enough precautions and pay enough ongoing fees to be reasonably assured that your $69M NFT and its metadata and the JPEG it refers to will remain accessible. Whether in practice these precautions are taken is definitely not always the case. David Gerard reports:
But functionally, IPFS works the same way as BitTorrent with magnet links — if nobody bothers seeding your file, there’s no file there. Nifty Gateway turn out not to bother to seed literally the files they sold, a few weeks later. [Twitter; Twitter]
Anil Dash claims to have invented, with Kevin McCoy, the concept of NFTs referencing Web URLs in 2014. He writes in his must-read NFTs Weren’t Supposed to End Like This:
Seven years later, all of today’s popular NFT platforms still use the same shortcut. This means that when someone buys an NFT, they’re not buying the actual digital artwork; they’re buying a link to it. And worse, they’re buying a link that, in many cases, lives on the website of a new start-up that’s likely to fail within a few years. Decades from now, how will anyone verify whether the linked artwork is the original?

All common NFT platforms today share some of these weaknesses. They still depend on one company staying in business to verify your art. They still depend on the old-fashioned pre-blockchain internet, where an artwork would suddenly vanish if someone forgot to renew a domain name. “Right now NFTs are built on an absolute house of cards constructed by the people selling them,” the software engineer Jonty Wareing recently wrote on Twitter.
My only disagreement with Dash is that, as someone who worked on archiving the "old-fashioned pre-blockchain internet" for two decades, I don't believe that there is a new-fangled post-blockchain Internet that makes the problems go away. And neither does David Gerard:
The pictures for NFTs are often stored on the Interplanetary File System, or IPFS. Blockchain promoters talk like IPFS is some sort of bulletproof cloud storage that works by magic and unicorns.
Update 22nd May 2021: At least one artist understands NTFs. Tip of the hat to David Gerard.

Update 28th May, 2021: Kimberly Parker's Most artists are not making money off NFTs and here are some graphs to prove it has some numbers on the NFT market during the mania which illustrate the extreme Gini coefficients endemic to cryptocurrencies:
The first time you sell an NFT it’s called a Primary Sale. Everything after that is called a Secondary Sale. According to this, 67.6% of Sales have not had a Secondary Sale and 19.5% have had one Secondary Sale. ...
  • 33.6% of Primary Sales were $100 or less
  • 20.0% of Primary Sales were $100-$200
  • 11.1% of Primary Sales were $200-$300
  • 7.7% of Primary Sales were $300-$400
  • 3.9% of Primary Sales were $400–500
  • 3.3% of Primary Sales were $500-$600
  • 2.5% of Primary Sales were $600–$700
Most NFT sites will recommend you set your sale price at 0.5 ETH, which was about $894 USD on March 19th. The number of Primary Sales that ended up selling for the recommended price was a whopping 1.8%.

The largest group of Primary Sales (34%) were for $100 or less. For $100, you can expect to have 72.5% — 157.5% of your Sale deducted by fees*. That’s an average(!) of 100.5%, leaving you with a $0.50 deficit or more.

The next biggest group of Primary Sales (20%) were for $100-$200. For $200, you can expect to have 37.5–80% of your Sale deducted by fees*. That’s an average(!) of 54%, leaving you with $92 or less.

The next biggest group of Primary Sales (11%) were for $200-$300. For $300, you can expect to have 25.8% — 54.2% of your Sale deducted by fees*. That’s an average(!) of 38.5%, leaving you with $223 or less.

Thoughts on Growing Up / Peter Murray

It ‘tis the season for graduations, and this year my nephew is graduating from high school. My sister-in-law created a memory book—”a surprise Book of Advice as he moves to the next phase of his life.” What an interesting opportunity to reflect! This is what I came up with:

Sometime between when I became an adult and now, the word “adulting” was coined. My generation just called it “growing up.” The local top-40 radio station uses “hashtag-adulting” to mean all of those necessary life details that now become your own responsibility. (“Hashtag” is something new, too, for what that’s worth.)

Growing up is more than life necessities, though. This is an exciting phase of life that you’ve built up to—many more doors of possibilities are opening and now you get to pick which ones to go through. Pick carefully. Each door you go through starts to close off others. Pick many. Use this life stage to try many things to find what is fun and what is meaningful (and aim for both fun and meaningful). You are on a solid foundation, and I’m eager to see what you discover “adulting” means to you.

Fedora Migration Paths and Tools Project Update: May 2021 / DuraSpace News

This is the eighth in a series of monthly updates on the Fedora Migration Paths and Tools project – please see last month’s post for a summary of the work completed up to that point. This project has been generously funded by the IMLS.

The University of Virginia has completed their data migration and successfully indexed the content into a new Fedora 6.0 instance deployed in AWS using the fcrepo-aws-deployer tool. They have also tested the fcrepo-migration-validator tool and provided some initial feedback to the team for improvements. Some work remains to update front-end indexes for the content in Fedora 6.0, and the team will also investigate some performance issues that were encountered while migrating and indexing content in the Amazon AWS environment in order to document any relevant recommendations for institutions wishing to migrate to a similar environment.

Based on this work, we will be offering an initial online workshop on Migrating from Fedora 3.x to Fedora 6.0. This workshop is free to attend with limited capacity so please register in advance. This is a technical workshop pitched at an intermediate level. Prior experience with Fedora is preferred, and participants should be comfortable using a Command Line Interface and Docker. The workshop will take place on June 22 at 11am ET.

The Whitman College team has been busy iterating on test migrations of representative collections into a staging server using the islandora_workbench tool. The team has been making updates to the migration tool, site configuration, and documentation along the way to better support future migrations. In particular, the work the team has done to iterate on the spreadsheets until they were properly configured for ingest will be very useful to other institutions interested in following a similar path. Once the testing and validation of functional requirements is complete we will begin the full migration into the production site.

We are nearing the end of the pilot phase of the grant, after which we will finalize a draft of the migration toolkit and share it with the community for feedback. While this toolkit will be openly available for anyone who would like to review it, we are particularly interested in working with institutions with existing Fedora 3.x repositories that would like to test the tools and documentation and provide feedback to help us improve the resources. If you would like to be more closely involved in this effort please contact David Wilcox <> for more information.

The post Fedora Migration Paths and Tools Project Update: May 2021 appeared first on

Storage Update / David Rosenthal

It has been too long since I wrote about storage technologies, so below the fold I comment on a keynote and three papers of particular interest from Usenix's File and Storage Technologies conference last February, and a selection of other news.

DNA Storage

I have been blogging about DNA data storage since 2012, and last blogged about the groundbreaking work of the U. Washington/Microsoft Molecular Information Systems Lab in January 2020. For FAST's closing keynote, entitled DNA Data Storage and Near-Molecule Processing for the Yottabyte Era, Karin Strauss gave a comprehensive overview of the technology for writing and reading data in DNA, and its evolution since the 1980s. Luis Ceze described how computations can be performed on data in DNA, ending with this vision of how both quantum and molecular computing can be integrated at the system level. Their abstract reads:
DNA data storage is an attractive option for digital data storage because of its extreme density, durability, eternal relevance and environmental sustainability. This is especially attractive when contrasted with the exponential growth in world-wide digital data production. In this talk we will present our efforts in building an end-to-end system, from the computational component of encoding and decoding to the molecular biology component of random access, sequencing and fluidics automation. We will also discuss some early efforts in building a hybrid electronic/molecular computer system that can offer more than data storage, for example, image similarity search.
The video of their talk is here, and it is well worth watching.

Facebook's File System

I wrote back in 2014 about Facebook's layered storage architecture, with Haystack as the hot layer, f4 as the warm layer, and the optical media cold layer. Now, Satadru Pan et al describe how Facebook realized many advantages by combining both hot and warm layers in a single infrastructure, Facebook's Tectonic Filesystem: Efficiency from Exascale. Their abstract reads:
Tectonic is Facebook’s exabyte-scale distributed filesystem. Tectonic consolidates large tenants that previously used service-specific systems into general multitenant filesystem instances that achieve performance comparable to the specialized systems. The exabyte-scale consolidated instances enable better resource utilization, simpler services, and less operational complexity than our previous approach. This paper describes Tectonic’s design, explaining how it achieves scalability, supports multitenancy, and allows tenants to specialize operations to optimize for diverse workloads. The paper also presents insights from designing, deploying, and operating Tectonic.
They explain how these advantages are generated:
Tectonic simplifies operations because it is a single system to develop, optimize, and manage for diverse storage needs. It is resource-efficient because it allows resource sharing among all cluster tenants. For instance, Haystack was the storage system specialized for new blobs; it bottlenecked on hard disk IO per second (IOPS) but had spare disk capacity. f4, which stored older blobs, bottlenecked on disk capacity but had spare IO capacity. Tectonic requires fewer disks to support the same workloads through consolidation and resource sharing.
The paper is well worth reading; the details of the implementation are fascinating and, as the graphs show, the system achieves performance comparable with Haystack and f4 with higher efficiency.

Caching At Scale

A somewhat similar idea underlies The Storage Hierarchy is Not a Hierarchy: Optimizing Caching on Modern Storage Devices with Orthus by Kan Wu et al:
We introduce non-hierarchical caching (NHC), a novel approach to caching in modern storage hierarchies. NHC improves performance as compared to classic caching by redirecting excess load to devices lower in the hierarchy when it is advantageous to do so. NHC dynamically adjusts allocation and access decisions, thus maximizing performance (e.g., high throughput, low 99%-ile latency). We implement NHC in Orthus-CAS (a block-layer caching kernel module) and Orthus-KV (a user-level caching layer for a key-value store). We show the efficacy of NHC via a thorough empirical study: Orthus-KV and Orthus-CAS offer significantly better performance (by up to 2x) than classic caching on various modern hierarchies, under a range of realistic workloads.
They use an example to motivate their approach:
consider a two-level hierarchy with a traditional Flash-based SSD as the capacity layer, and a newer, seemingly faster Optane SSD as the performance layer. As we will show, in some cases, Optane outperforms Flash, and thus the traditional caching/tiering arrangement works well. However, in other situations (namely, when the workload has high concurrency), the performance of the devices is similar (i.e., the storage hierarchy is actually not a hierarchy), and thus classic caching and tiering do not utilize the full bandwidth available from the capacity layer. A different approach is needed to maximize performance.
To over-simplify, their approach satisfies requests using the performance layer until it appers to be saturated then satisfies as much of the remaining load as it can from the capacity layer. The saturation may be caused simply by excess load, or it nay be caused by the effects of concurrency.

3D XPoint

One of the media Kan Wu et al evaluated in their storage hierarchies was Intel's Optane, based on the Intel/Micron 3D XPoint Storage-Class Memory (SCM) technology. Alas, Chris Mellor's Micron: We're pulling the plug on 3D XPoint. Anyone in the market for a Utah chip factory? doesn't bode well for its future availability:
Intel and Micron started 3D XPoint development in 2012, with Intel announcing the technology and its Optane brand in 2015, claiming it was 1,000 times faster than flash, with up to 1,000 times flash's endurance. That speed claim was not the case for block-addressable Optane SSDs, which used a PCIe interface. However bit-addressable Optane Persistent Memory (PMEM), which comes in a DIMM form factor, was much faster than SSDs but slower than DRAM. It is a form of storage-class memory, and required system and application software changes for its use.

These software changes were complex and Optane PMEM adoption has been slow, with Intel ploughing resources into building an ecosystem of enterprise software partners that support Optane PMEM.

Intel decided to make Optane PMEM a proprietary architecture with a closed link to specific Xeon CPUs. It has not made this CPU-Optane DIMM interconnect open and neither AMD, Arm nor any other CPU architectures can use it. Nor has Intel added CXL support to Optane.

The result is that, more than five years after its introduction, it is still not in wide scale use.
Intel priced Optane too high for the actual performance it delivered, especially considering the programming changes it needed to achieve full performance. And proprietary interfaces in the PC and server space face huge difficulties in gaining market share.

Hard Disk Stagnation

As Chris Mellor reports in Seagate solidifies HHD market top spot as areal density growth stalls the hard disk market remains a two-and-a-half player market focused on nearline, with Toshiba inching up in third place by taking market share from Western Digital, putting them a bit further behind Seagate. The promised technology transition to HAMR and MAMR is more than a decade late and progressing very slowly. Mellor quotes Tom Coughlin:
“The industry is in a period of extended product and laboratory areal density stagnation, exceeding the length of prior stagnations.”

The problem is that a technology transition from perpendicular magnetic recording (PMR), which has reached a limit in terms of decreasing bit area, to energy-assisted technologies – which support smaller bit areas – has stalled.

The two alternatives, HAMR (Heat-Assisted Magnetic Recording) and MAMR (Microwave-Assisted Magnetic Recording) both require new recording medium formulations and additional components on the read-write heads to generate the heat or microwave energy required. That means extra cost. So far none of the three suppliers: Seagate (HAMR), Toshiba and Western Digital (MAMR), have been confident enough in the characteristics of their technologies to make the switch from PMR across their product ranges.
Wikibon analyst David Floyer said: “HDD vendors of HAMR and MAMR are unlikely to drive down the costs below those of the current PMR HDD technology.”

Due to this: “Investments in HAMR and MAMR are not the HDD vendors’ main focus. Executives are placing significant emphasis on production efficiency, lower sales and distribution costs, and are extracting good profits in a declining market. Wikibon would expect further consolidation of vendors and production facilities as part of this focus on cost reduction.”
It has been this way for years. The vendors want to delay the major costs of the transition as long as possible. In the meantime, since the areal density of the PMR platters isn't growing, the bes they can do is to add platters for capacity, and add a second set of arms and heads for performance, and thus add costs. Jim Salter reports on an example in Seagate’s new Mach.2 is the world’s fastest conventional hard drive:
Seagate has been working on dual-actuator hard drives—drives with two independently controlled sets of read/write heads—for several years. Its first production dual-actuator drive, the Mach.2, is now "available to select customers," meaning that enterprises can buy it directly from Seagate, but end-users are out of luck for now.

Seagate lists the sustained, sequential transfer rate of the Mach.2 as up to 524MBps—easily double that of a fast "normal" rust disk and edging into SATA SSD territory. The performance gains extend into random I/O territory as well, with 304 IOPS read / 384 IOPS write and only 4.16 ms average latency. (Normal hard drives tend to be 100/150 IOPS and about the same average latency.)
It is a 14TB, helium-filled PMR drive. It isn't a surprise that the latency doesn't improve; the selected head still needs to seek to where the data is, so dual actuators don't help.

Seagate's Roadmap

At the 2009 Library of Congress workshop on Architectures for Digital Preservation, Dave Anderson of Seagate presented the company's roadmap for hard disks He included this graph projecting that the next recording technology, Heat Assisted Magnetic Recording (HAMR), would take over in the next year, and would be supplanted by a successor technology called Bit Patterned Media around 2015. I started writing skeptically about industry projections of technology evolution the next year in 2010

Whenever hard disk technology stagnates, as it has recently, the industry tries to distract the customers, and delight the good Dr. Pangloss, by publishing roadmaps of the glorious future awaiting them. Anton Shilov's Seagate's Roadmap: The Path to 120 TB Hard Drives covers the latest version:
Seagate recently published its long-term technology roadmap revealing plans to produce ~50 TB hard drives by 2026 and 120+ TB HDDs after 2030. In the coming years, Seagate is set to leverage usage of heat-assisted magnetic recording (HAMR), adopt bit patterned media (BPM) in the long term, and to expand usage of multi-actuator technology (MAT) for high-capacity drives. This is all within the 3.5-inch form factor.
In the recent years HDD capacity has been increasing rather slowly as perpendicular magnetic recording (PMR), even boosted with two-dimensional magnetic recording (TDMR), is reaching its limits. Seagate's current top-of-the-range HDD features a 20 TB capacity and is based on HAMR, which not only promises to enable 3.5-inch hard drives with a ~90 TB capacity in the long term, but also to allow Seagate to increase capacities of its products faster.

In particular, Seagate expects 30+ TB HDDs to arrive in calendar 2023, then 40+ TB drives in 2024 ~ 2025, and then 50+ TB HDDs sometimes in 2026. This was revealed at its recent Virtual Analyst Event. In 2030, the manufacturer intends to release a 100 TB HDD, with 120 TB units following on later next decade. To hit these higher capacities, Seagate is looking to adopt new types of media.
Today's 20 TB HAMR HDD uses nine 2.22-TB platters featuring an areal density of around 1.3 Tb/inch2. To build a nine-platter 40 TB hard drive, the company needs HAMR media featuring an areal density of approximately 2.6 Tb Tb/inch2. Back in 2018~2019 the company already achieved a 2.381 Tb/inch2 areal density in spinstand testing in its lab and recently it actually managed to hit 2.6 Tb/inch2 in the lab, so the company knows how to build media for 40 TB HDDs. However to build a complete product, it will still need to develop the suitable head, drive controller, and other electronics for its 40 TB drive, which will take several years.
In general, Seagate projects HAMR technology to continue scaling for years to come without cardinal changes. The company expects HAMR and nanogranular media based on glass substrates and featuring iron platinum alloy (FePt) magnetic films to scale to 4 ~ 6 Tb/inch2 in areal density. This should enable hard drives of up to 90 TB in capacity.

In a bid to hit something like 105 TB, Seagate expects to use ordered-granular media with 5 ~ 7 Tb/inch2 areal density. To go further, the world's largest HDD manufacturer plans to use 'fully' bit patterned media (BPM) with an 8 Tb/inch2 areal density or higher. All new types of media will still require some sort of assisted magnetic recording, so HAMR will stay with us in one form or another for years to come.
After they finally bite the investment bullet for mass adoption of HAMR, they're going to postpone the next big investment, getting to BPM, as long as they possibly can, just as they did for a decade with HAMR.

As usual, I suggest viewing the roadmap skeptically. One area where I do agree with Seagate is:
Speaking of TCO, Seagate is confident that hard drives will remain cost-effective storage devices for many years to come. Seagate believes that 3D NAND will not beat HDDs in terms of per-GB cost any time soon and TCO of hard drives will remain at competitive levels. Right now, 90% of data stored by cloud datacenters is stored on HDDs and Seagate expects this to continue.

How to reduce latency with Samsung Galaxy Buds on macOS / Jonathan Brinley

I use Samsung Galaxy Buds as my primary headphones. They’re great. Sometimes, though, I notice that audio coming from my Mac is delayed by ~1 second compared to the video.

After a bit of digging, I’ve discovered that macOS defaults to using the SBC codec, which does some buffering that can lead to that delay. Switching to the AAC codec (or fiddling with SBC settings) can eliminate that delay.

The magic incantation:

sudo defaults write bluetoothaudiod "Enable AAC codec" -bool true

Reconnect your headset, and the delay is unnoticeable.

Equitable but Not Diverse: Universal Design for Learning is Not Enough / In the Library, With the Lead Pipe

By Amanda Roth, Gayatri Singh (posthumous), and Dominique Turnbow

In Brief

Information literacy instruction is increasingly being delivered online, particularly through the use of learning objects. The development practice for creating learning objects often uses the Universal Design for Learning (UDL) framework to meet needs for inclusivity. However, missing from this framework is the lens of diversity. This article calls out the need to include practices in learning object development that goes beyond UDL so that learning objects are inclusive from the lens of equity, diversity, and inclusion. Looking at transferable techniques used in in-person instruction, we suggest guidelines to fill the inclusivity gap in learning object creation.


In response to the disruption of the in-person learning environment, many teaching librarians are moving information literacy instruction to the asynchronous learning environment. In place of the traditional classroom, learners frequently engage with information literacy concepts through online learning objects. Online learning objects are defined as “any digital resource that can be reused to support learning.”1 In this mode of instruction, Universal Design for Learning (UDL) is employed to address inclusivity via equitable access. While UDL aims to enhance inclusivity in terms of access by providing multiple modes of interacting with relevant content and it is technically accessible, it stops there. Inclusive design is “design that considers the full range of human diversity with respect to ability, language, culture, gender, age and other forms of human difference.”2 The question as to how we create learning objects that support all aspects of equity, diversity, and inclusion (EDI) has yet to be explored.

The Gap in Our Practice

One can find best practices for creating learning objects that focus on learning outcomes, user engagement, usability and design, and evaluation in the literature.3 When examining diversity in online learning, there remains a heavy focus on diverse learner populations and methods to incorporate different learning styles as a means to support learning for a wide range of individuals.4 In recognizing that it benefits all learners to have content presented in multiple formats, UDL’s framework relies heavily on providing multiple means of engagement, representation, and action or expression, to create a universal learning experience. This focus is evident in the UDL guideline principle of representation. For example, the UDL Representation guideline checklist  by the West Virginia Department of Education (n.d.) considers the following:

  • Perception: Content should not be dependent on a single sense like sight, hearing, movement, or touch.
  • Language and Symbols: Create a shared language by clarifying vocabulary or symbols and provide non-text based ways of communicating like illustrations or graphics.
  • Comprehension: Provide background knowledge or bridge new concepts by organizing content effectively, highlight key elements, or provide prompts for cognitive processing. 

In this case, representation is not linked to the representation of different ethnicities, genders, ages, social or economic classes, ability, etc., but rather how content is represented. As an offshoot of Universal Design (UD), originally an architectural movement to address needs relating to ability, UDL in practice has followed UD’s example to address a need to interact with instructional materials from an ability perspective. UDL has normalized the learning experience in learning environments by supporting the idea that designing lesson materials that offer access through multiple modalities and expression creates an inclusive experience for all.5 The UDL checklist of multiple means of representation, expression, and engagement is particularly apt at normalizing learning into considerations of content delivery. It overlooks marginalized experiences and disenfranchised voices – including giving a voice to disability, an aspect of EDI.6 The UDL principle of engagement does include guidelines that support providing culturally relevant content as a means to increase student-centered relevancy and value for an individual learner. It is included in the context of optimizing leaner interest. However, inclusivity and diversity within this guideline is at best a passive byproduct instead of  the primary goal. Learning object design guidelines that encompass inclusionary design principles and nurture diversity in a culturally responsive way7 have yet to be explored for the learning object format.

Additionally, an examination of the Library Instruction for Diverse Populations Bibliography by the Instruction Section of the Association of College and Research Libraries (n.d.) includes literature on various populations in classroom settings, but very little if none discuss learning objects. When fostering EDI in online spaces, accessibility is the framework by which learning object developers work. Clossen and Proces’ examination of library tutorials focuses on captions, screen audio coordination, link context, length, headings, and alternative text.8 The body of knowledge for learning object development focuses on the importance of and technical aspects of building accessible learning objects when speaking about inclusion. These efforts meet the needs of equitable access. Still, they do not touch upon inclusivity by creating an online environment where learners from minority or marginalized backgrounds can see themselves reflected in the learning experience. 

The importance of creating inclusive instructional environments for learners is indicative of the vast amount of literature devoted to the topic. Inclusivity shapes the value of self, promotes participation and access, and influences the desire to contribute.9 Pendell and Schroeder discuss bringing culturally responsive teaching into the classroom. This method “incorporates a multiplicity of students’ cultures and lived experiences into their education, improving their classroom engagement, content relevancy, and fostering diverse perspectives.”10 Theories and practices in instructional design theories follow suit. Instructional design processes like ADDIE (Analysis, Design, Development, Implementation, and Evaluation) and SAM (Successive Approximation Model) incorporate learner assessment into the early stages of the design process by identifying learner characteristics and taking account of the learners’ previous knowledge and experience.11 Both Keller’s ARCS Model of Motivation12 and Gagné’s Nine Events of Instruction13 speak to the necessity of gaining your learner’s attention to ensure reception and engagement with the material. Design Thinking 14 offers methods that designers can use to empathize with learners. Despite recognized practices of considering learner populations’ characteristics, prior knowledge, and experiences, many learning objects ignore EDI in their design and development. Given the importance of EDI in teaching and learning and the shift to the online environment, the conversation needs to pivot to include asynchronous online learning environments and learning objects. We must acknowledge this overlooked area in our instruction practices.

Our Guidelines

There are resources in the literature that provide practical techniques to aid teaching librarians in creating inclusive in-person instruction. Some of these techniques are transferable to the design and development of learning objects. These include:

  • Provide culturally relevant examples (images, topics, authors, etc.) to all learners – not just the majority.15
  • Create a space where diverse experiences and knowledge is valued.16
  • Provide a choice as to how learners will interact with content.17

With these in mind, we drew on our collective experience as teaching librarians, designers, and technologists to create guidelines for inclusive learning object development.

Relevant Topics and Examples

As with in-person instruction, the examples used in online instruction can create an inclusive learning environment. Even if an object’s learning outcome is not something one might classify as “diversity” content, the example or topic choices within the learning object can subtly express the value of different viewpoints. For example, if the learning object goal is to teach learners how to create a citation in a specific style, the citation examples used in the object can be selected based on authors of color. In practice, you could accomplish this by providing an example of a general citation format followed by an example centering the voice of Black, Indigenous, and people of color (BIPOC) as seen in the table below.

APA Book Chapter Format

General GuidelinesAuthor, A. A., & Author, B. B. (Year of publication). Title of chapter. In E. E. Editor & F. F. Editor (Eds.), Title of work: Capital letter also for subtitle (pp. pages of chapter). Publisher. DOI (if available)
Example CitationAmorao, A.S. (2018). Writing against patriarchal Philippine nationalism: Angela Manalang Gloria’s “Revolt from Hymen”. In: Chin G., Mohd Daud K. (Eds) The southeast Asian woman writes back. Asia in Transition, vol 6. Springer, Singapore.

Represent Diverse Experiences with Scenario-based Learning

Scenario-based learning uses realistic case-studies to create learner-centered learning.18 It is a widespread technique in e-learning and object development. Scenarios may be used to aid storytelling and decision making or be used in checking a learner’s knowledge of learned material. If incorporated into a learning object, scenarios should include relatable learner experiences. Special attention to how scenarios incorporate a minority experience is essential. For example, using a scenario as a self-knowledge check that features a first-year student experience might look like the following:

Ramona is writing a research paper about Mexican-American health disparities and is unsure where to start when looking at the library’s website. As a first-generation college student who lives at home and commutes to campus, there is no one to ask at home. Her assignment requires her to locate peer-reviewed journal articles. Based on the information provided earlier in the tutorial about the difference between a catalog and a database, select the resource from the multiple-choice options that would best help Ramona with her research.

Universal Design

Use the principles of universal design to provide multiple ways of engaging with learning object content to help sustain motivation and to provide choices for how learners interact with relevant content. Include audio, visual, and kinesthetic modes of representation, where applicable, as a means for interaction and active learning. Most learning objects, by default, have text elements. In applying Universal Design for learning, look for ways in which other modes of communication might be effective. In practice, we’ve created a video to tell the “Story of Research” for a chemistry lab class. The story explains how a chemical experiment in a lab becomes an everyday product used by millions. The video used storytelling as a mode of communication to place learners within the shared community of chemistry researchers. It also provided a break from the otherwise text-based content. 

Teacher/Student Representation 

When using characters to represent teachers or learners, provide multiple representations. Include details related to diverse races and ethnicity and consider characteristics related to ability, body size, gender, and authority positions. Pay attention to the order that characters of color are presented. Be careful to avoid tokenism. Character makeup should have diverse representation. If adding names and audio tracks for different characters, consider name choice, the sound of the voice, and character vocabulary. For example, we will often use a TA as a narrator instead of faculty to shrink the power gap between the traditional teacher and student role.


The language used in the object scripts and text displayed is a subtle but effective means of creating an environment of inclusivity. Multimedia principles suggest that a conversational style of speaking and use of first- and second-person pronouns generate a feeling of personalization.19 Person-centered language  and inclusive pronouns should be used. This includes using words such as “I,” “we,” “you,” and “yours.” For example, a script using first- and second-person language might include, “In this tutorial, you will explore the definition of plagiarism and recognize plagiarism when you see it.” Person-centered language and inclusive pronouns in a script could look like “Review the scenario in which Renita, a medical student at the top of her class, tries to save time on a paper by copying and pasting information from an article. Once you review the details of the scenario, decide whether or not you think they plagiarized by answering yes or no.”


Provide authentic feedback that acknowledges and supports diverse learners. This means rethinking the default feedback template provided by software that offers correct/incorrect feedback based on if-then logic. This dualistic mode of communication reinforces pro-western views that emphasize individualized mastery and achievement.20 Feedback should strive to acknowledge the way learners create meaning within their existing cultural schemas. Feedback messaging could recognize alternative ways of thinking and use existing schemas to create new understanding. Feedback that meets this guideline for an “incorrect” answer within a plagiarism tutorial could be:

Answer: The use of paraphrase and citation is a strategy to prevent plagiarism.

In some cultures repeating the ideas of scholars in a paper is a sign of respect and it is a universally acceptable practice not to cite the ideas of scholars because the value of their words is culturally accepted. In the United States and at UC San Diego, the practice is to use paraphrases or quotations and a cited reference to distinguish the ideas of scholars from your own. This practice will help you prevent violations of the UCSD Policy of Integrity of Scholarship and identify for your reader the words of the scholars you may be referencing and your own valued ideas. 


Web Content Accessibility Guidelines (WCAG) aim to create a “shared standard for web content” that removes barriers for individuals who may be ability challenged.21 Learning objects should be built to meet WCAG 2.0 guidelines at an AA level. Inclusivity relating to accessibility goes beyond technical requirements. This means addressing the user experience of those who use assistive technology. For example, relying on visual markers to indicate a multiple choice question ignores the experience of the visually impaired. As a screen reader reads through text on a screen the learner might not realize that the content reflects an activity in the multiple-choice format. Providing an introduction statement that states that you will be asked to answer three multiple-choice questions in the following section, introduces all learners to the activity.  

By incorporating these guidelines into learning object design and development, we hope to change our learning object design and development practices to reflect our profession’s EDI principles.

In Practice: a Tutorial Example

To test our guidelines with learners, we designed and evaluated a tutorial that incorporated the guidelines we crafted.

Institutional Background

The University of California, San Diego (UCSD) Library serves a large and ethnically diverse student body. According to the UC San Diego Office of Institutional Research (2020), the student characteristics of our 39,576 students are:

  • African American 3.0%
  • American Indian 0.4%
  • Asian/Asian American 37.1%
  • Chicano/Latino 20.8%
  • International Citizen 17%
  • Native Hawaiian/Pacific Islander 0.2%
  • White 19%
  • Undeclared/Missing 2.5%

Due to large student enrollment at the University and limited librarian resources, it is necessary to provide some information literacy instruction through online learning objects.

Preventing Plagiarism Tutorial

The plagiarism tutorial is built using Storyline 360 and consists of three modules (Define, Prevent, Cite) in which learners can test out of the material before proceeding. It uses a teaching assistant, Maya as a narrator, who walks learners through the test-out procedures and welcomes them to the tutorial content if they cannot test out. Maya, depicted in Image 1, is not representative of the librarian who created the object. Using Vyond, an animation tool, Maya’s physical characteristics include being of average build and having brown skin, brown eyes, and dark brown hair. Maya’s features look traditionally female, and her name was a call out to Latin or Asian cultures. She uses first- and second-person pronouns to address the learner throughout the object. She is the first character introduced to the learner. She describes herself as a teaching assistant to be less of an authority figure.

Maya is a female cartoon caricature of average height and weight. She has brown skin, brown shoulder length hair and brown eyes. Maya is wearing a purple sleeveless top, orange skirt, black flats.

(Image 1: Maya)

Other aspects of the learning object that include characters include scenario-based learning exercises. Student characters describe a writing scenario, and the learner determines if the student in the scenario plagiarized. Of the characters depicted in Image 2, one is a thin Asian male wearing a preppy dress style that is masculine. One character is white with an androgynous dress but leans toward a female presentation, and the other character is a white female who is larger in size. Character attributes were limited to what was available within the third-party software. 

Three cartoon caricatures of students. The first could be considered female of average height and weight. The character has black chin length hair with purple streaks, wide set blue eyes, and light skin. They are wearing a white sleeveless top under a black vest with red pants and black shoes.   The second caricature could be considered male of average build with brown hair, smaller set black eyes, and light skin. They are wearing a white long sleeve shirt, rolled at the sleeves under a light grey vest, bowtie, brown pants and black shoes.  The third caricature could be considered female of larger build with redish hair, brown eyes, and light skin. They are wearing red glasses, a pendant necklace, green top, jeans, and white shoes.

(Image 2: Scenario Student Characters)

Each student character has a unique voice, narrated by library staff volunteers. For example, in this tutorial an Asian staff member voiced the Asian character in the tutorial. If possible, ask BIPOC colleagues to voice BIPOC representations. The inclusion of BIPOC voices in the development of a learning object enhances the quality of personalization in the learning experience. Characters were referred to as the student instead of using binary pronouns. In hindsight, we should have specifically used a mixture of pronouns to acknowledge various gender identities. 

The examples used to explain plagiarism concepts were carefully considered. For example, the tutorial includes a common knowledge activity that asks the learner to identify whether a statement is common knowledge. The initial example stated, in your paper, you reference the attack on Pearl Harbor as being the trigger for the United States to join World War II officially. 

In reviewing the tutorial, there was concern that this common knowledge example is United States-centric, especially to the number of international students enrolled at the university. Although the evaluative data showed that most learners knew the United States-centric answer to this question, the statement was changed to, in your paper, you write that penicillin is commonly used to fight infections. This change in example creates a more inclusive learning experience by removing a historical reference point that does not consider the emotional impact held by other world views. 

The accessibility design of the learning object used a two-step process. First, the object was developed to meet WCAG 2.0 guidelines from a technical perspective. Once those guidelines were in place, a student worker from the campus’ Office of Students with Disabilities was consulted to work through the tutorial from a user perspective. This second step that included elements of participatory design in the object development resulted in the understanding that although a learning object is technically accessible, it could still result in a poor learning experience for those using assistive devices. Adjustments to content presentation and script after this meeting resulted in universal improvements for all. For example, a text introduction replaced a video introduction of the narrator. 


To determine how well the guidelines attended to our learners’ inclusive experience, we created an anonymous survey that captured 6,918 responses over eighteen months (January 2019-June 2020). Learners were asked to answer the following questions at the end of the tutorial. Learners did not get credit for completing the plagiarism tutorial until they answered these questions. Their responses were not linked to identifiable data.

  1. I noticed the diversity of characters in this tutorial (Y/N)
  2. In this module, you were introduced to the following characters. Which character’s appearance did you most identify with? (multiple choice)
  3. What features in character appearance would you like to see? (fill-in)
  4. Do you like having characters with diverse appearances? (Y/N)
  5. The module included a question about common knowledge. Was the example, “With the attack on Pearl Harbor, the US officially joined WWII” a common knowledge fact that you already knew? (Y/N)

Learners noticed the diversity of characters.

We weren’t surprised to learn that most of the learners (87%) noticed the diversity of characters in the tutorial. This means that they would also notice the lack of diversity of characters. It confirms the idea that the way the characters look does matter to learners. 

Many learners did not identify with any of the characters.

Perhaps the reason the learners noticed the diversity of characters is because nearly half of learners  (nearly 42%) did not identify with them. While this was disappointing, it is useful feedback as we think about making learning objects more diverse and inclusive. Of those that identified with the characters, 17% percent of learners identified with the narrator, 9% identified with the white androgynously dressed female, 25% identified with the male Asian character, and 6% with the white female character.

Some learners wanted to see many characteristics represented.

Respondents were given a text box to type in any characteristics they wanted to see represented in the Library’s learning objects. We asked “what features in the character’s appearance would you like to see?” 19% of the 3,062 respondents stated no preference or that they didn’t care about character features. Another 10% indicated no or none. The word cloud noted in Image 3 below illustrates the most popular terms provided by learners to describe the characteristics they would like to see. 

Asian, hair, color, dark, Black, skin, female, eyes, disability, brown, hijab, variety, white, Latin, glasses, clothes, guy, blonde, Hispanic, people of color, girl.

(Image 3: Word Cloud Response)

Some standout terminology includes skin, hair, black, dark, curly, diversity, Africa, gender, and Hispanic. We observed that the terms didn’t represent the statistical breakdown of the demographics on the campus in some cases. For example, the terms “black” and “africa” were used more than terms like “hispanic” and “latino.” UCSD has more Chicano/Latino students (20.8%) than African American students (3.0%). This suggests an opportunity for more research into understanding if learners simply want to see themselves represented or multiple groups/ethnicities/cultures represented. An alternative interpretation may be that when learners think about diversity, they think specifically of Black/African representations rather than a more holistic viewpoint. Additionally, the large number of “don’t care” or “______” could indicate that while learners like seeing diverse characters, they don’t necessarily care how they are represented. It could also suggest that they are not invested in providing feedback and didn’t want to answer the question. 

Most learners liked seeing a diversity of characters represented.

Learners overwhelmingly (94%) like to see a diversity of characters represented in the tutorials. 

Most learners knew the U.S. History common knowledge reference.

UCSD has a growing international student population. We were curious to know if typical common knowledge references (e.g., event that marked the United States entering WWII) would resonate with learners that may not have studied in the U.S. before college. The majority of learners reported knowing this fact (86%). A limitation is that responses relied on self-reporting for this question because there isn’t a way to confirm if learners really knew the answer or were wary of admitting they didn’t know it in the anonymous survey. The survey also did not ask learners how they identified themselves, so we do not know how many of the respondents represented the university’s international student population. 

The results clearly illustrate that learners value diverse characters in learning objects. They notice which groups are represented. However, we need to learn more about how learners identify and include that representation in learning objects to make them more inclusive.

Our Next Steps

In our practice, we plan to continue to gather feedback from learners to improve the EDI of the library’s learning objects. We will add a question that captures the demographics of the respondents so that we can compare specific learner demographics with what they want to see more of (e.g. more characters that look like themselves or those that look different). We will undertake a systematic review of the library’s learning objects to identify ways to incorporate more diverse examples and apply our best practices, especially in the areas associated with topic and example choice. Staying up to date with the university’s growing student population will also become a priority as student growth may include a change in student demographics. In addition, we will continue to refine the EDI requirements to the evaluation checklist of instructional software and image packages representing student characterizations so that we are financially supporting companies that invest in EDI. We will also use software feature requests to add our voice to a growing call to include more diverse student characterizations in elearning authoring tools.

The Preventing Plagiarism Tutorial provided us with an opportunity to use our guidelines and learn more about the mechanics of developing more inclusive learning objects. In the process, we learned that small changes have a significant impact. We also learned that there is much more work to be done. While inclusive design is our immediate goal, there is also work relating to design justice for consideration. We plan to actively include marginalized stakeholder voices (e.g. students, librarians, faculty) in the planning process. We will adopt a small moves/big moves framework22 where we attend to immediate changes that promote inclusive design and keep an eye toward big moves that strive to incorporate design justice into our workflows. 

Moving Beyond Universal Design in Learning

Putting EDI principles into practice begins in the design phase of the learning object. Collier compels the practice of  inclusive design and design justice in higher education.23 “Inclusive design emphasizes the contribution that understanding user diversity makes to informing these decisions, and thus to including as many people as possible.”24 While UDL focuses on equal opportunity and technical accessibility, inclusive design “celebrates difference and focuses on designs that allow for diversity to thrive.”25 Within higher education (and certainly when designing and developing information literacy learning objects) this can be accomplished in the design phase by including diverse stakeholders as part of the team that makes decisions as they relate to the content, instructional examples, and feedback provided to learners. In addition to being inclusive, our design decisions must be just. Design justice centers people with marginalized experiences in the design process to address design choices that affect them.26 It urges educators to consider who is exploited and marginalized in the design decisions and questions who even gets to make the design decisions in the first place and why. 

At UC San Diego, we have practiced inclusive design in the following ways: 

  • Incorporating Design Thinking27 into our design practices. The methods offered by this framework facilitate our understanding of our learners’ experience with using the library’s services and collections. This is where we see the biggest opportunities to incorporate the goals of Design Justice. 
  • Including diversity questions in our learning object evaluation forms to help us understand what resonates with our learners and how we could improve connecting with them through our design decisions about character appearance, examples used, etc. 
  • Proactively purchasing instructional software or image packages that include teachers’ and learners’ representations. We have added a diversity and inclusion checklist as part of the software review before purchasing. 

While we have made some progress incorporating inclusive design in our practices there is more work to do. We recognize the constraints of the learning object format. While best practices in creating culturally responsive online learning offer ways in which to best utilize the online classroom environment to create inclusive learning communities, those techniques aren’t as readily available within the learning object mode of delivery (e.g. discussion boards or collaborative group work).28 Thus, we plan to explore incorporating more learner-led participatory design into the workflow. In doing so we hope our learners can help us build upon our existing inclusive learning practices with an eye toward incorporating the goals of design justice. 


Our goal is to start a conversation about how diversity and inclusion practices can extend to creating learning objects. In doing so, we hope to begin to fill the gap in the literature so that designers of learning objects can have a reference point for their work in the future. As teaching librarians move information literacy instruction into the online learning environment via learning objects, the importance of extending EDI principles can not be overstated. The messages that we send through our design choices impact our learners in a variety of different ways and become even more important when the teacher and learner are unseen participants in the learning process. Each design choice is an opportunity to send a message of inclusion to learners. 

  • Choosing culturally relevant topics that speak to a diverse student body sends a message that student experience and interests are important.
  • Representing BIPOC voices in learning object examples conveys to learners that diverse voices are welcomed and heard.
  • Going beyond multiple modes of delivery to ensure that tutorials consider ability beyond technical requirements as part of the EDI experience creates inclusivity for an often forgotten group.
  • Using inclusive pronouns and language that is person-centred lets learners know they are respected and seen for the holistic person they are.
  • Providing culturally responsive feedback acknowledges and respects existing knowledge.

By adding an EDI lens to the Universal Design for Learning framework for learning object creation, we begin to take steps that create inclusive online learning. This reflection ultimately improves learner motivation and investment in the learning process. 


We would like to posthumously recognize Gayatri Singh for participating in our work to create more diverse and inclusive learning objects. Her work in equity, diversity, and inclusion is a guiding light that influences us as we strive to create inclusive learning environments. We would also like to thank our colleagues in the Library, Academic Integrity Office, and Office of Students with Disabilities at UC San Diego for their input and feedback throughout the design and development processes of the plagiarism tutorial. Finally, we also wish to thank the publishing editor of the submitted article, Ian Beilin, and peer reviewers, Nicole Cooke, and Sylvia Page for their thoughtful feedback. 


ACRL IS Instruction for Diverse Populations Committee. (n.d). Library instruction for diverse populations bibliography.

Allen, M.W. & Sites, R.A. (2012). Leaving ADDIE for SAM: An agile model or developing the best learning experiences. American Society for Training Development.

Blummer, B., & Kritskaya, O. (2009). Best practices for creating an online tutorial: A literature review. Journal of Web Librarianship, 3(3), 199-216. doi:10.1080/19322900903050799

Branch, R. M. (2009). Instructional design: The ADDIE approach. Springer

Chávez, A. F., Longerbeam, S. D., & White, J. L. (2016). Teaching across cultural strengths: A guide to balancing integrated and individuated cultural frameworks in college teaching. Stylus Publishing LLC. 

Clark, R. C., & Mayer, R. E. (2011). E-learning and the science of instruction: Proven guidelines for consumers and designers of multimedia learning. Pfeiffer.

Clark, R. C., & Mayer, R. E. (2013). Scenario-based e-learning: Evidence-based guidelines for online workforce learning. John Riley & Sons.

Clossen, A., & Proces, P. (2017). Rating the accessibility of library tutorials from leading research universities. Portal: Libraries and the Academy, 17(4), 803-825. doi: 10.1353/pla.2017.0047

Collier, A. (2020, October 26). Inclusive Design and Design Justice: Strategies to Shape Our Classes and Communities. EDUCAUSE Review.

Dam, R. & Siang, T. (2019). What is design thinking and why is it so popular?. Interaction design foundation. 

Design Justice Network. (n.d.). 

Dolmage, J. T. (2017). Academic ableism: Disability and higher education. University of Michigan Press.

Gagne, R., M. (1985). The conditions of learning and theory of instruction. (4th Ed.). Rinehart & Winston.

Grassian, E. & Kaplowtiz, J. (2009). Information literacy instruction: Theory and practice. New York: Neal-Schuman Publishers.

Hamraie, A. (2017). Building access: Universal Design and the politics of disability. University of Minnesota Press.

Henry, S. L. (2018). Web content accessibility guidelines (WCAG) overview. W3C.

Holmes, K. (2018). Mismatch: How Inclusion Shapes Design. The MIT Press.

Inclusive Design Research Center (n.d.).What Is Inclusive Design?

Keller, J. M. (2010). Motivational design for learning and performance. The ARCS model approach. Springer.

Mestre, L. 2009. “Culturally Responsive Instruction for Teacher-Librarians.” Teacher Librarian 36 (3): 8-12.

Pendell, K., & Schroeder, R. (2017). Librarians as campus partners: Supporting culturally responsive and inclusive curriculum. College & Research Libraries News, 78(8), 414-414. doi:10.5860/crln.78.8.414

Richards, H. Brown, A. F. & Forde, T. B. Addressing diversity in schools: Culturally responsive pedagogy. Teaching Exceptional Children, 36(3), 64. 

Smith, D. R. & Ayers, D. F. (2006). Culturally responsive pedagogy and online learning:

Implications for the globalized community college. Community College Journal of Research and Practice, 30(5-6), 401-415. doi:10.1080/10668920500442125

University of California, San Diego Institutional Research. (2020). UC San Diego Undergraduate Enrollment by Diversity. Retrieved from

Webb, K., & Hoover, J. (2015). Universal design for learning (UDL) in the academic library: A methodology for mapping multiple means of representation in library tutorials. College & Research Libraries, 76(4), 537-553. doi: 10.5860/crl.76.4.537

West Virginia Department of Education. (n.d.). UDL guidelines checklist.

“What Is Inclusive Design,” Inclusive Design Research Centre (website), n.d., accessed April 10, 2020. 

Wiley, D. A., (2000) “Connecting learning objects to instructional design theory: A definition, a metaphor, and a taxonomy,” in The Instructional Use of Learning Objects: Online Version, ed. Retrieved from

Woodley, X., Hernandez, C. M., Parra, J. L. & Negash, V. (2017). Celebrating difference: Best practices in culturally responsive teaching online. Tech Trends 61(2), doi:10.1007/s11528-017-0207-z

Yale Poorvu Center for Teaching and Learning. (2017). Learning styles as a myth. Retrieved from


Image Text Descriptions

Image 1: Maya Text Description

Maya is a female cartoon caricature of average height and weight. She has brown skin, brown shoulder length hair and brown eyes. Maya is wearing a purple sleeveless top, orange skirt, black flats.

Image 2: Senario Student Characters

Three cartoon caricatures of students. The first could be considered female of average height and weight. The character has black chin length hair with purple streaks, wide set blue eyes, and light skin. They are wearing a white sleeveless top under a black vest with red pants and black shoes. 

The second caricature could be considered male of average build with brown hair, smaller set black eyes, and light skin. They are wearing a white long sleeve shirt, rolled at the sleeves under a light grey vest, bowtie, brown pants and black shoes.

The third caricature could be considered female of larger build with redish hair, brown eyes, and light skin. They are wearing red glasses, a pendant necklace, green top, jeans, and white shoes.

Image 3: Word Cloud Response Text

Asian, hair, color, dark, Black, skin, female, eyes, disability, brown, hijab, variety, white, Latin, glasses, clothes, guy, blonde, Hispanic, people of color, girl.

  1. Wiley, 2000, p. 7
  2. Inclusive Design Research Center, n.d., para. 1
  3. Blummer & Kritskaya, 2009
  4. Webb & Hoover, 2015
  5. Dolmage, 2017
  6. Dolmage, 2017; Hamraie, 2017
  7. Richards, Brown, Forde, 2007
  8. Clossen & Proces, 2017
  9. Homes, 2018
  10. Pendell & Schroeder, 2017, p. 414
  11. Branch, 2009; Allen & Sites,  2012
  12. Keller, 2019
  13. Gagné, 1985
  14. Dam & Siang, 2019
  15. Mestre, 2009
  16. Chavez, Longerbeam, White, 2016
  17. Grassian & Kaplowitz, 2009
  18. Clark & Mayer, 2013
  19. Clark and Mayer, 2011
  20. Smith & Ayers, 2006
  21. Henry, 2018
  22. Collier, 2020
  23. Collier, 2020
  24. “What is inclusive design,” n.d.
  25. Collier, 2020
  26. Design Justice Network (n.d)
  27. Dam & Siang, 2019
  28. Woodley, Hernandez, Parra & Negash, 2017