Planet Code4Lib

Operational Images / Ed Summers

Oliver Gaycken invited me to come talk about Documenting the Now with his Visualizing Knowledge: From Data to Images class. The course focuses on the history of visualization:

Visualizations do not show us things that are evident—visualizations make things evident. Visualizations, in other words, reveal something about the world that would not have been obvious without the work they do.

Instead of trotting out some of the tools we’ve developed on Documenting the Now I thought it might be more appropriate to talk about a specific case study that I was involved with partly as a result of my work on Documenting the Now.

A few years ago I got to work with Damien Pfister and some others in the Communications Department at UMD on a project to analyze the rhetoric of computational propaganda that occurred on Facebook during the 2016 election. These were Facebook posts that were released to Congress as part of the Mueller investigation into the operations of the Internet Research Agency.

The plan is to just share some background and the impetus for the work, how the released PDFs were processed as data, and then used as a corpus for annotation with a set of “codes” for qualitative analysis.

In the process of putting together the slides, and thinking about the project from a media studies angle, it occurred to me that this project connected somewhat with my interest in the work of Trevor Paglen and Harun Farocki’s idea of Operational Images. For Farocki operational images are images that are made by machines for machines, or as Paglen’s says:

Harun Farocki was one of the first to notice that image-making machines and algorithms were poised to inaugurate a new visual regime. Instead of simply representing things in the world, the machines and their images were starting to “do” things in the world. In fields from marketing to warfare, human eyes were becoming anachronistic. It was, as Farocki would famously call it, the advent of “operational images.” (Paglen, 2014)

Of course machines don’t independently make images for other machines–they do so because people tell them to. Operational images are often part of a control apparatus, designed by people for people (Deleuze, 1992). While these IRA ads do not appear to be made by autonomously by machines, they were composed, targeted and delivered with machines, with very specific aims in mind.

Post metadata and image embedded in PDF

Here I’m also reminded here also of Amelia Acker’s work on data craft where “manipulators … create disinformation with falsified meta- data, specifically platform activity signals” (Acker, 2018). The use of metadata is clearly apparent in the way these ads were targeted, and their content was honed to foment division and polarization.

I guess this should have been an obvious connection from the start when I was working on the IRAds project…but having to get up and talk about your work is always clarifying in surprising ways. It seems like this idea of operational images, might need some updating or clarification in light of data craft?


Acker, A. (2018). Data craft: The manipulation of social media metadata. Data & Society. Retrieved from

Deleuze, G. (1992). Postscirpt on the societies of control. October, 59(Winter), 3–7. Retrieved from

Paglen, T. (2014). Operational Images. E-Flux, (59). Retrieved from

Request for Participation in the Web Archiving Survey Working Group / Digital Library Federation

Re-launched in 2021, the Web Archiving Survey Working Group plans to conduct a survey of organizations in the United States and beyond that are actively involved in, or interested in starting, programs to archive content from the Web. This survey, to be published in 2022, will build on previous iterations of the ‘Web Archiving in the United States’ surveys, published in 2017, 2016, 2013, and 2011. 

Before work begins in earnest, the Web Archiving Survey Working Group co-chairs, Zakiya Collier and Samantha Abrams, are seeking 3-4 additional Working Group volunteers to review previous surveys and design a new survey, publish the survey and collect responses, review the responses and write the report, and present results and work at upcoming conferences. Volunteers should represent a range of institutions, types, and locations, and will explicitly include one student (or a recent graduate) working towards their Master’s degree in Library and Information Studies in order to engage them with the NDSA and its members.

It’s estimated that joining the Web Archiving Survey Working Group will be a 9 month commitment (4-5 hours of work per month), with work beginning in December 2021.

Those interested in serving should complete this form by Friday, November 12: Co-chairs will review the responses and reach out with next steps soon thereafter.

Questions? Please email Samantha Abrams and Zakiya Collier at

The post Request for Participation in the Web Archiving Survey Working Group appeared first on DLF.

Meet the 2021 DLF Forum Community Journalists / Digital Library Federation

2021 DLF Forum Community Journalists

The 2021 DLF Forum, like last year’s event, will take place online again November 1-3. Because we aren’t convening in person and registration is low-cost, we decided to reprise last year’s extremely successful Community Journalist program.

The guiding purpose of this year’s DLF Forum is once again community-based; this year, we are planning our events to be a sustaining experience for our community. Our Community Journalist program will again help us highlight new voices from in the field.

We will again be providing $250 stipends to a cohort of 10 DLF Forum attendees from a variety of backgrounds and feature their voices and experiences on the DLF blog after our events this fall. Read what last year’s Community Journalists had to say about their experiences at the 2020 DLF Forum.

Head to the DLF Forum website to meet this year’s outstanding cohort!

The post Meet the 2021 DLF Forum Community Journalists appeared first on DLF.

We Are So Screwed / David Rosenthal

Last month I wrote The Looming Fossil Fuel Crash to refine my thoughts for a discussion with my financial advisers. The TL;DR was that the short-term focus and slow, corrupted decision-making process of large companies and institutions means that their response to the need to transition to low-carbon energy will be too slow and too late. The result will be a sudden crash in the value of fossil fuel and related stocks, enough to tank the whole market.

In case you think I'm panicing, the New York Times catches up with me in U.S. Warns Climate Poses ‘Emerging Threat’ to Financial System by Alan Rappeport and Christopher Flavelle:
Climate change is an “emerging threat” to the stability of the U.S. financial system, top federal regulators warned in a report on Thursday, setting the stage for the Biden administration to take more aggressive regulatory action to prevent climate change from upending global markets and the economy.
Higher temperatures are leading to more natural disasters, such as hurricanes, wildfires and floods. These, in turn, are resulting in damaged property, lost income and disruptions to business activity that threaten to alter how assets, such as real estate, are valued.

At the same time, the move away from fossil fuels could cause a sudden drop in the price of stocks and other assets tied to oil, gas, coal and other energy companies, or sectors that rely on them such as carmakers and heavy manufacturing. Such a shift could hurt the stock market, retirement savings and other parts of the financial sector.
Below the fold, an even more depressing update.

The report from the Financial Stability Oversight Council wasn't the only one released this week. The Director of National Intelligence released the National Intelligence Estimate on Climate Change. The report's key takeaways are:
We assess that climate change will increasingly exacerbate risks to US national security interests as the physical impacts increase and geopolitical tensions mount about how to respond to the challenge. Global momentum is growing for more ambitious greenhouse gas emissions reductions, but current policies and pledges are insufficient to meet the Paris Agreement goals. Countries are arguing about who should act sooner and competing to control the growing clean energy transition. Intensifying physical effects will exacerbate geopolitical flashpoints, particularly after 2030, and key countries and regions will face increasing risks of instability and need for humanitarian assistance.
  • As a baseline, the IC uses the US Federal Scientific community’s high confidence in global projections of temperature increase and moderate confidence in regional projections of the intensity of extreme weather and other effects during the next two decades. Global temperatures have increased 1.1 ̊C since pre-industrial times and most likely will add 0.4 ̊C to reach 1.5 ̊C around 2030.
  • The IC has moderate confidence in the pace of decarbonization and low to moderate confidence in how physical climate impacts will affect US national security interests and the nature of geopolitical conflict, given the complex dimensions of human and state decisionmaking.
Damian Carrington reported on another one in Planned fossil fuel output ‘vastly exceeds’ climate limits, says UN:
Fossil fuel production planned by the world’s governments “vastly exceeds” the limit needed to keep the rise in global heating to 1.5C and avoid the worst impacts of the climate crisis, a UN report has found.

Despite increasing pledges of action from many nations, governments have not yet made plans to wind down fossil fuel production, the report said. The gap between planned extraction of coal, oil and gas and safe limits remains as large as in 2019, when the UN first reported on the issue. The UN secretary general, António Guterres, called the disparity “stark”.

The report, produced by the UN Environment Programme (Unep) and other researchers, found global production of oil and gas is on track to rise over the next two decades, with coal production projected to fall only slightly. This results in double the fossil fuel production in 2030 that is consistent with a 1.5C rise.
That is a very scary graph.

Mark Sumner writes:
In particular, the sharply dropping cost of electricity from solar and wind—along with the increasing popularity of electric cars—places fossil fuels of all types in a unique bind. Prices may be high at the moment on speculation over near-future demands from China, but all those fuels—coal, oil, and natural gas—could lose value almost overnight. To a large extent, this has already happened with coal, with 11 of the top U.S. mining companies going through bankruptcy after the industry dropped sharply from 2008 peaks.

The whole fossil fuel sector could see much of its value erased as demand for those fuels crashes and investors take flight. Considering the size and value currently assigned to some of these companies, such a shift could not just spell doom for the fossil fuel corporations, but leave state governments, retirement funds, and individual investors holding the (suddenly empty) bag. Homes in areas dedicated to coal mining or oil and gas drilling could become worthless. So could massive refineries, giant port facilities, and thousands of miles of pipeline.

How those things are managed when the companies that once profited from them are no longer around is unclear. So is what to do about thousands of people stranded in areas that have lost their economic engine.
Patricia Espinosa, executive secretary of the UN Framework Convention on Climate Change, said:
“We’re really talking about preserving the stability of countries, preserving the institutions that we have built over so many years, preserving the best goals that our countries have put together. The catastrophic scenario would indicate that we would have massive flows of displaced people.”

The impact would cascade, she said, adding: “It would mean less food, so probably a crisis in food security. It would leave a lot more people vulnerable to terrible situations, terrorist groups and violent groups. It would mean a lot of sources of instability.”
As if all that wasn't bad enough, Oil System Collapsing so Fast It May Derail Renewables Warn French Government Scientists by Nafeez Ahmed is even worse. It is based upon Peak oil and the low-carbon energy transition: A net-energy perspective by Louis Delannoy et al, and starts from their observation that the energy required to produce a barrel of oil or a cubic foot of natural gas has been increasing, as the easiest deposits are exploited first.

The weighted average Energy Return On Investment (EROI) has declined by a factor of around 5 in the past 70 years. On average, only about 85% of the energy coming out of the ground makes it into the pipelines. But the 15% also represents carbon emissions. The 85% is projected to be around 65% by 2050. Thus even if the plans in the UN report's graph were to come to fruition, and the temperature thus to rise nearly 3 ̊C, the amount of usable energy from fossil fuels would decrease.

Delannoy et al Fig 1
Delannoy et al's Figure 1 shows the gross and net oil energy history and projects it to 2050. The gross energy, and thus the carbon emission, peaks around 2035, but the net energy peaks in about 5 years. They conclude:
Our findings question the feasibility of a global and fast energy transition, not in terms of stocks of energy resources, but in terms of flows. They imply that either the global energy transition takes place quickly enough, or we risk a worsening of climate change, a historical and long-term recession due to energy deficits (at least for some regions of the globe), or a combination of several of these problems. In other terms, we are facing a three-way conundrum: an energy transition that seems more improbable every passing year, increasing environmental threats and the risk of unprecedented energy shortages and associated economic depression in less than two decades.
The problem is that replacing fossil fuels with renewables requires energy input before they output energy. Ahmed writes:
So if we delay the clean energy transformation for too long, there might not be enough energy to sustain the transition in the first place — leading to a ‘worst of all worlds’ scenario: the collapse of both the fossil fuel system and the ability to create a viable alternative.
There is a small ray of hope. Empirically grounded technology forecasts and the energy transition by Way et al shows that:
if solar photovoltaics, wind, batteries and hydrogen electrolyzers continue to follow their current exponentially increasing deployment trends for another decade, we achieve a near-net-zero emissions energy system within twenty-five years. In contrast, a slower transition (which involves deployment growth trends that are lower than current rates) is more expensive and a nuclear driven transition is far more expensive.
Delannoy et al may have underestimated both the rate and the energy demands of deploying renewables.

So how are government's "slow AIs" responding to this barrage of warnings? Here are Justin Rowlatt & Tom Gerken reporting for the BBC in COP26: Document leak reveals nations lobbying to change key climate report:
The leak reveals Saudi Arabia, Japan and Australia are among countries asking the UN to play down the need to move rapidly away from fossil fuels.

It also shows some wealthy nations are questioning paying more to poorer states to move to greener technologies.

This "lobbying" raises questions for the COP26 climate summit in November.

The leak reveals countries pushing back on UN recommendations for action and comes just days before they will be asked at the summit to make significant commitments to slow down climate change and keep global warming to 1.5 degrees.
And the UN report:
found that countries have directed more than $300bn (£217bn) of new public finance to fossil fuel activities since the beginning of the Covid-19 pandemic, more than that provided for clean energy.
A major benficiary of these taxpayer subsidies is coal millionaire Joe Manchin, currently preventing the US government taking action to reduce carbon emissions:
Joe Manchin, the powerful West Virginia Democrat who chairs the Senate energy panel and earned half a million dollars last year from coal production, is preparing to remake President Biden’s climate legislation in a way that tosses a lifeline to the fossil fuel industry — despite urgent calls from scientists that countries need to quickly pivot away from coal, gas and oil to avoid a climate catastrophe.
So what about corporate "slow AIs"? Of all the corporate "slow AIs" the insurance industry's ones should be the most worried, since they'll be on the hook for a lot of the costs. But The Oil Merchant in the Gray Flannel Suit by Alexander Sammon explains
One would think, then, that the insurance industry would be among the most forceful advocates for large-scale intervention on climate change, based on, if nothing else, that good old market homily that is self-interest. The losses keep mounting in the absence of aggressive measures. Either the federal government, in tandem with private investment, pays for a major decarbonization program, or the insurance companies pay for the cleanup. That shouldn’t be a tough decision for a bunch of fund managers to make.

Furthermore, because of their clout and institutional power, one would also assume that if insurers put their mind to it, they could be unusually effective advocates for a green transition. Just about the only time anything gets done in Washington is in those rare moments when a corporation or industry decides they want it to happen. That could be the legacy of insurance companies and climate change.

And yet insurers have not been terribly vocal about the climate crisis. In fact, they’ve been highly resistant to even small-bore climate solutions, instead opting for tepid statements about environmental sustainability.
The "slow AIs" are OK making "tepid statements", but what they are doing is much worse:
But by loading up on stocks of oil and gas companies and energy utilities, purchasing corporate debt of coal and other fossil fuel firms, and underwriting the development of new infrastructure like pipelines and plants, much of which is being done at record rates, the insurance industry is currently propping up the industry that is expediting its own demise. Insurance companies are financially vulnerable to the ravages of climate change, but they also happen to be profiting off of its acceleration.
Some insuurers' statements are a bit less tepid. Here is Peter Giger, Group Chief Risk Officer, Zurich Insurance Group writing for the World Economic Forum:
We need to act now on our climate. Act like these tipping points are imminent. And stop thinking of climate change as a slow-moving, long-term threat that enables us to kick the problem down the road and let future generations deal with it. We must take immediate action to reduce global warming and fulfil our commitments to the Paris Agreement, and build resilience with these tipping points in mind.

We need to plan now to mitigate greenhouse gas emissions, but we also need to plan for the impacts, such as the ability to feed everyone on the planet, develop plans to manage flood risk, as well as manage the social and geopolitical impacts of human migrations that will be a consequence of fight or flight decisions.
But notice that the call to urgent action is addressed to "we". There's nothing in his post about actions that Zurich Insurance Group is going to take.

And of course the fossil fuel "slow AIs" are hard at work preventing action, as Hiroko Tabuchi reports in In Your Facebook Feed: Oil Industry Pushback Against Biden Climate Plans:
The ads appear on Facebook millions of times a week. They take aim at vulnerable Democrats in Congress by name, warning that the $3.5 trillion budget bill — one of the Biden administration’s biggest efforts to pass meaningful climate policy — will wreck the United States economy.
The paid posts are part of a broad attack by the oil and gas industry against the budget bill, whose fate now hangs in the balance. Among the climate provisions that are likely to be left out of the plan is an effort to dismantle billions of dollars in fossil-fuel tax breaks — provisions that experts say incentivize the burning of fossil fuels responsible for catastrophic climate change.

On Thursday, details emerged of an agreement between Senator Chuck Schumer of New York, the majority leader, and Senator Joe Manchin III of West Virginia, a Democrat with huge sway in the divided Senate who has said he doesn’t support such an expansive bill. According to a memo outlining the agreement, first obtained by Politico, Mr. Manchin said that if the legislation were to include extensions of smaller tax credits for wind and solar power, it shouldn’t undo tax breaks for fossil fuel producers.
The Economist has a long article examining the prospects for the upcoming COP26 conference, hosted by Britain's buffoon of a Prime Minister. It is entitled Broken promises, energy shortages and covid-19 will hamper COP26 and reaches this conclusion:
Any progress made at COP26 will probably be incremental, not a “big leap” of the sort John Kerry, America’s climate envoy, has promised. That will enrage grassroots activists. And it hardly matches the scale of the challenge. Two years from now a “Global Stocktake” scheduled under the Paris agreement will examine how well governments are implementing their climate plans. If their most recent climate promises are any indication, the stocktake could reveal a rather bare cupboard.
How do you think you'll manage in a world 2.5-3 ̊C hotter? You'd better start figuring it out, because that's where we are heading.

Humane Ingenuity 41: Zen and the Art of Winemaking / Dan Cohen

Here are sixteen “sketches of a 3D printer by Leonardo da Vinci,” as envisioned by AI using those words as a prompt:


By Rivers Have Wings and John David Pressman, using a CLIP-guided diffusion.

Multimedia essays from the Plant Humanities Lab were recently posted and are worth a look. These pieces use Juncture, a new open-source tool developed by JSTOR Labs that provides a scholarly version of the scrollingtelling that has become common in more mainstream media since the New York Times published “Snow Fall: The Avalanche at Tunnel Creek.” Interactive maps, archival images, and text are blended into a compelling narrative about specific plants, and behind the scenes the PHL connects to repositories of data, art, books, and articles about those plants, including WikiData, JSTOR, Artstor, and the Biodiversity Heritage Library.

Screen Shot 2021-10-08 at 3.32.05 PM.png

All of this tech is in the service of rich narratives about the origins and cultural roles of plant life. Despite the variety, there are common themes: the loss of biodiversity; how commerce and cultural exchange have been global for millennia, not just decades; how imperialism made those interactions extractive and deeply troubling; and yet also how indigenous cultures and profound local knowledge have managed to shape all of our lives, food, art, and beliefs.

A good place to start is with the strange history of the banana, as illustrated by Ashley Buchanan in one of her Juncture-powered PHL pieces.

Screen Shot 2021-10-12 at 1.13.24 PM.png

The alphabet of banana varieties

Wine is one of the oldest plant derivatives. A year ago, at the lowest point of the pandemic, our family decided to try our hand at making wine. We had several hundred pounds of grapes shipped to us from the West Coast. What followed was great fun; I heartily recommend the entire experience, from pressing the grapes to bottling.


There were a lot of steps in between, but there is really not that much to making wine — is it, after all, just the fermentation of grape juice. However, small chemical traits and decisions can change the profile of a wine dramatically. Little additions the size of a thimble in a large barrel might radically alter the ultimate composition.

In addition to the reds we made from Zinfandel grapes from Sonoma, we made three batches of Chardonnay from grapes grown near the Columbia Gorge in Washington, and we altered each batch with minute variations: one using a common American yeast, one using a rare yeast isolated from the Rhône Valley in France, and a third in which we added a secondary process, called malolactic fermentation, using bacteria from Oregon. Remarkably, even though 99% of the liquid is the same in each batch, they taste subtly but noticeably distinct. Much of this comes from the differing balance of three acids — tartaric, malic, and lactic. Malolactic fermentation, as the name suggests, transforms malic acid into lactic acid, and the taste from a sharper profile into a rounder, milkier one. (Yes, lactic as in milk.)

As with all of our contemporary pursuits, apps and computational methods are available to guide you. You can fill digital notebooks with measurements and analyses of the acids, sugars, and other chemical properties of the juice. If you really want to nerd out, you can download gigantic spreadsheets that allow you to tweak those properties through extensive calculations.

Screen Shot 2021-10-21 at 3.53.47 PM.png

Pixar, the legendary animation studio, says that their process of rendering computer graphics is about “turning math into emotions”; similarly, winemaking, among some practitioners, is the process of turning math into taste. 

But winemaking is instructive in our technological age because it pushes back against these algorithms. It may be chemical, but it also feels alchemical. The long time horizons of winemaking — it takes a year or more to assess how your handiwork came out — delays and clouds your knowledge, and renders the math uncertain. Winemaking requires decades of trial and error to become truly proficient, and one must accept that not everything can be completely controlled.

Relax and turn off that computer. In vino, mysterium.

Screen Shot 2021-10-21 at 4.13.06 PM.png

Another pathway to zen: place a raindrop anywhere in the United States and watch where it ends up, and how it gets there through streams and rivers.

Screen Shot 2021-10-14 at 11.52.26 AM.png
Screen Shot 2021-10-14 at 11.51.19 AM.png

Ted Underwood on neural models of language:

The immediate value of these models is often not to mimic individual language understanding, but to represent specific cultural practices (like styles or expository templates) so they can be studied and creatively remixed. This may be disappointing for disciplines that aspire to model general intelligence. But for historians and artists, cultural specificity is not disappointing. Intelligence only starts to interest us after it mixes with time to become a biased, limited pattern of collective life. Models of culture are exactly what we need.

This is exactly right, and well put. In a prior issue of this newsletter, I’ve called this experimenting with “stereotypical narrative genres,” and as a historian, it’s useful. The computer, by providing limitless examples of a “style,” can help us see the contours of that style, its peculiarities of language use and in turn its cultural context and connections.

[“Mapping the Latent Spaces of Culture: To Understand Why Neural Language Models Are Dangerous (And Fascinating), We Need to Approach Them as Models of Culture,” by Ted Underwood.]

Alvina Lai on the video game Spiritfarer, which has an unusual character…

…the Collector, a well-dressed, finicky walrus who goes by the name of Susan. When the player first meets Susan, she describes her distaste for the collection of “junk.” Nonetheless, Susan is the in-game collections achievement tracker, and will reward the player for finding objects throughout the game. While Susan is a minor character in-game, “collection management” is a real-world profession integral to museums, libraries and other curatorial institutions of all themes and sizes. This post will discuss Susan’s role and her views on collecting, and compare Susan and her ideas to real-world institutional collection management practices. In comparing real-world collection practices with those in Spiritfarer, I hope to show how Spiritfarer’s “reconstructive” storytelling and its collection mechanics can help shed light on the memory function of collection management policies and practices of libraries and museums.


Subscribe to the Humane Ingenuity newsletter:

Access 2021: Dave Binkley Memorial Lecture: Open Media: Laura Tribe / Cynthia Ng

Notes from the closing key by Laura Tribe from Open Media. Open Media: advocacy group for keeping internet low cost, surveillance free (etc.), especially conversations and diversity of voices in democratic process; help educate public, build tools to help participate in discussions Goal: how you can get involved Recap: Big picture: we’re tired of pandemic … Continue reading "Access 2021: Dave Binkley Memorial Lecture: Open Media: Laura Tribe"

Fedora FY2020-2021 Annual Report / DuraSpace News

Fedora is excited to share with you this year’s Annual Report. This report serves to showcase and highlight all of our successes from FY 2020-2021. Read all about our achievements, developments and plans for the future and help us celebrate our vibrant and engaged community of members.

You can download your copy of the Fedora Annual Report here.

Thank you to all of our members, institutions and community partners for their dedication to Fedora. Without the continued support of community and it’s members we would not be able to share these successes with you today. If you are not already one, we encourage you to consider becoming a of the Fedora community. Your contribution will ensure the longterm survive of this crucial digital preservation software and all of the valuable items contained within repositories around the world.

Read more about it here.

The post Fedora FY2020-2021 Annual Report appeared first on

A Quarter-Century Of Preservation / David Rosenthal

The Internet Archive turned 25 yesterday! Congratulations to Brewster and the hordes of miniature people who have built this amazing institution.

For the Archive's home-town newspaper, Chase DiFeliciantoni provided a nice appreciation in He founded the Internet Archive with a utopian vision. That hasn't changed, but the internet has:
Kahle’s quest to build what he calls “A Library of Alexandria for the internet” started in the 1990s when he began sending out programs called crawlers to take digital snapshots of every page on the web, hundreds of billions of which are available to anyone through the archive’s Wayback Machine.

That vision of free and open access to information is deeply entwined with the early ideals of Silicon Valley and the origins of the internet itself.

“The reason for the internet and specifically the World Wide Web was to make it so that everyone’s a publisher and everybody can go and have a voice,” Kahle said. To him, the need for a new type of library for that new publishing system, the internet, was obvious.

We (virtually) attended the celebration — you can watch the archived stream here., and please donate to help with the $3M match they announced.

Remote Transfers and Site Visits for Born-Digital Collections: Reflections from the Field / Digital Library Federation

This post was authored by the following members of DLF’s Born Digital Access Working Group’s (BDAWG) Remote Transfers and Site Visits subgroup:

Annalise Berdini
Eddy Colloton
Steven Gentry
Elizabeth-Anne Johnson
Margo Padilla
Shira Peltzman
Darcy Pumphrey
Dana Reijerkerk
Sara Rogers
Laura Schroffel


The COVID-19 pandemic has tremendously impacted all aspects of archival work. In response to the changes caused by this global catastrophe, members of the Digital Library Federation’s Born-Digital Access Working Group (BDAWG) Remote Site Visits and Transfers subgroup have spent the past year learning more about how archival workers have been using or exploring remote acquisition and site visit workflows.

After designing an initial set of discussion questions and recruiting participants via a survey, subgroup members conducted three moderated discussion sessions via Zoom between March 19 and March 25, 2021. A follow-up “Ask a Question” session was also held in early June 2021 for those who wanted to either continue conversations that had started in previous sessions or address novel topics. Ultimately, 28(1) archival workers from the United States and Canada joined these four sessions and discussed questions that focused on remote appraisal and acquisition workflows that participants had considered, implemented, and avoided, as well as participants’ expectations of their donors’ role in appraising and transferring archival material.

Following the conclusion of the “Ask a Question” event, members of the Remote Site Visits and Transfers subgroup reviewed each session’s transcript and identified several themes to report out to the broader community. This post discusses those themes in greater detail, which include:

  • The continued necessity of communication with all stakeholders (e.g., colleagues and donors).
  • Being flexible when working with donors to transfer material.
  • De-prioritizing strict adherence to best practices and instead focusing on donor-friendly efforts.
  • The issue of the scale of born-digital acquisitions.
  • Participants’ mixed feelings regarding pre-accessioning appraisal work.

Communication and Relationship Building for Transfers

Unsurprisingly, communication with donors and offices of transfer remained essential to the process of acquisition during the pandemic, if it did not become even more important due to restrictions around in-person work. Participants discussed transfers and reviewed records over Zoom, over the phone, over email, and often used a combination of all three in order to facilitate acquisitions. One theme that emerged across all of the guided discussion sessions was the need to walk donors through any donation and transfer documentation. Online forms, even those with good directions, were sometimes confusing for donors or difficult to use on mobile devices. However, for one participant, the shift to online forms “didn’t have the hurdles…[they] expected”, although they mentioned that forms with pre-filled buttons and dropdowns seemed more successful than those requiring written answers.

One participant identified the PAIMAS standard as helpful in determining how to engage with donors on email exports: normally, they would sit with a donor in person and bring a USB drive, but the pandemic gave them an opportunity to test out new workflows where they would instruct the donor over Zoom and phone on the technical steps required to transfer content. Because the pandemic provided an opportunity to establish future remote transfer workflows, they now have better information on how to facilitate good communication during a remote transfer. The participant was pleased that they could now confidently facilitate international transfers, for example.

An important lesson about communication that arose during the sessions involved the templates we sometimes create to guide donor discussions. One participant created an initial template based on a test transfer early in the pandemic, but the donor they tested it with was versed enough in technology to understand questions about file formats and operating systems. Going into another donor discussion with that same template, the participant found that they had to make changes on the fly—to move on from specific questions about file formats because the donor was struggling and ask about general software instead. This experience led them to adjust their discussion template workflow by saving multiple versions for different donor situations. Knowing how to adjust communication styles with different donors has led to better success for many of the practitioners we spoke to—especially without having the fallback of being able to look through content before transfer.

Across all of our discussions, the concepts of “the path of least resistance” and “getting through the day” were common. Despite wanting to acquire and transfer born-digital records according to the highest standards, for our participants, the reality was that this was usually not possible or even not preferable. For many of them, donor experience and donor relationships were more important than spending time trying to get a donor to download, install, and use a tool like Bagger in order to preserve the material’s technical metadata. As one participant noted, “born-digital material requires so much extra work on behalf of the donor anyway…making the transfer step as simple as possible seems like a good way to go.” Many participants have been especially successful with remote donations coming from existing cloud storage accounts, feeling that transfers through cloud-based services like Dropbox or Google Drive were good enough and that they did not have the time, resources, or a suitably tech-savvy donor base to try a transfer method beyond tools already known by most donors.

Relationship building to facilitate transfers was not only important in regards to donors. For some of our participants, it was as important, if not more so, to build strong relationships with institutional information technology (IT) departments. IT partners can be key to creating new workflows, including those for remote transfers. One participant mentioned that due to institutional firewalls, they have to go through their IT department to acquire materials through Google Drive. Another participant also ran into issues establishing a Secure File Transfer Protocol (SFTP) connection due to a firewall and were unable to get it to work even with IT assistance. Having communication channels in place and existing relationships with IT colleagues makes having these crucial conversations easier.

Donor Experience

We found that many participants in our sessions were concerned about the variations in experience with born-digital material among donors, acquiring archivists, digital archivists, and other archival staff. Planning the acquisition of born-digital material that meets the needs of these parties as well as the policy needs of one’s institution is complex and requires differing approaches from donor to donor. Actions like filling out forms or signing a deed of gift may need to happen electronically or on paper depending on the donor’s level of expertise and comfort with digital material and tools.

During our sessions, archives workers mentioned a variety of tools they used with donors, depending on the donors’ expertise. Google Drive and Dropbox seemed to be the most common transfer solutions, with external drives often utilized for larger transfers. Some other methods used by participants were using Zoom’s remote desktop capability to explore a donor’s computer and ready files for transfer, getting donors to use SFTP to transfer material, or simply sending a donor an empty hard drive and asking them to transfer their material onto it.

We also discussed the amount of preparatory work different archives ask donors to do with their digital material. As mentioned above, having conversations about the material before the acquisition process even begins can be helpful in getting a sense of the material’s needs and the donor’s comfort level. Developing a standard series of questions about file formats, operating systems, and physical media (among other topics) that are routinely asked of donors may be useful, though as discussed above, even this level of discussion may be beyond some donors. Some institutions ask donors to zip their donation or package their files in the BagIt standard using a tool like Exactly. Some assist donors with creating a cogent file structure and do some arranging on the donor’s computer before a transfer takes place, or ask donors to do some of this work. The practicability of all these options depends on the donor’s time, willingness, and capacity to assist in the transfer of their material in this way, and on whether someone in the archives has the ability to provide technical support to the donor when needed.

Doing this work also requires a level of expertise with digital tools from both the acquiring archivist who is in principal contact with the donor as well as from the digital archivist (in the case that these are different people). In some institutions, the acquiring archivists far outnumber the digital archivists, making knowledge transfer about born-digital archival material a challenge. Our respondents recommended that communication begin between acquiring archivists and digital preservation staff as early in the acquisition process as possible so that the maximum amount of knowledge transfer and education between the two parties can happen. As more and more donations come to the archives via electronic means and the process becomes more routine, these conversations will change, but it is important to share information about the tools used for electronic records transfer and the workflows involved in the transfers themselves.

Finally, we found that acquiring born-digital material by any means necessary is better than not preserving this material due to rigid expectations about the donor’s expertise. As discussed above, the “path of least resistance” and similar concepts were mentioned frequently in the sessions we held. When planning and executing the acquisition of born-digital material, archivists need to be flexible in the information and preparatory work they ask for, and play to the capabilities of their donors. Often this flexibility means that reinventing the acquisition wheel is unnecessary. Accepting born-digital content on physical media remains a safe, familiar, and reliable practice for many of the discussants. One participant indicated a policy threshold where an acquisition would be considered too large to be completed as a digital transfer. Indeed, most practitioners felt physical media was better suited for larger scale acquisitions, due to the security of the medium and the ability to assert control over the content.

Issues with Scale

Many participants expressed concerns with implementing standard and/or consistent workflows when the quantity of incoming materials ranged from a few assets to hundreds and more. Currently, many institutions still purchase or loan out external hard drives to donors for transfers. Digital transfers were desired by many, but few institutions actually acquired materials in this way. Many said that they wanted a way to streamline the transfer process for large batches of files, ideally with cloud-based transferring or through SFTP. One participant stated that they download file copying software on the donor’s computer, while another installed similar software on the external hard drive they lent out to donors.

Place of Appraisal

Participants’ responses to our questions about appraisal centered on several prominent themes. With regard to how they performed appraisal, some participants talked about using Zoom’s screen sharing and remote desktop tools to complete this work. One participant—who works heavily with audiovisual material typically received via hard drive—emphasized that “screen sharing allows for better appraisal of those hard drives before we actually get them,” highlighting the value of this tactic. Some participants commented on the need for strong relationships to facilitate appraisal, particularly with their curation colleagues, although challenges (e.g., a lack of time) were also noted. These examples point to some participants’ belief in the value of appraisal prior to accessioning material, even during these challenging times.

However, appraisal was also described by some participants in a more neutral or even strained tone. For example, one participant noted that appraising material alongside donors could be challenging because their donors typically transferred removable media to the archives once said media could no longer be easily read. Others expressed frustration or fatigue with appraisal in general, with one participant emphasizing that “electronic appraisal is…just absolutely overwhelming” and that “it’s easier for us right now to just take it all in…and reject things after the fact…[rather than]…build a proper robust archival quality appraisal on the front end.” This sentiment, which was echoed by other participants (e.g., one participant’s comment that “it’s almost easier…to take the files…and then do appraisal”) suggests that additional work can be done to make appraisal workflows more palatable, especially in light of scale-related issues (noted above).


Many themes and discussions referenced in this post will be familiar to readers. It is not exactly a new conversation that archivists are struggling to find ways to work with limited resources, limited staff, limited time, and donors unfamiliar with best practice technologies, and of course these issues affect the way we transfer and review materials while working remotely. We were hoping to start a conversation about how practitioners were meeting these challenges, and we are gratified to have helped spark some of those conversations. As many attendees noted, it was helpful simply to hold the space to have a dedicated discussion about remote transfers—to hear about common problems and learn alternative solutions from their peers. It was particularly inspiring to see attendees across all of the sessions share their resources with each other to try to address these common problems. Also important to many was simply knowing that others were going through the same challenges and that most of us are approaching them the same ways. There was a broad expectation that the COVID-19 pandemic would force practitioners into a fully digital future, but in reality it brought to light a scale of capacity where digital transfer became one tool among many.

We would like to see more research into this area and hope to hold a wider, more formal survey on the topic of transfers in the future.

(1) This number does not include DLF BDAWG subgroup members who actively participated in these sessions.

The post Remote Transfers and Site Visits for Born-Digital Collections: Reflections from the Field appeared first on DLF.

Excess Deaths (Updated) / David Rosenthal

It is difficult to comprehend how abject a failure the pandemic response in countries such as the US and the UK has been. Fortunately, The Economist has developed a model estimating excess deaths since the start of the pandemic. Unfortunately, it appears to be behind their paywall. So I have taken the liberty of screen-grabbing a few example graphs.

This graph compares the US and Australia. Had the US handled the pandemic as well as Australia (-17 vs. 250 per 100K), about 885,000 more Americans would be alive today. With a GDP per capita about $63.5K/year, this loses the economy about $56B/year.

This graph compares the UK and New Zealand. Had Boris Johnson handled the pandemic as well as Jacinda Arden (-49 vs. 170), about 149,000 more Britons would be alive today. With a GDP per capita about $42K/year, this loses the economy about $6.3B/year.

A graph is worth a thousand words. Below the fold, a little commentary.

The Economist argues that the true scale of the pandemic can only be determined from excess deaths:
Many people who die while infected with SARS-CoV-2 are never tested for it, and do not enter the official totals. Conversely, some people whose deaths have been attributed to covid-19 had other ailments that might have ended their lives on a similar timeframe anyway. And what about people who died of preventable causes during the pandemic, because hospitals full of covid-19 patients could not treat them? If such cases count, they must be offset by deaths that did not occur but would have in normal times, such as those caused by flu or air pollution.
Their machine-learning model:
estimates excess deaths for every country on every day since the pandemic began. It is based both on official excess-mortality data and on more than 100 other statistical indicators. Our final tallies use governments’ official excess-death numbers whenever and wherever they are available, and the model’s estimates in all other cases.
The model estimates that:
Although the official number of deaths caused by covid-19 is now 4.6m, our single best estimate is that the actual toll is 15.3m people. We find that there is a 95% chance that the true value lies between 9.4m and 18.2m additional deaths.
Excess deaths in the US and the UK are far from the worst, but my point is that countries at a similar level of development have done far better, and so have much less well-resourced countries. Had the US done as well as the model's estimate for China (38 vs. 250) about 702,000 more Americans would be alive today.

Update 9th Oct 2021:

Paul Campos crunches some of the numbers for US excess deaths and ask this fascinating question: Why did excess mortality in the USA in 2020 rise more among young adults than among the cohorts that accounted for almost all COVID deaths?
as age increased, excess mortality risk caused by COVID increased as well, to the point where among people 75 and older almost all excess mortality risk was attributable directly to COVID. By contrast, among young adults, excess mortality attributable to COVID decreased as cohort age decreased, to the point where among the youngest adults very little excess mortality could be attributed to it. Yet excess mortality rates increased more among young adults than they did among older adults. (As a percentage increase against baseline mortality risk of course. Overall mortality rates are naturally much higher among elderly cohorts).
Here is Campos' data in table form, with "Deaths" numbers per 100,000.
Age20192020COVIDCOVID %Other %

DeathsDeathsDeathsof excessof excess
Something other than COVID-19 accounts for 4 out of 5 excess deaths in the 25-34 cohort. This proportion falls with age until at 85+ COVID-19 accounts for all the excess deaths. Campos writes:
The big question is: why did mortality rates rise so sharply among young adults in the USA in 2020, given that COVID, or at least COVID directly, seems to account for so little of that rise? (The indirect effects of the pandemic are a different story of course).

Answering that will require digging more into the precise causes of excess mortality among young people in the USA last year. Back of the enveloping suggests that the rise in the homicide rate accounts for maybe 15% to 20% of the increase in mortality risk among people under 45. Drug overdose rates also shot up among young adults, so that accounts for part of the rise as well. (Whether either of these causes of death is related in some way to the pandemic is yet another tangled epidemiological puzzle to solve).

In any case the social effects of the pandemic would seem to go far beyond its direct effects on population mortality, and most especially among the non-elderly.

Cryptocurrency's Carbon Footprint Underestimated / David Rosenthal

Back in April I wrote Cryptocurrency's Carbon Footprint about the catastrophic carbon emissions of Proof-of-Work cryptocurrencies such as Bitcoin. It now turns out that I didn't know that half of it; the numbers I and everyone else has been using are greatly underestimated. Below the fold, based on my no doubt somewhat inadequate methodology, the real story.

The leading source of data on which to base Bitcoin's carbon footprint is the Cambridge Bitcoin Energy Consumption Index. As I write, after the Bitcoin price took at 10% hit on news from the People's Bank of China, they estimate 12.94GW or 0.45% of total global electricity consumption, with theoretical bounds of [4.65,31.17]GW.

Their methodology is explained here. Greatly oversimplified, it is this:
  • The hash rate of the Bitcoin blockchain is known (see graph).
  • The energy consumed per hash by the various mining ASICs is known.
  • The population of each type of mining ASIC is not known, and changes all the time, so:
    The lower bound estimate corresponds to the theoretical minimum total electricity expenditure based on the best-case assumption that all miners always use the most energy-efficient equipment available on the market. The upper bound estimate specifies the theoretical maximum total electricity expenditure based on the worst-case assumption that all miners always use the least energy-efficient hardware available on the market, as long as running the equipment is still profitable in electricity terms. The best-guess estimate is based on the more realistic assumption that miners use a basket of profitable hardware rather than a single model.
  • Mining ASICs are profitable if the mining rewards they generate more than cover the electricity cost.
  • The electricity cost varies among miners, but:
    Electricity prices available to miners vary significantly from one region to another for a variety of reasons. We assume that on average, miners face a constant electricity price of 5 USD cents per kilowatt-hour (0.05 USD/kWh). This default value is based on in-depth conversations with miners worldwide and is consistent with estimates used in previous research
The energy consumption estimates can be converted to carbon footprints using the known carbon intensity of electricity production in the regions where Bitcoin is mined. CBECI also produces estimates of the proportion of hash rate in different regions although, unlike their energy consumption estimates, these lag considerably. As you can see from their most recent map, they do not yet show the exodus of miners from China due to the government's continuing cryptocurrency crackdown.

Using the average carbon intensity of electricity in each region is, of course, a simplification. There is a tidal wave of Bitcoin greenwashing, such as:
  • Suggesting that because Bitcoin mining allows an obsolete, uncompetitive coal-burning plant near St. Louis to continue burning coal it is somehow good for the environment.
  • Committing to 100% renewable energy use by signing the Crypto Climate Accord, then implementing it:
    By buying 15 megawatts of coal-fired power from the Navajo Nation! They're paying less than a tenth what other Navajo pay for their power — and 14,000 Navajo don't have any access to electricity. The local Navajo are not happy.
But the desperate greenwashing suggests it is likely that in most regions Bitcoin mining uses electricity more carbon-rich than the average.

There were already many reasons why the CBECI energy estimates are too low. They exclude, for example, the energy consumption of the computers driving the ASICs, the cooling needed to stop them overheating, and the networking to tie them together. These are relatively small corrections, but there is another that is rather large.

A full accounting of the carbon footprint of IT equipment such as Bitcoin mining rigs requires estimating not just the emissions from its use, but also the embedded emissions produced during its manufacture and disposal. Fortunately, a paper from a year ago entitled Chasing Carbon: The Elusive Environmental Footprint of Computing by Adit Gupta et al from Harvard, Facebook and Arizona State provides these estimates for both mobile devices and data center equipment, which is a reasonable proxy for Bitcoin mining rigs:
we split emissions into Scope 1 (blue), Scope 2 (green), and Scope 3 (red). Recall that Scope 1 (opex) emissions come from facility use of refrigerants, natural gas, and diesel; Scope 2 (opex) emissions come from purchased electricity; and Scope 3 (capex) emissions come from the supply chain, including employee travel, construction, and hardware manufacturing
Figure 11
They plot the history of these emissions in Figure 11, which:
illustrates the carbon footprint of Google and Facebook over six years. Although the figure divides these emissions into Scope 1, Scope 2, and Scope 3, Scope 2 comprises two types: location based and market based. Location-based emissions assume the local electricity grid produces the energy — often through a mix of brown (i.e., coal and gas) and green sources. Market-based emissions reflect energy that companies have purposefully chosen or contracted — typically solar, hydroelectric, wind, and other renewable sources.
Thus the green line on the graph is "opex" (i.e. electricity) corresponding to the CBECI estimates, and the red line is "capex", i.e. embedded emissions. Note two things:
  • Until Facebook and Google switched to renewable electricity, opex and capex emissions were approximately equal. The fraction of renewable energy in Bitcoin mining is unlikely to be large.
  • In 2017 Facebook and Google changed their hardware footprint disclosure practice, resulting in an increase of 7x for Google and 12x for Facebook. It is safe to assume that neither would have done this had they believed the new practice greatly over-estimated the footprint.


In 2018 Christian Stoll et al estimated Bitcoin's carbon emissions:
We determine the annual electricity consumption of Bitcoin, as of November 2018, to be 48.2 TWh, and estimate that annual carbon emissions range from 21.5 to 53.6 MtCO2.
Adjusting this to the current CBECI estimate gives a range of about 43 to 108 MtCO2/yr for Bitcoin's opex emissions. Using data from Wikipedia, this is between Cameroon (~42) and Uzbekistan (~108).

A Factor Of 2?

If we assume that Bitcoin mining is represented by Facebook and Google before they switched to renewables, then the embedded Scope 3 emissions are around the same as those based on the CBECI numbers, and the carbon footprint of Bitcoin is double previous estimates. A carbon footprint of 86 to 216 MtCO2/yr would place Bitcoin mining between Tanzania (~76) and Pakistan (~215).

Note that this would mean that even were Bitcoin using 100% renewable energy, it would still have a carbon footprint between 43 and 108 MtCO2/yr.

A Factor of 10?

If we assume that the newer hardware footprint disclosures increase the embedded emissions by between 7x and 12x, say 9x, including the embedded emissions in mining rigs increases the emissions computed from the CBECI estimates by a factor of 10 (1 for electricity and 9 for embedded emissions). Thus applying the 9x factor for capex emissions gives a range for the total carbon footprint of 430 to 1080 MtCO2/yr. Or, about the same as South Africa (~440) to about the same as Russia (~1050).

A Factor of 19?

That is bad enough, but it isn't the end of the problem. Data center equipment is run in carefully controlled environments at carefully controlled duty cycles. Because of the phenomena outlined by Dean and Barroso in The Tail at Scale, even batch processing cloud infrastructure aims for a duty cycle around 80%, whereas interactive service infrastructure aims between 30% and 50%. Bitcoin mining rigs, on the other hand, are run at 100% duty cycle in sub-optimal environments.

Both factors impair their useful life, but a bigger factor is that the highly competitive market for mining ASICs means that they have only a short economic life. This month's Bitcoin's growing e-waste problem by Alex de Vries and Christian Stoll estimates that:
The average time to become unprofitable sums up to less than 1.29 years. While this concerns an unweighted average, we can refer to the case study on Bitmain's Antminer S9 ... to show that weighting the average lifetime by sales volume does not significantly change the results.
de Vries and Stoll estimate that:
  • Bitcoin's annual e-waste generation adds up to 30.7 metric kilotons as of May 2021.
  • This level is comparable to the small IT equipment waste produced by a country such as the Netherlands.
  • On average Bitcoin generates 272 g of e-waste per transaction processed on the blockchain.
  • Bitcoin could produce up to 64.4 metric kilotons of e-waste at peak Bitcoin price levels seen in early 2021.
  • The soaring demand for mining hardware may disrupt global semiconductor supply chains.
David Gerard points out that this means:
That's half an iPad of e-waste average per transaction.
Because a mining rig's life is perhaps only half that of data center equipment, it means that the embedded emissions in mining rigs are amortized over only half as long, and are thus in effect twice as great. Thus it is plausible that the multiplier for the embedded emissions is not 9 but 18, leading to Bitcoin having a carbon footprint between 817 and 2052 MtCO2/yr. This compares to between Brazil (~812) and India (~2400).

Note that this would mean that even were Bitcoin using 100% renewable energy, and we ignore the updated Scope 3 disclosure, it would still have a carbon footprint between 86 and 216 MtCO2/yr.

It Gets Worse

  • The numbers above are for Bitcoin alone, but it is far from the only Proof-of-Work cryptocurrency. In 2018, Tim Swanson estimated that the next four Proof-of-Work cryptocurrencies added an extra 35% to Bitcoin's electricity consumption. This could mean that in the worst case the top 5 cryptocurrencies had a carbon footprint of between 1100 and 2770 MtCO2/yr, or between Japan (~1074) and the EU (~2637).
  • Gupta et al add a caveat:
    Note that because industry disclosure practices are evolving, publicly reported Scope 3 carbon output should be interpreted as a lower bound.


These later estimates seem excessive. Please critique my methodology in the comments.

To be conservative, lets assume the value for Scope 3 emissions before the disclosure change, i.e. that the Scope 3 and 2 emissions were about the same, and simply adjust the Scope 3 emissions for the short working life of mining rigs. Thus the Scope 3 emissions are double the Scope 2 emissions, and the factor by which the emissions exceed the previous estimates is 3. This leads to a carbon footprint of between 65 and 161 MtCO2/yr for Bitcoin alone, or between Morocco (~65) and Columbia (~162). The top 5 cryptocurrencies would be between Kuwait (~89) and Kazakhstan (~217).

The Looming Fossil Fuel Crash / David Rosenthal

In 2018's It Isn't About The Technology I wrote about Charlie Stross' concept that corporations are "Slow AIs":
Stross uses the Paperclip Maximizer thought experiment to discuss how the goal of these "slow AIs", which is to maximize profit growth, makes them a threat to humanity. The myth is that these genius tech billionaire CEOs are "in charge", decision makers. But in reality, their decisions are tightly constrained by the logic embedded in their profit growth maximizing "slow AIs".
Below the fold, I apply this insight to the impact of climate change on "the market".

In It Isn't About The Technology I illustrated the problem from personal experience:
In the late 80s I foresaw a bleak future for Sun Microsystems. Its profits were based on two key pieces of intellectual property, the SPARC architecture and the Solaris operating system. In each case they had a competitor (Intel and Microsoft) whose strategy was to make owning that kind of IP too expensive for Sun to compete. I came up with a strategy for Sun to undergo a radical transformation into something analogous to a combination of Canonical and an App Store. I spent years promoting and prototyping this idea within Sun.

One of the reasons I have great respect for Scott McNealy is that he gave me, an engineer talking about business, a very fair hearing before rejecting the idea, saying "Its too risky to do with a Fortune 100 company". Another way of saying this is "too big to pivot to a new, more “sustainable” business model". In the terms set by Sun's "slow AI" Scott was right and I was wrong. Sun was taken over by Oracle in 2009; their "slow AI" had no answer for the problems I identified two decades earlier. But in those two decades Sun made its shareholders unbelievable amounts of money.
David Shuckman reports for the BBC that:
At the moment, projections suggest that even with recent pledges to cut emissions of greenhouse gases, the world is on course to heat up by up to 3C.
To limit warming to 1.5°C, huge amounts of fossil fuels need to go unused by Doug Johnson explains the problem facing the "Slow AIs" of fossil fuel companies:
According to the new research, nearly 60 percent of existing oil and fossil methane gas and 90 percent of global coal reserves need to go unused through at least 2050—and this action would only yield a 50 percent chance of limiting global warming to 1.5ºC. These reductions mean that many fossil fuel projects around the world, both planned and existing, would need to be halted. Further, oil and gas production needs to decline by 3 percent every year until 2050. This also means that most regions in the world need to reach their peak production now or within the next decade.
The article is based upon Unextractable fossil fuels in a 1.5 °C world by Dan Welsby et al from University College, London. In other words, the fossil fuel companies need to take two decisions in the near future:
  • To ramp down their production at 3%/yr.
  • To abandon the development plans for the majority of their reserves.
The problem their "Slow AIs" face is that their stock market valuations are based upon their profits, which would be reduced by the -3% ramp, and their assets, the reserves that should not be developed.

For their long-term survival these companies need to "pivot" to more sustainable businesses, for example renewable energy, accepting the short-term hit to their stock price. But this is precisely the kind of decision that corporate "Slow AIs" cannot take. Two examples of their reaction are:
  • Geoff Dembecki's CEOs Who Called for Climate Action Now Scrambling to Block Climate Action recounts Pat Gelsinger's "road to Damascus" moment:
    Sheltering at home in California with his family, Gelsinger watched a nearby wildfire spew smoke and ash and turn the sky orange. Never before had society experienced crises at this scale, he realized, a “global triple threat” of climate chaos, racial inequality, and an out-of-control pandemic. Gelsinger, who is now CEO of Intel, felt a moral duty to get the climate emergency under control while bridging social divisions.
    But Intel's "Slow AI" disagreed with the new CEO:
    “Make no mistake, these policies are a step backward for the U.S. economy that will harm all Americans,” reads a statement earlier this month from the Business Roundtable, a lobby group that Gelsinger belongs to along with top executives at corporations like Apple, Microsoft, BlackRock, and Disney. The Roundtable is reportedly waging “a significant, multifaceted campaign” costing potentially millions of dollars to defeat the corporate tax hikes which would help fund and make possible Biden’s Build Back Better plan — even as its individual members say there is nothing more important than stabilizing greenhouse gas emissions.
    Intel's "Slow AI" explained why it over-ruled the CEO:
    A spokesperson for Intel said in a statement to Rolling Stone that “we believe climate change is a serious environmental, economic, and social challenge.” The statement explained that while groups like the Roundtable might not align with the company “100 percent on every topic,” Intel believes “the overall benefits of our membership in these organizations outweighs our differences on some issues.”
  • Coral Davenport's This Powerful Democrat Linked to Fossil Fuels Will Craft the U.S. Climate Plan describes the effect of this lobbying:
    Joe Manchin, the powerful West Virginia Democrat who chairs the Senate energy panel and earned half a million dollars last year from coal production, is preparing to remake President Biden’s climate legislation in a way that tosses a lifeline to the fossil fuel industry — despite urgent calls from scientists that countries need to quickly pivot away from coal, gas and oil to avoid a climate catastrophe.
If the fossil fuel "Slow AIs" had taken the long view and accepted a gradual but substantial decrease in their profits and stock price then the effects on the broader market would have been manageable. A gradual accumulation of events such as Climate Change: Update on Harvard Action:
For some time now, Harvard Management Company (HMC) has been reducing its exposure to fossil fuels. As we reported last February, HMC has no direct investments in companies that explore for or develop further reserves of fossil fuels. Moreover, HMC does not intend to make such investments in the future. Given the need to decarbonize the economy and our responsibility as fiduciaries to make long-term investment decisions that support our teaching and research mission, we do not believe such investments are prudent.
and other financial institutions following suit will depress fossil fuel stock prices, but not enough to avoid others seeing this as an investment opportunity. But, with statements like The Saudi Prince of Oil Prices Vows to Drill ‘Every Last Molecule’, that isn't what the "Slow AIs" are going to do.

So, one of two things is going to happen. Either the world is going to head for 3°C and society collapses, in which case the "Slow AIs" profits, stock prices and reserve assets are irrelevant. Or there will be a discontinuous drop in their stock prices and bond ratings as they are forced to cut production and restate the value of their reserves.

Although it is 6 years since fossil fuel companies were counted in the top ten most valuable companies, as a group they are still very large. For example, as I write BP, Chevron, ConocoPhillips, Exxon Mobil, PetroChina, Royal Dutch Shell and Total Energy are together worth $1.1T, or a bit more than Facebook.

Assuming society doesn't collapse, here are three reasons why the stock price and bond ratings of the fossil fuel industry will crash:
  • The world cannot indefinitely absorb the externalities of their operation. Peter Coy's New York Times op-ed ‘The Most Important Number You’ve Never Heard Of’ reports on The Social Cost of Carbon: Advances in Long-Term Probabilistic Projections of Population, GDP, Emissions, and Discount Rates by Rennert et al, which describes a sophisticated model for the "social cost of carbon", i.e. the externalities of the fossil fuel industry:
    But with certain plausible assumptions, the model spits out a social cost of carbon of $56 a ton on average at a 3 percent discount rate, and $171 a ton on average at a 2 percent discount rate. The 2 percent figure is more in line with the relevant current interest rates
    It’s terrible news for the planet and humanity if greenhouse gas emissions create $171 in damages per ton. (Keep in mind that burning 113 gallons of gasoline is enough to generate a ton of carbon dioxide or the equivalent in other greenhouse gases, according to the Environmental Protection Agency, so that would be a cost to the planet of more than $1 per gallon consumed.) The higher figure implies that even very costly measures to reduce emissions should be implemented immediately.
  • The competitors to fossil fuels are already cheaper and their advantage is increasing. Empirically grounded technology forecasts and the energy transition by Way et al shows that:
    The prices of fossil fuels such as coal, oil and gas are volatile, but after adjusting for inflation, prices now are very similar to what they were 140 years ago, and there is no obvious long range trend. In contrast, for several decades the costs of solar photovoltaics (PV), wind, and batteries have dropped (roughly) exponentially at a rate near 10% per year. The cost of solar PV has decreased by more than three orders of magnitude since its first commercial use in 1958.
    and that:
    if solar photovoltaics, wind, batteries and hydrogen electrolyzers continue to follow their current exponentially increasing deployment trends for another decade, we achieve a near-net-zero emissions energy system within twenty-five years. In contrast, a slower transition (which involves deployment growth trends that are lower than current rates) is more expensive and a nuclear driven transition is far more expensive.
  • Fossil fuels are massively subsidized at taxpayer expense:
    Conservative estimates put U.S. direct subsidies to the fossil fuel industry at roughly $20 billion per year; with 20 percent currently allocated to coal and 80 percent to natural gas and crude oil. European Union subsidies are estimated to total 55 billion euros annually.
    Between the US and the EU that's $84B/yr, and they're not the only ones.
The fossil fuel industry can't stave off all of these indefinitely, but they are clearly going to try to postpone the inevitable. The result will be to magnify the eventual crash. The crash will thus be big enough to crash the stocks of related industries. How much have the banks lent to the fossil fuel industries and their suppliers, for example?

Update 20th October 2021

Damian Carrington's Planned fossil fuel output ‘vastly exceeds’ climate limits, says UN supports the argument of this post:
Fossil fuel production planned by the world’s governments “vastly exceeds” the limit needed to keep the rise in global heating to 1.5C and avoid the worst impacts of the climate crisis, a UN report has found.

Despite increasing pledges of action from many nations, governments have not yet made plans to wind down fossil fuel production, the report said. The gap between planned extraction of coal, oil and gas and safe limits remains as large as in 2019, when the UN first reported on the issue. The UN secretary general, António Guterres, called the disparity “stark”.

The report, produced by the UN Environment Programme (Unep) and other researchers, found global production of oil and gas is on track to rise over the next two decades, with coal production projected to fall only slightly. This results in double the fossil fuel production in 2030 that is consistent with a 1.5C rise.
The report also found that countries have directed more than $300bn (£217bn) of new public finance to fossil fuel activities since the beginning of the Covid-19 pandemic, more than that provided for clean energy.

Gif / Ed Summers

I’ve found myself creating animated GIFs from videos enough recently to make a little shell script to use from the command line, so I don’t need to keep looking up the ffmpeg options–of which there are so many. Put this in your PATH and smoke it:

What an insanely great tool ffmpeg is. I’d actually love to read a history of the project–has that been done already?

Thank you for Joining the Frictionless Data Hackathon / Open Knowledge Foundation

Last week people from around the world joined the Frictionless Data team for the world’s first Frictionless Data Hackathon. Find out what happened, and make sure you join the Frictionless Data Community to find out about upcoming events.

Watch video here

What’s this about?

The team at Open Knowledge Foundation have lots of experience running and attending Hackathons. We know how powerful they can be to create new functioning software and useful innovations in a short space of time.

This is why the team at Frictionless Data were so excited to launch the first Frictionless Data Hackathon on 7 – 8th October 2021.

Over 20 people from around the world signed up for the event. During two full days, the participants worked on four projects, all with very different outcomes. For example:

  • Covid Tracker was aimed at testing Livemark – the latest Frictionless tool – with real live data to provide an example of all its functionalities. Check out the project Github repository to learn more.
  • the Frictionless Tutorial project created new tutorials using the Python Frictionless Framework (see tutorial here)
  • Frictionless Community Insight focused on building a new Livemark website to tell the story of the Frictionless Community – who we are, where we are from, what we do and what we care about (see draft website here)
  • DPCKAN was a project proposed by a team working on the data portal of the state of Minas Gerais in Brazil to develop a tool that would allow publishing and updating datasets described with Frictionless Standards in a CKAN instance. Check out the Github Repository here.

The prize for the best project, voted by the participants, went to the DPCKAN team. Well done André, Andrés, Carolina, Daniel, Francisco and Gabriel!

    ”I feel pretty happy after this frictionless hackathon experience. We’ve grown in 2 days more than it could have been possible in one month. The knowledge and experience exchange was remarkable”, said the winning team.

Find out more
You can learn more about the Frictionless Data Hackathon here and watch the project presentations here.

Learn more about Frictionless Data on our website Ask us a question, or join the Frictionless Data community here.

2021 End-of-Year Community Updates / Open Library

Hi Open Library Community! This is going to be a less formal post detailing some of our recent community meetings and exciting Q3 (quarter 3) opportunities to learn, celebrate, and participate with the Open Library project.

Earlier this Month

Upcoming Events

  1. 📙 Library Leader’s Forum 2021-10-13 & 2021-10-20
  2. 🎉 Open Library Community Celebration (RSVP) 2021-10-26
  3. 📅 2022 Roadmap Community Planning (join) 2021-11-02 @ 10am PT

Open Library Community Celebration 2021

Last year we started the tradition of doing an Open Library Community Celebration to honor the contributions & impact of those in our community. On October 26, 2021 @ 10am Pacific we will be hosting our 2nd annual community celebration. We hope you can join us!

During this online event, you’ll hear from members of the community as we:

  • Announce our latest developments and their impacts
  • Raise awareness about opportunities to participate
  • Show a sneak-peek into our future: 2022

Free, open invitation: RSVP here

5-Year Vision

End of September on 2021-09-28 @ 10am PT, the Open Library community came together to brainstorm Open Library’s possible long-term directions. Anyone in the community is welcome to comment and add their notes and thoughts:

2021 Year-End Review

On 2021-10-12 @ 10am PT the community met to review what we had accomplished (see review doc) on our 2021 roadmap.

2022 Community Planning

First week of November on 2021-11-02 @ 10am PT the community will meet to brainstorm goals for Open Library’s 2022 roadmap. This community planning call will be open to the public here.

Migration: Fedora 3 to OCFL / Brown University Library Digital Technologies Projects

A previous post described the current storage setup of the Brown Digital Repository. However, until recently BDR storage was handled by Fedora 3. This post will describe how we migrated over one million objects from Fedora 3 to OCFL, without taking down the repository.

The first step was to isolate the storage backend behind our BDR APIs and a Django storage service (this idea wasn’t new to the migration – we’ve been working on our API layer for years, long before the migration started). So, users and client applications did not hit Fedora directly – they went through the storage service or the APIs for reads and writes. This let us contain the storage changes to just the APIs and storage service, without needing to update the other various applications that interacted with the BDR.

For our migration, we decided to set up the new OCFL system while Fedora 3 was still running, and run them both in parallel during the migration. This would minimize the downtime, and the BDR would not be unavailable or read-only for days or weeks while the migration script migrated our ~1 million Fedora 3 objects. We set up our OCFL HTTP service layer, and updated our APIs to be able to post new objects to OCFL and update objects either in Fedora or OCFL. We also updated our storage service to check for an object in OCFL, and if the object hadn’t been migrated to OCFL yet, the storage service would fall back to reading from Fedora. Once these changes were enabled, new objects were posted to OCFL and updated there, and old objects in Fedora were updated in Fedora. At this point, object files could change in Fedora, but we had a static set of Fedora objects to migrate.

The general plan for migrating was to divide the objects into batches, and migrate each batch individually. We mounted our Fedora storage a second time on the server as read-only, so the migration process would not be able to write to the Fedora data. We used a small custom script to walk the whole Fedora filesystem, listing all the object pids in 12 batch files. For each batch, we used our fork of the Fedora community’s migration-utils program to migrate the Fedora data over to OCFL. We migrated to plain OCFL, however, instead of creating Fedora 6 objects. We also chose to migrate the whole Fedora 3 FOXML file, and not store the object and datastream properties in small RDF files. If the object was marked as ‘D’eleted in Fedora, we marked it as deleted in OCFL by deleting all the files in the final version of the object. After the batch was migrated, we checked for errors.

One issue we ran into was slowness – one batch of 100,000 objects could take days to finish. This was much slower than a dev server migration, where we migrated over 30,000 objects in ~1.25 hours. We could have sped up the process by turning off fixity checking, but we wanted to make sure the data was being migrated correctly. We added memory to our server, but that didn’t help much. Eventually, we used four temporary servers to run multiple migration batches in parallel, which helped us finish the process faster.

We had to deal with another kind of issue where objects were updated in Fedora during the migration batch run (because we kept Fedora read-write during the migration). In one batch, we had 112 of the batch objects updated in Fedora. The migration of these objects failed completely, so we just needed to add the object PIDs to a cleanup batch, and then they were successfully migrated.

The migration script failed to migrate some objects because the Fedora data was corrupt – ie. file versions listed in the FOXML were not on disk, or file versions were listed out-of-order in the FOXML. We used a custom migration script for these objects, so we could still migrate the existing Fedora filesystem files over to OCFL.

Besides the fixity checking that the migration script performed as it ran, we also ran some verification checks after the migration. From our API logs, we verified that the final object update in Fedora was on 2021-05-20. On 2021-06-22, we kicked off a script that took all the objects in the Fedora storage and verified that each object’s Fedora FOXML file was identical to the FOXML file in OCFL (except for some objects that didn’t need to be migrated). Verifying all the FOXML files shows that the migration process was working correctly, that every object had been migrated, and that there were no missed updates to the Fedora objects – because any Fedora object update would change the FOXML. Starting on 2021-06-30, we took the lists of objects that had an error during the migration and used a custom script to verify that each of the files for that object on the Fedora filesystem was also in OCFL, and the checksums matched.

Once all the objects were migrated to OCFL, we could start shutting down the Fedora system and removing code for handling both systems. We updated the APIs and the storage service to remove Fedora-related code, and we were able to update our indexer, storage service, and Cantaloupe IIIF server to read all objects directly from the OCFL storage. We shut down Fedora 3, and did some cleanup on the Fedora files. We also saved the migration artifacts (notes, data, scripts, and logs) in the BDR to be preserved.

Training Specialist (Job Posting) / Evergreen ILS

NC Cardinal is looking for Training Specialist to join our team to help us build and deliver training for our member libraries. We are a growing organization, serving more than 50 percent of public libraries throughout North Carolina. We are a team of five people who work remotely the majority of the time and are based out of the State Library in Raleigh, North Carolina. We’re two hours from the beach and three hours from the mountains and in close proximity to a booming technology sector.

For more information on the open position, please see the posting:

What is slow librarianship? / Meredith Farkas

Last week, there was a lot of chatter about slow librarianship on social media. People were looking for writing on the subject and I realized that my work is scattered all around in such a disembodied way across presentations, slides, and blog posts. With this post, I hope to make a bit clearer my own vision of slow librarianship, with gratitude to those who started the conversation before I took it up, especially Julia Glassman with her 2017 article “The innovation fetish and slow librarianship: What librarians can learn from the Juicero.” And I’d love to hear your thoughts in the comments or on social media!

Here is my (evolving) definition: Slow librarianship is an antiracist, responsive, and values-driven practice that stands in opposition to neoliberal values. Workers in slow libraries are focused on relationship-building, deeply understanding and meeting patron needs, and providing equitable services to their communities. Internally, slow library culture is focused on learning and reflection, collaboration and solidarity, valuing all kinds of contributions, and supporting staff as whole people. Slow librarianship is a process, not a destination; it is an orientation towards our work, ourselves, and others that creates positive change. It is an organizational philosophy that supports workers and builds stronger relationships with our communities.

My slow librarianship takes inspiration from the slow food movement, which was a response to the impact of globalization on food. While it started as a protest against building a McDonalds at the Spanish Steps in Rome, it became a global movement focused on local food culture, sustainability, the ethical-sourcing of food, social justice and pleasure. My slow librarianship also takes inspiration from a variety of additional sources including the Great Lakes Feminist Geography Collective and their brilliant article “For Slow Scholarship: A Feminist Politics of Resistance through Collective Action in the Neoliberal University;” adrienne maree brown’s visions of Emergent Strategy and Pleasure Activism; Leah Lakshmi Piepzna-Samarasinha’s models for collective care and others described in her book Care Work: Dreaming Disability Justice; Tara Brach’s RAIN model for mindful healing and self-compassion; Michael Sandel’s indictment of our society’s phony meritocracy and abandonment of the common good and dignity of work in The Tyranny of Merit; Richard Wolff’s vision of worker-directed workplaces in Democracy at Work: A Cure for Capitalism; David Graeber’s and Dean Spade’s inspiring work in support of mutual aid; Prentis Hemphill’s work on embodiment and healing justice; Jenny Odell’s vision for taking control of our attention in How to do Nothing; as well as thinkers in our own profession like Fobazi Ettarh, Julia Glassman, Kaetrena Davis Kendrick, Karen Nicholson, Jane Schmidt, Maura Seale, Amanda Leftwich, and others.

The Slow Food movement’s manifesto broke their philosophy down into three areas: Good, Clean, and Fair. I similarly tried to break my vision of the characteristics of slow librarianship down into three areas: Good, Human(e), and Thoughtful.

Good – Being good begins by recognizing that libraries have not always been good for everyone. This requires bringing in critical practice, where we identify, question, and ultimately dismantle structures, practices, policies, and assumptions that oppress, exploit, exclude, or otherwise cause harm to our patrons or library workers. To become an antiracist library, library workers must look within the organizational structures of the library to see how white supremacy culture operates and then find new ways of communicating, organizing ourselves, and practicing librarianship that center BIPOC. Slow libraries are driven by their values over a desire to innovate or produce visible wins, and priorities will be determined based upon a deep understanding of the needs of patrons and how in-line they are with library values. They also center the needs of those with the greatest need in their communities and judge themselves by how they serve those most marginalized.

Human(e) – In humane organizations, library workers are supported as whole people with bodies and responsibilities and limitations beyond the workplace. Managers recognize the humanity of their employees and workers are viewed as more than just what they produce in a given week. Humane managers care about the well-being of their employees and foster environments where all staff feel a sense of psychological safety and feel supported in setting boundaries that nurture their well-being. Workers feel like they can be their real, human selves at work and can take time when they are struggling with their physical or mental health or are caring for someone else. A slow library rejects productivity culture and recognizes that creativity and valuable gains often come from fallow time and time spent building relationships within the workplace and in the community. Building relationships in the community that help us better understand and support our patrons is particularly valued, and managers recognize that relationship and partnership-building takes time. Slow workplaces also encourage collaboration and collective care through its structures and reward systems.

Thoughtful – A slow organization is a contemplative organization that encourages employees to slow down. In the absence of a sense of urgency, workers are less afraid of failure and are able to value process over product, especially the collaborative learning that comes from projects when people slow down. The organization is a learning culture where workers want to know more about their patrons’ needs and how they use the library, are given time and funding to learn and grow, and come together as an organization to reflect and learn. A thoughtful organization embraces a culture of appreciation and gratitude where the focus is on finding and highlighting the good things workers do.

Slow librarianship clearly requires a lot of personal work to help us develop a mindset that can both critically evaluate current structures in libraries and envision radical new futures. I’ll be addressing that further below.

In my talks and on social media, I’ve encountered a few misunderstandings of slow librarianship that I’d like to address below. I may add to these as conversations around this topic continue.

Slow librarianship means doing less, not caring, and/or embracing mediocrity.

Slow librarianship is against neoliberalism, achievement culture, and the cult of productivity, but I see its opposite as being driven by our own values and authentic desires, not necessarily being mediocre. We are so programmed in western societies to see being busy as being important, to chase external validation, and to try to make our lives look like external norms of success. Last year, I was listening to the wonderful podcast Everything Happens (and I wish I could remember the specific episode, but I know it was an early one in Season 1) when the host, Kate Bowler, a professor at Duke Divinity School living with stage 4 cancer asked “am I built from the outside in or am I built from the inside out?” How much is your vision of what success looks like based on external norms or a desire for external validation? How often do you compare yourself to others? Do you ever feel like you’ve done or achieved enough? Until a few years ago, I never considered what it would mean if I was enough right now. Right in this moment. And asking that question changed me. What if you are good enough just as you are today? What if you didn’t need to keep proving yourself? How might that sense of enoughness change your own priorities? Seven years after leaving a particularly traumatic tenure-track job, I’m still untangling which are my authentic desires, which are focused on pleasing the people who hold power over my future, and which have been programmed into me as someone who never felt they were enough.

For some people, slow librarianship may indeed look like doing less, especially if they have, in the past, prioritized work over their own well-being. For others, it might mean doing more that is deeply tied to our values. But I think for most people, it might mean producing less, but actually doing more meaningful work. In my own world, meaningful relationship-building with faculty in my subject liaison areas takes time. It often means going to meetings and joining committees that don’t look directly related to some library goal. But I’ve found that such activity often leads to the most meaningful instructional collaborations with faculty. If we are laser-focused on creating short-term wins to look productive to our manager or to get a good annual review, we will never feel like we can take that time, and thus, we will miss out on meaningful collaborations that will be better for students in the long-run.

Slow librarianship requires changes in how we operate within our cultures, but in order to do that, we have to be able to see the structures and assumptions that determine the choices we have/make and how we see ourselves and our work. That requires a level of mindfulness and reflective practice that so many of us don’t cultivate in our busy lives. In her book How to do Nothing, Jenny Odell talks about her dissatisfaction “with untrained attention, which flickers from one new thing to the next, not only because it is a shallow experience, or because it is an expression of habit rather than will, but because it gives me less access to my own human experience” (119). It truly had not occurred to me until I read that just how much I was letting my own anxiety and the attention economy drive what I paid attention to. We live so much of our lives on autopilot, not noticing so much of what is around us. Mindfulness allows us to take control of our attention and to use it to find our own authentic desires as well as develop a better vision for the future of our library and our work.

We’re also not going to be able to build antiracist libraries if we don’t deeply interrogate how we uphold white supremacy. It took time for me to recognize that a lot of the aspects of achievement culture and work addiction I used to embody were absolutely characteristics of white supremacy culture. It takes deep attention to really interrogate the assumptions and structures in our workplaces and to able to engage in this work. It also takes time to engage our BIPOC colleagues in envisioning a future that centers them and their concerns. That may also be time that doesn’t look productive to our managers, but it is CRITICAL.

But also, in thinking about productivity, people need fallow time to reflect, to learn, and to be creative. When I think about times in my career when I was most overloaded with work, I could tell that I was not capable of the deep-thinking I can do when I don’t feel like an overloaded, constantly buffering computer. Being overloaded makes it hard to prioritize and to see the big picture. How can I become a better teacher if I don’t even have time after a class to reflect on what went well and what didn’t? If we want to do really meaningful work, we have to recognize the time it takes. And I think we also need to value collaboration (which takes more time), because our best work comes from collaboration. I spent a lot of time in my career doing projects on my own, and when I compare those to the work I’ve done with others, the latter were not only better products, but far more personally-satisfying processes.

So maybe to some, slow librarianship will look like doing less, but I see it as slowing down in order to ask why we’re doing what we’re doing so that we can do our best and most meaningful work.

Slow librarianship is for the privileged. You can’t adopt slow practices if you’re working in precarity.

This is a very real concern as with anything that involves some amount of self-work, but I address this in my presentations on slow by putting the focus on relationship-building, collective care, and solidarity. Our ability to slow down, to resist, or to take control of our attention is very much determined by the conditions under which we live and work. Anthony Giddens (as quoted in Craig and Parkins) writes about how “access to means of self-actualization becomes itself one of the dominant focuses of class division and the distribution of inequalities more generally” (13). In her book How to do Nothing: Resisting the Attention Economy, Jenny Odell talks about how the ability to refuse and to take time for contemplation is not accessible to everyone and brings up “the frightening potential of something like gated communities of attention: privileged spaces where some (but not others) can enjoy the fruits of contemplation and the diversification of attention” (199) If slow is only seen in terms of liberating the self, there is certainly a huge risk that it could become just another tool that is only accessible to those with the most privilege. That can be seen in the slow food movement where some people with means embrace slow food only in terms of buying and enjoying local food. However, I think one of the most important pieces of the slow movement is the focus on solidarity and collective care and a move away from the individualism that so defines the American character. If you’re only focused on your own liberation and your own well-being, you’re doing it wrong. In Emergent Strategy, adrienne maree brown writes about how for her– 

It has meant learning to work collaboratively, which goes against my inner “specialness.” I am socialized to seek achievement alone, to try to have the best idea and forward it through the masses. But that leads to loneliness and, I suspect, extinction. If we are all trying to win, no one really ever wins. (42)

That takes a lot of unlearning for those of us who grew up in highly individualistic cultures, especially in America where the myth of the meritocracy has taken on an almost religiosity. And our places of work as well as our professional recognition and reward systems encourage us so see ourselves as individuals in competition with our colleagues. When there are limited raises, limited promotions, limited accolades, caste systems, precarity, or even just a general sense of scarcity in the workplace, people will see themselves as being in competition with their colleagues and their focus will be on finding ways to make themselves shine. I wrote about this in my previous blog post:

I’ve been thinking a lot about how individualism is at the root of so many of our problems and how things like solidarity, mutual aid, and collective action are the answer. Capitalism does everything it can to keep us anxious and in competition with each other. It gave us the myth of meritocracy – the idea that we can achieve anything if we work hard enough, that our achievements are fully our own (and not also a product of the privileges we were born to and the people who have taught us, nurtured us, and helped us along the way), and that we deserve what we have (and conversely that others who have less deserve their lot in life). It gave us petty hierarchies in the workplace – professional vs. paraprofessional, faculty vs. staff, full-time vs. part-time, white-collar vs. blue-collar – that make us jealously guard the minuscule privilege our role gives us instead of seeing ourselves in solidarity with all labor. It’s created countless individual awards and recognitions that incentivize us not to collaborate and to find ways to make ourselves shine. It’s created conditions of scarcity in the workplace where people view their colleagues as threats or competitors instead of rightly turning their attention toward the people in power who are responsible for the culture. This is how the system was made to work; to keep us isolated and anxious, grinding away as hard as we can so we don’t have time or space to view ourselves as exploited workers. It is only through relationships and collaboration, through caring about our fellow workers, through coming together to fight for change, that things will improve. But that requires us to focus less on ourselves and our desire to shine, rise, or receive external recognition, and to focus more on community care and efforts to see everyone in our community rise. It goes against everything capitalism has taught us, but we’ll never create meaningful change unless we replace individualism with solidarity and care more about the well-being of the whole than the petty advantages we can win alone.

In her article “Why Office Workers Didn’t Unionize,” Anne Helen Petersen wrote about how white-collar workers largely did not unionize because they 1) wanted to see themselves as having a status above blue-collar workers and 2) were socialized by their places of work to jealously guard the minimal privileges they had over their colleagues (think Dwight Schrute in The Office being Assistant to the Regional Manager, a functionally meaningless title given to him to keep him loyal to his boss).

Over time, even Dwight, the ambitious careerist solely focused on getting ahead and being better than everyone, began to see the value of relationships with his colleagues and began to realize that getting ahead in his work perhaps wasn’t the be-all-end-all. He stopped seeing relationship-building as a waste of time. The show ended with him as the regional manager, but he also had a full life with friends, family, love, and the respect of his colleagues. While his character was certainly a caricature, imagine a world in which everyone took a few steps away from seeing themselves as individuals who had to jealously guard their advantages and towards seeing themselves as being in solidarity with their fellow workers. Petersen writes:

How would your office culture shift if you actually thought of yourself in solidarity with your coworkers — and together, advocating for greater resources — instead of competition with them for the few resources allocated to you? How would your conception of yourself shift if you felt empowered not by your hopes for eventual advancement, but by identification with others?

Slow librarianship requires us to look beyond ourselves to try to help create the conditions that allow everyone to slow down. That means that those of us with more job security and autonomy need to fight structures that create precarity, scarcity, and competition. I never felt like I could slow down when I was on the tenure track in my previous job. I felt like I had to be laser focused on achieving in all the ways that were externally valued so my tenure file would be bullet-proof. And even years after I left that job, I was still running on autopilot, doing things that were more motivated by a need for gold stars than by my most strongly-held values. People have to feel a sense of safety in the workplace to be able to do the work of slow librarianship rather than focusing on achievement culture. They also need some measure of autonomy. If those of us who have privilege are not focused on supporting our colleagues who don’t, we are not practicing slow librarianship — we are only practicing self-liberation.

While community care is at the heart of slow librarianship as I see it, that cannot happen when people are not taking care of themselves. Self-care doesn’t have to be reduced to bubble baths, spa days, and buying things for ourselves. Self-care is about setting boundaries that maximize our well-being and provide us with capacity to focus on community care. It can be about resting when we need it rather than continuing to grind when we’re far from being at 100%. It can feel selfish, but when people feel stressed and depleted, they tend to get a tunnel vision that makes taking care of others much more difficult. We can’t be truly compassionate towards others if we don’t show compassion toward ourselves. As adrienne maree brown says “the work of cultivating personal resilience, healing from trauma, self-development and transformation is actually a crucial way to expand what any collective body can be. We heal ourselves, and we heal in relationship, and from that place, simultaneously, we create more space for healed communities, healed movements, healed worlds” (Emergent Strategy 144).

Of course all of this is just one person’s vision of slow librarianship based on my own experiences and readings. I’ve very much appreciated the conversations and critiques (even the mean ones) I’ve heard from others as they have helped me to develop this vision. Since collaboration serves to improve ideas, I would love to hear your thoughts, questions, critiques, and more! Thank you for reading all this!



Brach, Tara. Radical Compassion: Learning to Love Yourself and Your World with the Practice of RAIN. Viking, 2019. 

Bowler, Kate. Everything Happens Podcast.

brown, adrienne maree. Emergent Strategy: Shaping Change, Changing Worlds. AK Press, 2017.

brown, adrienne maree. Pleasure Activism: The Politics of Feeling Good. AK Press, 2019.

Craig, Geoffrey, and Wendy Parkins. Slow Living. Bloomsbury Publishing, 2006. 

Glassman, Julia. 18 Oct. 2017. “The innovation fetish and slow librarianship: What librarians can learn from the Juicero.” In the Library with the Lead Pipe, 18 Oct. 2017,

Graeber, David & Andrej Grubacic. “Introduction to Mutual Aid: An Illuminated Factor of Evolution.” Retrieved from The Anarchist Library (though it was written as an introduction to the new edition of Peter Kropotkin’s book Mutual Aid: An Illuminated Factor of Evolution).

Hemphill, Prentis. “Prentis Hemphill on Choosing Belonging.” In Young, Ayana. For the Wild podcast, 28 July 2021, (this is just one of many places where you can learn about Prentis’ work. They also have their own fantastic podcast)

Mountz, Alison, et al. “For slow scholarship: A feminist politics of resistance through collective action in the neoliberal university.” ACME: An International Journal for Critical Geographies 14.4 (2015): 1235-1259.

Odell, Jenny. How to do nothing: Resisting the attention economy. Melville House Publishing, 2020.

Piepzna-Samarasinha, Leah Lakshmi. Care Work : Dreaming Disability Justice. Arsenal Pulp Press, 2018.

Petersen, Anne Helen. “Why Office Workers Didn’t Unionize.” Culture Study, 18 Oct. 2020,

Sandel, Michael J. The Tyranny of Merit: What’s Become of the Common Good? Penguin Books, 2021.

Spade, Dean. Mutual aid: Building solidarity during this crisis (and the next). Verso Books, 2020.

Wolff, Richard D. Democracy at Work: A Cure for Capitalism. Haymarket Books, 2012.


Open Data Day 2022 Update: Focus on the Ocean / Open Knowledge Foundation

Today we are pleased to announce a new Open Data Day partnership with Friends of Ocean Action that aims to support UN Sustainable Development Goal 14 – to ‘conserve and sustainably use our ocean, seas and marine resources for sustainable development’.

What’s this about?

Open Data Day is an annual, global celebration of open data. Each year, 300+ groups from around the world create local events to:

  • show the benefits of open data in their local community; and
  • encourage the adoption of open data policies in government, business and civil society.

All outputs are open for everyone to use and re-use.

For several years we have worked with our partners to deliver hundreds of $300 mini-grants to help people organise Open Data Day events in their communities. These mini-grants have been distributed under four vertical themes:

  • Data for Equal Development
  • Environmental Data
  • Open Mapping; and
  • Tracking Public Funds.

See last year’s events here.

Today, we are pleased to announce a fifth vertical theme for the Open Data Day 2022.

  • Ocean Data for a Thriving Planet

Who is involved ?

Open Data Day is a community event. Everyone is encouraged to participate.

Last year 327 events were registered on the Open Data Day website, with 56 groups from 36 countries receiving financial support to run their event.

The Ocean Data for a Thriving Planet mini-grant scheme is supported by our partner Friends of Ocean Action – which is convened by the World Economic Forum, in collaboration with the World Resources Institute.

Friends of Ocean Action is a coalition of over 70 ocean leaders who are fast-tracking solutions to the most pressing challenges facing the ocean. Their work falls into five impact pillars – one of which is Creating a Digital Ocean. Learn more here.

The Ocean Data for a Thriving Planet mini-grant scheme has received funding from Schmidt Ocean Initiative. We are extremely grateful for their support.

What’s next?

Over the coming months we will share more information with you about this new initiative.

In the meantime, why not check out the list of ocean data resources available on the Open Data Day website, and start planning your ocean themed Open Data Day event!

– –
Photo of ocean by Kellie Churchman from Pexels

A Writer I Admire / David Rosenthal

Wouldn't it be great to write like Maciej Cegłowski? I've riffed off many of his riveting talks, including What Happens Next Will Amaze You, Haunted By Data, The Website Obesity Crisis and Anatomy of a Moral Panic. Now, in a must-read tweetstorm, Cegłowski takes on "Web3", the emerging name for the mania surrounding blockchains and cryptocurrencies. He starts from this tweet:
The replies it garnered are hilarious. Below the fold, some extracts from Cegłowski to persuade you to read his whole thread (Unroll here).

Cegłowski starts:
Then he lands this haymaker:
There are three non-fraud foundational problems with "web3":
  1. No way to reference anything in the real world (oracle problem)
  2. Immutable code makes any smart contract its own bug bounty.
  3. Everything breaks (more) unless expensive distributed systems are run in perpetuity.
He credits Trammell Hudson (@qrs) for #2, which brilliantly captures the type of problem which I discussed here:
The most amazing thing about the Compound fiasco is this:
There are a few proposals to fix the bug, but Compound’s governance model is such that any changes to the protocol require a multiday voting window, and Gupta said it takes another week for the successful proposal to be executed.
They were so confident in their programming skills that they never even considered that an exploit was possible. They built a system where, if an exploit was ever discovered, the bad guys would have ~10 days to work with before it could be fixed.

Engineering is all about asking "what could possibly go wrong?" but these cowboys are so dazzled by the $$$$$ that they never ask it.
He understands that "decentralization" is the Holy Grail that drives the technologists, despite (or perhaps because of) it's being unattainable in practice. He writes:
There's a poorly articulated sense of "decentralization = freedom" that drives this culture, as well as the familiar Year Zero mentality of silicon valley that enjoys reinventing human relationships from first principles and moving them into code. And there's oceans of real money
The "oceans of real money" are the root of the evil. For example:
Note that A16Z just raised a $2.2B fund dedicated to pouring money into similar schemes. This is enough to fund 650 Chia-sized ventures! (David Gerard aptly calls Andreesen Horowitz "the SoftBank of crypto")
The "oceans of real money" are chasing the real big bucks, which are from consumers (Consumer spending is currently 69% of US GDP). There are many companies trying to be the channel between consumers and cryptocurrencies:
The real villains to focus on right now are companies like Coinbase and Stripe that are trying to make this connection happen, with one leg in the regulated financial system and one leg in the cesspit of blockchain. They should be regulated into a fine pink mist.
Here's where I disagree with Cegłowski. He writes:
If there is value in the decade plus of experimentation with blockchains (and I'm bending over backwards here to try to see it) then it will find a way to break through without this Niagara of real money investment. We'll see at least one application that is not self-referential
The "Niagra of real money" is preventing any actual value emerging, for three reasons:
  • Almost the entire discourse about blockchains and their applications is corrupt. The extraordinary Gini coefficients of cryptocurrencies give the whales the means, motive and opportunity to hype their HODL-ings so that number go up "to the moon", and to practice social media "DDoS" against skeptics with armies of cultists.
  • Anyone wanting to develop a blockchain-based application needs to buy into the myths of "decentralization" and "immutability" and "security". They're already in the cult.
  • More fundamentally, the only defense permissionless blockchains have against Sybil attacks is to make participating in consensus be expensive, so that the cost of mounting an attack is much greater than the rewards it could obtain. Thus miners need to be reimbursed for their expensive participation. They can't be reimbursed by some central agency, that wouldn't be "decentralized"; they have to be reimbursed via the blockchain's cryptocurrency.
Thus you can't have "decentralized" without the corrupting cryptocurrency. Any possible blockchain application with actual value will be corrupted by the essential nature of the infrastructure upon which it is built.

How to use text mining to address research questions / Eric Lease Morgan

eye candyThis tiny blog posting outlines a sort of recipe for the use of text mining to address research questions. Through the use of this process the student, researcher, or scholar can easily supplement the traditional reading process.

  1. Articulate a research question – This is one of the more difficult parts of the process, and the questions can range from the mundane to the sublime. Examples might include: 1) how big is this corpus, 2) what words are used in this corpus, 3) how have given ideas ebbed & flowed over time, or 4) what is St. Augustine’s definition of love and how does it compare with Rousseau’s?
  2. Identify one or more textual items that might contain the necessary answers – These items may range from set of social media posts, sets of journal articles, sets of reports, sets of books, etc. Point to the collection of documents.
  3. Get items – Even in the age of the Internet, when we are all suffering from information overload, you would be surprised how difficult it is to accomplish this step. One might search a bibliographic index and download articles. One might exploit some sort of application programmer interface to download tweets. One might do a whole lot of copying & pasting. What ever the process, I suggest one save each and every file in a single directory with some sort of meaningful name.
  4. Add metadata – Put another way, this means create a list, where each item on the list is described with attributes which are directly related to the research question. Dates are an obvious attribute. If your research question compares and contrasts authorship, then you will need author names. You might need to denote language. If your research question revoles around types of authors, then you will need to associate each item with a type. If you want to compare & contrast ideas between different types of documents, then you will need to associate each document with a type. To make your corpus more meaningful, you will probably want to associate each item with a title value. Adding metadata is tedious. Be forewarned.
  5. Convert items to plain text – Text mining is not possible without plain text; you MUST have plain text to do the work. This means PDF files, Word documents, spreadsheets, etc need to have their underlying texts extracted. Tika is a very good tool for doing and automating this process. Save each item in your corpus as a corresponding plain text file.
  6. Extract features – In this case, the word “features” is text mining parlance for enumerating characteristics of a text, and the list of such things is quite long. In includes: size of documents measured in number of words, counts & tabulations (frequencies) of ngrams, readability scores, frequencies of parts-of-speech, frequencies of named entities, frequencies of given grammars such as noun phrases, etc. There are many different tools for doing this work.
  7. Analyze – Given a set of features, once all the prep work is done, one can actually begin to address the research question, and there are number of tools and subprocesses that can be applied here. Concordancing is one of the quickest and easiest. From the features, identify a word of interest. Load the plain text into a concordance. Search for the word, and examine the surrounding words to see how the word was used. This is like ^F on steroids. Topic modeling is a useful process for denoting themes. Load the texts into a topic modeler, denote the number of desired topics. Run the modeler. Evaluate the results. Repeat. Associate each document with a metadata value, such as date. Run the modeler. Pivot the results on the date value. Plot the results as a line chart to see how topics ebbed & flowed over time. If the corpus is big enough (at leaset a million words long), then word embedding is a useful to learn what words are used in conjunction with other words. Those words can then be fed back into a concordance. Full text indexing is also a useful analysis tool. Index corpus complete with metadata. Identify words or phrases of interest. Search the index to learn what documents are most relevant. Use a concordance to read just those documents. Listing grammars is also useful. Identify a thing (noun) of interest. Identify an action (verb) of interest. Apply a language model to a given text and output a list all sentences with the given thing and action to learn what they are with. An example is “Ahab has”, and the result will be lists of matching sentences including “Ahab has…”, “Ahab had…”, or “Ahab will have…”
  8. Evaluate – Ask yourself, “To what degree did I address the research question?” If the degree is high, or if you are tired, then stop. Otherwise, go to Step #1.

Such is an outline of using text mining to address research questions, and there are a few things one ought to take away from the process. First, this is an iterative process. In reality, it is never done. Similarly, do not attempt to completely finish one step before you go on to the next. If you do, then you will never get past step #3. Moreover, computers do not mind if processes are done over and over again. Thus, you can repeat many of the subprocesses many times.

Second, this process is best done as a team of at least two. One person plays the role of domain expert armed with the research question. A person who knows how to manipulate different types of data structures (different types of lists) with a computer is the other part of the team.

Third, a thing called the Distant Reader automates Step #5 and #6 with ease. To some degree, the Reader can do #3, #4, and #7. The balance of the steps are up people.

Finally, the use of text mining to adress research questions is only a supplement to the traditional reading process. It is not a replacement. Text mining scales very well, but it does poorly when it comes to nuance. Similarly, text mining is better at addressing quantitative-esque questions; text mining will not be able to answer why questions. Moreover, text mining is a repeatable process enabling others to verify results.

Remember, use the proper process for the proper problem.

The Great Mining Migration / David Rosenthal

China's Cryptocurrency Crackdown has been dramatically effective. The total hashrate dropped by over a half from its peak before recovering. As I write it is still down about 15% from the peak.

The latest figures from the Cambridge Bitcoin Energy Consumption Index provide more detail on what happened. In May China was producing 70.9 Exahash/sec and 44% of the total, as against 75% in 2019. In July, it produced none, triggering the collapse in the hash rate.

The gradual recovery happened as the containers of mining rigs reached their destinations, which by August were mostly in the US (42.7 Exahash/sec), Kazakhstan (21.9 Exahash/sec) and Canada (11.5 Exhash/sec).

If the migration continues to favor the US and Canada, which as of August accounted for about 45% of the total, it would bring closer the ability of Western nations to turn off Bitcoin, as outlined in Unstoppable Code?.

Source Evaluation: Supporting Undergraduate Student Research Development / In the Library, With the Lead Pipe

By Iris JastramClaudia Peterson and Emily Scharf

In Brief 

Each year since 2008, librarians at Carleton College read samples of sophomore writing as part of the Information Literacy in Student Writing project. The data captured through this project combined with our experiences in consultations and instruction sessions give us a richer understanding of undergraduate information literacy habits. We highlight two challenges for novices: evaluating and selecting sources, and understanding the purpose and methods of integrating sources into written work. We discuss the evidence that leads us to these conclusions and the methods we use to promote student development in these priority areas.


Carleton College’s Reference & Instruction librarians have engaged in the Information Literacy in Student Writing project (ILSW) since 2008.1 Each year, our observations while reading hundreds of papers, the data captured through this project, and our experiences in consultations and instruction sessions give us a richer understanding of undergraduate information literacy habits. As this project has evolved over ten years, students’ research behaviors have changed, as have the methods by which students retrieve their sources and the pre-college experiences our students have had with research. In previous articles, we have examined how this project has helped our relationship with faculty members and how it has impacted our information literacy instruction.2 In July 2018, 10 years after our initial reading, faculty members, librarians, and academic staff came together to read papers written by a representative sample of our sophomores submitted as part of the campus-wide Sophomore Writing Portfolio.3 The conversations we had during our reading sessions, the statistical analyses done by our data consultant, and our extensive experiences with research instruction and individual consultations, highlight two priorities for information literacy development for novice researchers: evaluating sources, and incorporating evidence into written work. In this article we present the evidence that leads us to these conclusions, and we discuss our work with students to help them learn to evaluate sources and use evidence.

Our Information Literacy Setting

Carleton College is a highly-selective private, four-year, residential, liberal arts institution located in Northfield, Minnesota. The Gould Library at Carleton College serves 2,000 students and 210 full time faculty members.4 From 2015-2021, the student population has had the following demographic characteristics: 12% first generation, 10% international, 29% students of color (who identify as at least one ethnicity other than white), and 58% receiving financial aid. Reference and Instruction librarians offer reference service 49 hours per week and also teach information literacy concepts and research skills in library instruction and student research consultations. We average nearly 800 consultations per year, 160 instruction sessions (primarily one-shot instruction), and 2,000 reference questions per year during non-pandemic years.

While we have a strong Writing Across the Curriculum program,5 there is no required composition course where students learn source use consistently. Instead, the library and several other academic support units on campus share collaborative responsibility for course-integrated support for student development of some key information literacy skills, such as citing sources, academic integrity, creating annotated bibliographies, reading critically, and the like. Key collaborators for us are the Writing Across the Curriculum Program, the Quantitative Resource Center, the Writing Center, Academic Technologies, and more. Together we work to reveal the constructed and variable disciplinary conventions that students must be able to recognize and interpret and, in many cases, reproduce through their research products. We also help students learn to work around and push against these conventions as necessary. Studying the products of this distributed instructional model by reading the writing produced in classes spanning the curriculum gives us a holistic look at student capacities, strengths, and struggles that we could not find by assessing library interactions specifically. Studying cross-disciplinary student writing in this way also helps us develop shared goals and priorities, both within the library and with other staff and faculty. 

In the summer of 2018, Carleton College generously funded an expansion of our regular ILSW reading to include faculty and academic staff as readers, and to hire a data consultant to help us with our analysis. We were able to bring together stakeholders from the various areas of campus that share leadership in teaching information literacy skills to students, including the director of the Writing Across the Curriculum Program, academic technologists, librarians, the director of the Learning and Teaching Center, and faculty from the arts, humanities, social sciences, and STEM fields. In all, five librarians, six faculty, and one academic technologist gathered for our norming session and to read an average of sixteen papers each for a total of 150 papers6 from a stratified random sample of sophomore students.7 Together, readers identified clear, statistically significant trends in student information literacy practices. These trends also dovetail with research done in the scholarship of Communication Theory.

The Difficulties of Using Evidence: Literature Review

It is probably not news to any librarian that students sometimes struggle to deploy information effectively in their writing. The entangled cultural maneuvers involved in effective source-based writing are more and more present in our professional thinking, in our literature, 8 and in the Framework for Information Literacy in Higher Education.9 Scholarship founded in Genre Theory and Communication Theory also shed light on the complications of learning to participate in scholarly communication. Academic genres of writing are embodiments of complex purposes and contexts. Any given utterance is “the product of the interaction between speakers and … the whole complex social situation in which the utterance emerges,”10 and “typified” utterances become genres packed with culturally encoded context, and expectation.11 The written genres that students are asked to produce, such as lab reports, position papers, research papers, and reaction papers are therefore encoded with whole constellations of socially constructed meaning. Unfortunately for novices and outsiders, however, these culturally encoded genres shape everything from subtle signals about where this writing sits in the scholarly conversation to how readers should interpret the claims presented, or even what counts as good evidence.12

Meanwhile, the most typical genre of academic writing that many students have read prior to college is the textbook, and the social context baked into that genre places the student into the role of “information receiver and learner,” or as psychologist Barbara K Hofer says, “passive receptors” of knowledge.13 What is contestable or not, what counts as evidence, and how authority works are all different in the context of information reception than in the context of information creation. Students may therefore think that their primary goal is to communicate facts rather than build new insight, or they may not understand how to draw upon community expectations of authority, evidence, and argument to further their rhetorical goals. In mimicking the formats of academic writing without understanding the culturally encoded motivations and affordances of those academic genres, students struggle to use evidence to communicate effectively in their writing.

This is not to denigrate replication and mimicry. One primary way that students begin to understand academic writing and disciplinary rhetoric is by mimicking what they read. From the field of education comes the term Threshold Concepts. “A threshold concept can be considered as akin to a portal, opening up a new and previously inaccessible way of thinking about something. It represents a transformed way of understanding, or interpreting, or viewing something without which the learner cannot progress.”14 Students are understood to function in a liminal state of mimicry until they’ve crossed the threshold into a new state of understanding.15 With the advent of the Association of College and Research Libraries’ Framework for Information Literacy in Higher Education,16 librarianship and the library literature has increasingly engaged with threshold concepts. Many of us (the authors very much included) remember well our college days when it felt far more manageable to recreate what critical thinking looked like in writing than it was to actually think critically about our materials. Many of us have great empathy for the feelings of inadequacy and outright fear that can come with assignments to create novel contributions to fields of study.

Evaluating and Selecting Sources

Librarians know that source evaluation is a difficult task, especially for novices in a field. There are check-list tools like the CRAAP test17 to help students learn basic source evaluation, and we know that students apply broad heuristics to the challenge of sifting through the millions of sources available to them.18 But since “authority is constructed and contextual,”19 it is no surprise that there is literature criticizing these simplified evaluation strategies.20 It is also no surprise that when we use our ILSW rubric to evaluate student writing, “Evaluation of Sources” is the area in which students struggle the most.

In the 2018 ILSW study, sophomore writers struggled to select high quality sources that matched their rhetorical goals. On our rubric’s four-point scale from 1 (Poor) to 4 (Strong), 12% of scores indicated “Poor” Evaluation of Sources, and only 8% of scores indicated “Strong” skills in this area (see Figure 1). This rubric category also received the highest percentage of 2s (Weaknesses Interfere) and the lowest percentage of 3s (Weaknesses Do Not Interfere) compared to the other rubric categories. In addition to assigning rubric scores, readers were able to indicate key patterns that they noted in the papers they read. Fully 24% of the papers were given the designation “Sources lack breadth or depth consistent with genre,” and 15% went so far as to note a pattern of “Inappropriate sources/evidence used to support claim” (see Figure 2). In the optional free-text comments submitted by readers, more than a third of the comments addressed some aspect of source evaluation. For example, one comment read, “Cited a Daily Kos article for info on the history of Dance in the US (and this [Daily Kos] article even pointed to a scholarly book on the topic that the student didn’t look at).” Weaker papers missed obvious avenues of source exploration or relied on secondary citations such as citing a New York Times article that mentions a research study rather than seeking out the original study, even when the original sources were readily available through more specialized search tools such as our library discovery system or disciplinary research databases. This points to a common misunderstanding that novice writers hold about the underlying goals of source selection, not always realizing the culturally constructed authority structures that they could use (or productively flout) to more effectively borrow authority from their sources. 

We investigated statistical differences in our scores between native English and non-native English writers, as well as between different races, ethnicities, and genders. However, our ILSW sample did not reveal any statistically significant differences between these groups. We do not know whether this is because there were no differences or because our sample size was too small to accurately assess all demographic groups. Our 2015 Research Practices Survey that measures pre-college experience revealed greater differences between first generation students, international students, and students of color as compared with white students,21 but we did not observe such differences in the ILSW results.

These findings mirror our experiences in research consultations, where students express confusion about why particular research tasks are being asked of them, whether the sources they find fit their assignment requirements, and what kinds of sources are suited to different research topics or goals.The students’ work may be further complicated by a mismatch between their chosen topics and the source types that may be required by their assignments, or by misidentifying source types to begin with. For example, we often see students in research consultations who think they have found articles when they have in fact found book reviews or encyclopedia entries. This could be because databases don’t make this distinction clearly enough in their metadata or because the students don’t know what an article looks like compared with other similar genres. ​​Even more fundamentally, students frequently assume that the primary goal for finding a source is to confirm that what the student plans to say is not new — that it is backed up by (or at least thought by) other people in the world. These assumed goals often do not match the professor’s goals22 of having students engage with literature in order to generate novel interpretations rather than simply report on what is already known.

The Difficulties of Evaluating and Selecting Sources

While these findings and experiences are sobering, they are not surprising. Not only are Sophomore students only half-way through their education, source evaluation is a nuanced and situation-dependent process, and the amount of information available to sort through is increasingly vast and entangled. At the same time, it becomes more difficult to distinguish between the various types of sources, especially online sources. Every type of online source looks like a “website” or a PDF even when it may actually be anything from a blog post to a book review to a peer reviewed article to a full monograph. This phenomenon has been described as “container collapse.”23 Coupled with our students’ reduced high school experience with research and with libraries,24 container collapse leaves students increasingly confused by source evaluation. 

The problem does not just lie in the fact that students do not have much experience with using physical resources, or the fact that many sources now do not have a physical counterpart. Multiple studies have shown that online sources are difficult to classify in general. In fact, two separate studies in 2016 found that there was no distinction between student level, age, or experience when it came to identifying online source types.25 Instruction may improve performance,26 but the major finding is that online publications are difficult for people to classify correctly, even into broad categories like “academic journal” or “book review.”

It may seem like a relic of a past era to think about publication types as an important aspect of source evaluation,27 but distinguishing between source genres is fundamental to the evaluation process. Source genres “identify a repertoire of actions that may be taken in a set of circumstances,” and they “identify the possible intentions” of their authors.28 Novice writers and researchers are therefore doubly hampered, first by not knowing which source genres are appropriate for various rhetorical tasks, and then by not being able to identify which genre of source they see in their browser windows.

Of course, evaluation doesn’t stop once appropriate source types are in hand. Students then have to navigate disciplinary conventions, subtle “credibility cues,”29 and webs of constructed authority, all of which is in addition to the basics of finding sources that speak about their topics in ways that seem informative, relevant, and understandable.30 For such a daunting set of tasks, all within tight term or semester time constraints, it’s no wonder that some students falter.

Supporting Student Development in Selecting Sources

For reference and instruction librarians supporting undergraduates, a foundational part of our work has always involved helping students develop the knowledge and skills needed for good source evaluation. Our experiences and ILSW findings emphasize that this core work of librarianship is vitally important. While librarians may not be as knowledgeable about specialized disciplinary discourse, we are uniquely positioned to help students recognize and navigate disciplinary conventions,31 and there is also evidence that library instruction can improve students’ ability to recognize online source types.32

Librarians help novice researchers develop their understanding for how to recognize source types through curated lists such as bibliographies, handbooks, research guides, and other resources that are created by experts rather than by algorithms. In an instruction session or research consultation, it takes very little time to show students that, in general, bibliography entries that list a place and/or publisher are books or chapters in books while the other entries are in some other kind of publication (journal, website, etc). Librarians can then point out that bibliographies are more than just alphabetical lists of relatively random works cited in a text — that they are instead maps of scholarly conversations, gathering together (ideally) the most relevant and important sources related to the text at hand. Students can then be encouraged to take notes on the keywords in bibliography entry titles, the journals that publish works related to the topic, prominent authors, key publishers, publication date distributions, and the like to develop a more nuanced sense of the kinds of sources that could be related to their topics. Each entry in a bibliography is a potential source in itself, but it also points to pockets of related sources for students to explore.

On our campus, librarians find that they can provide some very practical but crucial advice for  undergraduates by introducing and explaining less-understood source types and also by helping students develop research strategies that use the various source genres to their full advantage. For example, one of the first things some liaison librarians talk about in research appointments is the importance of using scholarly reference sources and even Wikipedia to build context and gain a foothold in a new research area.33 It can seem inefficient to spend time reading a source that won’t be acceptable in a bibliography, since scholarly convention often discourages citing reference sources in academic papers. Because of this, we often see students skip this step entirely and dive right into an argument without much knowledge about the subject they are trying to discuss. Whether licensed or freely available, reference sources provide important factual contexts, define core vocabulary, and point to major voices in the conversation around the topic at hand. Reference sources can also signal what kinds of other sources count as good evidence in this conversation, and where to find them. 

Crucially, going through the step of seeking out background sources, as Joseph Bizup terms them,34 will result in a better understanding of the topic at hand, which allows students to ask increasingly complex questions of their topic and make better use of analytical sources. Reading, rather than being a process that happens separately and after finding and accessing information, is an integral part of both “rhetorical invention”35 and also information literacy. Searching and even browsing may feel more active and efficient to the novice researcher, but good source discovery, evaluation, and selection all require active reading. And active reading in turn requires knowing how to spot and interpret the moves that authors make when positioning themselves against the backdrop of prior information, the language of the field (which will be useful for future searches), and the credibility cues that authors use when introducing outside sources into their writing. Building this context is one of the most crucial early steps in the research process.

Organizing information sources is also critical to source selection. For example, at Carleton we often introduce students to bibliographic managers such as Zotero or EndNote not only as citation generators, but as tools that help researchers think critically about their sources. We emphasize the practical aspects of these systems, but we also use them to teach students about the importance of citation tracking, tagging, and sorting, and how these practices allow researchers to see how sources are related to each other. We talk about the importance of organizing your own research and using a citation management system to identify prominent scholars, figure out which authors or experts are missing or left out, and even select source types if that is a requirement of a particular assignment.

Using Sources in Research and Writing

In our ILSW study, the Use of Evidence category on the rubric measures how well students synthesize, contextualize, and incorporate evidence into their writing. In 2018, this category of information literacy skill gave sophomore students almost as much trouble as the Evaluation of Sources category, with only 8% of papers given a “Strong” score of 1, and 12% given a “Poor” score of 4 (see Figure 3). Like with source evaluation, we expect Carleton sophomores to find these skills challenging, and our study’s findings reinforced these expectations. In addition, 29% of papers received the designation “Sources not integrated or synthesized,” usually indicating that students ceded control of their arguments to excessive quotation, summary, or reporting rather than calling on sources as rhetorical tools that advance the paper’s goals. Readers noted in the free-text comments such patterns as “Appears to cherry pick from those sources, most of which probably would have been great sources if used better,” or “There are a lot of opinions without much substantiation.” These weaker papers revealed confusion about the reason for drawing on evidence in the first place — not seeing the importance of interplay between source material and the student’s own thoughts.36 Stronger papers, on the other hand, integrated evidence in service of the students’ rhetorical goals, and the students framed and contextualized this evidence such that it helped the reader understand and trust the paper’s claims. 

Students were often successful when attributing evidence in their written work, generally providing information that helped their readers understand the origins of the evidence and ideas they incorporated. Only 14% of papers exhibited “Egregious errors in bibliography, in-text citations, or notes,” and this rubric category received the second highest number of 3s and 4s after Strategic Inquiry. On the other hand, “Under-cited/supported claims” was the most common pattern noted among papers, appearing on the scoring sheets 51% of the time, and nearly 48% of the optional free text comments submitted by readers pointed out misunderstandings about attributive practices. This suggests that students often attribute uncritically, not realizing that attribution is a set of rhetorical practices within academic communities rather than simply a set of mechanics that stave off plagiarism charges. In consultations and classes, we see similar confusion about when citations are expected and how they function, with many students thinking they should appear only after direct quotations or close paraphrases rather than understanding that citations also act as authority cues and as portals into further reading for future researchers.

Supporting Student Developing in Using Sources

As with the Use of Evidence category above, the weaker papers in this category signaled confusion about the underlying purpose for bringing evidence and outside sources into papers. This signals a need for librarians, professors, and writing center professionals to explicitly discuss with students the reasons behind citation — its function within rhetoric —  rather than simply the mechanics of quoting, paraphrasing, and creating proper citations.

Novices in academic writing often benefit from explicit instruction in the ways that academic writing draws on communication conventions that they already know but may not have recognized in the unfamiliar genres they’re reading and writing about. Especially with our first and second year students, we give them examples of the types of conversations they might have with a friend and point out that it would be awkward if one conversation partner simply repeated everything the other person was saying. Instead, in conversations each person builds on what the other person has said to generate new meaning or knowledge. For our upper level students, we teach them that the point of research is not to create a collection of statements as proof that they have read broadly, but rather to focus on the work of finding connections, selecting key sources, and remaining flexible about their thinking so that they can remain responsive to what they’re learning. Once a student gains a somewhat clear understanding of their topic, we then encourage them to identify any themes that emerge in the reading that would cause them to ask new questions, and we teach them to look for any small clues in their readings that indicate points where experts approach the topic differently, build on each other, push against each other, and in doing so make space for their contribution to human knowledge. This in turn helps students see that they can create space for themselves in the scholarly conversation — that they can join the conversation themselves by engaging with their sources rather than simply reporting on them.

Sometimes, a student’s ability to concentrate on finding good sources is complicated by the restrictions of their class assignments. We often see assignments where students are required to find an exact number of different source types, such as two peer-reviewed journal articles, a book, and a news source. This gives librarians an opportunity to teach students about these types of sources, where to find them, and how to recognize them, and how they function in scholarly communication. However, it does not always help a student to fully develop their own argument or ask more complex questions about their topic because they are consumed with making sure they are checking all the boxes of the assignment requirements. Sometimes these requirements and constraints can help steer students toward topics or approaches that have well-matched sources available, but other times the writing prompt and its source requirements can be at odds with each other. In either case, students can learn to navigate the challenges of their assignments if they understand that not every topic can be fully explored only through peer reviewed academic journals. Part of what they’re learning to do is scope their topics more appropriately, whether more broadly or narrowly, and to work within (or push productively against) disciplinary conventions about appropriate source types.

As students grapple with the difficulties of entering a community of academic practice, another challenge they face has to do with attribution practices within the various disciplines. Carleton students report worrying about accidentally plagiarizing, but they lack nuanced understandings of citation norms within each discipline and sub-discipline.37 This combined with not knowing that there is flexibility within citation styles to make citing decisions based on overall best practices, is a major stumbling block. The act of citing is often seen by students as something boring and mechanical, a check-box to mark. In fact, as Robert Connors has noted, “citations have an essentially rhetorical nature” that contain a “universe of meanings” and that are the “products and reflections of social and rhetorical realities.”38 Citations function within scholarly conversation to help readers evaluate the claims at hand, help authors position themselves within the field, and point readers to related conversations in the literature.39 Through our conversations with students in research consultations, we have found that students often mistake the various citation formats for arbitrary sets of rules, not understanding that each style matches a discourse community’s communication priorities and strategies. 

In our experience, shifting the conversation toward these underlying goals of attribution and away from punitive and mechanistic tutorials helps students both make better choices in their citation practices and participate more fully in their scholarly community. Each year we conduct training sessions with peer tutors in the Writing Center during which we discuss these concepts, and each year those peer tutors report that this was one of the most useful and eye-opening topics discussed during our training sessions. Similarly, two quick questions have helped students in research consultations decide whether something they’ve written is “common knowledge” and therefore doesn’t need a citation: “Might my reader not automatically agree with this? Might my reader be curious to know more about this?” If the answer to either question is “yes” then a citation is useful. Countless instruction sessions and research consultations over the years have dealt with similar themes, and students report similar feelings of empowerment (and sometimes even excitement) about attribution practices once they understand the many ways that citations can function in rhetoric. 


Our Information Literacy in Student Writing project and the scholarship in librarianship, information literacy, and rhetoric have shown us that there are a lot of opportunities to continue working with faculty, academic support staff, and students to assist with source selection and use. We also know from our research and first-hand experience that these practices are quite difficult and require just that: practice. Library instruction can help over time and one of the advantages of having faculty score papers for our ILSW project with us is that they see concrete evidence that evaluation and source selection are challenges for students. Results from our ILSW findings have opened up further opportunities with a number of faculty to provide more information literacy instruction and consultation to their students.40 We think it is also important to acknowledge the reality that the context in which our students find the majority of their research, the internet (including databases, online catalogs, Google Scholar, etc.), is only going to make evaluation more complicated. Information that is born-digital does not fit into the neat containers we could hold in our hands and more easily identify, nor does all information correspond to a physical format anymore. For these reasons, librarians, faculty, and other academic support staff should discuss source evaluation and selection with students and equip them with strategies.


This paper was only possible because of a whole community of people. We can’t possibly name everyone who has shaped our work and the ILSW project, but we would like to particularly acknowledge the contributions of:

  • All the students who generously made their Sophomore Portfolios available for research
  • Members of Gould Library’s Reference & Instruction Department, for conceiving of this project and carrying it through from year to year. Especially Matt Bailey, Sarah Calhoun, Audrey Gunn, Susan Hoang, Sean Leahy, Danya Leebaw, Kristin Partlo, Charlie Priore, Carolyn Sanford, Heather Tompkins, and Ann Zawistoski.
  • Carleton’s Gould Library, especially College Librarian Brad Schaffner, for supporting this project
  • Carleton’s Dean of the College office, particularly Bev Nagel, George Shuffelton, and Danette DeMann for approving, supporting, and funding the 2018 ILSW study
  • Carol Trossett, data consultant extraordinaire
  • Carleton’s Perlman Center for Teaching and Learning, especially Melissa Eblen-Zayas, for invaluable support and advice
  • Carleton’s Writing Across the Curriculum program, especially the director, George Cusack and Mary Drew, for access to the Sophomore Writing Portfolio papers and for so many other radical acts of collaboration
  • Carleton’s Office of Institutional Research and Assessment, especially Jody Friedow and Bill Altermatt, for access to sophomore student demographic reports
  • All the faculty and staff who participated in the ILSW project’s reading days
  • The reviewers who read and provided feedback on drafts of this paper. Thank you for your time and insights Ian Beilin and Amy Mars.


ACRL. “Framework for Information Literacy for Higher Education.” Chicago: Association of College and Research Libraries, 2016.

Bakhtin, M. M. Speech Genres and Other Late Essays. Edited by Michael Holquist and Caryl Emerson. Translated by Vern W. McGee. 1st edition. University of Texas Press Slavic Series 8. Austin: University of Texas Press, 1986.

Bazerman, Charles. “Systems of Genres and the Enactment of Social Intentions.” In Genre and the New Rhetoric, edited by Aviva Freedman and Peter Medway, 69–85. London: Taylor & Francis, 2005.

Bizup, Joseph. “BEAM: A Rhetorical Vocabulary for Teaching Research-Based Writing.” Rhetoric Review 27, no. 1 (January 4, 2008): 72–86.

Breakstone, Joel, Sarah McGrew, Mark Smith, Teresa Ortega, and Sam Wineburg. “Why We Need a New Approach to Teaching Digital Literacy.” Phi Delta Kappan 99, no. 6 (2018): 27–32.

Brent, Doug. Reading as Rhetorical Invention: Knowledge, Persuasion, and the Teaching of Research-Based Writing. Urbana, Ill.: National Council of Teachers of English, 1992.

Buhler, Amy, and Tara Cataldo. “Identifying E-Resources: An Exploratory Study of University Students.” Library Resources & Technical Services 60, no. 1 (January 7, 2016): 23–37.

Buhler, Amy G, Ixchel M Faniel, Brittany Brannon, Christopher Cyr, Tara Tobin, Lynn Silipigni Connaway, Joyce Kasman Valenza, et al. “Container Collapse and the Information Remix: Students’ Evaluations of Scientific Research Recast in Scholarly vs. Popular Sources.” In ACRL Proceedings, 14. Cleveland, Ohio, 2019.

Bull, Alaina C., and Alison Head. “Dismantling the Evaluation Framework – In the Library with the Lead Pipe,” July 21, 2021.

Calhoun, Cate. “Using Wikipedia in Information Literacy Instruction: Tips for Developing Research Skills.” College & Research Libraries News 75, no. 1 (2014): 32–33.

Connaway, Lynn Silipigni. “What Is ‘Container Collapse’ and Why Should Librarians and Teachers Care? – OCLC Next.” Next (blog), June 20, 2018.

Connors, Robert J. “The Rhetoric of Citation Systems Part I The Development of Annotation Structures from the Renaissance to 1900.” Rhetoric Review 17, no. 1 (1998): 6–48.

Cusack, George. “Writing Across the Curriculum.” Carleton College, 2018.

Daniels, Erin. “Using a Targeted Rubric to Deepen Direct Assessment of College Students’ Abilities to Evaluate the Credibility of Sources.” College & Undergraduate Libraries 17, no. 1 (2010): 31.

Gullifer, Judith, and Graham A. Tyson. “Exploring University Students’ Perceptions of Plagiarism: A Focus Group Study.” Studies in Higher Education 35, no. 4 (June 1, 2010): 463–81.

Hofer, Barbara K. “Personal Epistemology as a Psychological and Educational Construct: An Introduction.” In Personal Epistemology: The Psychology of Beliefs about Knowledge and Knowing, 3–14. London: Routledge, 2004.

Hyland, Ken. “Academic Attribution: Citation and the Construction of Disciplinary Knowledge.” Applied Linguistics 20, no. 3 (1999): 341–67.

Jastram, Iris, Danya Leebaw, and Heather Tompkins. “CSI(L) Carleton: Forensic Librarians and Reflective Practices.” In the Library with the Lead Pipe, 2011.

———. “Situating Information Literacy Within the Curriculum: Using a Rubric to Shape a Program.” Portal: Libraries and the Academy 14, no. 2 (2014): 165–86.

Leebaw, Danya, Kristin Partlo, and Heather Tompkins. “‘How Is This Different from Critical Thinking?’: The Risks and Rewards of Deepening Faculty Involvement in an Information Literacy Rubric,” 270–80. Indianapolis: ACRL 2013, 2013.

Leeder, Chris. “Student Misidentification of Online Genres.” Library & Information Science Research 38, no. 2 (April 2016): 125–32.

Lloyd, Annemaree. Information Literacy Landscapes: Information Literacy in Education, Workplace, and Everyday Contexts. Oxford: Chandos Publishing, 2010.

McGeough, Ryan, and C. Kyle Rudick. “‘It Was at the Library; Therefore It Must Be Credible’: Mapping Patterns of Undergraduate Heuristic Decision-Making.” Communication Education 67, no. 2 (April 3, 2018): 165–84.

Meriam Library. “Is This Source or Information Good?,” 2010.

Miller, Carolyn R. “Genre as Social Action.” Quarterly Journal of Speech 70, no. 2 (May 1, 1984): 151–67.

Reference, Gould Library, and Instruction Department. “Research Practices Survey 2015-16.” Northfield MN: Gould Library, Carleton College, 2017.

Russo, Alyssa, Amy Jankowski, Stephanie Beene, and Lori Townsend. “Strategic Source Evaluation: Addressing the Container Conundrum.” Reference Services Review 47, no. 3 (August 1, 2019): 294–313.

Simmons, Michelle Holschuh. “Librarians as Disciplinary Discourse Mediators: Using Genre Theory to Move Toward Critical Information Literacy.” Portal: Libraries and the Academy 5, no. 3 (2005): 297–311.

Soules, Aline. “E-Books and User Assumptions.” Serials: The Journal for the Serials Community 22, no. 3 (January 1, 2009): S1–5.

White, Beth A., Taimi Olsen, and David Schumann. “A Threshold Concept Framework for Use across Disciplines.” In Threshold Concepts in Practice, edited by Ray Land, Jan H. F. Meyer, and Michael T. Flanagan, 53–63. Educational Futures. Rotterdam: SensePublishers, 2016.

Appendix: ILSW Rubric


  1. The ILSW project uses a scoring rubric (see appendix) that is designed for use across disciplines, and it is intended to be flexible across many paper genres. It does not reveal specifics about student research strategies, but it does allow us to identify characteristics of information literacy habits of mind as they appear in completed student writing. The rubric calls attention to the clues students give their readers about how the students conceive of their research strategies and how they marshal and deploy evidence in service of their rhetorical goals. Our full rubric rubric, scoring sheet, and coder’s manual are available at
  2. Iris Jastram, Danya Leebaw, and Heather Tompkins, “Situating Information Literacy Within the Curriculum: Using a Rubric to Shape a Program,” Portal: Libraries and the Academy 14, no. 2 (2014): 165–86,; Danya Leebaw, Kristin Partlo, and Heather Tompkins, “‘How Is This Different from Critical Thinking?’: The Risks and Rewards of Deepening Faculty Involvement in an Information Literacy Rubric” (Indianapolis: ACRL 2013, 2013), 270–80,
  3. Information about the Sophomore Writing Portfolio can be found at
  4. More about Carleton College can be found in this profile: 
  5. George Cusack, “Writing Across the Curriculum,” Carleton College, 2018,
  6. 25% of the papers were read twice to provide inter-rater reliability scores. Among the group of readers, librarians had relatively high levels of inter-rater reliability (73% agreement). Faculty disagreed on scores more frequently (54% agreement), possibly due to having less experience with the project or possibly due to their experiences grading papers according to how well the papers meet the requirements of their assignments rather than evaluating according to our rubric, which is not related to a particular assignment. However even with these differences, there were clear trends that emerged from the comments and from the statistically significant differences in the data.
  7. We did not include any measurement of whether students in this sample had had any library instruction or experience.
  8. see Annemaree Lloyd, Information Literacy Landscapes: Information Literacy in Education, Workplace, and Everyday Contexts (Oxford: Chandos Publishing, 2010).
  9. ACRL, “Framework for Information Literacy for Higher Education” (Chicago: Association of College and Research Libraries, 2016),
  10. M. M. Bakhtin, Speech Genres and Other Late Essays, ed. Michael Holquist and Caryl Emerson, trans. Vern W. McGee, 1st edition., University of Texas Press Slavic Series 8 (Austin: University of Texas Press, 1986), 41.
  11. Carolyn R. Miller, “Genre as Social Action,” Quarterly Journal of Speech 70, no. 2 (May 1, 1984): 163,
  12. see Bakhtin, Speech Genres and Other Late Essays; Miller, “Genre as Social Action.”
  13. Barbara K Hofer, “Personal Epistemology as a Psychological and Educational Construct: An Introduction,” in Personal Epistemology: The Psychology of Beliefs about Knowledge and Knowing (London: Routledge, 2004), 3.
  14. Jan Meyer & Ray Land, “Threshold Concepts and Troublesome Knowledge: Linkages To Ways of Thinking and Practicing Within the Disciplines.” (Enhancing Teaching-Learning Environments in Undergraduate Courses, Occasional Report 4; Universities of Edinburgh, Coventry and Durham; May 2003),
  15. Beth A. White, Taimi Olsen, and David Schumann, “A Threshold Concept Framework for Use across Disciplines,” in Threshold Concepts in Practice, ed. Ray Land, Jan H. F. Meyer, and Michael T. Flanagan, Educational Futures (Rotterdam: SensePublishers, 2016), 53,
  16. “Framework for Information Literacy for Higher Education.”
  17. Meriam Library, “Is This Source or Information Good?,” 2010,
  18. Ryan McGeough and C. Kyle Rudick, “‘It Was at the Library; Therefore It Must Be Credible’: Mapping Patterns of Undergraduate Heuristic Decision-Making,” Communication Education 67, no. 2 (April 3, 2018): 165–84,
  19. ACRL, “Framework for Information Literacy for Higher Education.”
  20. Joel Breakstone et al., “Why We Need a New Approach to Teaching Digital Literacy,” Phi Delta Kappan 99, no. 6 (2018): 27–32,; Alaina C. Bull and Alison Head, “Dismantling the Evaluation Framework – In the Library with the Lead Pipe,” July 21, 2021,
  21. Private internal report of Research Practices Survey administered by HEDS (the Higher Education Data Sharing Consortium) results prepared by Carol Trosset in September 2015.
  22. Gleaned from conversations with our liaison faculty and from conversations during internal professional development workshops, often led by the Learning and Teaching Center.
  23. Amy G Buhler et al., “Container Collapse and the Information Remix: Students’ Evaluations of Scientific Research Recast in Scholarly vs. Popular Sources,” in ACRL Proceedings (Association of College and Research Libraries, Cleveland, Ohio, 2019), 14; Lynn Silipigni Connaway, “What Is ‘Container Collapse’ and Why Should Librarians and Teachers Care? – OCLC Next,” Next (blog), June 20, 2018,
  24. We know this from the Research Practices Survey, conducted at Carleton in 2006 and again in 2015. This survey contains many measures of these changes in pre-college experience. While we are a highly selective institution, our students’ pre-college experiences with research have been diminishing over time. You can see some of the RPS results at “Research Practices Survey 2015-16” (Northfield MN: Gould Library, Carleton College, 2017),
  25. Amy Buhler and Tara Cataldo, “Identifying E-Resources: An Exploratory Study of University Students,” Library Resources & Technical Services 60, no. 1 (January 7, 2016): 33,; Chris Leeder, “Student Misidentification of Online Genres,” Library & Information Science Research 38, no. 2 (April 2016): 129,
  26. Leeder, “Student Misidentification of Online Genres,” 129.
  27. Aline Soules, “E-Books and User Assumptions,” Serials: The Journal for the Serials Community 22, no. 3 (January 1, 2009): S4,
  28. Charles Bazerman, “Systems of Genres and the Enactment of Social Intentions,” in Genre and the New Rhetoric, ed. Aviva Freedman and Peter Medway (London: Taylor & Francis, 2005), 69.
  29. Erin Daniels, “Using a Targeted Rubric to Deepen Direct Assessment of College Students’ Abilities to Evaluate the Credibility of Sources,” College & Undergraduate Libraries 17, no. 1 (2010): 35,
  30. Alyssa Russo et al., “Strategic Source Evaluation: Addressing the Container Conundrum,” Reference Services Review 47, no. 3 (August 1, 2019): 294–313,
  31. Michelle Holschuh Simmons, “Librarians as Disciplinary Discourse Mediators: Using Genre Theory to Move Toward Critical Information Literacy,” Portal: Libraries and the Academy 5, no. 3 (2005): 297–311,
  32. Leeder, “Student Misidentification of Online Genres,” 129.
  33. Cate Calhoun, “Using Wikipedia in Information Literacy Instruction: Tips for Developing Research Skills,” College & Research Libraries News 75, no. 1 (2014): 32–33,
  34. Joseph Bizup, “BEAM: A Rhetorical Vocabulary for Teaching Research-Based Writing,” Rhetoric Review 27, no. 1 (January 4, 2008): 75,
  35. Doug Brent, Reading as Rhetorical Invention: Knowledge, Persuasion, and the Teaching of Research-Based Writing (Urbana, Ill.: National Council of Teachers of English, 1992).
  36. Anecdotally, when the reference and instruction librarians met with Writing Center workers in fall 2019 to discuss how we can better work together, one point of common ground was the desire to support students who struggle to get their own voice into their writing rather than relying on sources.
  37. Judith Gullifer and Graham A. Tyson, “Exploring University Students’ Perceptions of Plagiarism: A Focus Group Study,” Studies in Higher Education 35, no. 4 (June 1, 2010): 463–81,
  38. Robert J Connors, “The Rhetoric of Citation Systems Part I The Development of Annotation Structures from the Renaissance to 1900,” Rhetoric Review 17, no. 1 (1998): 6–7,
  39. See Ken Hyland, “Academic Attribution: Citation and the Construction of Disciplinary Knowledge,” Applied Linguistics 20, no. 3 (1999): 342–44.
  40. Iris Jastram, Danya Leebaw, and Heather Tompkins, “CSI(L) Carleton: Forensic Librarians and Reflective Practices,” In the Library with the Lead Pipe, 2011,

Announcing Incoming NDSA Coordinating Committee Members for 2022-2024 / Digital Library Federation

Please join me in welcoming the three newly elected Coordinating Committee members Stacey Erdman, Jen Mitcham, and Hannah Wang. Their terms begin January 1, 2022 and run through December 31, 2024. 

Stacey Erdman is the Digital Preservation & Curation Officer at Arizona State University. In this position, she has responsibility for designing and leading the digital preservation and curation program for ASU Library. She is also currently serving as the Acting Digital Repository Manager at ASU, where she has been working with the repository team on migrating repository platforms to Islandora. She is the former Digital Archivist at Beloit College; and Digital Collections Curator at Northern Illinois University. She has been a part of the Digital POWRR Project since its inception in 2012, and is serving as Principal Investigator for the recently funded IMLS initiative, the Digital POWRR Peer Assessment Program. Stacey currently serves on the 2021 NDSA Program Committee, and is also a member of the Membership Task Force. She has been excited to see the steps that the NDSA has taken recently to diversify the member base, and looks forward to working as a part of the CC to help make this work mission-critical. Stacey feels passionately about making the digital preservation field more equitable and inclusive, and would be a strong advocate for expanding NDSA’s outreach, advocacy, and education efforts.

Jen Mitcham is Head of Good Practice and Standards at the Digital Preservation Coalition (DPC), an international membership organization with charitable status based in the UK. In her role at the DPC, Jenny is responsible for promoting and maintaining the DPC’s maturity model for digital preservation, the Rapid Assessment Model (DPC RAM), and leads a digital preservation project with the UK’s Nuclear Decommissioning Authority. She has recently led the DPC’s taskforce on EDRMS preservation which has resulted in the publication of an online resource. She is involved in the organization of events and commissioning publications on digital preservation issues and provides support to DPC Members in a variety of different areas. Jenny was previously a digital archivist at the Archaeology Data Service and the University of York and has been working in the field of digital preservation since 2003. She has been involved in several initiatives with the NDSA over the last few years, including the revision of the NDSA Levels of Preservation and the 2021 Fixity Survey.

Hannah Wang works at Educopia Institute, where she is the Community Facilitator for the MetaArchive Cooperative and the Project Manager for the BitCuratorEdu project. Her work and research focuses on digital archives pedagogy and amplifying and coordinating the work of digital preservation practitioners through communities of practice. She currently serves on the NDSA Staffing Survey Working Group. Hannah was previously the Electronic Records & Digital Preservation Archivist at the Wisconsin Historical Society, and has taught graduate-level archives classes as an Associate Lecturer at the University of Wisconsin-Madison iSchool. She received her MSIS from University of North Carolina-Chapel Hill and lives in Sheboygan, Wisconsin.

We are also grateful to the very talented, qualified individuals who participated in this election.

We are indebted to our outgoing Coordinating Committee members, Stephen Abrams and Salwa Ismail, for their service and many contributions. To sustain a vibrant, robust community of practice, we rely on and deeply value the contributions of all members, including those who took part in voting.

The post Announcing Incoming NDSA Coordinating Committee Members for 2022-2024 appeared first on DLF.

October Open Meeting / Islandora

October Open Meeting jmlynaryk Wed, 10/13/2021 - 05:17

Join us for our October Open Meeting, taking place Tuesday, Oct 26th from 10:00-2:00 Eastern! 

The morning theme of our October Open Meeting is ISLE. Stop by for a lesson on the basics of using Docker, as well as an interactive installation demo - we will send out the installation instructions used during the demo in case anyone wants to try or follow along! 

In the afternoon there will be a community discussion focusing on Carving A Path Forward based on the Jamboard exercise from last month’s open meeting. Together, we will create goals and action items. 

See below for the full schedule:


10:00 -10:15

Topic: Welcome and Introductions



Topic: Docker Basics

Topic: ISLE Install

Presenters: Noah Smith and others (TBA)



Open Q&A


12:00 - 12:30

Lunch Break


12:30 - 1:30

Topic: Jamboard Results and Next Steps in Community Health

Presenters: Islandora Open Event Planning Committee


1:30 - 2:00

Open Q&A


All are welcome and encouraged to attend, and no registration is required! Simply contact us at to receive a calendar invite or Zoom password.

Evergreen 3.8-beta available / Evergreen ILS

The Evergreen Community is pleased to announce the availability of the beta release for Evergreen 3.8. This release contains various new features and enhancements, including:

  • Angular rewrites of several staff interfaces:
    • Acquisitions administration
    • Holdings maintenance and item attributes editor
    • Patron triggered events log
    • Item triggered events log
  • A new option to make headings browsing case-insensitive
  • A new interface for editing notes that are attached to bibliographic records
  • Improvements to the staff interface for browsing bib records that are attached to a heading
  • Patron notes, messages, alert messages, and standing penalties have been folded into a consolidated notes interface.
  • New settings to control how the item price and acquisition cost are used to determine the item’s value for replacement
  • Improvements to the dialogs used to override events in the checkout, items out, and renew items interfaces
  • The patron photo URL can now be edited from the patron registration interface
  • New settings for hold stalling based on the pickup library
  • New settings for tuning the default pickup location that is applied when a hold request is placed by a staff member
  • Stripe credit payments in the public catalog now use a newer API recommended by Stripe
  • Cover images are now displayed in the My Account items checked out, check out history, holds, and holds history pages
  • New reporting views, including item statistics andDewey call number blocks and ranges

Evergreen admins installing the beta or upgrading a test system to the beta should be aware of the following:

  • The minimum version of PostgreSQL required to run Evergreen 3.6 is PostgreSQL 9.6.
  • The minimum version of OpenSRF is 3.2.
  • Debian Bullseye is now supported, while Debian Jessie is no longer supported
  • The upgrade does not require a catalog reingest. However, if you are experimenting with the case-insensitive browse feature, reingests will be required for any index that you change.
  • The beta release should not be used for production.

Additional information, including a full list of new features, can be found in the release notes.

A new CEO for Open Knowledge Foundation – Renata Ávila / Open Knowledge Foundation

Beyond Open Data, our new CEO will start a conversation about the future of our global knowledge commons.

Today we are delighted to announce that the Board of Directors of Open Knowledge Foundation has appointed Renata Avila to be the new CEO of Open Knowledge Foundation – effective from October 4th 2021.

Board Chair, Vanessa Barnett, said that Renata was selected after a long and extremely competitive process, over many months.

    ‘Renata is an outstanding choice for CEO of Open Knowledge Foundation, bringing a wealth of experience that will be invaluable to achieve our mission’, Vanessa said. ‘This appointment marks a new chapter for Open Knowledge Foundation and the open movement. We are delighted to have her on board’.

Renata Avila (1981, Guatemala) is an international Human Rights lawyer and digital rights advocate. Throughout her career, Renata has successfully built a global network of networks advancing a decolonial, peoples-centric approach to open technologies and knowledge, as tools to advance rights and create stronger communities. 

She comes to the Open Knowledge Foundation to challenge the prevalent narrative and invite our network and extended community to advance a positive vision to bring back open to the most pressing challenges of our times

We welcome Renata Avila as the CEO of Open Knowledge Foundation and look forward to working with her to achieve our mission.

Commenting on her appointment, Renata said 

     ‘I am honoured to be appointed as the new CEO of Open Knowledge Foundation, which plays such an important role in the international open knowledge movement. Without openness, global actions against climate change cannot scale. Without removing the barriers to accessing knowledge, no real solution against misinformation will be ever found. Without including everyone, and equipping people with skills to transform data into actionable knowledge, open data is just an enabler of the powerful.  Never before was our mission more urgent than today.’ 

She went on to say:

    ‘My goal in the upcoming months is to work together with our global network in designing the open knowledge ecosystem of tomorrow, with tools, strategies, governance structures and communities that are both shielded from abuses, exclusion and data extractivism, and enabled to create, connect and advance our positive agenda.  Our vision for an open future.’ 

Please do join us in welcoming Renata to the Open Knowledge Foundation team.

More about Renata here

Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis / Peter Sefton

This is a presentation by Peter (Petie) Sefton and Moises Sacal, delivered at the online eResearch Australasia Conference on October 12th 2021.

The presentation was by recorded video - this is a written version. Mosies and I are both employed by the University of Queensland School of Languages and Culture.

Towards a (technical architecture for a) HASS Research Data Commons for language and text analysis Peter Sefton & Moises Sacal technical architecture for a

Here is the abstract as submitted:

The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC).

The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities.

The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations.

In this presentation we will present the proposed architecture of the system, the principles that informed it and demonstrate the first version. Features of the solution include the use of the Arkisto Platform (presented at eResearch 2020), which leverages the Oxford Common File Layout. This enables storing complete version-controlled digital objects described using linked data with rich context via the Research Object Crate (RO-Crate) format. The solution features a distributed authorization model where the agency archiving data may be separate from that authorising access.

Project Team(alphabetical order) Michael D’Silva  Marco Fahmi Leah Gustafson  Michael Haugh Cale Johnstone  Kathrin Kaiser  Sara King  Marco La Rosa  Mel Mistica  Simon Musgrave  Joel Nothman  Moises Sacal  Martin Schweinberger  PT Sefton  <p>With thanks for their contribution: Partner Institutions:

This cluster of projects is led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions.

I work on Gundungurra and Darug land in the Blue Mountains, Moises is on the land of the Gadigal peoples of the Eora Nation. We would like to show acknowledge the traditional custodians of the lands on which we live and work and the importance of indigenous knowledge, culture and language to the these projects.

The Language Data Commons of Australia (LDaCA) and Australian Text Analytics Platform (ATAP) projects received investment ( and from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS).

This work is supported by the Australian Research Data Commons.

⛰️ 🔏

We are going to talk about the emerging architecture and focus in on one very important part of it: Access control. 🔏

But first, some background.⛰️

The platform will: Be sustainable, with a focus on data preservation as an overriding concern - data will not be ‘trapped’ in a particular platform and all data and code developed on the platform will be in a “migration free” layout ready for reuse preserve interoperable and re-usable data via the use of common standards for describing and structuring data with useful detailed context and provenance  make data from ATAP and LDaCA and collections discoverable - with the caveat that harvesting harmonised metadata from existing corpora may be difficult Provide workbench services for computational research - starting with code-notebooks but with the aim of building towards no-code environments and automatically re-runnable workflows include clear licensing on all data and code on how data may be reused, informed by a legally sound policy framework, with an access-control framework to allow automated data access where possible (there are some external dependencies here) be distributed - with data held by a number of different organizations under a variety of governance models and technologies  (potentially including copies for redundancy or to put data close to compute  and analytical services) enable best-practice in research, with research products such as code and derived data available as “fully documented research objects” that as as re-runnable and rigorously described as possible provide and be able to show value in enabling and measuring the impact of research  <p>

The architecture for the Data Commons project is informed by as set of goals and principles starting with ensuring that important data assets have the best chance of persisting into the future.

 <p>Repositories: institutional, domain or both</p> <p>Find / Access services Research Data Management Plan Workspaces:</p> <p>working storage domain specific tools domain specific services collect describe analyse Reusable, Interoperable data objects deposit early deposit often Findable, Accessible, Reusable data objects reuse data objects V1.1  © Marco La Rosa, Peter Sefton 2021</p> <p>🗑️ Active cleanup processes  workspaces considered ephemeral 🗑️ Policy based data management

The diagram which we developed with Marco La Rosa makes a distinction between managed repository storage and the places where work is done - “workspaces”. Workspaces are where researchers collect, analyse and describe data. Examples include the most basic of research IT services, file storage as well as analytical tools such as Jupyter notebooks (the backbone of ATAP - the text analytics platform). Other examples of workspaces include code repositories such as GitHub or GitLab (a slightly different sense of the word repository), survey tools, electronic (lab) notebooks and bespoke code written for particular research programmes - these workspaces are essential research systems but usually are not set up for long term management of data. The cycle in the centre of this diagram shows an idealised research practice where data are collected and described and deposited into a repository frequently. Data are made findable and accessible as soon as possible and can be “re-collected” for use and re-use.

For data to be re-usable by humans and machines (such as ATAP notebook code that consumes datasets in a predictable way) it must be well described. The ATAP and LDaCA approach to this is to use the Research Object Crate (RO-Crate) specification. RO-Crate is essentially a guide to using a number of standards and standard approaches to describe both data and re-runnable software such as workflows or notebooks.


In the context of the previous high-level map distinguishing workspaces and repository services, we are using the Arkisto Platform (introduced at eResearch 2020).

Arkisto is an approach to eResearch service that places the emphasis on ensuring the long term preservation of data independently of code and services - recognizing the ephemeral nature of software.

An example of a corpus is the PARADISEC collection - Pacific and Regional Archive for Digital Sources in Endangered Cultures

PARADISEC has viewers for various content types: video and audio with time aligned transcriptions, image set viewers and document viewers (xml, pdf and microsoft formats). We are working on making these viewers available across Arkisto sites by having a standard set of hooks for adding viewer plugins to a site as needed.

Compute <p>HPC Cloud Desktop</p> <p>collect describe analyse 🗑️ Active cleanup processes  workspaces considered ephemeral … etc ATAP Notebooks Apps, Code, Workflows</p> <p>Deposit /Publish PARADISEC Analytics Portal Code discovery Launch / Rerun Data Discovery Authenticated API</p> <p>Workbench Notebooks Data import by URL Export fully described pkg Stretch goals: Code gen / simple interfaces eg Discursis</p> <p>BYOData 🥂 ⚙️ STORAGE (including Cloudstor) . Data Curation & description Reuse Licence Server Identity Management AAF / social media accounts</p> <p>Data Cleaning OCR / transcription format migration Archive & Preservation Repositoriesinstitutional, domain or both AU Nat. Corpus AusLan (sign) Sydney Speaks ATAP Corpus Reference,Training & BYO Workspaces: working storage domain specific tools domain specific services Harvested external Lang. portal(s) Corpus discovery Item discovery Authenticated API Create virtual corpora</p> <p>

This slide captures the overall high-level architecture - there will be an analytical workbench (left of the diagram) which is the basis of the Australian Text Analytics (ATAP) project - this will focus on notebook-style programming using one of the emerging Jupyter notebook platforms in that space. The exact platform is not 100% decided yet, but that has not stopped the team from starting to collect and develop notebooks that open up text analytics to new coders from the linguistics community. Our engagement lead, Dr Simon Musgrave sees the ATAP work as primarily an educational enterprise - which will be underpinned by services built on the Arkisto standards that allow for rigorous, re-runnable research.

Compute <p>HPC Cloud Desktop</p> <p>collect describe analyse 🗑️ Active cleanup processes  workspaces considered ephemeral … etc ATAP Notebooks Apps, Code, Workflows</p> <p>Deposit /Publish PARADISEC Lang. portal(s) Corpus discovery Item discovery Authenticated API Create virtual corpora</p> <p>Analytics Portal Code discovery Launch / Rerun Data Discovery Authenticated API</p> <p>Workbench Notebooks Data import by URL Export fully described pkg Stretch goals: Code gen / simple interfaces eg Discursis</p> <p>BYOData 🥂 ⚙️ STORAGE (including Cloudstor) . Data Curation & description Reuse Licence Server Identity Management AAF / social media accounts</p> <p>Data Cleaning OCR / transcription format migration Archive & Preservation Repositoriesinstitutional, domain or both AU Nat. Corpus AusLan (sign) Sydney Speaks ATAP Corpus Reference,Training & BYO Workspaces: working storage domain specific tools domain specific services Harvested external Our demo today looks at this part …

Today we will look in detail at one important part of this architecture - access control. How can we make sure that in a distributed system, with multiple data repositories and registries residing with different data custodians, the right people have access to the right data?

I didn’t spell this out in the recorded conference presentation, but for data that resides in the repositories at the right of the diagram we want to encourage research processes that clearly separate data from code. Notebooks and other code workflows that use data will fetch a version-controlled reference copy from a repository - using an access key if needed, process the data and produce results that are then deposited into an appropriate repository alongside the code itself. Given that a lot of the data in the language world is NOT available under open licenses such as Creative Commons it is important to establish this practice - each user of the data must negotiate or be granted access individually. Research can still be reproducible using this model, but without a culture of sharing datasets without regard for the rights of those who were involved in the creations of the data.


Regarding rights, our project is informed by the CARE principles for indegenous data.

The current movement toward open data and open science does not fully engage with Indigenous Peoples rights and interests. Existing principles within the open data movement (e.g. FAIR: findable, accessible, interoperable, reusable) primarily focus on characteristics of data that will facilitate increased data sharing among entities while ignoring power differentials and historical contexts. The emphasis on greater data sharing alone creates a tension for Indigenous Peoples who are also asserting greater control over the application and use of Indigenous data and Indigenous Knowledge for collective benefit


We are designing the system so that it can work with diverse ways of expressing access rights, for example licensing like the Tribal Knowledge labels.The idea is to separate safe storage of data with a license on each item, which may reference the TK labels from a system that is administered by the data custodians who can make decisions about who is allowed to access data.

Case Study - Sydney Speaks <p>

We are working on a case-study with the Sydney Speaks project via steering committee member Professor Catherine Travis.

This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney. The title “Sydney Speaks” captures a key defining feature of the project: the data come from recorded conversations between Sydney siders, as they tell stories about their lives and experiences, their opinions and attitudes. This allows us to measure how their lived experiences impact their speech patterns. Working within the framework of variationist sociolinguistics, we examine variation in phonetics, grammar and discourse, in an effort to answer questions of fundamental interest both to Australian English, and language variation and change more broadly, including:

  • How has Australian English as spoken in Sydney changed over the past 100 years?
  • Has the change in the ethnic diversity over that time period (and in particular, over the past 40 years) had any impact on the way Australian English is spoken?
  • What affects the way variation and change spread through society
    • Who are the initiators and who are the leaders in change?
    • How do social networks function in a modern metropolis?
    • What social factors are relevant to Sydney speech today, and over time (gender? class? region? ethnic identity?) A better understanding of what kind of variation exists in Australian English, and of how and why Australian English has changed over time can help society be more accepting of speech variation and even help address prejudices based on ways of speaking. Source:

The collection contains recordings of people speaking both contemporary and historic.

Because this involved human participants there are restrictions on the distribution of data - a situation we see with lots of studies involving people in a huge range of disciplines.

Sydney Speaks Licenses

There are four tiers of data access we need to enforce and observe for this data based on the participant agreements and ethics arrangements under which the data were collected.

Concerns about rights and interests are important for any data involving people - and a large amount the data both indigenous and non-indigenous we are using will require access control that ensures that data sharing is appropriate.


In this example demo we uploaded various collections and are authorising with Github organisations

In a our production release we will use AAF to authorise different groups

Let's find a dataset: The Sydney Speaks Corpus

As you can see we cannot see any data

Lets login… We authorise Github…

Now you can see we have access sub corpus data and I am just opening a couple of items

Now in Github we can see the group management example.

I have given access to all the licences to myself, as you can see here and given access to licence A to others.


This diagram is a sketch of the interaction that took place in the demo - it shows how a repository can delegate authorization to an external system - in this case Github rather than CILogon. But we are working with the ARDC to set up a trial with the Australian Access Federation to allow CILogon access for the HASS Research Data Commons so we can pilot group-based access control.

NOTE: This diagram has been updated slightly from the version presented at the conference to make it clear that the lookup to find the licence for the data set is internal to the repository - the id is a DOI but it is not being resolved over the web.


In this presentation, about work which is still very much under construction, we have:

  • Shown an overview of a complete Data Commons Architecture
  • Previewed a distributed access-control mechanism which will separates out the the job of storing and delivering data from that of authorising access
  • We'll be back next year with more about how analytics and data repositories connect using structure and linked data.

Setting and handling solr timeouts in Blacklight / Jonathan Rochkind

When using the Blacklight gem for a Solr search front-end (most used by institutions in the library/cultural heritage sector), you may wish to set a timeout on how long to wait for Solr connection/response.

It turns out, if you are using Rsolr 2.x, you can do this by setting a read_timeout key in your blacklight.yml file. (This under-documented key is a general timeout, despite the name; I have not investigated with Rsolr 1.x).

But the way it turns into an exception and the way that exception is handled is probably not what you would consider useful for your app. You can then change this by over-riding the handle_request_error method in your CatalogController.

I am planning on submitting some PR’s to RSolr and Blacklight to improve some of these things.

Read on for details.

Why set a timeout?

It’s generally considered important to always set a timeout value on an external network request. If you don’t do this, your application may wait indefinitely for the remote server to respond, if the remote server is being slow or hung; or it may depend on underlying library default timeouts that may not be what you want.

What can happen to a Blackligh that does not a set a Solr timeout? We could have a Solr server that takes a really long time — or is entirely hung — on returning a response for one request, or many, or all of them.

Your web workers (eg puma or passenger) will be waiting a while for Solr. Either indefinitely, or maybe there’s a default timeout in the HTTP client (I’m actually not sure, but maybe 60s for net-http?). During this time, the web workers are busy, and unable to handle other requests. This will reduce the traffic capacity of your app, for a very slow/repeatedly misbehaving Solr possibly catastrophically leading to an app that appears unresponsive.

There may be some other part of the stack that will timeout waiting for the web worker to return a response (while the web worker is waiting for Solr). For instance, heroku is willing to wait a maximum of 30 seconds, and I think Passenger also has timeouts (although may default to as long as 10 minutes??). But this may be much longer than you really want your app to wait on Solr for reasons above, and when it does get triggered you’ll get a generic “Timed out waiting for app response” in your logs/monitoring, it won’t be clear the web worker was waiting on solr, making operational debugging harder.

How to set a Blacklight Solr timeout

A network connection to Solr in the Blacklight stack first goes through RSolr, which then (in Rsolr 2.x) uses the Faraday ruby gem, which can use multiple http drivers but default uses net-http from the stdlib.

For historical reasons, how to handle timeouts has been pretty under-documented (and sometimes changing) at all these levels! They’re not making it easy to figure out how to effectively set timeouts! It took the ruby community a bit of time to really internalize the importance of timeouts on HTTP calls.

So I did some research, in code and in manual tests.

Faraday timeouts?

If we start in the middle at Faraday, it’s not clearly documented… and may be http-adapter-specific? Faraday really doesn’t make this easy for us!

But from googling, it looks like Faraday generally means to support keys open_timeout (waiting for a network connection to open), and timeout (often waiting for a response to be returned, but really… everything else, and sometimes includes open_timeout too).

If you want some details….

For instance, if we look at the faraday adapter for http-rb, we can see that the faraday timeout option is passed to http-rb for each of connect, read, and write.

  • (Which really means if you set it to 5 seconds… it could wait 5 seconds for connect then another 5 seconds for write and another 5 seconds for read 😦. http-rb actually provided a general/global timeout at one point, but faraday doens’t take advatnage of it. 😦😦).

And then http-rb adapter uses the open_timeout value just for connect and write. That is, setting both faraday options timeout and open_timeout to the same value would be redundant for the the http-rb adapter at present. the http-rb adapter doesn’t seem to do anything with any other faraday timeout options.

If we look at the default net-http adapter… It’s really confusing! We have to look at this method in faraday generic too. But I confirmed by manual testing that net-http actually supports faraday read_timeout, write_timeout, and open_timeout (different values than http-rb), but will also use timeout as a default for any of them. (Again your actual end-to-end timeout can be sum of open/read/write. 😦).

It looks like different Faraday adapters can use different timeout values, but Faraday tries to make the basic timeout value at least do something useful/general for each adapter?

Most blacklight users are probably using the default net-http adapter (Curious to hear about anyone who isn’t?)

What will Blacklight actually pass to Faraday?

This gets confusing too!

Blacklight seems to take whatever keys you have set in your blacklight.yml for the given environment, and pass them to RSolr.connect. With one exception, you have to say http_adapter in blacklight.config to translate to adapter passed to Rsolr.

  • (I didn’t find the code that makes that blacklight_config invocation be your environment-specific hash from blackight.yml, but I confirmed that’s what it is!)

What does Rsolr 2.x do? It does not pass on anything to Faraday, but only certain allow-listed items, after translating. Confusingly, it’s only wiling to pass on open_timeout, and also translate a read_timeout value from blacklight.yml to Faraday timeout.

Phew! So Blacklight/Rsolr only supports two timeout values to be passed to faraday:

  • open_timeout to Faraday open_timeout
  • read_timeout to Faraday timeout.

PR to Rsolr on timeout arguments?

I think ideally RSolr would pass on any of the values Faraday seems to recognize, at least with some adapters, for timeouts: read_timeout, open_timeout, write_timeout, as well as just timeout.

But to get from what it does now to there in a backwards compatible way… kind of impossible because of how it’s currently translating read_timeout to timeout. :(

I think I may PR one that just recognizes timeout too, while leaving read_timeout as a synonym with a deprecation warning telling you to use timeout? Still thinking this through.

What happens when a Timeout is triggered?

Here we have another complexity. Just as the timeout configuration values are translated on the way down the stack, the exceptions raised when a timeout happens are translated again on the way up, HTTP library => Faraday => RSolr => Blacklight.

Faraday basically has two exception classes it tries to normalize all underlying HTTP library timeouts to: Faraday::ConnectionFailed < Faraday::Error (for timeouts opening the connection) and Faraday::TimeoutError < Faraday::ServerError < Faraday::Error for other timeouts, such as read timeouts.

What happens with a connection timeout?

  1. Faraday raises a Faraday::ConnectionFailed error. (For instance from the default Net::HTTP Adapter)
  2. RSolr rescues it, and re-raises as an RSolr::Error::ConnectionRefused, which sub-classes the ruby stdlib Errno::ECONNREFUSED
  3. Blacklight rescues that Errno::ECONNREFUSED, and translates it to a Blacklight::Exceptions::ECONNREFUSED, (which is still a sub-class of stdlib Errno::ECONNREFUSED)

That just rises up to your application, to give the end-user probably a generic error message, be logged, be caught by any error-monitoring services you have, etc. Or you can configure your application to handle these Blacklight::Exceptions::ECONNREFUSED errors in some custom way using standard Rails rescue_from functionality, etc.

This is all great, just what we expect from exception handling.

The one weirdness is that the exception suggests connection refused, when really it was a timeout, which is somewhat different… but Faraday itself doesn’t distinguish between those two situations, which some people would like to improve for a while now, but there isn’t much a client of Faraday can do about in the meantime.

What happens with other timeouts?

Say, the network connection opened fine, but Solr is just being really slow returning a response (it totally happens) and exceeding a Faraday timeout value set.

The picture here is a bit less good.

  1. Faraday will raise a Faraday::TimeoutError (eg from the net-http adapter).
  2. RSolr does not treat this specially, but just rescues and re-raises it just like any other Faraday::Error as a generic RSolr::Error::Http
  3. Blacklight will take it, just as any other RSolr::Error::Http, and rescues and re-raise as a generic Blacklight::Exceptions::InvalidRequest
  4. Blacklight does not allow this to just rise up through the app, but instead uses Rails rescue_from to register it’s own handler for it, a handle_request_error method.
  5. The handle_request_error method will log the error, and then just display the current Blacklight controller “index” page (ie search form), with a message “Sorry, I don’t understand your search.”

This is… not great.

  • From a UX point of view, this is not good, we’re telling the user “sorry I don’t understand your search” when the probelm was a Solr timeout… it makes it seem like there’s something the user did wrong or could do differently, but that’s not what’s going on.
    • In fact that’s true for a lot of errors Blacklight catches this way. Solr is down? Solr collection doesn’t exist? Solr configuration has a mismatch with Blacklight configuration? All of these will result in this behavior, none of them are something the end-user can do anything about.
  • If you have an error monitoring service like Honeybadger, it won’t record this error, since the app handled it instead of letting it rise unhandled. So you may not even know this is going on.
  • If you have an uptime monitoring service, it might not catch this either, since the app is returning a 200. You could have an app pretty much entirely down and erroring for any attempt to do a search… but returning all HTTP 200 responses.
  • While Blacklight does log the error, it does it in a DIFFERENT way than Rails ordinarily does… you aren’t going to get a stack trace, or any other contextual information, it’s not really going to be clear what’s going on at all, if you mention it at all.

Not great. One option is to override the handle_request_error method in your own CatalogController to: 1) Disable this functionality entirely, don’t swallow the error with a “Sorry, I don’t understand your search” message, just re-raise it; and 2) unwrap the underlying Faraday::TimeoutError before re-raising, so that gets specifically reported instead of a generic “Blacklight::Exceptions::InvalidRequest”, so we can distinguish this specific situation more easily in our logs and error monitoring.

Here’s an implementation that does both, to be put in your catalog_controller.rb:

  # OVERRIDE of Blacklight method. Blacklight by default takes ANY kind
  # of Solr error (could be Solr misconfiguraiton or down etc), and just swallows
  # it, redirecting to Blacklight search page with a message "Sorry, I don't understand your search."
  # This is wrong.  It's misleading feedback for user for something that is usually not
  # something they can do something about, and it suppresses our error monitoring
  # and potentially misleads our uptime checking.
  # We just want to actually raise the error!
  # Additionally, Blacklight/Rsolr wraps some errors that we don't want wrapped, mainly
  # the Faraday timeout error -- we want to be able to distinguish it, so will unwrap it.
  private def handle_request_error(exception)
    exception_causes = []
    e = exception
    until e.cause.nil?
      e = e.cause
      exception_causes << e

    # Raies the more specific original Faraday::TimeoutError instead of
    # the generic wrapping Blacklight::Exceptions::InvalidRequest!
    if faraday_timeout = exception_causes.find { |e| e.kind_of?(Faraday::TimeoutError) }
      raise faraday_timeout

    raise exception

PRs to RSolr and Blacklight for more specific exception?

RSolr and Blacklight both have a special error class for the connection failed/timed out condition. But just lump Faraday::Timeout in with any other kind of error.

I think this logic is probably many years old, and pre-dates Faraday’s current timeout handling.

I think they should both have a new exception class which can be treated differently. Say RSolr::Error::Timeout and Blacklight::Exceptions::RepositoryTimeout?

I plan to make these PRs.

PR to Blacklight to disable that custom handle_request_error behavior

I think the original idea here was that something in the user’s query entry would trigger an exception. That’s what makes rescueing it and re-displaying it with the message “Sorry, I don’t understand your search” make some sense.

At the moment, I have no idea how to reproduce that, figure out a user-entered query that actually results in a Blacklight::Exceptions::InvalidRequest. Maybe it used to be possible to do in an older version of Solr but isn’t anymore? Or maybe it still is, but I just don’t know how?

But I can reproduce ALL SORTS of errors that were not about the user’s entry and which the end-user can do nothing about, but which still result in this misleading error message, and the error getting swallowed by Blacklight and avoiding your error- and uptime-monitoring services. Solr down entirely; Solr collection/core not present or typo’d. Mis-match between Solr configuration and Blacklight configuration, like Blacklight mentioning an Solr field that doens’t actually exist.

All of these result in Blacklight swallowing the exception, and returning an HTTP 200 response with the message “Sorry, I don’t understand your search”. This is not right!

I think this behavior should be removed in a future Blacklight version.

I would like to PR such a thing, but I’m not sure if I can get it reviewed/merged?

Talk At "Blockchain for Business" Conference / David Rosenthal

I was invited to be on a panel at the University of Arkansas' "Blockchain for Business" conference together with John Ryan and Dan Geer. Below the fold are my introductory remarks.

I'd like to thank Dan Conway for inviting me to talk about the security of blockchains. You don't need to take notes; the text of my remarks with links to the sources is at

"Blockchain" is unfortunately a term used to describe two completely different technologies, which have in common only that they both use a data structure called a Merkle Tree, commonly in the form patented by Stuart Haber and Scott Stornetta in 1991. This is a linear chain of blocks each including the hash of the previous block. Even more unfortunately, the more secure way to implement trustworthy public databases using Merkle Trees isn't called a blockchain, so doesn't benefit from the tsunami of hype that surrounds the term.

Permissioned blockchains have a central authority controlling which network nodes can add blocks to the chain, whereas permissionless blockchains such as Bitcoin's do not; this difference is fundamental:
  • Permissioned blockchains can use well-established and relatively efficient techniques such as Byzantine Fault Tolerance to ensure that each node in the network has performed the same computation on the same data to arrive at the same state for the next block in the chain. This is a consensus mechanism.
  • In principle each node in a permissionless blockchain's network can perform a different computation on different data to arrive at a different state for the next block in the chain. Which of these blocks ends up in the chain is determined by a randomized, biased election mechanism. For example, in Proof-of-Work blockchains such as Bitcoin's a node wins election by being the first to solve a puzzle. The length of time it takes to solve the puzzle is random, but the probability of being first is biased, it is proportional to the compute power the node uses.
This fundamental difference means that the problems of securing the two blockchains are quite different:
  • A permissioned blockchain is a way to implement a distributed database. Securing it is a conventional problem. You need to ensure the central authority doesn't admit bad actors. You need to ensure each node is under separate administr ation with no shared credentials, to guard against compromise. Ideally, each node should run different software to guard against supply chain attacks, and so on.
  • Securing a permissionless blockchain is an unconventional problem. Because anyone, even bad guys, can take part its security depends primarily on ensuring that the cost of a successful attack greatly exceeds the rewards to be obtained from it.
To succeed, the attacker of a permissionless blockchain needs a high probability of being elected, which typically means they need to control a majority of the electorate. Making this control more expensive than the potential reward for an attack requires that being a voter be expensive. This has a number of consequences:
  • There is no central authority to collect funds to pay the voters, so they need to be reimbursed by the system itself via either inflation of a cryptocurrency, or transaction fees, or both. Currently, Bitcoin miners income is around 90% rewards (i.e. inflation). Research shows a fee-only system, as Bitcoin is intended to become, is insecure.
  • Imposing costs with Proof-of-Work, as most cryptocurrencies do, leads to catastrophic carbon footprints. Alternatives to Proof-of-Work are vastly more complex, and extremely difficult to get right. Ethereum has been trying to implement Proof-of-Stake for seven years.
  • Information technologies have strong economies of scale. The more resource a voter has, the better its margins. Thus successful permissionless blockchains are centralized. 3-4 mining pools have controlled the majority of Bitcoin mining power for at least 7 years.
The advantage of permissionless over permissioned blockchains is claimed to be decentralization. But in practice this is an illusion, the enormous costs of attempting to avoid centralization are wasted:
a Byzantine quorum system of size 20 could achieve better decentralization than proof-of-work mining at a much lower resource cost.
Ethereum has been even more centralized than Bitcoin, and because it is a programming environment its attack surface is exponentially greater. In particular, as we see in the recent $600M attack on Poly Network, it is much more vulnerable to supply chain attacks of the kind that Munin defends against in other environments.

Actually, centralization is often a good thing. Mistakes are inevitable, as we see with the recent $90M oopsie at Compound and the subsequent $67M oopsie, or the $23M fee Bitfinex paid for a $100K transaction. Centralization of Ethereum allowed Poly Network to to convince miners to make most transfers of the $600M loot very difficult, and persuade the thief to return most of it. Immutability sounds like a great idea until you're the victim of a theft.

The less vulnerable way to implement a trustworthy decentralized database is shown by the Certificate Transparency system described in RFC6962. It is a trust-but-verify system, typically a much more appropriate model for business than immutability. It allows for real-time verification that the certificates that secure HTTPS were issued by the appropriate Certificate Authority (CA) and are current. In essence it is a network with three types of node:
  • Logs, to which CAs report their current certificates, and from which they obtain attestations, called Signed Certificate Timestamps (SCTs) that owners can attach to their certificates. Clients can verify the signature on the SCT, then verify that the hash it contains matches the certificate. If it does, the certificate was the one that the CA reported to the log, and the owner validated. Each log maintains a Merkle tree data structure of the certificates for which it has issued SCTs.
  • Monitors, which periodically download all newly added entries from the logs that they monitor, verify that they have in fact been added to the log, and perform a series of validity checks on them. They also thus act as backups for the logs they monitor.
  • Auditors, which use the Merkle tree of the logs they audit to verify that certificates have been correctly appended to the log, and that no retroactive insertions, deletions or modifications of the certificates in the log have taken place. Clients can use auditors to determine whether a certificate appears in a log. If it doesn't, they can use the SCT to prove that the log misbehaved.
As with permissioned blockchains, a few tens of nodes provides adequate decentralization. A key point is that clients verify certificates against a random subset of the tens of nodes they trust, which for each node is a different subset of the whole set of nodes. Thus an attacker has to compromise the vast majority of the nodes to avoid detection. This aids efficiency, by optimizing for the common case when no attack is taking place, while still providing a very high probability of unambiguous detection while an attack is underway.

Note that unlike a blockchain, this is not a consensus or an election mechanism. It is a mechanism for ensuring that none of the actors in the network can escape responsibility for their actions, which in many cases is what is needed. For example, Hof and Carle show how the same mechanism can be applied to securing the software supply chain.

BDR Storage Architecture / Brown University Library Digital Technologies Projects

We recently migrated the Brown Digital Repository (BDR) storage from Fedora 3 to OCFL. In this blog post, I’ll describe our current setup, and then a future post will discuss the process we used to migrate.

Our BDR data is currently stored in an OCFL repository1. We like having a standardized, specified layout for the files in our repository – we can use any software written for OCFL, or we can write it ourselves. Using the OCFL standard should also help us minimize data migrations in the future, as we won’t need to switch from one application’s custom file layout to a new application’s custom layout. OCFL repositories can be understood just from the files on disk, and databases or indexes can be rebuilt from those files. Backing up the repository only requires backing up the filesystem – there’s no metadata stored in a separate database. OCFL also has versioning and checksums built in for every file in the repository. OCFL gives us an algorithm to find any object on disk (and all the object files are contained in that one object directory), which is much nicer than our previous Fedora storage where objects and files were hard to find because they were spread out in various directories based on the date they were added to the BDR.

In the BDR, we’re storing the data on shared enterprise storage, accessed over NFS. We use an OCFL storage layout extension that splits the objects into directory trees, and encapsulates the object files in a directory with a name based on the object ID. We wrote an HTTP service for the OCFL-java client used by Fedora 6. We use this HTTP service for writing new objects and updates to the repository – this service is the only process that needs read-write access to the data.

We use processes with read-only access (either run by a different user, or on a different server with a read-only mount) to provide additional functionality. Our fixity checking script walks part of the BDR each night and verifies the checksums listed in the OCFL inventory.json file. Our indexer process reads the files in a object, extracts data, and posts the index data to Solr. Our Django-based storage service reads files from the repository to serve the content to users. Each of these services uses our bdrocfl package, which is not a general OCFL client – it contains code for reading our specific repository, with our storage layout and reading the information we need from our files. We also run the Cantaloupe IIIF image server, and we added a custom jruby delegate with some code that knows how to find an object and then a file within the object.

We could add other read-only processes to the BDR in the future. For example, we could add a backup process that crawls the repository, saving each object version to a different storage location. OCFL versions are immutable, and that would simplify the backup process because we would only have to back up new version directories for each object.

1. Collection information like name, description, … is actually stored in a separate database, but hopefully we will migrate that to the OCFL repository soon.

How one volunteer is sharing a better reading experience with all of us / Open Library

For nearly 15 years Open Library has been giving patrons free access to information about books in its catalog, direct to their computers. But for millions of readers across the globe who rely on their phones for access, this hasn’t always presented the ideal mobile reading experience.

This year, a volunteer within the Open Library community named Mark developed an independent mobile app, an unofficial companion to the website called the Open Library Reader. This lite app, which is available for free from the Apple store and Play store, emphasizes the mobile reading experience and showcases the books within a patron’s Open Library reading log. It’s a great way to take your personal library with you on the go.

While Open Library Reader is an unofficial app which is not maintained or supported by the staff at Internet Archive, we’re ecstatic that talented volunteers within our community are stepping up to design new experiences they wish existed for themselves and others. We applaud Mark, not only for the time he invested and showing what’s possible with our APIs, but — true to the spirit of Open Library — for sharing his app for free with patrons, in such a way which seems to respect patron privacy.

We sat down with Mark for an interview to learn why he created the Open Library Reader and which of its features may be appreciated by book lovers who are on the go.

A picture of a patron’s personal library when logged in to the Open Library Reader app

Open Librarian: “Why did you find the need to build an Open Library Reader?”

Mark: I read a lot of books on my iPad, especially old, hard-to-find mystery novels. Open Library has a lot of great reads, but I was getting frustrated trying to manage my Reading Log and read books in the tablet browser. There was a lot of scrolling and clicking around, a tap in the wrong place could send me off somewhere else, and the book I was reading was always surrounded by browser and bookreader controls. I just wanted to sit down and read, and not have to be reminded of the fact that I was looking at a website through a browser.

Open Librarian: What were some of the approaches Open Library Reader used to solve these problems?

Mark: I thought about some of the good tablet-based reading experiences I’ve had, and imagined what it could look like if the interface were centered around the individual reader and the small set of tools they need to find, manage, and read books. So the Reading Log shelves and the reading interface are the core of the app, and everything else kind of happens at the edges. Everything you need is just one tap away. The reading interface is still the familiar Internet Archive BookReader, but I’ve overlaid some additional functionality. You can hide all the controls with the single tap, and the book expands to completely fill the screen. I also added a swipe gesture, so it’s easy to turn pages if you’re holding your device with one hand on the couch.

Open Librarian: What does it feel like to use? Can we have a tour?


Open Librarian: What is your favorite part of the app? I like how it shows the return time

Mark: That is cool — that’s another example of centering the needs of the reader. It’s hard to pick a favorite part. Every feature is the result of me reading in the app every day for months before I released it. Periodically, I’d think “that’s kind of annoying” or “I wish I could…” and I’d go code for a while until I was happy with the experience. But the full-screen reading mode is probably my favorite. With the high-resolution page scans expanded to fill the screen, it’s almost like reading a physical book.

Open Librarian: What was your experience like developing the Reader?

Mark: I’m a retired web developer, so interface design, user experience, APIs and that sort of thing are nothing new, but I’ve never built a native app. After some reading, I picked Google’s Flutter tool, which allows easy cross-platform app development. I was amazed at how fast it was to assemble a simple app with just a few lines of code, and then it was just a matter of layering on the functionality I wanted. I spent a lot of time exploring the Open Library and Internet Archives APIs to figure out the best way to get at the data I needed, and even submitted a few updates to the Open Library codebase to support features I wanted to build. The Open Library team was extremely welcoming and supportive, and really made this app possible.

How can you support Mark’s work?

First, try downloading the Open Library Read App from the Apple store or Play store. If you have a suggestion, question, or feedback for Mark, send him an email to If you appreciate his work, consider rating the app on the app stores and leaving a review so others may discover and enjoy it too. To learn more about Mark and the Open Library Reader, look out for his upcoming interview on the Open Library Community Podcast.

Want to contribute to Open Library too?

See all the ways you can volunteer within the Open Library community!

See the schedule and register for Samvera Connect 2021 Online / Samvera

The schedule is now available for Samvera Connect 2021 Online!

Samvera Connect 2021 Online is October 18th – 22nd, from 11:00 AM EDT – 2:30 PM EDT.  Two workshops will be offered on Friday, October 15th, and you can register for them when you register for Samvera Connect.

Connect offers a great way to learn what’s happening and what’s coming next for the Samvera Community and technologies. Registration is free and easy!

A few reasons you’ll want to be sure to attend this year:

  • A Keynote presentation on Friday, October 22nd from Aymar Jean Christian, an associate professor of communication studies at Northwestern University and a Fellow at the Peabody Media Center. Professor Christian will present “Digital Archiving-in-Process as Reparative Practice”
  • Two workshops on October 15th: Introduction to Samvera Community, Technology, and Values; and Introduction to Valkyrie. You can register for these workshops when you register for the Conference. Seats are limited!
  • A great schedule of panels, presentations, and lightning talks from our Community, across all Samvera technologies and with interesting topics for developers, administrators, and those interested in learning more about Samvera.
  • Easy registration: You can register quickly via Eventbrite and you’ll be invited to join the Connect 2021 Sched, where you’ll be able to browse sessions and create your own schedule as soon as it is posted. You’ll also be able to access webinar links directly from Sched during the conference.

You can still submit lightning talks and online posters for Connect through Thursday, October 7th.

We look forward to seeing you in two weeks!

The post See the schedule and register for Samvera Connect 2021 Online appeared first on Samvera.

Hear a developer’s view on algorithmic decision-making! Justice Programme community meet-up on 14th October / Open Knowledge Foundation

Last month, the Open Knowledge Justice Programme launched a series of free, monthly community meetups to talk about Public Impact Algorithms.

We believe that by working together and making connections between activists, lawyers, campaigners, academics and developers, we can better achieve our mission of ensuring algorithms do no harm.

For the second meet-up, we’re delighted to be joined for an informal talk by Patricio del Boca, who is a senior developer at Open Knowledge Foundation. He is an Information Systems Engineer and enthusiast of open data and civic technologies. He likes to build and collaborate with different communities to disseminate technical knowledge and participate as a speaker in events to spread the importance of civic technologies.

Patricio will share a developer’s perspective on AI and algorithms in decision-making, the potential harms they can cause and the ethical aspects a developer’s work. We will then open up the discussion for all.

Whether you’re a new to tech or a seasoned pro, join us on 14th October 2021 between 13:00 and 14:00 GMT to share your experiences, ask questions, or just listen.

= = = = =
Register your interest here
= = = = =

More info:

DLF Digest: October 2021 / Digital Library Federation

DLF Digest

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation.

This month’s news:

This month’s DLF group events:

Joint Research Libraries UK/DLF Data & Digital Scholarship working group topics of interest and networking meeting

Tuesday, October 12, 11 am ET/8 am PT; register at 

This highly interactive session will follow two successful joint meetings between members of the DLF Data and Digital Scholarship working group (DDS) and RLUK’s Digital Scholarship Network (DSN). The meeting will continue to explore the areas of common and shared interest between the US and UK research libraries in relation to digital scholarship and skills development. 

Breakout topics for this session include: collections as data, working with institutional partners, and text and data analysis.

It will also include opportunities to meet with fellow professionals, share skills and knowledge, hear from skills experts, and receive updates regarding the continued collaboration between the DDS and DSN.

You do not need to have attended a previous joint meeting to attend this session, and the meeting is open to all members of the DDS and DSN. Join the DDS Google Group here.

This event will involve lots of delegate participation. Come energized to share your experiences, specialisms, and skill needs in a dynamic, transatlantic skills exchange.


DLF Digital Library Pedagogy group and Digital Accessibility Advocacy and Education subgroup

Tuesday, October 12, 3 pm ET/12 pm PT; register at

The #DLFteach and DLF Digital Accessibility working groups are excited to co-host a discussion group meeting on ways of improving accessibility for teaching with VR in libraries and higher education classrooms. We will start by discussing the article by Jasmine Clark and Zack Lischer-Katz. “Barriers to Supporting Accessible VR in Academic Libraries.” Journal of Interactive Technology & Pedagogy, no. 17, 20 May 2020, and then turn to a broader discussion about concerns around immersive technologies for distant learning and collaboration.

If you have any questions, please contact Heidi Winkler (, Adele Fitzgerald ( or Alex Wermer-Colan (

After registering, you will receive a confirmation email containing information about joining the meeting.

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Can’t find meeting call-in information? Email us at


DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member institution. Learn more about our working groups and how to get involved on the DLF website. Interested in starting a new working group or reviving an older one? Need to schedule an upcoming working group call? Check out the DLF Organizer’s Toolkit to learn more about how Team DLF supports our working groups, and send us a message at to let us know how we can help. 

The post DLF Digest: October 2021 appeared first on DLF.

Help us make Open Data Day more impactful – strategic thinker needed. / Open Knowledge Foundation

Open Data Day is a community event where everyone is invited to contribute. Each year over 300 groups organise activities around the world to show the benefits of open data and to encourage the adoption of open data policies in government, business and civil society.

As the stewards of Open Data Day, Open Knowledge Foundation is committed to ensuring that Open Data Day has the support it needs to thrive in a rapidly changing world.

This is why we recently asked you about your experience of Open Data Day. We wanted to learn what worked and what could be improved. You gave us lots of fantastic ideas – which we published here.

This is also why we decided to publish our ‘Open Data Day 2021 Report’. This report describes (from our perspective) our stewardship of Open Data Day – what we did, who we worked with, and what happened. Please do download a copy of this report and share it with your contacts.

    In the next stage of our stewardship plan, Open Knowledge Foundation plans hire an expert to help us engage with the whole Open Data Day community, to provide insights into the long-term impacts of Open Data Day, and to help us identify new ways to achieve them.

    The world is changing rapidly, and like any organisation, we need to make plans to ensure our work remains relevant into the future and that the work of the Open Data Day community continues to achieve impact.

    If you are interested in learning more about this opportunity – please email

Learn more about Open Data Day here, and join the discussion here.

Getting to know users of archival aggregation sites / HangingTogether

The following post is part of series highlighting OCLC Research’s efforts for the Building a National Finding Aid Network (NAFAN) project.  

A person is typing on a laptop while seated at a table. Photo by
Cytonn Photography. Free use image from UnSplash.

One major challenge for the NAFAN project is the lack of information about the users of archival aggregation sites. Past research is dominated by one-time studies. We’ve talked at greater depth on this in previous blog posts in this series

The lack of data about users and user behavior is perhaps not surprising given the amount of work it takes to just  keep the lights on. Sustaining archival aggregation requires constant attention to the contributor network, normalizing finding aid data, and often requires maintenance for aging infrastructure. With these challenges in mind, OCLC Research designed a study that takes a comprehensive approach to understanding the breadth of users and the depth of their search behavior across different aggregators. The project is more than just a retrospective study. The information we collect through this project is a key component to the design of a new national portal for users to access aggregated archival information. For this study we are gathering information using a variety of methods, including point-of-service pop-up surveys, virtual focus group interviews with cultural heritage professionals that work with archival materials and create finding aids, and in-depth individual interviews with users of archival aggregation sites. This post focuses on the information gleaned from the pop-up survey. Subsequent blogs will highlight findings from data that were gathered using different qualitative methods.  


To gather information about users and user behavior across sites, we asked all 13 archival aggregators participating in the grant to host the pop-up survey on their sites. 

There were a few challenges we had to work through in designing our data collection strategy. One of the most important was determining how to get a sample of the general population that visit and conduct research on the sites. For 8 of the participating aggregators with lower traffic, the pop-up survey appeared for every user, and individuals were free to opt-in or choose not to take the survey. For 5 sites with a higher frequency of use, the pop-up survey appeared for every other user of the site. We chose this strategy to provide an opportunity for a broad cross-section of users to respond. We had considered a targeted probability-based sampling approach at the outset. But without prior studies detailing the population of users of archival aggregation sites, there was no way to develop the probabilistic mechanisms for sample selection; therefore, we used convenience sampling. While the inability to make an inference from the data presents a challenge, we knew that there still were many important uses of the data from a convenience sample. For example, as an early study that looks across multiple aggregators, the data would be helpful in developing new research hypotheses and defining general tendencies and ranges of responses. It also was a unique opportunity to look at user responses across multiple aggregators.   

We set an initial national target for 1,000 total responses. Given the wide variability in use rates across aggregators, which ranged from 20 to 141,000 per month, we knew it would be impossible to get comparable response numbers across all sites using the same collection period and concluded that a strategy of blanketing respondents across most sites would provide the greatest flexibility for subsequent research. The total survey response far exceeded expectations with 3,300 complete responses across all sites. 

Data collection started on March 19, 2021. Some aggregators implemented the pop-up survey link at a slightly later date. The entire data collection period spanned two months with each aggregator holding their survey open for at least six weeks. The pop-up survey was posted on aggregator portals on the homepage, on search results pages, and on the landing page for each finding aid published within the aggregator system.  

Who are the users of archival aggregation sites? 

Below are findings from the survey. These are key takeaways that help to describe the pop-up survey respondents, and should not be used to generalize to all archival users. 

Chart 1 below shows all reported professions ranked from most frequently reported professions to the least frequently reported. The highest portion of survey respondents (20.8%) reported that they had retired from full time employment. Chart 2 below shows that 56.8% of respondents are over 55 years old, which is consistent with the fact that a large number of respondents reported that they are retired.  

Chart 1. NAFAN Pop-up Survey of Users by Profession, ranked

Interestingly, the next highest ranked profession is information professionals; librarians and archivists make up 13.5% of respondents. We know from contextual information that archivists and librarians visit aggregation sites to fulfill daily work-related responsibilities such as reference work for users or collection development. Roughly one-third of respondents work in various professions where research in archives is common such as faculty and academic researchers, graduate students, genealogists and undergraduate students.

The professions representing less than 5% of respondents includes journalists, writers, artists, filmmakers, museum professionals, K-12 educators, historians, and independent researchers.

Given the reported age distribution (Chart 2), it also appears that the highest percentage of respondents are also the oldest (65+ years old). Below the age of 55, the next highest group of respondents is aged 45-54 at 14.5%, followed by 10.6% of respondents in the 35-44 age range. Roughly the same percentage of respondents indicated they were undergraduate students (5.9%) who reported they are 19-25 years of age (6.2%).

Chart 2. NAFAN Pop-up Survey of Users by Age, ranked Note: 2.1% of respondents did not report their age.

When asking respondents to report their purpose for using an archival aggregation website, we allowed them to select more than one topical or thematic area. In Chart 3 below, the reasons reported for visiting the aggregation site include a mix of personal and professional uses. Some of these uses are short-term in nature, such as school assignments, newspaper articles, and thesis. Others are longer in nature and may require several visits. These uses appear more frequently (each more than 19%) and included uses like book projects, family history and local history research. When compared to the professions in Chart 1, it also is  easy to associate some of these long-term projects with the work of academics, professionals, archivists and genealogists.

Chart 3. NAFAN Pop-up Survey User’s Research Purpose, ranked.
Totals do not sum to 100%. Long-term projects include books, documentaries or other projects that take months or years. Short-term projects include news articles, television projects, or other projects that take days or weeks.

Nearly half of respondents (42.7%) indicated that they preferred online materials but were willing to use in-person materials. Roughly a quarter of the respondents (23.6%) indicated they had no preference between online or in-person materials. Roughly the same number of respondents stated a strong preference for online only (14.4%) as those that prefer in-person only (14.7%).

Chart 4. NAFAN Pop-up Survey User’s Preference for Online or In-person Materials

Look for more posts in this series on other parts of the Building a National Finding Aid Network project from OCLC Research.

Acknowledgements: I want to thank my project team colleagues, Chela Scott Weber, Lynn Silipigni Connaway, Chris Cyr, Brittany Brannon, and Merrilee Proffitt for their assistance with the survey data collection and analysis; OCLC Research colleagues that reviewed the draft pop-up survey questionnaire; and, Chela, Lynn, and Merrilee for their review of this blog post. We also want to thank the respondents for taking the time to support our research efforts and complete the survey.

The post Getting to know users of archival aggregation sites appeared first on Hanging Together.

The interoperability imperative in research support / HangingTogether

Photo by Mila Tovar on Unsplash

The “dictionary definition” of interoperability is the “ability of a system … to work with or use the parts or equipment of another system.” Boiling this definition down to its very essence, we can say that interoperability involves one thing “working with” another to create value that neither could achieve independently. A growing body of work produced by OCLC Research in the area of research support touches on the theme of interoperability – between systems, between people, and between institutions. And the ubiquity of this theme across our findings, as well as its demonstrated importance in delivering robust, sustainable research support services, has led us to label it “the interoperability imperative.”

Technical interoperability

What does interoperability mean to you? For many, technical interoperability comes to mind: linkages between multiple systems, creating integrated, seamless processes. Research support systems are certainly no strangers to interoperability of this form: think of data exchange and synching between repositories or harmonizing metadata between an institutional repository and a research information management (RIM) system. A forthcoming OCLC Research report Research Information Management in the United States, featuring case studies of RIM systems at five research universities, asserts “In today’s universities, we need our systems to be technically interoperable” and encourages “institutional stakeholders [to] adopt an enterprise view of RIM practices—examining silos, redundancies, duplication of effort, and providing insights into opportunities for improved interoperability, decision support capabilities, and informed institutional investment.”

Social interoperability

We have recently introduced another way to think about interoperability in the context of research support services: social interoperability, which we define as “the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding.” Our recent report Social Interoperability in Research Support: Cross-campus Partnerships and the University Research Enterprise collects insights from university stakeholders on the importance of working across campus units to provide research support services. In the report, we highlight the “significant challenges of trying to coordinate highly independent individuals with different goals and interests, spread across a large, decentralized organization” like a university, and conclude that “[s]ocial interoperability is a means of cutting through these complexities and obstacles, promoting mutual understanding, highlighting coincidence of interest, and cultivating buy-in and consensus.”

Institution-scale interoperability

The two projects described above focus on interoperability between systems and between people, with those systems and people often located on the same campus. But we can also think of interoperability between institutions. We are currently working on another project, Library Collaboration in Research Data Management (RDM), where we are exploring institution-scale interoperability in the form of cross-institutional collaborative arrangements to support research data management (RDM) needs. Our motivation for the project draws on Lorcan Dempsey’s observation that “library collaboration is very important, so important that it needs to be a more deliberate strategic focus for libraries.” Or as we state in the project description:

“Libraries have a rich history of working together to meet mutual needs. Current advances in digital and network technologies have amplified the benefits and lowered the costs of cross-institutional collaboration, making it an inviting choice for academic libraries seeking to acquire new services, expertise, and infrastructure … As interest in library collaboration grows, it becomes more important for academic libraries to be purposeful and strategic in their use of this sourcing option.”

So three different projects – one focused on systems, one on people, and one on institutions – and the common thread uniting them, apart from a shared foundation in research support services, is that they all feature some flavor of interoperability. One of the interview participants we spoke to for our Social Interoperability report remarked: “Well up front, I would say I can’t get anything done without partnerships. I mean it’s just absolutely essential to partner, whether it’s with centers, institutes, department chairs, academic deans, research deans, all the above.” While this person was speaking about social interoperability, the substance of their observation – “it’s just absolutely essential to partner” – applies equally well to interoperability in all its forms.

In a sense, the need to “work with” other systems, people, and institutions is almost a universal property of well-designed, sustainable research support services.

This is what we call the interoperability imperative – a term coined by my colleague Rebecca Bryant.

The interoperability imperative

The interoperability imperative means that in developing research support services, increased attention needs to be dedicated to what happens at the boundaries between the key agents that bring those services to life: systems, people, and institutions. How do those agents interact, and what infrastructure – technical, social, or collaborative – do we need to catalyze and facilitate those interactions?

For example, for technical interoperability, we need infrastructure like APIs, data exchange standards, and global persistent identifiers (PIDs) that allow different systems to “talk to” each other and exchange information. For social interoperability, we need both formal and informal social infrastructure, like standing committees, cross-unit working groups, or even regular coffee meet-ups (or the virtual equivalent) that bring people from different campus units together to cultivate and leverage effective working relationships. And for inter-institutional interoperability, we need the collaborative apparatus to convene institutional partners around shared research support needs, apportion costs and responsibilities, and find consensus in priorities and direction.

A focus on bridging the boundaries between systems, people, and institutions – including the infrastructure or contextual opportunities needed to close gaps and make connections – helps weave the research support eco-system into a navigable, coherent whole. This is especially important in a service environment in which research support functionalities are split across many systems; research support expertise and administrative responsibility are distributed across many people in many different campus units; and vital research support capacities are often most efficiently provided through the pooled contributions of multiple institutions operating shared services at scale.

One thing that we have noticed in our work looking at the interoperability imperative from the technical, social, and collaborative perspectives is that there is a fractal-like quality to our findings: the general patterns or features of effective interoperability seem to replicate themselves whether the focus is systems, people, or institutions.

For example, technical interoperability is facilitated through the adoption of shared data exchange formats; in the same way, we found that social interoperability is strengthened by speaking a “common language” that avoids potential misunderstandings over jargon, terminology, and concepts from different professional backgrounds. Or another example: we found that social interoperability is furthered by “knowing your partners” – their responsibilities, priorities, and pain points. In the same way, a sustainable and effective collaborative network of institutions requires a level of trust that commitments to the collective effort will be kept – a trust often cultivated through the familiarity of long-standing associations such as consortia. In short, although our work on the interoperability imperative splits its focus across systems, people, and institutions, there is lots of cross-pollination of ideas and findings.

As we continue our work in the area of research support, we will have much more to say about the interoperability imperative. It’s a crucial part of orchestrating valuable connections across a decentralized and diffuse research support service space.

Thanks to my colleagues Rebecca Bryant and Annette Dortmund for helpful suggestions that improved this post!

The post The interoperability imperative in research support appeared first on Hanging Together.

Make a difference! Our invitation to join the Advisory Council of The Justice Programme / Open Knowledge Foundation

What’s this about?

= = = = =
Do you have professional expertise in emerging data driven technologies such as artificial intelligence and their relationship with the law – especially the legal systems of the UK, Republic of Ireland and the EU?

Are you up to date with the debates around surveillance technologies and their potential negative impacts on human rights?

Do you understand innovation in big data and machine learning, and the opportunities these technologies present? Are you also working to ensure these technologies can be deployed in a way that ensures that everyone benefits?

Would you like to use these skills to influence the future of artificial intelligence and algorithmic decision making in the UK, and around the world?
= = = = =

If this is you – we invite you to be part of the Advisory Council of The Justice Programme.

Email to find out more.

Also – if you know someone who would be a great fit for the role – please either share this with them, or email us their details and we will get in touch with them directly.

September Open Meeting / Islandora

September Open Meeting jmlynaryk Mon, 09/27/2021 - 12:51

Our September Open Meeting takes place on Tuesday, Sept 28th from 10:00-2:00 Eastern.

The theme of our September Open Meeting is ‘Islandora Roadmap’, and will be hosted by the Islandora Event Planning Subgroup. Join us in a conversation about community health, and sprint planning and management in service of inclusive, sustainable Islandora codebase and community going forward. Activities include open discussions, sprint planning, and an issue/use-case cleanup mini-sprint!


Here is the full schedule:


Welcome and Introductions

10:00 -10:15

Presenters: Islandora Event Planning Subgroup


Community Health


Have an introductory open discussion about community health.

Goal: Determine the appropriate group(s) or committee(s) to undertake any next steps we decide upon.

Presenters: Islandora Event Planning Subgroup


Sprint planning and management


Present the emerging strategy for sprint management & issue cleanup

Goal: Refine this approach & prepare for issue/use-case cleanup in the afternoon.

Presenters: Rosie LeFaive, Kirsta Stapelfeldt, and Natkeeran Ledchumykanthan



12:00 - 12:30


Issue/use-case cleanup

12:30 - 2:00

Facilitators: Rosie LeFaive, Kirsta Stapelfeldt, and Natkeeran Ledchumykanthan


All are welcome and encouraged to attend, and no registration is required! Simply contact us at to receive a calendar invite or Zoom password.

Can't wait to see you there!

Introducing the Distant Reader Toolbox / Eric Lease Morgan

readerThe Distant Reader Toolbox is a command-line tool for interacting with data sets created by the Distant Reader — data sets affectionally called “study carrels”. See:


The Distant Reader takes an almost arbitrary amount of unstructured data (text) as input, creates a corpus, performs a number of text mining and natural language processing functions against the corpus, saves the results in the form of delimited files as well as an SQLite database, summarizes the results, and compresses the whole into a zip file. The resulting zip file is a data set intended to be used by people as well as computers. These data sets are called “study carrels”. There exists a collection of more than 3,000 pre-created study carrels, and anybody is authorized to create their own.

Study carrels

The contents of study carrels is overwhelmingly plain text in nature. Moreover, the files making up study carrels are consistently named and consistently located. This makes study carrels easy to compute against.

The narrative nature of study carrel content lends itself to quite a number of different text mining and natural language processing functions, including but not limited to:

  • bibliometrics
  • full-text indexing and search
  • grammar analysis
  • keyword-in-context searching (concordancing)
  • named-entity extraction
  • ngrams extraction
  • parts-of-speech analysis
  • semantic indexing (also known as “word embedding”)
  • topic modeling

Given something like a set of scholarly articles, or all the chapters of all the Jane Austen novels, study carrels lend themselves to a supplemental type of reading, where reading is defined as the use and understanding of narrative text.


The Toolbox exploits the structured nature of study carrels, and makes it easy to address questions from the mundane to the sublime. Examples include but are not limited to:

  • How big is this corpus, and how big is this corpus compared to others?
  • Sans stop words, what are the most frequent one-word, two-word, etc-word phrases in this corpus?
  • To what degree does a given word appear in a corpus? Zero times? Many times, and if many, then in what context?
  • What words can be deemed as keywords for a given text, and what other texts have been classified similarly?
  • What things are mentioned in a corpus? (Think nouns.)
  • What do the things do? (Think verbs.)
  • How are those things described? (Think adjectives.)
  • What types of entities are mentioned in a corpus? The full names of people? Organizations? Places? Locations? Money amounts? Dates? Times? Works of art? Diseases? Chemicals? Organisms? And given these entities, how are they related to each other?
  • What are all the noun phrases in a text, and how often do they occur?
  • What did people say?
  • What are all the sentences fragments matching the grammar subject-verb-object, and which ones of those fragments match a given regular expression?
  • Assuming that a word is known by the company it keeps, what words are in the same semantic space (word embedding), or what latent themes may exist in a corpus beyond keywords (topic modeling)?
  • How did a given idea ebb and flow over time? Who articulated the idea, and how? Where did a given idea manifest itself in the world?
  • If a given book is denoted as “great”, then what are its salient characteristics, and what other books can be characterized similarly?
  • What is justice and if murder is morally wrong, then how can war be justified?
  • What is love, and how do Augustine’s and Rousseau’s definitions of love compare and contrast?

Given a study carrel with relevant content, the Toolbox can be used to address all of the questions outlined above.


The Toolbox requires Python 3, and it can be installed from the terminal with the following command:

pip install reader-toolbox

Once installed, you can invoke it from the terminal like this:


The result ought to be a help text looking much like this:

  Usage: rdr [OPTIONS] COMMAND [ARGS]...

    --help  Show this message and exit.

    browse       Peruse <carrel> as a file system Study carrels are sets of...
    catalog      List study carrels Use this command to enumerate the study...
    cluster      Apply dimension reduction to <carrel> and visualize the...
    concordance  A poor man's search engine Given a query, this subcommand...
    download     Cache <carrel> from the public library of study carrels A...
    edit         Modify <carrel>'s stop word list When using subcommands such...
    get          Echo the values denoted by the set subcommand This is useful...
    grammars     Extract sentence fragments from <carrel> where fragments are...
    ngrams       Output and list words or phrases found in <carrel> This is...
    play         Play the word game called hangman.
    read         Open <carrel> in your Web browser Use this subcommand to...
    search       Perform a full text query against <carrel> Given words,...
    semantics    Apply semantic indexing queries against <carrel> Sometimes...
    set          Configure the location of your study carrels and a subsystem...
    sql          Use SQL queries against the database of <carrel> Study...
    tm           Apply topic modeling against <carrel> Topic modeling is the...

Once you get this far, you can run quite a number of different commands:

  # browse the remote library of study carrels
  rdr catalog -l remote -h
  # read a carrel from the remote library
  rdr read -l remote homer
  # browse a carrel from the remote library
  rdr browse -l remote homer
  # list all the words in a remote carrel
  rdr ngrams -l remote homer
  # initialize a local library; accept the default
  rdr set
  # cache a carrel from the remote library
  rdr download homer
  # list all two-word phrases containing the word love
  rdr ngrams -s 2 -q love homer
  # see how the word love is used in context
  rdr concordance -q love homer
  # list all the subject-verb-object sentence fragments containing love; please be patient
  rdr grammars -q love homer
  # much the same, but for the word war; will return much faster
  rdr grammars -q '\bwar\b' -s homer | more


The Distant Reader creates data sets called “study carrels”, and study carrels lend themselves to analysis by people as well as computers. The Toolbox is a companion command-line application written in Python. It simplifies the process of answering questions — from the mundane to the sublime — against study carrels.

Extending a Warm Welcome to New Patrons / Open Library

By Sabreen Parveen with Ray Berger & Mek

A Forward from the Mentors

For book lovers who use every day, it may be easy to forget what it felt like to visit the website for the first time. Some features which some were able to learn the hard way — through trial and error — may not be as easy or intuitive for others to understand. We feel like we’ve failed each time a patron leaves the library, frustrated, and before even having the chance to understand the value it may provide to them.

At Open Library, we strive to design a service which is accessible and easy for anyone to use and understand. We understand that everyone has different experiences and usability needs. Our mission is to make books as accessible and useful to the public as possible, and we’re unable to do this if patrons aren’t given the opportunity and resources to learn how our services work.

After polling dozens of patrons on video calls and through surveys, we started to get a good idea about which aspects of the website are most confusing to new patrons. The most common question was, “what is Open Library and what does it let you do?”. We tried to search for a clear explanation on our homepage, but there wasn’t one — just rows of books we assumed patrons would click on and somehow understand how it all worked. We also received useful questions concerning which books on Open Library are readable, borrowable, or what is meant when a book shows as unavailable or not in the library. We also received questions about how the Reading Log works. We decided to address some of these frequently asked questions at the earliest possible entry point: on our home page with a new Onboarding Carousel. Leading this project was 2021 Open Library Fellow, Sabreen, with the mentorship of Ray & Mek. We’re so excited and proud to showcase Sabreen’s hard work to you!

Designing a Simple-to-use Onboarding Experience

By Sabreen Parveen

This summer I got this amazing opportunity to work with the Internet Archive as an Open Library Fellow where I contributed to the Onboarding Project.

My Journey with Open Library

I decided to join the Open Library community in 2020 because I was interested in contributing to an open source project and improving my abilities as a programmer and designer. Several things about Open Library stuck out to me while I was browsing projects on github. Firstly, I had the knowledge of the languages and frameworks it used. Secondly the documentation was very clear and easy to understand. Thirdly, the issue tracker contained many exciting ways for me to help. Most importantly the project had an active community and hosted calls every week where I could work with others and ask questions. Once I had familiarized myself with the project, I joined Open Library’s public gitter chatroom and asked questions about getting started. Shortly after, I attended my first community call, received a Slack invite, and later that week submitted my first contribution! I have joined almost all the community calls since. Gradually I started solving more and more issues, many of them related to web accessibility and SEO. I also started creating graphics for Open Library’s “monthly reads” pages. The community must have been excited about my contributions, because this year I was invited to be a 2021 Open Library Fellow and to team up with a mentor to lead a flexible, high-impact project to completion.

Selecting a Project: Onboarding Flow

The project I chose for my 2021 Open Library Fellowship was to add a new user onboarding experience to homepage to help new patrons get an overview of the website and how to use its features.

The problem

First time visitors to often report getting confused because they don’t know how to use the service. We had several indicators this was the case:

  • From my own experience, I had been confused when I first started using the website. I didn’t know what the “Want to read button” does? I came to know about the list feature while solving an issue.
  • Bounce Rate: Open Library has a fairly high bounce rate, which is a measure of percentage of people who visit a website and leave without continuing to the other pages. We wondered if this is because patrons were confused about how to use the website and so we wanted to test this.
  • Feedback: We received this feedback from patrons emailing us about their experience

So by adding onboarding flow many of the users will get an insight of what the website actually does.


While designing user onboarding, we wanted to create a system that was interactive, contextual, and easy to use and understand. As a result, we decided to start by adding an onboarding carousel to the homepage, the most common place patrons would land on when visiting the website for the first time. We designed the carousel to feature five cards: Read Free Library Books Online, Keep Track of Your Favourite Books, Try the virtual library explorer, Be an Open Librarian and Feedback form to receive feedback from the visitors. 

We  decided on a carousel as the format because they’re

  • non-interruptive.
  • persistent, unlike other onboarding design patterns that only show up upon signup and are never seen again.
  • easy to explore.

When clicked, each card redirects patrons to a FAQs page. In an upcoming version, the “keep track of your favourite books” card will instead trigger an onboarding modal with a step-by-step tutorial containing several slides explaining how we can add a book to our reading log, create a new list and view your reading log. Each feature is explained using a GIF, which is short and descriptive. You can close the modal at any step and any time. The modal creation was a long process of discussions and feedback, but finally we came up with a simple and attractive modal.

During implementation we kept following things in our mind:

  • The icons for the home page cards. Their resemblance with the text.
  • Eye catchy and easy to understand captions
  • Links the card will redirect people to (currently FAQs page)
  • GIFs should be contextual.
  • Modal design should be such that the main focus should be on the GIF and not the modal itself. Also easy navigation between the slides was necessary.

Design Process

To make this project successful, we had weekly meetings and discussions in the community channel to get everyone’s opinion. Designs were mocked up using Figma. I also had the chance to present my ideas before the Internet Archive’s product team. We used feedback from these meetings to review our previous decisions, our progress, and inform next steps. 


  • Alexa: The bounce rate is now reduced to 38.2%.
  • Google Analytics: More than 5000 engagements with these cards.
  • Infrastructure to continue building from which we can re-use in other situations. 

Next Steps

  • Doodles to bring more character to the homepage cards
  • Include pop-up tutorials for more of the cards (other than just Reading Log + Lists)
  • Ability to hide / show the carousel (for patrons who have already received the information) 

My experience

I had a pretty good time working with experienced mentors Mek and Raymond Berger. They were very supportive during the entire program. Sometimes we spent our meeting time finding solutions to some problems together. Additionally, I learned more about project management and clarifying a plan by breaking issues into manageable steps. I got to spend time learning about new industry tools like Figma, which we used for presenting designs and Google Analytics for tracking key metrics. I also gained a deeper understanding of user experience. I learned to design by thinking as a patron of Open Library, what would she or he want? Will it be useful or easy to understand? I appreciated the flexibility of the Open Library Fellowship program, there was no pressure on me so that I could focus on my studies also. We tried to have clear next steps and homeworks at the end of each of our calls. The calls helped clarify what we were hoping to accomplish and provided direction and feedback. Finally, having the community available for regular feedback was really useful for tuning our designs.

About the OpenLibrary Fellowship Program

The Internet Archive’s Open Library Fellowship is a flexible, self-designed independent study which pairs volunteers with mentors to lead develop of a high impact feature for Most fellowship programs last one to two months and are flexible, according to the availability of contributors. We typically choose fellows based on their exemplary and active participation, conduct, and performance working within the Open Library community. The Open Library staff typically only accepts 1 or 2 fellows at a time to ensure participants receive plenty of support and mentor time. If you’re interested in volunteering as an Open Library Fellow and receiving mentorship, you can apply using this form or email for more information.