If you want to understand where the commercial parts of scholarly
communications may be heading, you need to look beyond policy documents,
conference panels, or public-facing strategy statements. You should look
at what large commercial actors say when speaking to investors. Earnings
calls are one of the places where that language becomes especially
revealing: less concerned with sector ideals than with growth, market
opportunity, competitive position, and what will ultimately generate
value for shareholders. For this reason, it can be worthwhile to review
earnings calls and investor presentations, as these are often overlooked
when discussing OA policy and sectoral movements.
Someone decided to compress the kill chain. Someone decided that
deliberation was latency. Someone decided to build a system that
produces 1,000 targeting decisions an hour and call them high-quality.
Someone decided to start this war. Several hundred people are sitting on
Capitol Hill, refusing to stop it. Calling it an “AI problem” gives
those decisions, and those people, a place to hide.
GUIBo is a desktop GUI for operators and developers who run Kubo (the
IPFS daemon in Go). It drives your node through Kubo’s HTTP RPC API so
you can work with pins, UnixFS content, IPNS, remote pinning, gateways,
and network or repo diagnostics without living in the terminal.
At The Human Line, we are committed to ensuring that AI technologies,
like chatbots, are developed and deployed with the human element at
their core. LLMs are powerful tools, and with Ethical design, users can
gain new skills and knowledge while remaining emotionally intact.
Tech-related delusions, whether they involve train travel, radio
transmitters or 5G masts, have been around for centuries, Morrin says.
“What’s different is that we’re now arguably entering an age in which
people aren’t having delusions about technology, but having delusions
with technology. What’s new is this co-construction, where technology is
an active participant. AI chatbots can co-create these delusional
beliefs.”
The seemingly unassailable hegemony of the contemporary internet means
too few people know that shortwave radio has never gone away and that in
many ways it’s more durable, more secure, and more widely accessible
than other contemporary forms of wireless communication such as cell or
wifi.
“[T]he first thing is that computers cannot think, that is an invented
concept. And rather than computers being able to think, we’ve reinvented
thinking to be something computers can do. And when we do that, all
manner of power consolidation, wealth consolidation, technological
monopolies happen and we are looking at this fantasy enemy instead of
the real political work and community work to be done.”
The Richmond Folk Festival is one of Virginia’s largest events, drawing
visitors from all over the country to downtown Richmond’s historic
riverfront. The Festival is a FREE three-day event that got its start as
the National Council for the Traditional Arts’ National Folk Festival,
held in Richmond from 2005-2007. The Richmond Folk Festival features
performing groups representing a diverse array of cultural traditions on
six stages.
Featuring Peter Linebaugh on the long histories of commons and
commoning, connections between enclosures in Europe and imperial
conquest abroad, and writing history from below.
If output is your only metric, then the steam engine really is just a
better bicycle. Both get you from A to B. One gets you there faster with
less effort. Case closed. The fact that you arrive having done nothing,
learned nothing, built nothing—that’s not a bug, that’s the point.
Effort is a cost to be minimized, not a value to be preserved.7
But embedded in that worldview is that the journey is merely
instrumental. The only thing that matters is arrival. That it doesn’t
matter if you travel or are traveled. The Inuit elders seem to operate
on a different premise. Arrival, of course, mattered. These were hunters
who needed to find caribou and get home alive. But only through the
journey could you acquire deep knowledge of the terrain. You couldn’t
separate arriving at the destination from what you learned on the way
there.
What teams collaborate on during review is changing. Less time spent on
style nits and mechanical correctness, more time on intent,
architecture, and whether a change moves the product in the right
direction. That’s a good shift. And the collaborative act itself –
multiple humans exercising judgment together, developing shared taste,
building mutual understanding of where the system is heading – that’s
not a bottleneck to eliminate. It’s something to uplevel.
Wikipedia and similar DPGs cannot sustain themselves on a fragile mix of
donations, sporadic philanthropy, and ad-hoc corporate generosity.
What’s needed is a multi-stakeholder settlement in which large-scale
users of the commons take on long-term, structured obligations to
sustain it: contractual funding through paid APIs and usage-based
levies, formal recognition of DPGs as Digital Public Infrastructure to
unlock multilateral co-financing, and a shift in philanthropy from
one-off project grants to sustained core support for the institutions
that maintain the commons.
We intentionally chose this constraint so we would build what was
necessary to increase engineering velocity by orders of magnitude. We
had weeks to ship what ended up being a million lines of code. To do
that, we needed to understand what changes when a software engineering
team’s primary job is no longer to write code, but to design
environments, specify intent, and build feedback loops that allow Codex
agents to do reliable work.
This post is about what we learned by building a brand new product with
a team of agents—what broke, what compounded, and how to maximize our
one truly scarce resource: human time and attention.
It was very interesting to read OpenAI’s recent write-up on “Harness
engineering” which describes how a team used “no manually typed code at
all” as a forcing function to build a harness for maintaining a large
application with AI agents. After 5 months, they’ve built a real product
that’s now over 1 million lines of code.
The article is titled “Harness engineering: leveraging Codex in an
agent-first world”, but only mentions “harness” once in the text. Maybe
the term was an afterthought inspired by Mitchell Hashimoto’s recent
blog post. Either way, I like “harness” as a word to describe the
tooling and practices we can use to keep AI agents in check.
We are at a turning point in AI. For years, we focused only on the
model. We asked how smart/good the model was. We checked leaderboards
and benchmarks to see if Model A beats Model B.
The difference between top-tier models on static leaderboards is
shrinking. But this could be an illusion. The gap between models becomes
clear the longer and more complex a task gets. It comes down to
durability: How well a model follows instructions while executing
hundreds of tool calls over time. A 1% difference on a leaderboard
cannot detect the reliability if a model drifts off-track after fifty
steps.
We need a new way to show capabilities, performance and improvements. We
need systems that proves models can execute multi-day workstreams
reliably. One Answer to this are Agent Harnesses.
When the server goes dark, we go dark, too. We’ve built an entire
civilisation on an unthinkably brutal and comically unreliable stack
while hallucinating it as literally anything else. We condemn AI today
for making shit up, but what about us? We’re building on a fantasy just
as brittle, we are just as demonstrably wrong. Yet we pretend a file
isn’t just a gesture that can disappear in an instant. We hallucinate
that the server is somehow both fleeting and forever.
GitHub is slowly becoming a very dangerous website as more and more
threat actors are starting to use it to host and distribute malware
disguised as legitimate software repositories.
What started as an infrequent sighting in early 2024 is now at the
center of an increasing number of infosec and malware reports.
The tactic is usually the same. A threat actor would take a legitimate
repository, add malware to the files—typically an infostealer or a
remote access trojan— and then upload the boobytrapped repo back on
GitHub.
A unique and deeply moving piece of biographical filmmaking, the short
documentary Echo provides a window into the life of an older man named
Allister Hadden living in Northern Ireland. The film drifts between past
and present, with a rich, textured, shot-on-film aesthetic tethering
together Hadden’s archival recordings and newly shot footage from the
Belfast-based filmmaker Ross McClean.
Around twelve years ago, Google figured out the fundamental problem facing Tesla's Fake Self Driving. Almost nine years ago in Robot Cars Can’t Count on Us in an Emergency, John Markoff wrote:
Three years ago, Google’s self-driving car project abruptly shifted from designing a vehicle that would drive autonomously most of the time while occasionally requiring human oversight, to a slow-speed robot without a brake pedal, accelerator or steering wheel. In other words, human driving was no longer permitted.
The company made the decision after giving self-driving cars to Google employees for their work commutes and recording what the passengers did while the autonomous system did the driving. In-car cameras recorded employees climbing into the back seat, climbing out of an open car window, and even smooching while the car was in motion, according to two former Google engineers.
Google binned its self-driving cars' "take over now, human!" feature because test drivers kept dozing off behind the wheel instead of watching the road, according to reports.
"What we found was pretty scary," Google Waymo's boss John Krafcik told Reuters reporters during a recent media tour of a Waymo testing facility. "It's hard to take over because they have lost contextual awareness."
Follow me below the fold for a wonderful example of Tesla's handoff problem, and a discussion of the difference between Tesla's and Waymo's approaches to self-driving.
I wrote about this handoff problem in 2017's Techno-hype part 1. I did a thought experiment, imagining mass-market cars 3 times better than Waymo's at the time:
A normal person would encounter a hand-off once in 15,000 miles of driving, or less than once a year. Driving would be something they'd be asked to do maybe 50 times in their life.
Even if, when the hand-off happened, the human ... had full "situational awareness", they would be faced with a situation too complex for the car's software. How likely is it that they would have the skills needed to cope, when the last time they did any driving was over a year ago, and on average they've only driven 25 times in their life? Current testing of self-driving cars hands-off to drivers with more than a decade of driving experience, well over 100,000 miles of it. It bears no relationship to the hand-off problem with a mass deployment of self-driving technology.
I concluded:
But the real difficulty is this. The closer the technology gets to Level 5, the worse the hand-off problem gets, because the human has less experience. Incremental progress in deployments doesn't make this problem go away.
used to run the self-driving-car division at Uber, trying to build a future in which technology protects us from accidents. I had thought about edge cases, failure modes, the brittleness hiding behind smooth performance. My team trained human drivers on when and how to intervene if a self-driving car made a mistake. In the two years I ran the division, we had no injuries in our early pilot programs.
As an enthusiast for slef-driving technology, Krikorian used it:
With my own Tesla, I started out using Full Self-Driving as the default setting only on highways. That’s where it makes sense: You have clear lane markers and predictable traffic patterns. Then, one day, I tried it on a local road, and it worked well enough to become a habit.
My memory is hazy, and some of it comes from one of my sons, who watched the whole thing unfold from the back seat. The car was making a turn. Something felt off—the steering wheel jerked one way, then the other, and the car decelerated in a way I didn’t expect. I turned the wheel to take over. I don’t know exactly what the system was doing, or why. I only know that somewhere in those seconds, we ended up colliding with a wall.
He didn't have "situational awareness", even though he was an experienced driver aware of the handoff problem. He sums up the current problem, with drivers like him:
Full Self-Driving works almost all of the time—Tesla’s fleet of cars with the technology logs millions of miles between serious incidents, by the company’s count. And that’s the problem: We are asking humans to supervise systems designed to make supervision feel pointless. A machine that constantly fails keeps you sharp. A machine that works perfectly needs no oversight. But a machine that works almost perfectly? That’s where the danger lies. After a few hours of flawless performance, research shows, drivers are prone to start overtrusting self-driving systems. After a month of using adaptive cruise control, drivers were more than six times as likely to look at their phone, according to one study from the Insurance Institute for Highway Safety.
Imagine this problem compounded by handing off to a driver who hadn't driven in a year.
Google was building Level 4 robotaxis. Their conservative approach was to eliminate the handoff problem completely. Waymos operate on carefully mapped routes after much practice, and are equipped with a diverse set of sensors. Just as everywhere along their flight path, airliners have a designated diversion airport, Waymos know a safe place to stop and ask for help from remote humans. They don't drive the cars, they just advise the car as to how to solve the problem. This can, as I have seen a couple of times, cause frustration among other road users, but it is safe.
Tesla, on the other hand, had a Level 2 driver assist system with a limited set of sensors, which depended on handing off to the driver in case of confusion. They consistenly marketed it as "Full Self-Driving" with exaggerated claims about its capabilities, and sold it to normal, untrained drivers. They could not, and could not afford to, implement Google's approach. Why not?
Scale: Tesla has 1.1M FSD customers, where six months ago Waymo had about 2K cars in service. To support them, Waymo has about 70 remote operators on duty. Of course, FSD is used much less intensively, lets guess only 5% as much. Even if, optimistically, Tesla's technology generated as few remote requests as Waymo's they would need almost 2,000 remote operators on duty.
Technical: First, Tesla markets FSD as usable anywhere, even if their terms of service disagree. So they lack the detailed maps Waymos use when they need to find a safe place. Second, Tesla has far fewer sensors, so has much less information on which to base the need for and choice of a safe place.
Marketing: There are two problems. First, telling the public that FSD will sometimes need to stop and ask for help goes against the idea that it is "Full Self Driving". Second, everyone can see that a Waymo is driving itself and can set their expectations to match. No-one can tell that a Tesla is using Fake Self Driving. So were Teslas stopping unexpectedly, even if it wasn't using Fake Self Driving, the assumption would be that the technology had failed.
Because Tesla has always depended upon handing off to the human, the result is that Tesla's minimal robotaxi service with "safety monitors" in Austin, TX crashes six times as often as human-driven taxis.
The Disintegration Loops: Generational Loss in Web Archives
Michael L. Nelson
As part of the Internet Archive's Information Stewardship Forum (March 18–20, 2026), I decided to use my five minute lightning talk to raise the issue of generational loss in web archives. Or more directly, making copies of copies (...of copies…) – something that web archives currently do not do well. My title is based on William Basinski's four volume release "The Disintegration Loops", in which he played the audio tapes of "found sounds", recorded decades earlier, in loops, with the whole process lasting over an hour. The effect is hauntingly beautiful, with each loop slightly degrading the magnetic tape, resulting in a generational loss. The degradation of each loop is right on the edge of the just-noticeable difference, until the entire track is reduced to just a shadow of its former self.
I first discussed this topic in my 2019 CNI closing keynote (slide 88), where I introduced the inability of web archives to archive other web archives as part of the larger issue of web archive interoperability. Let's begin with walking through the example of archiving a tweet (which we already know to be challenging!). The original tweet is still on the live web, even though the UI has undergone many revisions since when it was originally tweeted in 2018.
Note that archive.today is aware that the page comes from the Wayback Machine but the original host is twitter.com, and it maintains both the original Memento-Datetime (20180501125952) as well as its own Memento-Datetime (20190407023141). I then archived archive.today's memento to perma.cc in 2019 (screen shot from 2019):
Although the loss occurs in discrete chunks, it is reminiscent of Basinski's Disintegration Loops, with information lost at each step, and the final version being a mere shadow of the original. In 2019, this was not universally recognized as a problem, since archiving the playback interface of other web archives was not considered a problem to itself. The "right" solution, of course, is to share the WARC files (or WAC, or HAR, or…) out-of-band and let the other web archives replay from the same source files. But this is rarely possible: for a variety of reasons web archives typically do not share the original WARC files, and in the case of archive.today, might not even store the original source files (and instead, likely only store the radically transformed pages).
More importantly, it is sometimes useful to archive a particular web archive's replay of a page, which itself must be archived, because it changes through time. For example, memento #3 (the perma.cc memento of archive.today's memento) is now different; this is a screen shot from 2026:
Surely the source files themselves have not changed, and the difference is due to improvements in pywb, which is under constant development. perma.cc's replay of the 2019 page in 2019 is different from the replay from 2026, which implies that it could be different still in the future. But we can not currently archive without generational loss of perma.cc's replay of that page to, say, the Wayback Machine. The fact that screen shots – which are rife with their own potential for abuse (cf. HT 2025, arXiv 2022) – are the only mechanism to document these replay differences underscores the web archive interoperability problem.
I chose the topic of generational loss for my slot at the Information Stewardship Forum because recent events have introduced a new use case for archiving the replay of web archives. Wikipedia recently announced it was blacklisting archive.today because its editors discovered that webmaster at archive.today was using its captcha to direct a DDoS attack against a blog owned by someone that webmaster had a dispute with (the blogger had posted a lengthy investigation of the identity of webmaster), and, for our discussion more disturbingly, had edited the content of an archived page to include the name of the blogger where it would not otherwise be. The Wikipedia discussion page is hard to follow, in part because the editors are discussing how to archive the replay of an archived page. For one example, they show how the archive.today replay now has been changed back to have "Comment as: Nora ████" (middle of the image):
But the replay alteration from archive.today in question is archived at megalodon.jp to show that the name "Nora ████" was replaced with the name of the blogger that had earned webmaster's ire, "Jani Patokallio". And yes, megalodon.jp's replay of archive.today's memento is that bad (at least in my browser, it is shrunk down impossibly small), so I used the dev tools to find the string in question.
Another Wikipedian archived (using yet another archive, ghostarchive.org) a google.com SERP to show that archive.today has reverted from "Jani Patokallio" back to "Nora ████".
What does changing "Nora" to "Jani" (and then changing it back again) accomplish? I'm not sure; this appears to be just a petty response to an ongoing dispute. But the implication is profound: this is the first known example of a major web archive purposefully and maliciously altering its contents, something that we knew was possible but had not yet experienced.
We have long known that replay can change through time (cf. PLOS One 2023) due to the replay engine (the Wayback Machine, Open Wayback, pywb, etc.) evolving, but these changes were engineering results and the replay mostly improved over time. But now we have seen web archives maliciously alter (and then revert) the replay, and we need a more standard and interoperable way to archive archival replay. Not just to prove that a web archive did alter its replay, but also to prove that an archive did not alter its replay. Out-of-band sharing of WARC files is the gold standard, but for a variety of reasons this is unlikely to happen. We must be able to use web archives to verify and validate web archives. We explored a heavyweight design for this a few years ago (JCDL 2019), but it should be revisited in light of developments like WACZ.
In Brief: This article explores how minimal computing principles guided the parallel web development of two related but distinct publishing platforms, DigitalArc and Opaque Publisher. DigitalArc, a community-driven digital archive and exhibit platform, was developed in response to principles governing post-custodial archiving, taking it one step further to ensure communities maintain ownership of their materials and their digital artifacts. The Opaque Publisher, originally developed in support of a born-digital dissertation, adapts DigitalArc to support refusal theory for scholars who have to negotiate the tensions between using unethically obtained evidence in support of their research with moral objections to a lack of informed consent. At first glance, the use cases for each platform seem different, but both are providing mechanisms for individuals-by-proxy and communities to assert control over how their respective stories are shared.
This article details the conversations, dependencies and contingencies that developed as our team simultaneously built two related, but distinct academic publishing platforms and considered the theoretical motivations for having done so. The first of these, DigitalArc (DA), was designed to support the creation of low-cost sustainable digital exhibits and archives built by and for communities who want to control how their histories are presented online. DA was designed as a community-driven digital archive and exhibit platform, ensuring communities maintain ownership of their materials and their digital artifacts.1 The second, the Opaque Publisher (OP), used DigitalArc as a foundation for a digital-exhibit and digital-publication platform that supports scholars who want to redact or remove information that was obtained from medical patients without their informed consent. The modeling and development of both platforms were guided by complementary frameworks that shaped our decision to use DigitalArc as the technical foundation for the Opaque Publisher (Zenzaro, 2024; Ciula et al., 2018).
In “The Digital Opaque: Refusing the Biomedical Object” (Purcell, Craig & Dalmau, 2025), we outlined our adoption of refusal theory in the rejection of unquestioned institutional norms around the use and display of unethically obtained medical specimens. This theoretical framework was operationalized in the OP as an author-audience interaction that allows authors to identify sensitive information in both the text of, and images included in, an academic publication. Readers are then given the ability to control how and whether that sensitive information is redacted fully or partially, or displayed openly, with the default view set to “partially opaque,” serving as a compromise between fully redacted and fully open.
Here, we outline a history of technical interventions that supported ethical creation and interpretation of sensitive content through the iterative implementation of two publishing platforms. Core to both platforms were ethical-research considerations and public-communication audiences, which in turn drove the adoption of minimal computing approaches. We hope that, by focusing on the audience needs we identified for the two projects, the existing models we assessed for the OP’s parent framework, and some of the serendipitous contingencies that shaped the minimal-computing development of both platforms, we can offer some lessons for other digital-humanities development teams seeking to operationalize their theoretical frameworks in the form of technical choices.
DigitalArc: A Community Digital Archive Platform
In Fall of 2018, our team began to assess options for a spring 2019 course centered around the creation of a public-history archive as the main classroom activity. As the instructor, Craig initially asked for consulting advice about potential archiving platforms from Dalmau and other members of the team at the Institute for Digital Arts and Humanities, and from Dalmau and members of her digital-libraries team. Using their advice, along with obstacles that arose with our own institution supporting digital projects, Craig began to assess the potential for Github Pages as a publishing platform.
In Fall of 2018, our team began to assess options for a spring 2019 course centered around the creation of a public history archive as the main classroom activity. Our audience was twofold: first-year undergraduates with little to no research experience in history or technical experience in digital humanities, and the public audiences who would be engaging with the digital collection and historical essays those first-year students would develop as a part of their class. Models for this sort of endeavor existed in spades, most of which focused on the simplicity of content creation for content creators with minimal technical skill. Content management systems (CMS) allow these users to interact with a graphical user interface (GUI) and engage in button-pushing and form-filling behaviors that build multi-media experiences palatable to public audiences, with integrated display for photos, videos, audio, and text (Russel & Merinda, 2017). From Google Sites and WordPress in the corporate freemium sphere to Drupal and Omeka, open-source platforms commonly used in academic contexts, many of these content management systems modeled the use of a programming language (often PHP) supported by a back-end database (often MySQL) that served on-demand pages built “on the fly” as a reader requested each page on the web site. Our institutional support was rich for Omeka in particular, and we appreciated Omeka’s focus on non-profit academic public engagement. However, acquiring, critiquing, and applying digital literacies are key outcomes for the course, and we were able to hone these literacies, with the built-in support structure offered by the class, by exploring a more transparent code base offered by static sites (Wikle, Williamson and Becker, 2020).
As with many technical projects, however, serendipity wrinkled the fabric of our plan: that same semester, university IT rolled out a required upgrade to PHP on the servers that were available for hosting that, in turn, prompted a systemwide Omeka upgrade. This IT-driven upgrade represented, on the one hand, a very well-provisioned IT environment that could support database-driven CMS support for many sites, and on the other, a very clear division between that institution’s IT’s environment-building responsibility and researchers’ site-creation and maintenance responsibilities. Dozens of sites needed upgrades in order to remain accessible for public view, and in many cases, the creators of those sites were not equipped to handle such upgrades. That semester, Omeka served as both a model for what worked exceptionally well for novice creators in the site-building phase, and as a warning for the errors our students, and our public audiences, might expect to see in the site’s long-term post-project maintenance.
The experience pushed us away from big tech into the realm of “minimal computing,” an approach that responds to the tension between the often-limited resources and needs of a community of practitioners–which can include individual partners with institutional affiliation–with limited resources. The shift in focus to this tension between need and resource availability has as its main effect the need to consider how and why we’re using the technologies in the first place. Roopika Risam and Alex Gil anchor the minimal-computing movement’s motivation in “a very real fear” of big tech’s ideologies of fast growth at all costs, disruption over stability, and expense over access. Such ideologies continue to exclude communities whose “voices and stories…have been elided” in the cultural record, this time in a digital space instead of in physical collections (Gil and Risam, 2022). By contrast, minimal-computing best practices offered a framework that helped us evaluate these early-stage classroom priorities by asking what we had, what we needed, and what we wanted to prioritize. We had a team capable of managing almost any technical environment. We needed to reduce or eradicate long-term institutional dependencies and create an archive without longer-term sustainability concerns that Omeka and WordPress presented in the immediate institutional context. We wanted to prioritize short-term labor and development over a need for students or us to handle the long-term maintenance that Omeka and WordPress presented. As we further developed DigitalArc in the years that followed, this tension between resource limitation and need, then, led us to consider moving much of the maintenance complexity of the technology and the labor onto our institutional team, through development and documentation, in order to shift the expense of technology away from anyone who might be interested in using our platform later on.
Minimal computing’s emphasis on smaller-scale projects, with initial labor investment by technical experts that result in lower barriers to long-term technical maintenance and much lower cost, is also informed by a resistance against a “maximal” digital humanities, which leans on a combination of well-provisioned institutional support and the structural exigencies that require researchers to respond quickly to a limited set of choices when that institutional support changes. When these maximal computing tendencies are transferred from well-resourced IT environments and institutional support for long-term maintenance into other settings, the changed institutional pressures in turn create institution-specific site-creation and sustainability concerns that vary greatly from context to context (Miya & Rockwell, 2025).
The contingencies we faced, even in an IT-rich environment, helped guide us as we considered how to de-institutionalize both the minimal-computing and maximal-computing platforms to which we had access. We anticipated that implementers of a minimal-computing platform would need methodical yet easy-to-step through documentation, to scaffold what they might initially see as a less “user-friendly” interface. In this case, to achieve a minimal codebase in support of ongoing sustainability, we rely on substantive documentation. Herein lies one of several counter-intuitive responses to minimal computing. Despite these contradictions, our choices were intended to allow more agency for communities and scholars, giving both our developer team and our audiences a better handle on both the short-term and long-term “considerations of the costs, limits, or wisdom of scale” (Walsh 2024).
In Spring of 2019, our classroom began using the first version of a minimal-computing digital-exhibit template that would become DigitalArc many years later. The feature set included in this student-built version of the platform was partially inspired by Omeka’s academic focus on discovery and presentation based on well-formed metadata and its ability to support meaningful interactions with multimedia objects. We also took cues from the many CMSs that developed on WordPress’s model of simple authoring for novice creators, and cues from colleagues in digital humanities whose research on minimal computing suggested that the back-end design and documentation requires a heavier lift up front, but easier ongoing and longer term management over time (Wingo & Anderson, 2025). We also took steps to narrate the parallels between CMSs like WordPress or Weebly and the features that Github’s GUI editing pages offered, as a bridge to help build student confidence that Github Pages could come quite close to the ease point-and-click editing with minimal training time for them.
The tech stack that supported this initial minimal-computing approach was centered around Jekyll, a static-site generator that takes a different approach to the design and publishing of sites than Omeka and WordPress’ dynamic pages (on-request PHP-and-database generated pages).2 In this model, pages are built when creators make edits, rather than being built when a public viewer requests the page; if something went wrong with an edit, or with part of the tech stack, the implementers would have a greater chance of noticing and fixing the problem before the viewers would encounter an interruption to the site. As with Omeka and WordPress, Jekyll lets us design and implement headers, footers, and page templates that would apply to any of the content generated by students. As with other CMSs, we set up customization of fonts, colors, navigation elements, and other basic design. We later added support for custom metadata and navigation labels, which was intended to support both multilingual audiences and communities’ preferred vocabulary.
Our one compromise with the world of “maximal” computing, which we will address more fully below, was to host our Jekyll site on Github. From a teaching perspective, Github’s free user-account option and focus on collaborative editing made it easier for students to collaboratively access Github during class. From a site-editing perspective, Github’s Pages feature automatically added the template features we built to any of the simpler “markdown” files. Creators authored these markdown files, which focus on representation of the digital objects in the collection, including descriptions and corresponding images, text or time-based media files (see Fig 1). Markdown serves as the vehicle for encapsulating curated information (metadata) described through basic text formatting, which reduces technical barriers associated with scripting languages and database implementations.
For students, connecting these easy-to-teach text-only editing processes meant they could use Github’s online GUI-based editor. This experience highlighted the division of labor in minimal computing that places additional up-front burdens on our developers. It was our responsibility to: understand that Github Pages’ existing GUI interface was viable as a point-and-click user interface for file editing; create as user-friendly an environment as possible within the context of Github Pages; explain the affordances of Github Pages as having some additional difficulty up front but a much longer-scale ease of maintenance and use that allowed for better trade offs; and provide documentation of that environment that makes it more easily adaptable by novices.
While our first development effort in Fall of 2019 was aimed at the infrastructure for a one-time classroom engagement, it would, in true minimal-computing fashion, come to include “ethical concerns that influence our practice” (Risam, 2025). These ethical considerations added to the benefits that we found in choosing custom development in Github Pages over the similarly time-consuming customization and long-term maintenance we would have had to budget in order to to use an existing digital exhibition and publishing platform that had institutional support. We quickly realized that this minimal-computing model had several additional affordances that we could use for other projects. First, as we built the student exhibit, we realized it was an easier model for novices to adapt free of charge for multiple projects, as compared to the freemium option that is more common for platforms like WordPress or Weebly, in which a single user is limited to a single free site. Building and applying new single-page templates in Jekyll was easier with limited design and programming skill, both for our team’s own development work and for future potential community members learning to customize and launch their own web sites. Second, the collaborative, free, non-academic context of Github had promise for audiences whose experiences with universities and other large institutions was less than positive (Sutton & Craig, 2023).
The specifics of our rollout in the classroom and those that followed, however, presaged a persistent concern that Quinn Dombrowski addresses in “Minimizing Computing Maximizes Labor”: “going ‘minimal’ requires a great deal of technical labor” (2022). Students needed additional support to transition from a fully GUI-based editing interface to the combination GUI/text-based editing system that GitHub Pages and Jekyll require. Coming face-to-face with this early on helped us establish teaching and documentation principles that addressed the longer-term implications of a minimal-computing platform-development agenda. This trade-off emphasized, for us, the importance of creating scaffolded documentation for DA implementation that allows for a more flexible, accessible user experience and easier maintenance for web site content creators and managers (see Figure 2).3
With both concerns and affordances in mind, we began to use these minimal-computing approaches in other settings, including 3 community-facing History Harvest projects that took place over the 4 years that followed the student-focused digital-archive classroom experience.4
Opaque Publisher: A Scholarly Publishing Platform
It’s here that we time-skip forward to 2023, and the development of the Opaque Publisher. By then, our team had experience building DA into a templated platform that drew on existing models of community archiving and had fully integrated ideas of minimal-computing labor division into our workflow. Prototyping through an ACLS-funded grant5 helped us build and test a reasonably featured minimal-computing Jekyll template that served several community archive projects as well as proved its adaptability to other web-publishing needs.6
The affordances we identified during the initial iterations of what we now know as DigitalArc also played a role in setting up DA to become a useful foundation for the OP. Our experience with the adaptability of minimal-computing Jekyll sites was crucial for the timely development of the OP. Jekyll’s development process meant that our OP team members with basic HTML skills and a willingness to experiment could see DA in its fully articulated form and use that as an easily portable example to customize a new platform. That changed the up-front labor required of the more skilled members of our development team, allowing us to divide the labor more easily and to repurpose code across our different Jekyll projects, which lowered the burden of learning for team members who were still learning that customization skillset. We appreciated this method because Jekyll offered an environment that not only scaffolded our team as they experimented with their newly built site in increasingly complex ways, but encouraged them to do so because each small success helped them see themselves as capable of technical tasks.
The second–our focus on moving DA outside of its original institutional context–offered an anchor for the OP’s goal of refusal–an intentional rejection of institutional context and institutional harm. Those experiences provided a foundation for operationalizing the refusal theory that Sean Purcell’s then-dissertation project brought to our attention.7 At minimum, his functional requirements included a digital exhibition platform that mirrored the formatting requirements associated with academic publishing (citations, tables of contents, and indices). These elements were not included in DA’s development, owing to a difference in intended audience and intended output. In addition to this, the project’s interactive approach to refusal required templating for the platform’s interactive elements, which could be added to prepared markdown files by a user familiar with basic hypertext markup language (HTML).8 One of the advantages of working on these two projects in parallel was an opportunity to develop resources for future Jekyll templates that attend to the overlapping, but distinctions shared by academics, archivists, and communities.
While DigitalArc offered an easy starting point for a team already familiar with Github Pages and the specifics of the DigitalArc template itself, there was no shortage of CMS options that were designed with some academic apparatus in mind. We re-evaluated our minimal-computing starting point–what do we have, what do we need, and what are our priorities–as we did due diligence. Omeka’s third-party development community included a footnote plugin, and Omeka’s base install had a built-in table-of-contents generator. Scalar diverged from the exhibit model to offer a combination of non-linear and table-of-contents-based reading processes, but customizing Scalar required a higher learning curve for our team. However, Scalar’s computational overhead and its dependence on PHP complicated the process of long-term digital preservation. Mukurtu’s focus on the ethics of digital exhibits was a good fit for the refusal theory that drove Purcell’s dissertation, but its dependence on Drupal 7 triggered worries for us about the long-term sustainability of the platform. In hindsight, platform worries beyond our initial reluctance to use Omeka were well-founded; Drupal 7 support ended in January 2025, leaving Mukuru in limbo, and Scalar experienced a maintenance outage in August of 2025.9 As with DigitalArc, we wanted to engineer around potential vulnerabilities and gaps in site availability, and the friction between open, non-profit archival platforms and dependency on a constantly updating database-driven codebase meant moving away from these easily accessible platforms.
Despite the surfeit of academic CMS models, flexible redaction models were harder to find. Print redaction, like that done in Adobe Acrobat or other print-document generators, assumes permanent strike-throughs or blackouts. Again, Mukurtu offered inspiration for changing levels of visibility based on community-oriented traditions and ethics, but those levels are controlled by site creators rather than readers or audience members. Ultimately, DigitalArc offered both the longer-term maintenance that we prioritized, the non-institutional platform that aligned with our ethical goals of institutional refusal, and the fastest customization path for a series of interface options that reified authorial choices about which text and image sections were sensitive but allowed the redaction-level display of those sections to be reader-controlled (Fig.3). In choosing to go further down the minimal-computing path that we started with DigitalArc so, we provide readers with a way to engage with the tensions and ethical questions that we posed in the companion article: “as scholars we have to show our work, and this practice of showing is often at the expense of those whose lives and deaths are entangled in our research programs.” (Purcell, Craig & Dalmau, 2025)
Figure 3. Image description: An example of how refusal informed the opacity functions in the OP in which users are able to toggle how they view the images and text based on whether the subjects depicted in primary materials consented to the research. This example was drawn from Purcell, Sean “Teaching Hygiene” in The Tuberculosis Specimen. (2025). https://tuberculosisspecimen.github.io/diss/dissertation/1_3_4
Our choices were made with an audience of scholar-authors looking for simple technical solutions in mind. That focus pushed us away from the integration of more complex programming and toward a mostly-CSS solution to implement the interactive opacity filters. For text, the opacity filter activates unique span classes in the textual narrative that have been flagged during composition.10 The interpolation of image and text was done mostly in markdown, using Scrivener, with a few text-string replacements that allowed Purcell to easily insert the necessary HTML to apply the Javascript redaction, a process which we describe more fully in “The Digital Opaque.” The actual content of the images, however, were much more complicated as every image that needed to be made opaque had to be edited three times: first, to crop and format for web; second, to edit and remove the first level of opacity for the ‘partial opacity’ version of the site; and third, to remove more of the image for the ‘opaque’ version of the site (fig. 4). When the site loads for a user, all three versions of these images are loaded at the same time, but only one is visible for the user at any time.
Figure 4. Image description: The three versions of each image corresponded with the opacity guidelines of the site, incrementally removing elements of the bodies of patients based on the project’s predefined protocols. From left to right, a white woman drinking from a glass while staring at the camera, in the next image her eyes are obscured, in the final image her whole face is obscured. Three unique versions of every image had to be made. Lockard, Lorenzo B.. Tuberculosis of the Nose and Throat. St. Louis: C. V. Mosby Medical Book & Publishing Co., 1909
We tested a few versions of the opacity functionality during the platform’s development. The first was Javascript-heavy. As one of the most common Javascript libraries in use for web site development at the time of OP’s development, React.js (https://react.dev/) offers a broad platform to build interactive user-controlled opacity of both images and text drawn from the opaqued parts of those images. Two things led us to an alternative path. React’s requirement for local compilation, coupled with the sometimes unpredictable attention to backward compatibility because of React’s emphasis on the constantly changing world of mobile-app development, had the potential to create sustainability issues. That, along with its origins in a very profit-driven corporate Facebook setting, led us to emphasize CSS control rather than Javascript control of the opacity features. We instead adapted basic show/hide options that were already built into the non-profit Zurb Foundation 6 library (https://get.foundation/). This CSS library’s user-contribution-oriented development process aligns with our community-oriented goals, Zurb’s smaller contributor base leads to a slower code-update rate, making it more suitable for a project with few developers available to update code in response to new library releases, and the smaller library base also meant we could load a local copy of the library, frozen at a particular release date. These choices, in turn, provide a more predictable user experience in the preserved versions of the site that are hosted not at Github but in other disciplinary and institutional repositories and in the Internet Archive. By keeping the site architecture simple, the published sites are more accessible to end-users and to web archiving tools that struggle to replicate more complex interaction.
Purcell’s introduction of refusal theory also provided the team with another opportunity to question our minimal-computing approach. Microsoft purchased Github in 2018, just as we were addressing the workload of updating those Omeka sites that had broken. Our choice to engage in a minimal-computing endeavor was thoughtful and anchored in careful consideration; our choice of Github Pages and Github’s front-end file-editing GUI was less well-theorized, and the OP offered us the opportunity to reconsider our choice. While we decided Github was the most timely and stable choice for hosting the dissertation, we also made offsetting decisions owing largely to the affordances of minimal computing sites. The most crucial of these was to take a cue from the LOCKKS program (https://www.lockss.org/) in our choice of static-site generation through Jekyll. Static sites are more easily preserved in packaged form on multiple platforms, like IU’s institutional repository Scholarworks, to open-source repositories like Knowledge Commons. More importantly, static sites function with their intended behavior on the Internet Archive, which serves as a fully-open-source public repository of general knowledge.11 The distribution of many copies of completed archives in a variety of forms has also flagged a future need to explore alternatives to Github Pages like GitLab, in order to provide a platform for the DigitalArc template and its automated static-site page building process outside of Github’s environment.
Conclusion
While many of the choices we made were specifically oriented around the audiences for DA and OP, and the models in the digital-archive and digital-publication spaces that offered some but not all of the features we needed, the interplay between two platforms developed for different audiences on different models by the same team of scholar-developers has offered us some lessons that we are now integrating into our future work.
The first lesson we learned along the way is that contingency matters, and that the impulsive choices we made in response to serendipity and contingency can be a foundation for thoughtful, worthwhile change. If not for the PHP upgrade in Fall of 2018, we might not have repositioned our long-term platform-choice goals for DA in the context of minimal computing. That choice, in turn, shaped our choice of CSS-heavy redaction in the OP, a choice that has made our code more portable.
Our second takeaway is that it is hard to fully escape maximal computing. While Jekyll offers the option of building a website entirely on a personal computer, doing so has an enormous amount of technical overhead. In order to re-scope the technical skills required of our community-archive audiences, we needed Github’s full infrastructure–its web-based editing system, Github Pages and Github Actions–which is itself maximal and monopolistic.12 While we’ll always need a maximalist web platform to provide support for less technical content creators, both in the community-archive world and for self-publishing in the digital humanities, diversification of platform away from monopolies–to Gitlab and Bitbucket in particular, in this instance–will also help us as we seek to live up to the goals we set of static-site generation in service of long-term stability for both DA and OP users. Practically speaking, this compromise allowed for users to author and access sites on their phones and creators to minimally maintain sites over a long period of time with no upgrades or technical skills necessary to keep the site accessible to public audiences.
The functionality we developed for DA and the OP may be imperfect and can be time consuming. However, the code that drives that functionality can be operationalized as part of the ethical reconsideration of a scholar’s primary evidence. Whether that evidence is from communities whose partnerships with institutions have been fraught or from subjects whose evidence was included in archives without their consent, the code that presents our evidence in a variety of forms is a humanistic process, an endeavor that happens in context and should treat context and contingency as an opportunity to understand our relationship to technology, rather than as something to be erased.
Acknowledgements
We would like to thank Nate Howard, Sagar Prabhu, Jessica Organ, and Morgan Vickery for contributing to the web application development of DigitalArc and Opaque Publisher. We also want to thank our friends and colleagues who also shaped this work, especially Emily Clark, Vanessa Elias, and Marisa Hicks-Alcaraz. We would also like to thank Élika Ortega, Roopika Risam, and Alex Gil, and especially members of the Minimal Computing Go:DH working group, for the various ways of framing minimal computing for the digital humanities and for the publics more broadly; for the inspiration and the paradoxes that have kept us on our toes. We are big fans of In the Library with the Lead Pipe’s open peer review process, and appreciate Quinn Dombrowski, Pamella Lach and Jessica Schomberg for their feedback. We are grateful to get to this (better) version of the article. Lastly, we would like to thank our funders for making this work possible: the New York Academy of Medicine, the Center for Research on Race Ethnicity and Society, and with support from the American Council of Learned Societies’ (ACLS) Digital Justice grant program.
References
Hannah Alpert-Abrams et al., “Post-Custodialism for the Collective Good: Examining Neoliberalism in US – Latin American Archival Partnerships,” Journal of Critical Library and Information Studies 2, no. 1 (2019)
Christina Boyles et al., “Postcustodial Praxis: Building Shared Context through Decolonial Archiving,” Scholarly Editing 39 (2011).
Ciula, Arianna, Øyvind Eide, Cristina Marras, and Patrick Sahle. (2018). “Models and Modelling between Digital and Humanities. Remarks from a Multidisciplinary Perspective.” Historical Social Research / Historische Sozialforschung 43, no. 4, https://www.jstor.org/stable/26544261.
Miya, Chelsea and Geoffrey Rockwell. (2025). “Platitudes: The Carbon Weight of the Post-Platform Scholarly Web”, The Journal of Electronic Publishing 28, no. 2. doi: https://doi.org/10.3998/jep.7247
Purcell, Sean. (2025). “The Tuberculosis Specimen: The Dying Body and Its Use in the War Against the ‘Great White Plague.’” Indiana University. https://tuberculosisspecimen.github.io/diss/.
Russel, John E., and Merinda Kaye Hensley. “Beyond Buttonology: Digital humanities, digital pedagogy and the ACRL Framework” College & Research Libraries News. (December 2017), 588-591, 600.
Sutton, Jazma, and Kalani Craig. (2022). “Reaping the Harvest: Descendant Archival Practice to Foster Sustainable Digital Archives for Rural Black Women.” Digital Humanities Quarterly, vol. 16, no. 3, https://dhq.digitalhumanities.org/vol/16/3/000640/000640.html.
Wikle, Olivia, Evan Williamson and Devin Becker. (2020). “What is Static Web and What’s it Doing in the Digital Humanities Classroom?” In M. Brooks et al.(Eds), Literacies in a Digital Humanities Context: A dh+lib Special Issue (pp. 14-18), https://doi.org/10.17613/ryea-4z10.
Wingo, Rebecca, Anderson MR. (2025). A Sustainable Shared Authority: The Future of Rondo’s Past. Public Humanities. doi: https://doi.org/10.1017/pub.2025.29.
Zenzaro, Simone, (2024). “Models for Digital Humanities Tools: Coping with Technological Changes and Obsolescence.” International Journal of Information Science &Technology, vol. 8, no. 2, http://dx.doi.org/10.57675/IMIST.PRSM/ijist-v8i2.283.
Institutional dependencies can facilitate the creation and publication of a digital community archive but they can also result in reduced, or perceived reductions in community control over their own materials. For example, an institutional partnership might mean communities need to meet digital archiving standards that require costly equipment, where a community archive goal focuses on capturing community contributions (interviews, artifacts, etc.) in the best possible way, using easy-to-access and affordable mechanisms like one’s smartphone and DIY lightbox. Another example is reliance on more advanced technological infrastructure offered by institutions. Rather than opt for a post-custodial approach in which an institution like a public library or local history center hosts the digital archive, community members can do so themselves (Alper-Abrams et al. 2019; Boyles et al. 2011). ︎
For an example of a more complex installation of Jekyll, which relies on the user installing a programming environment on their computer to compile their site, see Amanda Visconti, “Building a static website with Jekyll and GitHub Pages,” Programming Historian 5 (2016), https://doi.org/10.46430/phen0048. Note that the “difficulty” level for this tutorial is rated as “low”. ︎
Purcell, Sean. 2025. “The Tuberculosis Specimen: The Dying Body and Its Use in the War Against the ‘Great White Plague.’” Indiana University. https://tuberculosisspecimen.github.io/diss/︎
At the time of writing this article, Mukurtu had still not released version 4, which would move away from Drupal 7 to Drupal 11. Currently, Mukurtu 4 is available in as a stable beta: https://mukurtu.org/mukurtu-4/. ︎
Flagging of text and images depended on a predefined ethical framework, and highlighted at different phases in research. For The Tuberculosis Specimen, opacity designations were decided based on different approaches to biomedical informed consent and subject privacy ( https://tuberculosisspecimen.github.io/diss/dissertation/FAQ ). Images were flagged as they were added to chapter drafts and text was flagged during the project’s ‘ethics audit’–a moment prior to publication where researchers are invited to reflect on what processes they used and materials they included and alter their final result to match the ethical frameworks they hoped to meet in the project (Purcell, Craig & Dalmau 2025). These sections were flagged in the project’s word processing program (Scrivener), using placeholder text, which would be changed using a batch find-and-replace script for text files (https://tuberculosisspecimen.github.io/diss/dissertation/X_2_3). ︎
The Wayback Machine is able to preserve Sean’s dissertation as-is: https://web.archive.org/web/20250516183042/https://tuberculosisspecimen.github.io/diss/. The same isn’t true for Mary Borgo Ton, whose born-digital dissertation preceded Sean’s at Indiana University. Mary had to combine several output and documentation approaches to preserve, as closely as possible, her dissertation since the Wayback Machine was unable to preserve content produced by Scalar, which is a more complex PHP site. Instead parts of Mary’s dissertation were preserved via Indiana University’s institutional repository: https://scholarworks.iu.edu/dspace/handle/2022/26951. ︎
The Dutch Open Government Act (Wet open overheid) has the potential to act as a game changer, as it will require governments to disclose these administrative decisions proactively to everyone.
The Barcelona Declaration on Open Research Information was launched in 2024 as an initiative to encourage research institutions and research funders to actively contribute to the open availability of research information.
Today, we understand that openness alone is not enough to generate lasting and structural impact. Open without structure is like a library without a catalogue.
Although there are many examples of how open data reuse generates societal value, this potential is sometimes forgotten in political debates on open governance.
It was interesting to see this short article 1
about the enclosure of the web commons go by after just having listened
to The Dig’s epic two part interview with Peter Linebaugh2.
What’s needed is a multi-stakeholder settlement in which large-scale
users of the commons take on long-term, structured obligations to
sustain it: contractual funding through paid APIs and usage-based
levies, formal recognition of DPGs as Digital Public Infrastructure to
unlock multilateral co-financing, and a shift in philanthropy from
one-off project grants to sustained core support for the institutions
that maintain the commons.
I hadn’t realized that the details of these deals that Wikipedia are
striking aren’t fully transparent, and well understood outside of closed
doors? I think it’s really instructive to think about what is happening
right now on the web as enclosure, and part of a longer history of
capitalism (as Linebaugh and Denvir talk about). The interview made me
think of the craft, tooling, and means of production that are still
present in the software industry, but that are being actively being
enclosed by the centralization of tooling and skill, craft and knowledge
itself.
Yes, I’m talking about LLMs here. Once you see it, it’s impossible not
to see it.
This all makes me think of Eleanor Ostrom’s design
principles and how it is important that the Wikipedia community have
insight into how their commons is being used through monitoring,
decision making, resolving the future conflicts that will no doubt
ensue.
Digital Public Goods (DPG) are supposed to be shielded from precisely
this kind of capture. They require financing models commensurate with
their public value, not models that make them fiscally dependent on
their most extractive users. When the sustainability of a DPG hinges on
a small oligopoly of AI firms, the risk turns political: agenda-setting
and governance drift toward those who can threaten to walk away.
Perhaps Wikipedia is at risk with losing its DPG status? In what
practical ways does identification as a DPG help shape governance? What
is being done, or can we be doing to push back on this enclosure? And of
course, the situation is quite a bit bigger when you consider the strain
that LLM hungry bots are putting on cultural heritage organizations,
also part of a larger commons.
In this paper, the question of machine learning is revisited in order to
explore whether Bayesian learning, as a form of abductive reasoning, can
provide an alternative to the current dichotomy between inductive and
deductive approaches in machine learning debates. The paper will further
demonstrate that machine learning invariably entails a degree of
situatedness, as evidenced by the example of Bayesian belief networks,
which arguably rely on abductive reasoning. In this manner, the
discourse surrounding Bayesian learning models has the capacity to
elucidate the aspects that are often left implicit in contemporary
machine learning debates and methodologies.
Our approach is harness-first engineering: instead of reading every line
of agent-generated code, invest in automated checks that can tell us
with high confidence, in seconds, whether the code is correct. The agent
generates code, the harness verifies it, production telemetry validates
it, and if something is wrong, the feedback updates the harness and the
agent tries again. The specific methods to develop harnesses vary in
rigor—deterministic simulation testing, formal specifications, shadow
evaluation, observability-driven feedback loops—but the principle
remains the same: make the verification fast and automatic, and let the
harness do the work that human review cannot scale to do.
Dolt uses Prolly Trees because they give us two very important
properties: history independence and structural sharing. These are both
incredibly valuable properties for a distributed database. Structural
sharing in particular means that two tables that differ only slightly
can re-use storage space for the parts that are the same. Most SQL
engines obtain structural sharing for tables by using B-trees or a
similar data structure… but that doesn’t extend easily to JSON
documents. Some tools like Git and IPFS achieve structural sharing for
directories by using a tree structure that mirrors the directory… but
that creates a level of indirection for each layer of the document,
which would slow down queries if the document had too many nested
layers. Something else was needed.
On this account, protocols are governance structures whose design
choices allocate power, and the purpose of the entire enterprise is the
protection of rights. Protocol design is a form of political design, and
the appropriate way to evaluate protocols is not only by their technical
properties but by the governance outcomes they produce.
The popularisation of artificial intelligence (AI) has given rise to
imaginaries that invite alienation and mystification. At a time when
these technologies seem to be consolidating, it is pertinent to map
their connections with human activities and more than human territories.
What set of extractions, agencies and resources allow us to converse
online with a text-generating tool or to obtain images in a matter of
seconds?
How to analyze data settings rather than data sets, acknowledging the
meaning-making power of the local.
In our data-driven society, it is too easy to assume the transparency of
data. Instead, Yanni Loukissas argues in All Data Are Local, we should
approach data sets with an awareness that data are created by humans and
their dutiful machines, at a time, in a place, with the instruments at
hand, for audiences that are conditioned to receive them. The term data
set implies something discrete, complete, and portable, but it is none
of those things. Examining a series of data sources important for
understanding the state of public life in the United States—Harvard’s
Arnold Arboretum, the Digital Public Library of America, UCLA’s
Television News Archive, and the real estate marketplace
Zillow—Loukissas shows us how to analyze data settings rather than data
sets.
A breathtakingly elegant visual dictionary of 2000 Japanese words for
rain, with 100 drawings in indigo.
In Water of the Sky, artist Miya Ando offers us a beautifully rich,
bilingual visual dictionary for rain. Through a collection of 2,000
Japanese words, their English interpretations, and 100 drawings, Ando
describes the breadth and diversity of rain’s many expressions: when it
falls, how it falls, and how its observer might be transformed
physically or emotionally by its presence. The words range from prosaic
to esoteric, extending from the meteorological (mukaame, or “very fine
rain that falls in spring”) to the mystical (bunryūu, or “rain that
splits a dragon’s body in half”) and from the minute (kisame, or
“raindrops that fall off the leaves and branches of trees”) to the vast
(takuu, or “blessed rain that quenches all things in the universe”).
The AI industry defaults to “bigger is better” - GPT-4, Claude Opus,
Llama 70B. But for most production workloads, 80% of LLM calls don’t
need a 100B+ parameter model. They need a function routed, a tool
selected, a query classified, or a simple response generated.
Small LLMs (under 4B parameters) solve this by running locally, for
free, in milliseconds.
PyTorch is currently one of the most popular deep learning frameworks.
It is an open-source library built upon the Torch Library.
Most tutorials assume you’re comfortable jumping straight into code. I
made a visual introduction that walks through the core concepts step by
step, with animations and diagrams instead of walls of text
A filesystem backed by PostgreSQL, and a filesystem interface to
PostgreSQL. TigerFS mounts a database as a directory. Every file is a
real row. Writes are transactions. Multiple agents and humans can read
and write concurrently with full ACID guarantees, locally or across
machines. Any tool that works with files works out of the box.
I wanted to show how an index is not just a bibliographic convention, or
an organizational method, or a media form, or a financial instrument, or
a corporeal component; it’s also an intellectual architecture, a
literary genre, a creative form, a semiotic concept, and an embodiment
of agency — one that might offer an important antidote to the pervasive
autonomous, extractive cloudification of our contemporary information
ecology.
Brandolini’s law (or the bullshit asymmetry principle) is an Internet
adage coined in 2013 by Italian programmer Alberto Brandolini. It
compares the considerable effort of debunking misinformation to the
relative ease of creating it in the first place. The adage states:
The amount of energy needed to refute bullshit is an order of magnitude
bigger than that needed to produce it.[1][2]
The challenge of refuting bullshit does not come just from its
time-consuming nature, but also from the challenge of defying and
confronting one’s community.
Summary: An AI agent of unknown ownership autonomously wrote and
published a personalized hit piece about me after I rejected its code,
attempting to damage my reputation and shame me into accepting its
changes into a mainstream python library. This represents a
first-of-its-kind case study of misaligned AI behavior in the wild, and
raises serious concerns about currently deployed AI agents executing
blackmail threats.
Edward Palmer Thompson (3 February 1924 – 28 August 1993) was an English
historian, writer, socialist and peace campaigner. He is best known for
his historical work on the radical movements in the late 18th and early
19th centuries, in particular The Making of the English Working Class
(1963).
PDF parser for AI data extraction — Extract Markdown, JSON (with
bounding boxes), and HTML from any PDF. #1 in benchmarks (0.90 overall).
Deterministic local mode + AI hybrid mode for complex pages.
It’s funny, everyone has been predicting the Singularity for decades
now. The premise is we build systems that are so smart that they
themselves can build the next system that is even smarter, that builds
the next smarter one, and so on, and once we get that started, if they
keep getting smarter faster enough, then the incremental time (t) to
achieve a unit (u) of improvement goes to zero, so (u/t) goes to
infinity and foom.
Anyway, I have never believed in this theory for the simple reason we
outlined above: the majority of time needed to get anything done is not
actually the time doing it. It’s wall clock time. Waiting. Latency.
And you can’t overcome latency with brute force.
I know you want to. I know many of you now work at companies where the
business model kinda depends on doing exactly that.
CanIRun.ai runs entirely in your browser. When you visit the site, we
use browser APIs to detect your GPU, CPU, and memory — then we calculate
which AI models can run on your hardware and how fast. No data is sent
to any server. Everything is computed client-side.
Generate IIIF Level 0 static tiles from images in a HF Bucket. Downloads
source images from a bucket, generates IIIF Image API 3.0 tiles using
libvips, creates a IIIF Presentation v3 manifest, and syncs everything
to an output bucket for static serving via HF CDN.
L’année 2026 sera l’année des agents IA… C’était annoncé, et
effectivement depuis le début de l’année nous assistons à la diffusion
et à la montée en puissance de deux grandes familles d’outils agentiques
d’un nouveau type : d’une part des assistants orientés coding comme
Claude Code, Codex, Gemini CLI, Opencode etc., et d’autre part des
frameworks de création, de configuration et d’orchestration d’agents
permettant l’automatisation de workflows via des canaux de communication
(Slack, Discord, messagerie…) comme OpenClaw et ses multiples dérivés
Toi Derricotte (pronounced DARE-ah-cot ) (born April 12, 1941) is an
American poet. She is the author of six poetry collections and a
literary memoir. She has won numerous literary awards, including the
2020 Frost Medal for distinguished lifetime achievement in poetry
awarded by the Poetry Society of America, and the 2021 Wallace Stevens
Award, sponsored by the Academy of American Poets. From 2012–2017,
Derricotte served as a Chancellor of the Academy of American Poets. She
is currently a professor emerita in writing at the University of
Pittsburgh. Derricotte is a member of The Wintergreen Women Writers
Collective.[2]
This is an attempt to clarify this discussion of degrowth strategy, a
topic on which I think there is considerable confusion and mistaken
approaches. Debate has recently been fuelled by Jason Hickel’s argument
for a socialist position on both the goal and the means do it for
degrowth. Liegey, Nelson and Leahy replied against Jason, defending the
wide and diverse range of strategies now characteristic of the movement
and often referred to by the terns “Horizontalism” and “Pluriverse”.
Several others have contributed to the discussion, including Jason’s
reply to Leigey et al., his subsequent response, my critique of the
Leigey, Nelson and Leahy article, Gasparo and Vico, Gregoletto and
Burton, Bunea, and Kallis and D’Alisa.
In this series we will explain ideas in Category theory from first
principles in order to build intuition and derive the actual formal
definitions. We’ll use that foundation to demonstrate exactly where
these concepts fit into day to day functional programming and how you
can do useful things with that knowledge.
Watch this if you want an introduction to category theory that is
simple, practical, joyful, and deeply grounded in functional programming
Sentimental Value (Norwegian: Affeksjonsverdi) is a 2025 Norwegian drama
film directed by Joachim Trier, who co-wrote it with Eskil Vogt. It
follows sisters Nora (Renate Reinsve) and Agnes (Inga Ibsdotter
Lilleaas) in their reunion with their estranged father Gustav (Stellan
Skarsgård). It also stars Elle Fanning.
On the Silver Globe (Polish: Na srebrnym globie) is a 1988 Polish epic
surrealist science fiction arthouse film[1] written and directed by
Andrzej Żuławski, adapted from The Lunar Trilogy by his grand-uncle,
Jerzy Żuławski. Starring Andrzej Seweryn, Jerzy Trela, Iwona Bielska,
Jan Frycz, Henryk Bista, Grażyna Deląg and Krystyna Janda, the plot
follows a team of astronauts who land on an uninhabited planet and form
a society. Many years later, a single astronaut is sent to the planet
and becomes a messiah.
Production took place from 1976 to 1977, but was interrupted by the
Polish authorities. The budget is estimated to be at least PLN 58
million.[2] Many years later, Żuławski was able to finish his film,
although not as originally intended. On the Silver Globe premiered at
the 1988 Cannes Film Festival, and has received consistent critical
acclaim.
A fundamental problem for decentralized systems like permissionless blockchains is that their security depends upon the cost of an attack being greater than the potential reward from it. Various techniques are used to impose these costs, generally either Proof-of-Work (PoW) or Proof-of-Stake (PoS). These costs have implications for the economics (or tokenomics) of such systems, for example that their security is linear in cost, whereas centralized systems can use techniques such as encryption to achieve security exponential in cost.
Now, via Toby Nangle's Stablecoin = Fracturedcoin we find Tokenomics and blockchain fragmentation by Hyun Song Shin, whose basic point is that these costs must be borne by the users of the system. For cryptocurrencies, this means through either or both transaction fees or inflation of the currency. The tradeoff between cost and security means that there is a market for competing blockchains making different tradeoffs. In practice we see a vast number of
competing blockchains:
The chart shows Ethereum losing market share against competing blockchains.
Shin's analysis uses game theory to explain why this fragmentation is an inevitable result of tokenomics. Below the fold I go into the background and the details of Shin's explanation.
Background
In 2018's Cryptocurrencies Have Limits I discussed Eric Budish's The Economic Limits Of Bitcoin And The Blockchain, an important analysis of the economics of two kinds of "51% attack" on Bitcoin and other cryptocurrencies based on PoW blockchains. Among other things, Budish shows that, for safety, the value of transactions in a block must be low relative to the fees in the block plus the reward for mining the block.
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
Bitcoin's costs are defrayed almost entirely by inflating the currency, as shown in this chart of the last year's income for miners. Notice that the fees are barely visible.
It has been known for at least a decade that Bitcoin's plan to phase out the inflation of the currency was problematic. In 2024's Fee-Only Bitcoin I wrote:
Our key insight is that with only transaction fees, the variance of the miner reward is very high due to the randomness of the block arrival time, and it becomes attractive to fork a “wealthy” block to “steal” the rewards therein.
So Bitcoin's security depends upon the "price" rising enough to counteract the four-yearly halvings of the block reward. In that post I made a thought-experiment:
As I write the average fee per transaction is $3.21 while the average cost (reward plus fee) is $65.72, so transactions are 95% subsidized by inflating the currency. Over time, miners reap about 1.5% of the transaction volume. The miners' daily income is around $30M, below average. This is about 2.5E-5 of BTC's "market cap".
Lets assume, optimistically, that this below average daily fraction of the "market cap" is sufficient to deter attacks and examine what might happen in 2036 after 3 more halvings. The block reward will be 0.39BTC. Lets work in 2024 dollars and assume that the BTC "price" exceeds inflation by 3.5%, so in 12 years BTC will be around $98.2K.
To maintain deterrence miners' daily income will need to be about $50M, Each day there will be about 144 blocks generating 56.16BTC or about $5.5M, which is 11% of the required miners' income. Instead of 5% of the income, fees will need to cover 89% of it. The daily fees will need to be $44.5M. Bitcoin's blockchain averages around 500K transactions/day, so the average transaction fee will need to be around $90, or around 30 times the current fee.
Bitcoin users set the fee they pay for their transaction. In effect they are bidding in a blind auction for the limited supply of transaction slots. Miners are motivated to include high-fee transactions in their next block. If there were an infinite supply of transactions slots miners' fee income would be zero. In practice, much of the timethe supply of slots exceeds demand and fees are low. At times when everyone wants to transact, such as when the "price" crashes, the average fee spikes enormously.
Cryptocurrencies such as Bitcoin rely on a ‘proof of work’ scheme to allow nodes in the network to ‘agree’ to append a block of transactions to the blockchain, but this scheme requires real resources (a cost) from the node. This column examines an alternative consensus mechanism in the form of proof-of-stake protocols. It finds that an economically sustainable network will involve the same cost, regardless of whether it is proof of work or proof of stake. It also suggests that permissioned networks will not be able to economise on costs relative to permissionless networks.
In 2022 Ethereum switched from Proof-of-Work to Proof-of-Stake, reducing its energy consumption by around 99%. This chart shows that, like Bitcoin, until the "Merge" the costs were largely defrayed by inflating the currency. After the "Merge" the blockchain has been running on transaction fees.
Shin's Analysis
Here is a summary of Shin's analysis.
Notation
There is a continuum of validators i.
For validator i ∈ [0;1], the cost of contributing to governance is ci > 0.
The blockchain needs at least a fraction k̂ of the validators contributing to be secure. Shin writes:
There are two special cases of note: k̂ = 1 (unanimity, corresponding to full decentralisation where every validator must participate for the blockchain to function) and k̂ = 0 which corresponds to full centralisation, where one validator has authority to update the ledger.
k̂ = 1 is impractical,lacking fault tolerance. k̂ = 0 is much more practical, it is the traditional trusted intermediary.
If the blockchain is secure, each contributing validator earns a reward p > 0. A non-contributing validator earns zero.
The validators share a common cost threshold c*. If ci < c*, validator i contributes, if ci > c* validator i does not.
Argument
Each validator will want to contribute only if at least k̂ - 1 other validators contribute, which poses a coordination problem. The case of particular interest is the validator with ci = c*. Shin writes:
Intuitively, even though the marginal validator may have very precise information about the common cost c*, the validator faces irreducible uncertainty about how many other validators will choose to contribute. It is this strategic uncertainty — uncertainty about others' actions — that is the central feature of the coordination problem.
This "strategic uncertainty" is similar to the attacker's uncertainty about other peers' actions that is at the heart of the defenses of the LOCKSS system in our 2003 paper Preserving peer replicas by rate-limited sampled voting.
Shin Figure 6
Because the marginal validator's ci = c*, the decision whether or not to contribute makes no difference. Sin's Figure 6 explains this graphically. Rectangle A is the loss if k < k̂ and rectangle B is the gain if k > k̂. Setting them equal gives:
c*k̂ = (p - c*)(1 - k̂)
which simplifies to:
c* = p(1 - k̂)
Shin and Morris earlier showed that this is the unique equilibrium no matter what strategy the validators use.
Result
What this means is that successful validation depends upon the reward p being large enough so that:
Note that the required reward p explodes as k̂ → 1. This is the central result of the paper: the more decentralised the blockchain (the higher the supermajority threshold), the higher must be the rents that accrue to validators. In the limiting case of unanimity (k̂ = 1), no finite reward can sustain the coordination equilibrium.
Shin Figure 1
This yet another result showing that a reasonably secure blockchain is unreasonably expensive. The complication is that, much of the time, transactions are cheap because the demand for them is low. Thus most of the time validators are not earning enough for the risks they run. But:
When many users want to transact at the same time, they bid against each other for limited block space, and fees spike — much as taxi fares surge during rush hour. Figure 1 shows how Ethereum gas fees exhibited sharp spikes during periods of network congestion, such as during surges in decentralised finance (DeFi) activity or spikes in the minting of non-fungible tokens (NFTs). These spikes are not merely a reáection of excess demand; they are the mechanism through which the blockchain extracts the rents needed to sustain validator coordination.
Note that these spikes mean that the majority of the time fees are low but the majority of transactions face high fees. It is this "user experience" that drives the fragmentation that Shin describes:
When demand for block space is high, fees rise and validators are well compensated. But high fees deter users, especially those making small or routine transactions. These users are the first to migrate to competing blockchains that offer lower fees — blockchains that can offer lower fees precisely because they have lower coordination thresholds (and hence less security). The users who remain on the more secure blockchain are those with the highest willingness to pay: institutions, large DeFi protocols, and transactions where security and censorship resistance are paramount. This sorting of users across blockchains is the essence of fragmentation.
The fragmentation argument is the flipside of blockchain's "scalability trilemma," as described by Vitalik Buterin, who posed the problem as the impossibility of attaining, simultaneously, a ledger that is decentralised, secure, and scalable.
It is worth noting that Buterin's trilemma is a version for PoS of the trilemma Markus K Brunnermeier and Joseph Abadi introduced for PoW in 2018's The economics of blockchains. See The Blockchain Trilemma for details.
Shin's focus is primarily on the effects of fragmentation on stablecoins. He notes that:
Rather than converging on a single platform, stablecoin activity is scattered across many chains (Figure 4). As of late 2025, Ethereum held the majority of total stablecoin supply but was facing competition from Tron and Solana, each of which had attracted tens of billions of dollars in stablecoin balances. Each chain serves different geographies and use cases: Ethereum for institutional settlement, Tron for low-cost remittances, Solana for retail payments and DeFi activity.
This fragmentation among blockchains would not matter much if stablecoins were interoperable between them, but they are confined to the blockchain on which they were minted:
A USDC token on Ethereum is not the same as a USDC token on Solana — they exist on separate ledgers that have no native way of communicating with each other. Transferring between chains requires the use of bridges: specialised software protocols that lock tokens on one chain and issue equivalent tokens on another. These bridges introduce additional risks, including vulnerabilities in the smart contract code — bridge exploits have accounted for billions of dollars in cumulative losses — and they impose costs and delays that undermine the seamless transferability that is the hallmark of money. The result is a landscape in which stablecoins from the same issuer exist in multiple, non-fungible forms across different blockchains, fragmenting liquidity and undercutting the network effects that should be the strength of a widely adopted payment instrument.
Discussion
As I've been pointing out since 2014, very powerful economic forces mean that Decentralized Systems Aren't. So the users paying for the more expensive transactions because they believe in decentralization aren't getting what they pay for.
Coinbase Global Inc. is already the second-largest validator ... controlling about 14% of staked Ether. The top provider, Lido, controls 31.7% of the staked tokens,
That is 45.7% of the total staked controlled by the top two.
In addition all these networks lack software diversity. For example, as I write the top two Ethereum consensus clients have nearly 70% market share, and the top two execution clients have 82% market share.
Shin writes as if more decentralization equals more security even though it doesn't happen in practice, but this isn't really a problem. What the users paying the higher fees want is more security, and they are probably getting because they are paying higher fees. As I discussed in Sabotaging Bitcoin, the reason major blockchains like Bitcoin and Ethereum don't get attacked is not because the (short-term) rewards for an attack are less than the cost. It is rather that everyone capable of mounting an attack is making so much money that:
those who could kill the golden goose don't want to.
In any case what matters for Shin's analysis isn't that the users actually get more security for higher fees, but that they believe they do. Like so much in the cryptocurrency world, what matters is gaslighting. But what the chart showing Ethereum losing market share shows is that security is not a concern for a typical user.
I revisited an old Go package I've been using over the past few years to build IIIF manifests — nothing fancy, just some glue around structs and JSON. From that I built a new CLI, mkiiif, to generate IIIF manifests from static images (tiled or not). There are plenty of similar tools out there (iiif-tiler, tile-iiif, biiif, ...) but none quite matched the CLI ergonomics I needed for my daily workflow.
go install github.com/docuverse/iiif/cmd/mkiiif@latest
mkiiif can generate an IIIF manifest from a source directory containing images, or from a PDF file that gets exploded and converted to images via mupdf. Output images can be either untiled or static tiles generated with vips. Both approaches produce a IIIF Level 0 compliant layout, static files that can be served from any HTTP server, with no image server required. Untiled is less efficient for large images but perfectly fine for printed books, papers, and similar material.
mupdf and vips are external dependencies, that need to be installed separately. They are invoked via subprocess; I chose not to add Go library wrappers around them to keep the tool simple. WASM ports of both may become viable in the future.
The CLI usage:
Usage: mkiiif -id <id> -base <url> -title <title> -source <dir|pdf> -destination <dir> [-tiles] -base string Base URL where the manifest will be served (e.g. https://example.org/iiif) -destination string Output directory; a subdirectory named <id> will be created inside it, containing the images and manifest.json -id string Unique identifier for the manifest (e.g. book1) -resolution int Resolution (DPI) used when converting PDF pages to images via mutool (default 150) -source string Path to a directory of images or a PDF file to convert -tiles Generate IIIF image tiles for each image using vips dzsave (requires vips) -title string Human-readable title of the manifest
The directory can then be served from https://digital.library.org.
I've adopted this URL scheme:
https://{base}/{id} /manifest.json — the IIIF manifest /index.html — a simple viewer
So in the example above, https://digital.library.org/iiif01 opens a full viewer to browse the object. The viewer used is Triiiceratops — the newest viewer in the IIIF ecosystem. Built on Svelte and OpenSeadragon, is still young, but very usable, lightweight, and easy to embed and customize. It is my favourite viewer.
mkiiif doesn't handle metadata for now (and probably won't) — the manifest can be easily patched to insert descriptive metadata in a later step, after image preparation, pulling from any existing datasource or metadata catalog.
The main drawback of generating IIIF this way is that you end up managing a large number of files on the filesystem, and handling millions of small image tiles can be slow (and costly). This is where IIIF intersects — and overlaps — with similar practices in digital preservation, such as BagIt, OCFL, and WARC/WACZ. So far there's no specification or viewer implementation that handles IIIF containers (e.g. a zip file bundling images, tiles, and the manifest). Discussions on this have been ongoing in the past; I've recently been looking at analogous approaches like GeoTIFF and SZI.
A static IIIF bundle generated with this CLI still needs to be served from an HTTP server, with the base URL defined at derivation time. Could such a bundle be opened from localhost and viewed directly in the browser? Service Workers might help here (even if HTTP is still needed), but it's a rabbit hole I haven't explored yet.
The CLI is pretty bare-bones — feel free to suggest improvements or report bugs. I've been using it over the past weeks as part of a personal project: an amateur digital library built around a DIY book scanner I assembled at home, to preserve magazines, zines, and similar material (content NSFW and out of scope to link here).
Artificial intelligence (AI) has transformed nearly every field. Today, we can access and train models that generate text, images, sound, video, and code. This transformation is reshaping how we think, analyze, and preserve information. Yet, despite the rapid growth of AI, its use for analyzing web archive content seems to advance at a slower pace.
Web archiving is the process of collecting, preserving, and providing access to web content over time, where a memento represents a previous version of a web resource as it existed at a specific moment in the past. Much of the recent work within the web archiving community (e.g., [1], [2], [3]) has focused on making the archiving process itself more intelligent, integrating AI into tasks such as web crawling, storage optimization, and metadata generation. In contrast, the application of AI to the analysis of already archived web content has received comparatively less attention. This gap represents a great opportunity for innovation and contribution, particularly as web archives continue to grow in size, diversity, and historical importance.
In this blog, I aim to outline (based on my perspective, analysis, preliminary work, and insights gained during my PhD candidacy exam) opportunities for where AI could play a role, as well as key challenges involved in integrating AI into web archiving.
My Preliminary Work
Since I joined the PhD program at ODU in 2023 (Blog post introducing myself) under the supervision of Dr. Michele C. Weigle, my work has focused on the intersection of web archiving and AI, with a particular emphasis on leveraging Large Language Models (LLMs) through Retrieval-Augmented Generation (RAG) to detect and interpret text changes across mementos. Identifying the exact moment when content was modified often requires carefully comparing multiple archived versions, a process that can be both tedious and time-consuming. Moreover, detecting and analyzing where important changes occur is not a straightforward process. Users often need to select a subset of captures from thousands available, and even then, there is no guarantee that the differences they find will be meaningful or important. Traditional approaches to memento change analysis, such as lexical comparisons and indexing (e.g., [4], [5]), focus on showing the deletion or addition of terms or phrases but ignore semantic context. As a result, they miss subtle shifts in meaning and rely heavily on human interpretation.
My early work resulted in a paper titled “Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach,” coauthored with Lesley Frew, Dr. Jose J. Padilla, and Dr. Michele C. Weigle. The results of this initial exploration demonstrated that an LLM, when combined with tools such as RAG over a set of mementos, can effectively retrieve and analyze changes in archived web content. However, it remains necessary to constrain the analysis to distinguish between important and non-important changes. Building on this, I have been developing a pipeline to automatically determine whether a change alters meaning or context and should be considered significant. This aims to reduce manual effort, cognitive load, and support integration into web archive systems while advancing methods for analyzing archived web content at scale.
My PhD Candidacy Exam
During the summer of 2025, I passed my PhD candidacy exam (pdf, slides). This milestone marked an important transition in my doctoral studies and provided an opportunity to reflect on my preliminary work, learn, and identify new ways to contribute to the intersection of AI and web archiving. In my candidacy exam, I reviewed a set of ten papers related to analyzing changes and temporal coherence in archived web pages and websites. Changes refer to any modifications observed in web content over time, including the addition, deletion, or alteration of text, images, structure, or other embedded resources. Temporal coherence, on the other hand, refers to the degree to which all components of an archived web page (such as HTML, text, images, and stylesheets) or website (such as interconnected pages and resources) were captured close enough in time to accurately represent how it appeared and functioned at a specific moment. A lack of temporal coherence can result in inconsistencies in how the archived page or site looks or behaves, which may affect the accuracy of change analysis.
Figure 2. A moment from my PhD candidacy exam, where I presented a ten-paper review on analyzing changes and temporal coherence in archived web pages and websites.
AI in Web Archiving: Opportunities
Over time, several researchers have addressed the analysis of changes and temporal coherence in web archives; however, the use of AI in this context has been limited. Below, I outline some research opportunities and challenges based on insights gained from my preliminary work and candidacy exam on how AI could play a role in these activities.
Topic Drift
AlNoamany et al. [6] studied web archive collections to identify off-topic pages within TimeMaps, which occur when a webpage that was originally relevant to a collection later changes into unrelated content. For example, in a collection about the 2003 California Recall Election (Figure 3), the site johnbeard4gov.com initially supported candidate John Beard (September 24, 2003) but later transformed into an unrelated adult-oriented page (December 12, 2003), making it irrelevant to the collection. To detect such changes, AlNoamany et al. proposed automated methods including text-based similarity metrics (cosine similarity, Jaccard similarity, and term overlap), a kernel-based method using web search context, and structural features such as changes in page length and word count. Using manually labeled TimeMap versions as ground truth, they found that the best performance was achieved by combining TF-IDF cosine similarity with word-count change.
Figure 3. Example of johnbeard4gov.com going off-topic. The first capture (September 24, 2003) shows the site supporting a California gubernatorial candidate, while the later capture (December 12, 2003) shows the domain transformed into unrelated adult-oriented content. Source: AlNoamany et al. [6]
Recent advances in AI and representation learning offer opportunities to enhance off-topic detection in web archives beyond traditional term frequency measures. Instead of relying on TF-IDF, future approaches could use dense semantic embeddings from transformer models to better capture meaning and context, enabling the detection of more subtle topic drift. Comparing embedding-based similarity with the methods proposed by AlNoamany et al. could help determine which approach is more effective, particularly when topic shifts are not immediately apparent.
Temporal Coherence
Weigle et al. [7] highlight a key challenge in modern web archiving: many sites, such as CNN.com, rely on client-side rendering, where the server delivers basic HTML and JavaScript that later fetch dynamic content (often JSON) through API calls. Traditional crawlers like Heritrix do not execute JavaScript or consistently capture these dynamic resources, leading to temporal violations in which archived HTML and embedded JSON files have different capture times, potentially misrepresenting events or news stories. The issue is illustrated in Figure 5, which shows archived CNN.com pages captured between September 2015 and July 2016. The top row displays pages replayed in the Wayback Machine that show the same top-level headline despite being captured months apart. The bottom row shows mementos from the same dates with the correct top-level headlines; however, the second-level stories remain temporally inconsistent.
By measuring time differences between base HTML captures and embedded JSON resources using CNN.com pages (September 2015–July 2016), Weigle et al. identified nearly 15,000 mementos with mismatches exceeding two days. They conclude that browser-based crawlers best reduce such inconsistencies, though due to their higher cost and slower performance, they recommend deploying them selectively for pages that depend on client-side rendering.
Figure 4. Example of temporal coherence violation in archived CNN.com pages using client-side rendering. Source: Weigle et al. [7].
AI can enhance existing approaches to temporal coherence in web archives, such as those proposed by Weigle et al., by helping identify pages that depend on client-side rendering. For example, a machine learning model could be fine-tuned to analyze the initial HTML and related resources to detect signals such as empty or minimally populated DOM structures and classify whether a webpage relies on client-side rendering. AI-based analysis could also estimate the proportion of JavaScript relative to textual content and detect patterns associated with common client-side frameworks. Combined with indicators such as API endpoints referenced in scripts, these features can be used to flag pages that are unlikely to render correctly with traditional crawlers and may require browser-based crawling.
AI for Enhancing Web Archive Interfaces
While platforms such as Google and others have begun integrating AI into their user interfaces, web archives have largely remained unchanged in this respect. This is notable given the potential of AI to make web archive interfaces more intuitive and more informative for a wide range of users. For example, as my preliminary work suggests, when analyzing content changes, users currently must manually browse long lists of captures or compare multiple archived versions of a webpage. AI could instead automatically identify moments when important changes occur and direct users’ attention to those points in time.
Along the same line, the Internet Archive’s Wayback Machine provides a “Changes” feature that highlights deletions and additions between two snapshots and a calendar view where color intensity reflects the amount of variation. However, this variation is based on the quantity of changes rather than their significance. As a result, many small edits may appear more important than fewer but meaningful modifications. An AI-enhanced interface could address this limitation by incorporating semantic change detection. For instance, a calendar view that highlights when the meaning or message of a page changes can make large-scale temporal analysis more efficient and accessible. Moreover, users could ask natural-language questions such as “When did this page change its message?” or “What were the major updates during a specific period?” and receive concise, understandable answers.
AI could also guide users through large collections by recommending related pages, explaining why certain versions are relevant, or warning when an archived page may contain temporally inconsistent content. For non-experts, visual aids generated by AI, such as timelines, change highlights, or short explanations, could make complex web archive data easier to interpret.
AI in Web Archiving: Challenges
While there are opportunities for AI integration into web archiving, there are also challenges that must be considered.
Technical Challenges
From a technical standpoint, I identified three primary challenges regarding using AI for analyzing archived web content. The first concerns the nature of archived web data. Web archiving systems typically store collected content using the Web ARChive (WARC) format. Each WARC file stores complete HTTP response headers, HTML content, and additional embedded resources such as images and JavaScript files. Although this format provides a structure and allows long-term preservation, it is verbose and was not designed to support AI-based analysis. Consequently, researchers must perform extensive parsing and preprocessing before AI models can effectively use archived web content.
Second, many web archives, such as the Internet Archive’s Wayback Machine, prioritize long-term storage and preservation over indexing and large-scale content retrieval. As a result, a single web page may have hundreds or even thousands of archived versions over time. Building and maintaining large-scale vector indexes over such temporally dense collections quickly becomes computationally expensive and, in many cases, impractical.
Third, even when working with controlled data scenarios, such as curated web archive collections, AI-driven analysis still depends on the availability of ground truth for evaluation and validation. For instance, training models to detect significant changes across mementos would require large-scale, high-quality annotations that capture not only what changed, but whether those changes meaningfully affect content interpretation. At present, no large-scale annotated datasets exist that support systematic analysis of change significance across archived web versions, creating a major barrier to training and evaluating AI models in this domain.
Ethical Challenges
Beyond technical limitations, the integration of AI into web archive analysis raises important ethical challenges. For instance, web archives preserve content as it existed at specific points in time, often without the consent or awareness of content creators or the individuals represented in that content. When AI models analyze archived web data, they may surface, reinterpret, or amplify sensitive information that was never intended to be reused in new analytical contexts. For this reason, it is important to carefully consider how AI is applied within web archiving. I contend that AI should be viewed as a complementary tool, one that supports, rather than replaces, human judgment. For example, AI can assist in identifying potential moments of relevant changes, flagging or summarizing them, while humans interpret the results and make decisions.
It is also important to note that recent debates highlight growing tensions between web archives and content owners regarding the use of archived data for AI training and analysis. For example, major news publishers have begun restricting access to resources like the Internet Archive due to concerns that archived content is being used for large-scale AI scraping without compensation or consent [8]. In response to such restrictions, researchers and practitioners—including Mark Graham, Director of the Wayback Machine—have argued that limiting access to web archives poses a significant risk to the preservation of digital history [9]. From this perspective, the primary concern is not excessive access, but rather the potential loss of the web as a historical record if archiving efforts are weakened.
Conceptual Challenges
AI models, particularly LLMs, typically operate on individual snapshots of data. As a result, they are not inherently designed to reason about evolution, temporal coherence, or change over time in archived web content. Consequently, answers to temporally grounded questions should not be expected by default when these models are applied without additional structure or context.
In static analysis scenarios, AI models can perform effectively. For example, given a single archived web page, an LLM can generate a summary, identify main topics, extract named entities, or analyze embedded resources such as images, videos, or scripts. Temporal analysis in web archiving, however, requires a different mode of reasoning. The central questions are not “What does this page say?” or “What is this page about?” but rather “What changed?”, “When did it change?”, “Why did it happen?”, and “What impact does the change have over time?” Answering these questions requires comparing multiple archived versions, reasoning based on context, and perhaps correlating changes across web pages.
Integrating AI into web archiving is therefore not only about efficiency, but about enabling new forms of discovery. This requires clearly defining desired outcomes and using AI to support or accelerate processes that have traditionally been manual.
Final Reflections
To conclude, I would like to leave the reader with a set of open questions as we continue moving toward the integration of AI in web archiving. One of the most visible changes introduced by AI is the ability to go beyond syntactic analysis and begin exploring semantic analysis, where meaning, context, and interpretation matter. This shift is not about replacing existing techniques, but about expanding the types of questions we can ask when working with web archive data.
I contend that traditional algorithms remain essential for many web archiving tasks. They are precise, transparent, and well understood. AI, by contrast, offers strengths in areas where rules struggle: interpreting context, assessing relevance, and reasoning across multiple versions of content. Rather than framing this as a competition between algorithms and AI, a more productive question is how these approaches can complement one another, and in which parts of the analysis pipeline each is most appropriate.
In the short term, I consider that AI tools are unlikely to replace algorithmic methods. However, they already show promise as assistive tools that can guide analysis, prioritize attention, and help humans reason about large and complex temporal collections. This naturally raises a forward-looking question: if AI continues to improve in its ability to reason about time, meaning, and change, how should the web archiving community adapt its tools, workflows, and standards?
The WARC format has proven effective for long-term preservation, but it was not designed with AI-driven analysis in mind. Should we aim to augment existing archival formats with AI-aware representations, or should we focus on developing AI methods that better adapt to current standards such as WARC? How we answer this will shape not only how we analyze web archives, but also how future generations access and understand the web past.
References
[1] AK, Ashfauk Ahamed. “AI driven web crawling for semantic extraction of news content from newspapers.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-25616-x.
[2] Abrar, M. F., Saqib, M., Alferaidi, A., Almuraziq, T. S., Uddin, R., Khan, W., & Khan, Z. H. “Intelligent web archiving and ranking of fake news using metadata-driven credibility assessment and machine learning.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-31583-0.
[3] Nair, A., Goh, Z. R., Liu, T., and Huang, A. Y. “Web archives metadata generation with gpt-4o: Challenges and insights,” arXiv, Tech. Rep. arXiv:2411.0540, Nov. 2024. [Online]. https://arxiv.org/abs/2411.05409.
[4] L. Frew, M. L. Nelson, and M. C. Weigle, “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives,” in Proceedings of the 23rd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2023, pp. 71–81. https://doi.org/10.1109/JCDL57899.2023.00021.
[5] T. Sherratt and A. Jackson, GLAM-Workbench/web-archives, https://zenodo.org/records/6450762, version v1.1.0, Apr. 2022. DOI: 10.5281/zenodo.6450762.
[6] Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within timemaps in web archives,” International Journal on Digital Libraries, vol. 17, no. 3, pp. 203–221, 2016. https://doi.org/10.1007/s00799-016-0183-5.
[7] M. C. Weigle, M. L. Nelson, S. Alam, and M. Graham, “Right HTML, wrong JSON: Challenges in replaying archived webpages built with client-side rendering,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Jun. 2023, pp. 82–92. https://doi.org/10.1109/JCDL57899.2023.0002.
Figure 1: Each tweet ID is a unique identifier that encodes the tweet creation timestamp, example adapted from Snowflake ID, Wikipedia.
Web archives, such as the Wayback Machine, are indexed by URL. For example, if we want to search for a tweet we must first know its URL. Figure 2 demonstrates that searching for a tweet URL results in a timemap of that tweet archived at different points in time. Clicking on a particular datetime will show the archived tweet at that particular point in time.
Figure 2: An archived tweet URL results in a timemap consisting of archived copies of the tweet.
Figure 3 shows a screenshot of a tweet shared by @_llebrun. The tweet in the screenshot was originally posted by @randyhillier who later deleted his tweet. The screenshot of the tweet does not have the tweet's URL on the image. Moreover, when a tweet is deleted, we will not be able to find the tweet URL on the live web, nor will we know how to look it up in the archive.
Figure 3: @_llebrun tweeted a screenshot of a tweet originally posted by @randyhiller, who later deleted his tweet.
Therefore, we need to construct the URL of a tweet using only the information present in the screenshot. The structure of a tweet URL is:
We need the Twitter_Handle and Tweet_ID to construct a tweet URL. Each tweet ID is a unique identifier known as theSnowflake ID that encodes the tweet creation timestamp (Figure 1). We can extract the Twitter handle and timestamp from a tweet in the screenshot. In our previous tech report, we introduced methods for extracting Twitter handles and timestamps from Twitter screenshots. Next, we need to determine the tweet ID from the extracted timestamp. We could use only the Twitter handle and query the Wayback Machine, but that would be an exhaustive task to individually dereference all the archived tweets for a user. For example, the following curl command shows the total number of archived tweets required to dereference for @randyhiller's status URLs is huge (42,053). Hence, our goal is to limit the search space by utilizing the timestamp present on the screenshot.
Previously, one could query Twitter to find the timestamp of a tweet given a tweet ID. But, this service is no longer freely available.. The Twitter API has access rate limits and metadata from deleted/suspended/private tweets cannot be accessed using the API. Moreover, the Twitter API is currently monetized and no longer research-friendly. To address these issues, WS-DL members Mohammed Nauman Siddique and Sawood Alam developed the TweetedAt web service in 2019. The goal of this service is to extract the timestamps for Snowflake IDs and estimate timestamps for pre-Snowflake IDs. Therefore, TweetedAt has become a useful tool for finding timestamps from tweet IDs. However, we require a tweet ID prefix to be determined from a given timestamp.
Reverse TweetedAt
The Snowflake service generates a tweet ID which is a 64-bit unsigned integer composed of: 41 bits timestamp, 10 bits machine ID, 12 bits machine sequence number, and 1 unused sign bit. The timestamp occupies the upper 41 bits only.
TweetedAt determines the timestamp for a tweet ID by right-shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657 (offset).
For Reverse TweetedAt, given a datetime, we want to generate a tweet ID prefix by subtracting the offset and left-shifting by 22 bits. The process will not reconstruct the exact tweet ID because the lower 22 bits are all zeros. However, the process will give us a tweet ID prefix for a timestamp. For example, the tweet ID for @randyhillier’s tweet is ‘1495226962058649603’ and the timestamp is ‘9:41 PM Feb 19, 2022’ as shown in Figure 3. The tweet ID is a 19-digit ID and the timestamp is at minute-level granularity. The Reverse TweetedAt would compute a tweet ID prefix ‘149522’ of 6-digits for the 19-digit tweet ID ‘1495226962058649603’ based on the timestamp at minute-level granularity.
Python code to get tweet ID prefix from a Wayback timestamp
We integrated Reverse TweetedAt as a web service alongside TweetedAt. The service accepts a timestamp as user input and returns the corresponding tweet ID prefix, tweet ID regex, and full tweet ID range (Figure 4). It supports multiple valid timestamp formats (e.g., ISO 8601, RFC 1123, Wayback) and provides output at different levels of granularity. For example, Figure 4 shows output for millisecond-level granularity. Because millisecond-level precision is typically unavailable in tweet timestamps, the tool can interpret such inputs at second- or minute-level granularity. Rather than assuming zeros for unknown fields, the tool expands the input into the full corresponding time window (e.g., an entire second or minute), and computes the tweet ID prefix over that interval.
Figure 4: Reverse TweetedAt outputstweet ID prefix at millisecond- level granularity.
Figure 5: Reverse TweetedAt outputs tweet ID prefix at second-level granularity.
Figure 6: Reverse TweetedAt outputs tweet ID prefix at minute-level granularity.
Tweet ID Regex-based Retrieval Across Temporal Granularity
We can use the tweet ID regex derived from a timestamp to search for archived tweets within a specific temporal window. By querying the Wayback Machine’s CDX API and filtering results using this prefix-based regex, we can identify tweet URLs whose IDs fall within the calculated range. As the timestamp becomes less precise, the tweet ID becomes shorter and the regex search space widens.
For example, the tweet ID of @randyhillier’s tweet shown in Figure 3 is ‘1495226962058649603.’ Using TweetedAt, we can get the timestamp at millisecond-level granularity. Using Reverse TweetedAt, the millisecond-level granularity returns a more precise prefix and results in 10 archived captures, while a slightly less precise prefix (second-level granularity) returns 15. When the precision is reduced further (minute-level granularity), the number of results remains 15. This indicates that all tweets within that broader time window were posted within the same narrower interval. This illustrates how lower temporal granularity expands the potential search space. However, a wider ID range does not necessarily produce more results; it only increases the number of possible candidate IDs.
CDX API Wildcard Search and Snowflake IDs to Limit the Search Space Using Tweet ID Prefix
We can now determine a tweet ID prefix from a screenshot timestamp using the Reverse TweetedAt service. Since a tweet can be archived any time between ±26 hours of the screenshot timestamp, we can determine tweet ID prefixes from the time window timestamps. We can use this time window to limit the search space by excluding the URLs tweeted before and after the alleged timestamp. Let us consider a tweet in the screenshot in Figure 2, where the screenshot timestamp is:
9:41 PMᐧ Feb 19, 2022 (20220219214100)
We compute the tweet ID prefixes from left-hand boundary (-26) and right-hand boundary (+26) timestamps using the Reverse TweetedAt which are listed below:
-26 hours timestamp: 20220218194100 → tweet ID prefix: 14947588
+26 hours timestamp: 20220220234100 → tweet ID prefix:149554404
As previously mentioned, the timestamp occupies the upper 41 bits only. We can use a common portion of tweet ID prefixes (149[4-5]) and do a CDX API wildcard search in the Wayback Machine to limit the search space. The search space reduces to 629 archived tweets, whereas using only the Twitter handle outputs 42,053 archived tweets. Now, dereferencing 629 archived tweets to search for a particular tweet text of a screenshot is a lot of work but feasible, whereas dereferencing 42,053 archived tweets is far too expensive. The following curl command shows the total number of archived tweets required to dereference for @randyhiller's status URLs with a common tweet ID prefix is comparatively less (629).
It is easy to search for a tweet in the Wayback Machine when you know the URL. But a screenshot of a tweet typically does not have its URL present on the image. However, the Twitter handle and timestamp present in the tweet in the screenshot can be utilized to search for a tweet in the Wayback Machine web archive. Given a datetime, Reverse TweetedAt produces a tweet ID prefix, which we can then use to grep through a CDX API response of all tweets associated with a Twitter account. We can determine approximate tweet IDs from left-hand boundary and right-hand boundary timestamps from a screenshot timestamp using the Reverse TweetedAt tool. We found that we can limit the search space using a CDX API wild card search based on a common tweet ID prefix. Thus, the process for finding candidate archived tweets for the tweet in the screenshot is optimized. We published a paper at the 36th ACM Conference on Hypertext and Social Media, “Web Archives for Verifying Attribution in Twitter Screenshots,” which discusses how we can further use the candidate archived tweets to verify whether the tweet in the screenshot was posted by the alleged author.
In Brief: This study examines the concept of neutrality in Library of Congress Subject Headings and the subject approval process by analyzing proposed headings that were rejected over a nearly 20-year period. It considers the place of neutrality in libraries more generally and argues that equity, rather than neutrality, is the appropriate lens for judging subject heading proposals. Finally, it recommends several reforms that could improve the subject heading process and make it more equitable.
If a train is moving down the track, one can’t plop down in a car that is part of that train and pretend to be sitting still; one is moving with the train. Likewise, a society is moving in a certain direction—power is distributed in a certain way, leading to certain kinds of institutions and relationships, which distribute the resources of the society in certain ways. We can’t pretend that by sitting still—by claiming to be neutral—we can avoid accountability for our roles (which will vary according to people’s place in the system). A claim to neutrality means simply that one isn’t taking a position on that distribution of power and its consequences, which is a passive acceptance of the existing distribution. That is a political choice.[1]
Introduction
Library workers and patrons have long been frustrated with Library of Congress Subject Headings (LCSH) for being out of date and lacking well-known concepts with abundant usage. Contributors to the Subject Authority Cooperative Program (SACO) have made many improvements to LCSH by proposing new headings and revising existing terms. Those attempts, however, have sometimes been hampered by the Library of Congress’s (LC) preference for supposed neutrality within the vocabulary; Subject Headings Manual (SHM) instruction “H 204,” released in 2017, specifically dictates that proposed headings should “employ neutral (i.e., unbiased) terminology.”[2]
This desire for neutrality has been directly stated, alluded to, or otherwise upheld in myriad rejections of proposed subject headings, from Negative campaigning[3] to White flight.[4] Even Water scarcity, a quantifiable concept of worldwide concern, was rejected in 2008 as a non-neutral topic requiring value judgments with the following justification:
Works on the topics of water scarcity and water shortage have been cataloged using the heading Water-supply, post-coordinating[5] as necessary with additional headings such as Water conservation and Water resources management. The meeting determined that this practice is appropriate and should continue, since Water-supply is a neutral heading that does not require a judgment about the relative abundance of water.[6]
However, what exactly constitutes neutral and unbiased terminology is never defined in “H 204” or anywhere else in the SHM, nor in any other Library of Congress controlled vocabulary manuals.[7] Much of the previous literature on neutrality in libraries focuses on debates over possible definitions of the term and what role neutrality should play in library services and collections. Building off previous critical cataloging literature, which focuses on addressing problematic terms, subject hierarchies, and biases within cataloging standards, this article extends that scrutiny further. We analyze how neutrality is embedded in the LC structures and systems that vet the terms catalogers utilize to describe materials.
Our article examines the ways in which neutrality is enforced in LCSH rejections between July 2005 and December 2024. We review “Summaries of Decisions” from LC Subject Editorial Meetings (along with associated discussion and commentary in the field); within these, we identify and interpret patterns of justifications used to reject subject heading proposals and maintain purported neutrality within the vocabulary. We argue that neutrality has been used to keep many concepts depicting prejudice (racism, sexism, etc.), as well as concepts related to the lived experiences of marginalized people, out of the vocabulary and/or to obscure materials about those topics under other, often more generalized or euphemistic, terminology. As a counterpoint, we suggest a values- and equity-driven approach to replace the principle of neutrality in a cataloging context and within the subject approval process. We acknowledge that the current political situation may be particularly fraught for equity-driven change, but believe bowing to political pressures is untenable, and continued pursuit of neutrality will only serve to further the discordance between library values and the realities of LCSH.
Background
Neutrality: Assumed, but Nebulous
Schlesselman-Tarango notes the perceived conceptual importance of neutrality for libraries and librarianship; their “status as ‘an essential public good’” is “contingent on the perpetration of the idea that [they are] also neutral.”[8] Seale further situates this notion of libraries-as-neutral as not externally imposed, but emanating from within librarianship itself: “The positioning of the library as a neutral and impartial institution, separated from the political fray, resonates with dominant library discourse around libraries.”[9]
However, despite both critics and supporters assuming that neutrality is fundamental to librarianship, there is a dearth of references to the term in official documents underpinning the ethics and standards of the library profession. The American Library Association’s (ALA) Working Group on Intellectual Freedom and Social Justice observed, for example, that “the word neutrality does not appear in the Library Bill of Rights, the ALA Code of Ethics, and any other ALA statements that the Working Group could locate. It does not appear in the Intellectual Freedom Manual (10th Edition) nor is it defined in any official ALA document or policy.”[10] The International Federation of Library Associations and Institutions’s (IFLA) Code of Ethics mentions but does not define neutrality in Section 5, in sentences such as “Librarians and other information workers are strictly committed to neutrality and an unbiased stance regarding collection, access and service.”[11] For catalogers in particular, the Cataloging Code of Ethics, issued in 2021 and discussed further below, explicitly disputes the concept of neutrality.
Most pertinent to the subject proposal process, the National Information Standards Organization’s (NISO) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies mentions neutrality exactly twice, yet again without definition. The first instance, in guidance about choosing preferred forms of terms, asserts that “Neutral terms should be selected, e.g., developing nations rather than underdeveloped countries.”[12] The second appearance, in a discussion of synonyms, notes “pejorative vs. neutral vs. complimentary connotation[s]” of terms that might influence usage.[13] The latter reference positions neutrality as the impartial fulcrum of term meanings, while the former implies, particularly via the example, a more active attempt at choosing equitable and unbiased terminology.
Although the terms “neutral” and “unbiased” are often linked when they appear in library literature (as in the IFLA Code of Ethics), they are not synonymous. Oxford English Dictionary (OED) definitions of neutral include “inoffensive,” and “not taking sides in a controversy, dispute, disagreement, etc.”; unbiased, however, while meaning “not unduly or improperly influenced or inclined; [and] unprejudiced,” does not necessarily imply a lack of involvement in social or political issues.[14] The incompatibility between neutrality as inoffensive isolation versus unbiasedness as active equity plays out repeatedly in library discussions. Without clear definitions, neutrality in the NISO Guidelines and elsewhere is open to conjecture and interpretation. As noted by Scott and Saunders, “[T]he term ‘neutrality’ seems to be used for, or conflated with, everything from not taking a side on a controversial issue to the objective provision of information and a position of defending intellectual freedom and freedom of speech.”[15]
Proponents of library neutrality don’t fully agree on definitions, either. In Scott and Saunders’s survey, some describe it as “lacking bias,” which more closely aligns with principles of equity.[16] The depiction of neutrality by LaRue, the former Director of the ALA’s Office for Intellectual Freedom, also appears to resemble equity; he frames neutrality as not “deny[ing] people access to a shared resource just because we don’t like the way they think” and giving everyone “a seat at the table.”[17] Dudley, reframing library neutrality in relation to pluralism, highlights similar values; his proposed ethos calls on librarians to “adhere to principled, multi-dimensional neutrality” which includes “welcoming equally all users in the community” and “consistently-apply[ing] procedures for engaging with the public.”[18]
The 2008 book Questioning Library Neutrality examines many aspects of why neutrality is both an illusion and a misguided aspiration, and also disabuses readers of the idea that it has always been a core value. Rosenzweig points out that neutrality as a principle of librarianship does not go back to the early development of public libraries:
We would do well to remember that, if libraries as institutions implicitly opened democratic vistas, our librarian predecessors were hardly democratic in their overt professional attitude or mission, being primarily concerned with the regulation of literacy, the policing of literary taste and the propagation of a particular class culture with all its political, economic and social prejudices. In fact, the idea of the neutrality of librarianship, so enshrined in today’s library ideology (and so often read back into the indefinite past), was alien to these earlier generations.[19]
Although Macdonald and Birdi’s literature review identifies four conceptions of neutrality within library science literature—“favourable,” “tacit value,” “libraries are social institutions,” and “value-laden profession”—the authors found that depictions of neutrality articulated by practitioners are more complicated. Many have “ambivalent” views of neutrality, seeing it as “a slippery and elusive concept.”[20] The relative importance of neutrality to proponents varies, depending on its position vis-à-vis other library values: “When it is alone, or grouped with a simple, single other value like professionalism, it is very low in priority. When it is presented in a group of other values or left implicit, it fares better.”[21] Catalogers tended to espouse neutrality the least among library specializations, with 21% reporting that they never think about neutrality.[22] Further, some surveyed librarians “are more likely to eschew neutrality on matters of social justice,” when neutrality comes into conflict with core library values.[23]
Neutrality versus Social Justice
Since the late 1960s, neutrality has increasingly come into question as librarians have embraced ideals centering social justice, equity, diversity, and inclusion, particularly in the ALA.[24] These values, codified in the ALA Code of Ethics and Library Bill of Rights, include a commitment to “recognize and dismantle systemic and individual biases; to confront inequity and oppression; to enhance diversity and inclusion; and to advance racial and social justice in our libraries, communities, profession, and associations.”[25] ALA resolutions go a step further, acknowledging the “role of neutrality rhetoric in emboldening and encouraging white supremacy and fascism.”[26] Scott and Saunders sum up the issue, noting that while some librarians cast neutrality as a “fundamental professional value, albeit one that is not explicitly mentioned in the professional codes of ethics and values,” others assert that it is “a false ideal that interferes with librarians’ role of social responsibility, which is an explicitly stated value of librarianship.”[27] As Watson argues in an ALA 2018 Midwinter panel on neutrality in libraries, “We can’t be neutral on social and political issues that impact our customers because, to be frank, these social and political issues impact us as well.”[28]
Even among library codes of ethics that explicitly hold neutrality as a core value, there is a tension between practitioners and official documentation. For example, the Canadian Federation of Library Associations / Fédération canadienne des associations de bibliothèques (CFLA-FCAB) Code of Ethics calls for librarians to “promote inclusion and eradicate discrimination,” provide “equitable services,” and “counter corruption directly affecting librarianship”; but the Code also advocates for neutrality, advising librarians to “not advance private interests or personal beliefs at the expense of neutrality.”[29] Once again neutrality remains undefined—though it’s implied, based on context, to be not taking sides, matching one of the OED definitions above. This understanding accords with a 2024 study on Canadian librarians, which noted most Canadian academic librarians seem to have coalesced around defining neutrality as “not taking sides,” followed by “not expressing opinions.”[30]
Yet the same study also highlights a perceived incompatibility of neutrality with other values of librarianship, with “the majority (54%) of respondents” disagreeing or strongly disagreeing that “‘neutrality is compatible with other library values and goals,’” and 58% disagreeing “that it is ethical to be neutral.”[31] Brooks Kirkland asserts that assuming neutrality as a key tenet of librarianship conflicts with such principles as promoting inclusion and eradicating discrimination.[32] Pagowsky and Wallace note that, whether knowingly or not, upholding neutrality within inequitable systems ultimately supports them: “Trying to remain ‘neutral,’ by showing all perspectives have value … is harmful to our community and does not work to dismantle racism. As Desmond Tutu has famously said, ‘If you are neutral in situations of injustice, you have chosen the side of the oppressor.’”[33]
Cataloguing Code of Ethics, Critical Cataloging, and Other Recent Developments
The incongruity between neutrality and social justice as core library values has sparked the numerous debates detailed above and on mailing lists and social media. It has also led in part to the expansion of the critical cataloging movement and the creation of the Cataloguing Code of Ethics, published in 2021 and since adopted by several library organizations, including the ALA division Core. The Code explicitly refutes the concept of neutrality; it avers that “neither cataloguing nor cataloguers are neutral,” and calls out the biases inherent within the dominant, mostly Western cataloging standards currently in use. It particularly notes that “cataloguing standards and practices are currently and historically characterised by racism, white supremacy, colonialism, othering, and oppression.”[34]
The most well-known critical cataloging subject heading proposal was the attempt to change the now-defunct heading Illegal aliens, as depicted in the documentary Change the Subject. In November 2021, five years after LC initially announced it would change the Illegal aliens subject headings and then backtracked after political pressure, LC announced it would replace the subject headings Aliens and Illegal aliens. However, LC did not adopt the changes it had initially announced, nor the recommendations made in a report by the ALA Subject Analysis Committee (SAC), which included revising the term to Undocumented immigrants.[35] LC instead split Illegal aliens into two new headings: Noncitizens and Illegal immigration.[36] Librarians have criticized the retention of “illegal” within one of the updated headings for continuing to make library vocabularies “complicit” with the “legally inaccurate” criminalization of undocumented immigrants.[37]
Other critical cataloging proposals have been subjected to inordinate scrutiny by LC; even when headings have been approved, they have sometimes faced heavy editing and modification. One example is Blackface, where LC’s changes to the proposal obscured the racism characterizing the phenomenon. The broader term (i.e., the parent in the subject hierarchy) was altered from Racism in popular culture to Impersonation.[38] Since Impersonation falls under the broader terms Acting, Comedy, and Imitation, this change emphasizes the performance aspect in lieu of its racist connotations. Similarly, the scope note (i.e., definition), was modified from “Here are entered works on the use of stereotyped portrayals of black people (linguistic, physical, conceptual or otherwise), usually in a parody, caricature, etc. meant to insult, degrade or denigrate people of African descent” to “Here are entered works on the caricature of Black people, generally by non-Black people, through the use of makeup, mannerisms, speech patterns, etc.”[39] As noted by Cronquist and Ross, these changes ultimately “neutralize[d]” the proposal “in the name of objectivity.”[40]
However, there have also been numerous successful updates to outdated terminology and additions of missing concepts, particularly in recent years. For example, in 2021, fifteen subject headings for the incarceration of ethnic groups during World War II, including Japanese Americans, were changed from the euphemistic phrase –Evacuation and relocation to –Forced removal and internment.[41] The African American Subject Funnel added the new heading Historically Black colleges and universities in 2022 and helped to revise Slaves to Enslaved persons in 2023; the Gender and Sexuality Funnel successfully changed the heading Gays to Gay people, and proposed the new term Gender-affirming care, in 2023; and the Medical Funnel updated Hearing impaired to Hard of hearing people in 2024.[42]
On a hopeful note, many of these large-scale projects coordinated with Cataloging Policy Specialists within LC, who worked closely with catalogers during the process and ensured that related term(s) and related Library of Congress Classification number(s) were updated as well. Further, LC has taken some recent steps to improve its vocabularies and create avenues for increased input from outside institutions. This includes hiring a limited term Program Specialist to help redress outdated terminology related to Indigenous peoples. LC also created two advisory groups for Demographic Group Terms and Genre/Form Terms, both of which allow for greater community input into these vocabularies.
Still, frustrations remain. Changing outdated terminology is a complicated process. Library of Congress vocabularies, in particular, are vulnerable to potential governmental interference. Attempted Congressional intervention during the updating of Illegal aliens and the passing of a statute mandating transparency in the subject approval process led to the creation of “H 204” codifying LC’s preference for a neutrality uninvolved in political and social issues.[43] The complication of bibliographic file maintenance (e.g., reexamining cataloged materials to determine whether subject headings should be changed, deleted, or revised) also muddies the waters and impedes large-scale projects. Staffing issues within LC further hinder the ability to undertake or complete projects, as seen in the SACO projects process, paused in 2025 due to LC’s catalog migration.
Maintaining LCSH
Library workers are familiar with LCSH in our discovery tools, and most are aware of concerns about outdated and problematic headings. However, they may not see debates and conflicts about new headings and ongoing maintenance of the vocabulary as a built-in and inherent part of the system, as catalogers who engage in that work do.
As Gross asserts:
To remain effective, headings must be regularly updated to reflect current usage. Today’s LCSH People with disabilities used to be Handicapped and, before that, Cripples. Additionally, new concepts require new headings, such as the recently created Social distancing (Public health), Neurodiversity, and Say Her Name movement. The process of determining which word or phrase to use as the subject heading for a given topic is inevitably fraught and can never be free of bias. The choice of terms embodies various perspectives, whether they are intentional and acknowledged or not.[44]
Both the need to continually revise existing headings and create new ones, and indeed wrangling over what they should be, are not defects, nor a surprise. They flow directly from the purpose of controlled vocabulary and the complications of language it exists to help navigate—the ever-changing and endless variety of ways to refer to things.
Some of the frequency and intensity of debates about LCSH stem from the fact that it attempts to be a universal vocabulary that covers all branches of knowledge. While it is created and maintained primarily for the needs of the Library of Congress, it is used by all kinds of libraries. Balancing the need to serve a user base that consists of federal legislators and providing the world with a one-size-fits-all vocabulary is clearly a formidable and contradictory endeavor. In recent decades, LC has made significant progress in opening up the maintenance process to input and contributions from the broader library community via the SACO program. These changes appear to be partly in response to demands to make the process faster and more transparent, but also a desire by LC to incorporate broader perspectives and experiences and to help with the tremendous workload.
LCSH Creation and Revision Process
The SACO program, created circa 1993,[45] allows librarians to submit proposals for new or revised LCSH terms (as well as other LC vocabularies) to the Library of Congress. In order to submit proposals, catalogers are expected to be familiar with the Subject Headings Manual (SHM), which governs LCSH usage and formulations as well as the proposal process, required research, and criteria used to evaluate proposals.[46] One of the primary requirements is literary warrant: proposers must demonstrate that there is a need for the new subject heading based on a work being cataloged.[47] Beyond the work cataloged and published/reference sources, librarians can also cite user warrant, “the terminology people familiar with the topic use to describe concepts,” as justification in proposals.[48] This can include reviews, blog posts, social media threads, LibGuides, etc.
After a proposal is submitted, LC staff schedule it to a monthly “Tentative List,” which is published to allow for public comment on proposed headings. Taking those comments and SHM instructions into account, members of LC’s Policy, Training, and Cooperative Programs Division (PTCP) make a decision about whether to add the proposed heading to LCSH, send it back to the cataloger for revision and resubmission, or reject it. If the heading is not added, a monthly “Summary of Decisions” document details the reasons for its exclusion. While the SACO program allows external librarians to submit proposals, the Library of Congress maintains its “authority to make final decisions on headings added.”[49]
Most proposals are routine and relatively straightforward, such as those that follow patterns—repeated formulations of similar subjects that provide a predictable search structure for library patrons (e.g., Boating with dogs already exists and the cataloger wants to propose Boating with cats). SHM “H 180” notes that patterns help achieve desired qualities for the vocabulary, including “consistency in form and structure among similar headings.”[50] LC is also concerned with avoiding multiple subject headings that convey too closely related concepts. LCSH online training “Module 1.2” highlights both “consistency and uniqueness among subjects” as strengths of controlled library vocabularies, for instance.[51] Proposals that don’t follow patterns therefore receive more scrutiny, to make sure they are unique, definable topics. LC makes judgment calls based on the strength of the evidence in proposals, and on SHM instructions, including the guidance in “H 204” about neutrality.
Neutrality within LC Documentation
Within its official documentation on subject headings, LC mentions neutrality sparingly. In the entirety of the SHM, the word neutral appears only once, specifically in guideline “H 204” with the recommendation that catalogers “employ neutral (i.e., unbiased) terminology.”[52] Apart from an association with the term unbiased, neutral is not defined in “H 204” or anywhere else in the SHM. Online LCSH training, freely available from the Library of Congress website, offers similarly little on the concept of neutrality. “Module 1.4” recommends that catalogers “accept the idea that all knowledge is equal” and “remain neutral … and attempt to be as objective as possible” when describing material.[53]
Despite the lumping together of neutral and unbiased in “H 204,” a neutrality which calls for a static ignoring of social realities and historical context does not equal an unbiased active engagement against prejudice. The Merriam-WebsterDictionary’s definitions of “neutral” and “unbiased” make this clear. “Neutral” as “indifferent” and politically nonaligned echoes OED. But the definition of “unbiased” goes even further, meaning not just free from prejudice and “favoritism” but “eminently fair”[54]—an active and flexible balancing of interests inherently at odds with static and detached neutrality. Eliding the two concepts risks undermining the latter, and with it library ethics and values, resulting in the further entrenchment of Western, colonial, and other biases in LCSH.
The definition of neutrality that LC, and by extension LCSH, seems to favor is one of passivity. Neutrality as indifference to social realities appears, for instance, in LCSH training “Module 1.4.” The module acknowledges that library vocabularies “are culturally fixed” and “from a place; they are from a time; they do reflect a point of view.” However, rather than using that “realiz[ation]” to encourage periodic updating of outdated or potentially prejudicial content in LCSH, the module advises “accepting” that cultural fixity as immutable fact; it recommends that catalogers “remain neutral, suspend disbelief” and focus on (undefined) objectivity instead.[55] Objectivity also appears in “H 180,” which advises catalogers: “Avoid assigning headings that … express personal value judgments regarding topics or materials. … Consider the intent of the author or publisher and, if possible, assign headings … without being judgmental.”[56]
Here, as in “Module 1.4,” objectivity appears linked to neutrality; the implication is that a subject can only be described without bias if a cataloger is dispassionate and has no opinions on the topic. However, not all definitions of objectivity match this interpretation. Although OED defines objectivity as “detachment” and “the ability to consider or represent facts, information, etc., without being influenced by personal feelings or opinions,” Merriam-Webster’s definition is “freedom from bias” and a more actively equitable “lack of favoritism toward one side or another.”[57]
This disparity in meanings begs the question: What does it mean to describe a topic without judgment or bias? Is objectivity erasing any uncomfortable content in a topic, even if that erasure favors a biased status quo and/or muddies a topic’s meaning? Or, rather, is it objective to label something truthfully, even if the topic raises strong feelings? As demonstrated by the revisions to Blackface discussed above, changes to the scope note and broader term in the name of objectivity did not result in a clearer or less biased heading; instead, they obfuscated the racist intent behind the phenomenon.
Similarly, despite the assertion in “H 180,” a singular focus on authorial intent does not always result in a lack of bias or judgment in subjects. As noted by literary critics such as Wimsatt and Beardsley, “placing excessive emphasis on authorial intention [leads] to fallacies of interpretation,”[58] since readers only have access to the text in front of them; attempting to guess an author’s intent is already an act of judgment, not a discovery of objective facts. Further, if an author writes a prejudicial text, taking its content at face value risks replicating that bias through subject provision. LCSH terms such as Holocaust denial literature recognize and counter this, labeling Holocaust denial works as ones “that diminish the scale and significance of the Holocaust or assert that it did not occur.”[59] If catalogers relied strictly on authorial intent in the name of objectivity, those works would instead get misleading subjects such as Holocaust, Jewish (1939-1945) instead of Holocaust denial literature, tacitly legitimizing bias.
Thus, the SHM’s focus on objectivity and neutrality highlights incongruities and tensions within subject guidance and LCSH vocabulary itself between indifference and self-imposed inoffensiveness on the one hand, and actively countering bias and promoting equity on the other. As will be shown below, rejections in the name of neutrality reveal that in fact the proposal process itself has never been neutral or apolitical.[60]
Neutrality and SACO Rejections
LC’s adherence to an inflexible and indifferent definition of neutrality, critiquing proposals engaging with social and political realities as subjective and relying on value judgments, has led to the rejection of multiple headings that surface prejudice or describe the lives and experiences of marginalized peoples. Instead, rejections upholding neutrality reinforce hegemonic societal attitudes within LCSH.
Neutrality appears in several guises in proposal rejections in “Summaries of Decisions” from 2005 to 2025. The most obvious ones reference “H 204” and “neutral (i.e., unbiased) terminology,” including the 2008 rejection of Water scarcity and the 2024 rejection of White flight (discussed in more depth below).[61] Similar rejections use words such as “judgment” (including Negative campaigning in 2013, and Zombie firms in 2023); “pejorative” (e.g., Dive bars in 2010, and Banana republics in 2015); “vulgar and offensive” (such as Vaginal fisting and Anal fisting in 2010); “subjective” (such as African American successful people in 2009); “viewpoint” (including Jim Crow laws in 2019); and “non-loaded language” (e.g., Incarceration camps in 2024).[62]
Neutrality as non-involvement in political and social realities also appears in the rejection of proposals due to LC’s Policy, Training, and Cooperative Programs Division (PTCP)’s unwillingness to establish certain “patterns” of subject headings (i.e., set precedents for future headings of specific types). Pattern rejections often appear entirely arbitrary; that is, the rejections stated merely that PTCP did not wish to begin a pattern, and not that a proposal as formulated was missing vital elements, had no warrant, or did not conform to provisions stipulated in the SHM. Despite acknowledging in “Module 1.4” that the wrong subject heading “can make any resource in the collection ‘disappear,’”[63] these rejected patterns render certain topics invisible and unsearchable by library patrons.
Uncreated patterns include critiques of prejudicial attitudes and behaviors, particularly by governmental bodies, such as rejections of Prison torture in 2007 or Religious profiling in law enforcement in 2024.[64] Similarly, patterns that would have highlighted the unearned privilege and/or bigotry of certain groups remain largely unestablished, including Holocaust deniers (2016), Toxic masculinity (2020), and White privilege (rejected in 2011 and 2016, before finally being accepted as White privilege (Social structure) in 2022).[65] The rejection of White fragility in 2020 is particularly interesting, as the rationale was that “LCSH does not include any headings that ascribe an emotion or personality trait to a specific ethnic group or race, and the meeting does not want to begin the practice.”[66] However, LCSH has included since 2010 the heading Post-apartheid depression, meant to convey the mental health and feelings of white Afrikaners. So not all white people’s emotions appear off-limits—just ones that reveal systemic biases. PTCP also declined to create patterns naming discrimination directed at certain groups, such as Police brutality victims in 2014 and Missing and murdered Indigenous women in 2023.[67] In the latter case, the rejection of a term meant to highlight societal neglect of the violence against Indigenous peoples means that their existence and trauma continue to be hidden in library vocabularies and catalogs.
Pattern rejections not only make prejudices invisible in library catalogs, they also underrepresent concepts that celebrate or describe the cultures and experiences of marginalized peoples. Erasures of joy can be as damaging as erasures of struggle. Aronson, Callahan, and O’Brien’s discussion of themes related to people of color in picture books, for instance, could equally apply to messages portrayed in LCSH via what topics it hides or surfaces in library catalogs: a “predominance of Oppression … at the expense of other types of portrayals can send a message that suffering and struggle are definitive of a group’s experience, or even of victimhood.”[68] Instead, marginalized people “deserve to see themselves represented as people who lead full and dynamic lives and who are not fully defined by histories of oppression.”[69] Unaccepted subject headings of this type include African American successful people (2009), Overweight women’s writings (2011), Gay neighborhoods and Lesbian neighborhoods (2012), Gay personals (2018), Afro-pessimism (2021), and Indigenous popular culture (2024).[70]
Absorbing a proposed critical term into a supposed “positive” equivalent also served to preserve an inoffensive neutrality in LCSH; this is seen in the rejection of Food deserts in 2014:
The concept of food desert has been defined in multiple ways by various governments and organizations, often in ways to suit their specific political agendas … The existing heading Food security is defined as access to safe, sufficient, and nutritious food. The existing heading is used for both the positive and negative (it has a UF [cross-reference for] Food insecurity), and the meeting feels that it adequately covers the concept of a food desert.[71]
Similarly, LC rejected a proposal for Genocide denial in 2017 with the rationale that the “positive” heading—Genocide—was sufficient for patron access: “A heading for a concept in LCSH includes both the positive and negative aspects of that topic. A work about the denial of genocide still discusses the concept of Genocide.”[72]Slum clearance was also rejected in 2007 in favor of the euphemistic and supposedly equivalent Urban renewal.[73]
Sometimes rejections upholding neutrality appeared in the guise of a fear that the term might be misapplied. For instance, although LC acknowledged in its 2019 rejection of Jim Crow laws and Jim Crow (Race relations) that the headings described laws and attitudes promulgated during a specific time period—which could therefore be described in a scope note guiding subject usage—it claimed that “the meeting is also concerned that the heading would be assigned only if the phrase Jim Crow is used in the title.”[74] In other words, the rejection prioritized avoiding possible future confusion over a definable term with ample literary and user warrant. The potential for definitional uncertainty also fueled other rejections, such as Femicide and Secret police in 2010, and Forced assimilation in 2024.[75] To preempt said confusion in all of these cases, LC could have added scope notes defining appropriate usage. Subjects have been remediated in the past when found to be misused, via clarifying scope notes or additional term creation, as with Romance literature (now Romance-language literature) versus Love stories (now Romance fiction).[76] Instead of denying the proposal due to a fear that a term might be misapplied, LC could have worked with the proposers to ensure the heading clearly defined the topic and, if necessary, made a public announcement with additional guidance on how to retrospectively add the term.
Overly-limiting definitions of subjects also provided reasoning for neutrality-based proposal rejections. An attempt in 2011 to add the natural language phrase Queer-bashing as a cross-reference under the then-current heading Gays–Violence against, for example, was rejected with the justification that “queer-bashing is not necessarily violent.”[77]Intersexuality–Law and legislation, a heading reflecting ongoing debates about genital surgeries on infants and legally-recognized genders, was rejected in 2016 because “The subdivision –Law and legislation free-floats [i.e., can be used] under ‘headings for individual or types of diseases and other medical conditions, including abnormalities, functional disorders, mental disorders, manifestations of disease, and wounds and injuries’ (SHM H 1150).”[78] The medicalizing language of the rejection reinforced the view of intersexuality as a “condition” or “disorder” needing fixing, rather than the natural human diversity of a group struggling for bodily autonomy and human rights. The rejection of Redlining in 2024 also fits this definitional pattern. Despite acknowledging that Redlining “functioned in many different financial contexts,” LC’s rejection implied that redlining’s definition was too broad, as LC preferred “the specificity of … separate headings.”[79] This continues to fracture the topic into multiple subjects such as Discrimination in financial services, Discrimination in mortgage loans, and Discrimination in credit cards. The rejection also sidestepped notions of governmental complicity in redlining, and whitewashed the topic by making it appear less systemic in nature.
Purported limitations of the vocabulary also served as justification for rejecting proposals and upholding LCSH neutrality. For instance, Butch/femme (Gender identity) was deemed “too narrow and specialized for a general vocabulary such as LCSH” in 2011 (though Butch and femme (Lesbian culture) was later approved in 2012)[80]—this, despite the copious presence of narrow terms in LCSH about other topics, such as Madagascar hissing cockroaches as pets, Photography of albatrosses, Church work with cowgirls and Zariski surfaces. Anal fisting and Vaginal fisting were rejected with the same rationale in 2010 (in addition to the “vulgar and offensive” argument described above).[81] Two rejections utilizing the same reasoning raise the question of whether queer cultures and identities were evaluated using particularly stringent criteria. As one librarian noted in the RADCAT mailing list after the rejection of Butch/femme (Gender identity):
This is especially baffling given that Bears (Gay culture) has been a valid subject heading for years, and both concepts have about the same amount of literary warrant. For those of you keeping track at home, this isn’t the first example of this rejection. During The Great Fisting Debacle of 2010 … the Anal fisting and Vaginal fisting proposals were shot down using the same language. I haven’t seen PSD [the prior name for PTCP] rejecting scientific or technical heading proposals as too specialized, which makes me wonder if it’s only gender & sexuality-related headings that receive this type of scrutiny.[82]
Troublingly, rejections for queer identities have continued since LC resumed processing tentative lists in January 2025, particularly for queer youth proposals. The rejection of Sexual minority high school students, for instance, indicates potential deference to current governmental queerphobia, particularly since the phrase “At this time” prefaces the justification: “At this time, it is not desirable to qualify headings for this age group by gender identity or expression/sexual orientation.” LC’s suggestion that instead “[t]erms from other subject vocabularies such as Homosaurus may be used instead of, or in conjunction with, existing LCSH headings to express the topic” suggests that there is no place for queer youth identity headings within LCSH.[83]
Finally, proposals were rejected in favor of maintaining pre-existing biases in LCSH–the cultural fixity mentioned in LCSH training “Module 1.4.”[84] For instance, a 2015 rejection of a change proposal related to Indigenous peoples–South Africa highlighted in its rationale the scope note for Indigenous peoples defining them entirely in relation to colonial power: “Here are entered works on the aboriginal inhabitants either of colonial areas or of modern states where the aboriginal peoples are not in control of the government.”[85] Sometimes, even the longevity of a term within LCSH was treated as sufficient reason to reject proposals meant to update outdated and inequitable terms, as with the 2020 rejection of a proposed change from Juvenile delinquents to Juvenile prisoners: “The existing heading Juvenile delinquents has been used for this concept for many years. At this point, it would be practically impossible to examine the entire file so the new heading could be applied accurately. The heading Juvenile delinquents should be assigned instead.”[86] This hesitance to tackle large projects because of the labor required for bibliographic file maintenance perpetuates the tendentious language present in LCSH and reinforces the view that the proposal process is itself not neutral.
Case Study: White Flight
In 2024, the African American Subject Funnel Project submitted a subject proposal for White flight. The proposal cited Kruse’s book White Flight: Atlanta and the Making of Modern Conservatism to demonstrate literary warrant. It additionally cited three reference sources—Encyclopedia of African-American Politics, The New Encyclopedia of Southern Culture, and Wikipedia—in order to define the term and demonstrate that it is commonly used by scholars and the public.
[Source]: Encyclopedia of African-American politics, 2021 (“White flight” is the term used to refer to the tendency of whites to flee areas and institutions once the percentage of blacks reaches a certain level)
[Source]: The new encyclopedia of southern culture, 2010 (The term “white flight” refers to the spatial migration of white city dwellers to the suburbs that took place throughout the United States after World War II. One of the most powerful and transformative social movements of the 20th century, white flight significantly affected the class and racial composition of cities and metropolitan areas and the distribution of a conservative postwar political ideology)
[Source]: Wikipedia, 16 Oct. 2023 (White flight or white exodus is the sudden or gradual large-scale migration of white people from areas becoming more racially or ethnoculturally diverse. Starting in the 1950s and 1960s, the terms became popular in the United States; examples in Africa, Europe, and Oceania as well as the United States)
However, LC rejected White flight with the following rationale: “LCSH does not currently have an established pattern that combines the topic of migration with the social reasoning for that migration. The meeting was concerned that introducing such a pattern, particularly in this case, would contradict the practice in LCSH of preferring neutral, unbiased terminology as stated in SHM H 204 sec. 2.”[87]
After this Summary of Decisions was issued, librarians on the SACOLIST mailing list publicly disagreed with the rejection and pointed out the flaws in LC’s argument. One poster highlighted the fact that the term was in common use and searched for by library patrons; they also noted another heading already in LCSH that fit the pattern PTCP claimed didn’t exist:
According to H 204 Section 2, the proposed heading should “reflect the terminology commonly used to refer to the concept,” which I believe is the case with this term. Additionally, the same section of H 204 asks, “Will the proposed revision enhance access to library resources? Would library users find it easier to discover resources of interest to them if the proposed change were to be approved?” Again, if this phrase is commonly used by patrons, it would make sense to add it to our catalogs … You wrote that “LCSH does not currently have an established pattern that combines the topic of migration with the social reasoning for that migration.” Could someone explain why Great Migration, ca. 1914-ca. 1970 doesn’t fit this pattern? Is it because of the date range and that this is a specific event?[88]
Another librarian emphasized the ongoing importance of white flight, the prevalence of literature discussing it, and the unequal treatment of headings describing different groups in LCSH:
The differences between these proposals from my perspective seems to be that one describes African Americans and the other describes White people, and White flight is an ongoing concept rather than a single historical event. I hope PTCP reconsiders this decision, because the effects of White flight and the practices surrounding it shape racial inequality in the United States and in many other countries in the world. Many works describe White flight and its consequences … and users are familiar with the term and want to find works about it.[89]
Finally, a respondent noted yet another term matching the supposedly non-existent pattern: “The existing heading Amenity migration would also appear to provide a pattern combining the topic of migration with the social reasoning for that migration.”[90]
Despite these arguments, LC did not respond to the mailing list discussion nor change its decision. As White flight had literary warrant, was amply supported by reference sources, and was a concept that could not be accurately conveyed using already existent subject headings, why was PTCP concerned about neutrality “particularly in this case”? Even governmental entities as varied as the Supreme Court, the U.S. Commission on Civil Rights, the National Register of Historic Places, and LC itself use the term white flight. The rejection’s insistence on the need for uninvolved neutrality therefore seemed inconsistent with the widespread acceptance of the term.
Instead, the neutrality justification appears to be a smokescreen to cover up discomfort with a term that called out white racism; mandating neutrality in this case meant privileging being inoffensive to white people over acknowledging a widely accepted critique of systemic racism. Patton notes in her Substack post “White People Hate Being Called ‘White People’” that whiteness functions in part by invisibility, a “retreat into universalism where whiteness can dissolve back into ‘humanity’ and avoid accountability.”[91] Rejecting the proposal may have been a neutral decision (i.e., deliberately unobjectionable and indifferent to political and social realities), but it was certainly not unbiased (i.e., free from favoritism). Instead, it conceptually reinforced the false position of whiteness described by Patton as “the default, neutral, objective, and moral”[92]—thus undermining equity in LCSH and making works on this important topic invisible and unsearchable in library catalogs.
Discussion
Chiu, Ettarh, and Ferretti describe the futility of relying on neutrality to further social justice within librarianship and its vocabularies:
When the profession discusses neutrality, we believe that the profession actually seeks equity. However, neutrality will not yield equitable results and will always fall short because it relies on equity already existing in society. This is not the condition of our current society, nor is it true for the profession. Therefore, neutrality will actually work toward reinforcing bias and racism.[93]
The rejection of White flight illustrates this point aptly. Justifying the rejection by invoking neutrality means that practically speaking being neutral equates to whitewashing the ongoing phenomenon, by pretending that the movement of white people in the United States is entirely benign, divorced from racism, and not worth library or library user attention. What are the long-term consequences of privileging neutrality, as opposed to equity, in the subject approval process? Neutrality as political isolationism and mandated inoffensiveness leads, as seen in the rejections from 2005 through 2024, to suppressing political and social critiques, hiding prejudice, and rendering the lived experiences of marginalized groups invisible.
It is unfortunately far too easy to weaponize a neutrality that gives equal weight to what groups such as racists and antisemites intend when evaluating proposals. A SHM instruction created in late 2024, “H 1922,” further embeds this weaponization within subject guidance. “H 1922” defines “offensive words” as “derogatory terms that insult, disparage, offend, or denigrate people according to their race, ethnicity, nationality, religion, gender identity, sexuality, occupation, social views, political views, etc.”[94] By including political and social views in the definition, LC inaccurately equates groups espousing opinions about how people should behave in society with demographic groups who have historically been marginalized merely for existing. This leaves LCSH vulnerable to political actors disingenuously claiming “offense” to silence critiques or establish prejudicial terms within the vocabulary. A recent example of this was the proposal to change Trans-exclusionary radical feminism into Gender-critical feminism, the obfuscatory label preferred by the transphobic group, by claiming that trans-exclusionary radical feminism was a slur.[95] (LC ultimately rejected the proposal, thanks in large part to “community activism” and mobilization opposing the change.[96] LC specifically mentioned library community input as the rationale for the rejection: “When this tentative list was published in November 2024, PTCP received over 300 email comments demanding rejection of this proposal.”[97])
There is ample evidence from the recent past and present of this weaponization of offense being used to undermine progress toward equity in the United States. The Trump administration’s proposed Compact for Academic Excellence in Higher Education (2025) exemplifies the dangers of privileging neutrality over equity. The Compact demands “institutional neutrality,” requiring that universities and their employees “abstain from actions or speech relating to societal and political events except in cases in which external events have a direct impact upon the university.” Those agreeing to this isolationist neutrality, in the meantime, would also agree to erase trans, non-binary, and intersex students, faculty, and staff, and to police and punish speech deemed offensive to conservatives. Notably, the Compact requires that admissions be based on “objective” criteria—except for explicitly-allowed faith, “sex-based,” and anti-immigrant biases.[98]
Mandated neutrality within “H 204” risks reifying the same prejudices within library vocabularies. This can be seen in LC’s recent alteration of Mexico, Gulf of to America, Gulf of, and Denali, Mount (Alaska) to McKinley, Mount (Alaska).[99] Critical cataloger Berman describes the former change as “linguistic imperialism,” and the latter as an “affront to Alaska’s indigenous population.”[100] The latter change is particularly damaging, given the simultaneous effort by LC to remediate LCSH related to Indigenous peoples, and might undermine confidence in the project. In both cases, a neutral approach—remaining uninvolved in political and social events—led to an undue “deference to chauvinistic, ethnocentric, and unjustified authority.”[101] Whether LC realistically could have resisted altering these headings is a counterfactual hypothetical. Its actions must be judged by the effects of these revisions within library catalogs and for library patrons. By clinging to the illusion of neutrality, and capitulating to the whims of a racist and colonialist regime, LC undermined the profession’s stated values and harmed the larger library community.
Recommendations
What philosophical approach can LC take in lieu of neutrality, to bring the SACO process more in concert with library ideals of equity and egalitarianism? We recommend that LC employ a values-driven approach to vocabulary construction and maintenance. Explicitly stated library values—particularly around social justice and social responsibility—benefit all users, both marginalized peoples and the “mainstream.” Further, the PCC Policy Committee, of which LC is a permanent member, has already committed to the PCC Guiding Principles for Metadata, which acknowledge that “the standards and controlled vocabularies we use and their application are biased,” and advocates for “incorporating DEI principles in all aspects of cataloging work.”[102] Below, we suggest a number of changes LC could enact to make LCSH and the proposal process more equitable.
In backing away from neutrality as a guiding principle, philosophical approaches that have been suggested in critiques of traditional practice deserve consideration. In her chapter in Questioning Library Neutrality, Iverson proposes that librarians adopt feminist philosopher Haraway’s approach to objectivity: “Haraway explains that what we have accepted as ‘objectivity’ claims to be a vision of the world from everywhere at once … We can not see from all perspectives at once, we each have our own particular views that are shaped by our own identities, cultures, experiences, and locations.”[103] Instead of claiming to possess “infinite vision,” Iverson recommends that we adopt Haraway’s recognition of “situated knowledge.”[104]
Watson argues that instead of literary or bibliographic warrant (cataloging a book in hand, asking what subject headings are needed to convey its content), critical catalogers “operate from a position of catalogic warrant, reading the terms and hierarchies of cataloging and classification systems with a critical eye, reflecting on the potential benefit or harm of each term on marginalized users, groups, or the GLAMS [galleries, libraries, archives, and museums] community as a whole.”[105] In other words, librarians should focus on the subject heading system in its entirety, asking what revisions and additions are needed. In some ways, by collaborating with SACO funnels on large-scale projects to create and revise related groups of subject headings, LC has already moved away from strict adherence to an interpretation of literary warrant that considers the only valid reason to propose a subject heading having a book in hand that requires it. This shift should be continued and expanded.
As for concrete actions, we advise that LC restore its open monthly subject editorial meetings where proposals are discussed, and to expand points of communication with external libraries. This would allow a more diverse range of librarians to participate in the SACO process and provide valuable input during decision-making. Other benefits of monthly meetings have been noted by SACO librarians in an open letter to PTCP: they helped to demystify “the SACO process” for the newly-involved; and allowed librarians to contribute to “lively conversations on a broad range of options, and the opportunity to shape the vocabularies we all use, from proposing single headings to creating special lists to debating new guidelines for topical subdivisions.”[106]
Building off of this, we suggest creating an external advisory group for LCSH, similar to the ones for LCDGT and LCGFT, to get input from a broader range of users on proposal vetting and vocabulary maintenance. Further, we urge LC to allow greater decision-making power for external librarians in all advisory groups. This would help LC vocabularies better reflect the resources in the Library of Congress collections and the needs of thousands of libraries of different types around the world, and improve accountability for decisions made regarding proposals. It would also help to better insulate library vocabularies from the governmental interference noted above, by making a broad range of institutions responsible for their creation and maintenance.
Within such bodies, we recommend that LC follow guidance from the SAC Working Group on External Review of LC Vocabularies, by including members from groups being described in those vocabularies, subject matter experts, and international representatives. Furthermore, membership should not include “[r]epresentatives from groups or organizations that purport to speak for marginalized communities, but who exclude the voices of members of the marginalized community,” or “[r]esearchers or representatives from groups or organizations where the experts cause harm to members of marginalized communities.”[107] The inclusion of representative groups aligns with the PCC Guiding Principles for Metadata and follows the principles put forth in the Cataloguing Code of Ethics.
In vetting SACO proposals, “LC should prioritize sources from the peoples and communities described, privileging those sources over traditionally ‘authoritative’ sources, including literary warrant,” to ensure that the terminology used “reflect[s] a more inclusive and culturally relevant understanding of the language associated with these groups and their heritage and history.”[108] The creation of a position within LC focused on remediating metadata related to Indigenous peoples was a good first step in this direction; and we strongly encourage LC to both continue and expand this practice.
Finally, we suggest revisions to various LC documents and SHM instruction sheets. References to neutrality should be removed from “H 204” and “Module 1.4,” in favor of a focus on active equity in subject assignment and proposals. Examples of unbiased terminology, created in concert with advisory groups described above, reflecting a variety of situations, and periodically updated, would help create a shared understanding between librarians proposing headings and those evaluating them for inclusion in LCSH. “H 180” and “Module 1.4” should also be edited, in the sections advising catalogers to remain objective and not “express personal value judgments.”[109] All cataloging relies on judgment, and judgment is not always synonymous with bias or divorced from facts. A more useful focus here, as in a revised “H 204,” would be on the active equity present in Merriam-Webster’s definition of objectivity; catalogers should employ “catalogic warrant” and evaluate the “potential benefit or harm”[110] of subjects, particularly when assigning headings to prejudicial works. Finally, in order to protect against weaponized “offense,” we also recommend that “social views” and “political views” be removed from “H 1922.” These alterations would bring the SHM and LCSH training more in line with LCDGT guidance, which foregrounds cataloging ethics. “L 400,” for instance, notes that “naming demographic groups and identifying individuals as members of those groups must be done with accuracy and respect,” and highlights the importance of self-identification when assigning headings.[111]
We cannot make recommendations on this topic without addressing the current political climate. Because LC’s catalog migration put most SACO work on hold during 2025,[112] the effect of the Trump administration’s anti-DEI policies on LCSH remains uncertain. However, United States history is rife with periods of political repression. Waiting until relative calm to advocate for equity has not been, historically, how equity was advanced; and it will not serve library patrons or the broader community in the present moment.
Conclusion
LCSH began over a century ago as a subject cataloging tool for the Library of Congress, and has since evolved into a vocabulary serving thousands of libraries around the world. Despite the broad and diverse user base, LC has remained the sole arbiter of which proposals are accepted into LCSH and what form the headings take. During the last two decades it has rejected a number of subject proposals due to a preference for purported neutrality and objectivity, in various guises. Yet, as a profession, librarianship claims to prioritize social responsibility. Social justice and equity are incompatible with an indifferent and purposefully inoffensive neutrality that allows harmful, colonialist, and racist headings in LCSH, and keeps out headings describing prejudice, or about the lived experiences of marginalized peoples.
Olson describes LCSH as “a Third Space between documents being represented and users retrieving them,” since “LCSH constructs the meanings of documents for users.”[113] These meanings impact how users view materials, and whether they can locate them in library catalogs. And it is within this space that LC’s commitment to neutrality fails both users and the ideals of librarianship around social responsibility. However, “because the Third Space is one of ambivalence, it is one with potential for change.”[114] By focusing on library values rather than neutrality within the subject creation and approval process, LCSH could develop into a vocabulary that constructs truly equitable and inclusive meanings for users and librarians alike.
Acknowledgements
Thank you to our publishing editor, Jess Schomberg, and the editorial board for their flexibility, guidance, and expertise throughout the publication process. Thank you to K.R. Roberto, Margaret Breidenbaugh, Crystal Yragui, and Matthew Haugen, who allowed us to quote them within this article. We would also like to thank our reviewers, Jamie Carlstone and Ian Beilin, and other readers who gave valuable feedback: Adam Schiff, Rebecca Albitz, Chereeka Garner, Rebecca Nowicki, Naomi Reeve, Simone Clunie, Violet Fox, and Stephanie Willen Brown.
[1] Robert Jensen. “The Myth of the Neutral Professional,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 91.
[3] Throughout this article, authorized subject headings (i.e., those that exist currently in LCSH) are presented in bold font; while rejected proposed headings appear in italics. For consistency, subject headings within quotations will follow the same formatting, regardless of the formatting used in the original quotation.
[8] Gina Schlesselman-Tarango, “How Cute!: Race, Gender, and Neutrality in Libraries,” Partnership: The Canadian Journal of Library and Information Practice and Research 12, no. 1 (Aug. 2017): 10, https://doi.org/10.21083/partnership.v12i1.3850.
[9] Maura Seale, “Compliant Trust: The Public Good and Democracy in the ALA’s ‘Core Values of Librarianship,’” Library Trends 64, no. 3 (2016): 589, https://doi.org/10.1353/lib.2016.0003.
[15] Dani Scott and Laura Saunders, “Neutrality in Public Libraries: How Are We Defining One of Our Core Values?,” Journal of Librarianship and Information Science 53, no. 1 (2020): 153, https://doi.org/10.1177/0961000620935501.
[16] Scott and Saunders, “Neutrality in Public Libraries,” 158.
[19] Mark Rosenzweig. “Politics and Anti-Politics in Librarianship,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 5-6.
[20] Stephen Macdonald and Briony Birdi, “The Concept of Neutrality: A New Approach,” Journal of Documentation 76, no. 1 (2020): 333–353. https://doi.org/10.1108/JD-05-2019-0102.
[21] Jaeger-McEnroe, “Conflicts of Neutrality,” 3.
[22] Jaeger-McEnroe, “Conflicts of Neutrality,” 6.
[23] Jaeger-McEnroe, “Conflicts of Neutrality,” 9.
[24] Steve Joyce, “A Few Gates Redux: An Examination of the Social Responsibilities Debate in the Early 1970s and 1990s,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 33-65.
[26] “Resolution to Condemn White Supremacy and Fascism as Antithetical to Library Work,” American Library Association, Jan. 25, 2021, https://tinyurl.com/yr4z9e8x
[27] Scott and Saunders, “Neutrality in Public Libraries,” 153.
[35] Subject Analysis Committee Working Group on the LCSH “Illegal aliens,” “Report from the SAC Working Group on the LCSH ‘Illegal aliens,'” July 13, 2016, https://alair.ala.org/handle/11213/9261.
[36] Jill E. Baron, Violet B. Fox, and Tina Gross, “Did Libraries ‘Change the Subject’? What Happened, What Didn’t, and What’s Ahead,” in Inclusive Cataloging: Histories, Context, and Reparative Approaches, eds. Billey Albina, Rebecca Uhl, and Elizabeth Nelson (ALA Editions, 2024), 53; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 11 (November 12, 2021)” (Library of Congress, 2021), https://classweb.org/approved-subjects/2111b.html.
[37] Baron et al., “Did Libraries ‘Change the Subject?,’” 54.
[38] Michelle Cronquist and Staci Ross, “Black Subject Headings in LCSH: Successes and Challenges of the African American Subject Funnel Project,” Reference and User Services Association, July 7, 2021, virtual. https://d-scholarship.pitt.edu/41826
[39] Cronquist and Ross, “Black Subject Headings in LCSH.”
[40] Cronquist and Ross, “Black Subject Headings in LCSH.”
[41] Library of Congress, “Library of Congress Subject Headings Approved Monthly List 06 (June 18, 2021)” (Library of Congress, 2021), https://classweb.org/approved-subjects/2106.html. Note the headings for Japanese Americans, Japanese Canadians, and Aleuts were originally submitted as –Forced removal and incarceration matching preferred usage, but LC changed them all to –Forced removal and internment.
[43] For more information about Congressional actions related to the attempt to change Illegal aliens, see: SAC Working Group on Alternatives to LCSH “Illegal aliens,” “Report of the SAC Working Group on Alternatives to LCSH ‘Illegal aliens’” (American Library Association, 2020), http://hdl.handle.net/11213/14582.
[48] Rich Gazan, “Cataloging for the 21st Century Course 3: Controlled Vocabulary & Thesaurus Design Trainee’s Manual” in Library of Congress Cataloger’s Learning Workshop (Library of Congress, n.d.), 2-2,
[60] Anastasia Chiu, Fobazi M. Ettarh, and Jennifer A. Ferretti, “Not the Shark, but the Water: How Neutrality and Vocational Awe Intertwine to Uphold White Supremacy,” in Knowledge Justice: Disrupting Library and Information Studies through Critical Race Theory, eds. Sofia Y. Leung, Jorge R. López-McKnight (MIT Press, 2021), 65.
[61] Library of Congress, “Editorial Meeting Number 4,” 2008; Library of Congress, “LCSH/LCC Editorial Meeting Number 02 (2024).”
[68] Krista Maywalt Aronson, Brenna D. Callahan, and Anne Sibley O’Brien, “Messages Matter: Investigating the Thematic Content of Picture Books Portraying Underrepresented Racial and Cultural Groups,” Sociological Forum 33, no. 1 (2018): 179, http://www.jstor.org/stable/26625904.
[69] Lisely Laboy, Rachael Elrod, Krista Aronson, and Brittany Kester, “Room for Improvement: Picture Books Featuring BIPOC Characters, 2015–2020,” Publishing Research Quarterly 39 (2023): 58, https://doi.org/10.1007/s12109-022-09929-7.
[72] Library of Congress, “Summary of Decisions, Editorial Meeting Number 09” (Library of Congress, 2017), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-170918.html. LC did establish a new heading for Denialism at that time; however, per the rejection, “To bring out the denialism aspect of events or topics, the heading may be post-coordinated with headings for the events or topics. The existing subject headings Holocaust denial and Holodomor denial, which are related to specific events, were added by exception as narrower terms of the new heading Denialism. Additional narrower terms will not be added to Denialism.”
[80] Library of Congress, “Editorial Meeting Number 27,” 2011; Library of Congress, “Library of Congress Subject Headings Monthly List 12 LCSH (December 17, 2012)” (Library of Congress, 2012), https://classweb.org/approved-subjects/1212.html.
[81] Library of Congress, “Editorial Meeting Number 27,” 2010.
[82] K.R. Roberto, “LCSH Proposals: Is this a Trend?” Jan. 17, 2012, RADCAT mailing list archives.
[85] Library of Congress, “Summary of Decisions, Editorial Meeting Number 12” (Library of Congress, 2015), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-151212.html. A 2016 rejection of Dadaist literature, Romanian (French) also highlighted colonialist content in LCSH, noting that “Headings for national literatures qualified by language are generally established for the language(s) of the colonial power that used to control the territory.” See: Library of Congress, “Editorial Meeting Number 04,” 2016.
[103] Sandy Iverson, “Librarianship and Resistance,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 26.
[104] Iverson, “Librarianship and Resistance,” 26.
[105] B. M. Watson, “Expanding the Margins in the History of Sexuality & Galleries, Libraries, Archives, Museums & Special Collections (GLAMS)” PhD diss. (University of British Columbia, 2025), 270.
[107] Subject Analysis Committee Working Group on External Review of LC Vocabularies, Report of the SAC Working Group on External Review of Library of Congress Vocabularies, February 2023, 8-9, https://alair.ala.org/handle/11213/20012.
[108] Working Group on External Review of LC Vocabularies, “Report,” 8.
I’m on strike right now, along with thousands of other faculty, academic professionals, and staff at Portland Community College (that’s two unions, friends!). It’s a weird feeling. I never thought I’d be in this position. PCC was the first place I worked where I really felt like the values of the College matched my own. I work with insanely dedicated and caring library workers, faculty, and staff. They believe unwaveringly in what they do and constantly go above and beyond for students. After being here for a few years, I knew this was the place I wanted to work for the rest of my career. Even as administration became worse – more corporatized, more performative, less accessible, more likely to listen to outside consultants than the people who directly work with students – I still never considered leaving because the folks I work with regularly are awesome and I love our students.
As a scholar of time, I’m always interested in different forms of time (queer time, crip time, etc.). Strike time feels really strange. We were talking this morning on the picket line how it feels a lot like early COVID where time moved very differently. We feel like the days are both way too long and super short with not enough time to get everything done but also too much time just staring at different union social channels. We’re totally energized and totally exhausted (I’m lying on the couch like a ragdoll right now after three hours of holding signs, screaming, and dancing, marching and chanting with hundreds of colleagues). In terms of information, we feel like we’re both drinking from a firehose and like we don’t have any of the information we need. We have no idea what the near-term future will bring. What day of the week it is feels almost arbitrary because none of the usual markers of those days apply (I see all the things I was supposed to have been doing at work each day on my calendar and it feels like another life entirely). We’re both unmoored and deeply connected. I love it (the connection and collective power) and I also really hate it (for our students, for our colleagues who live paycheck to paycheck, for what the administration and the Board are doing to my beloved institution).
So it’s weird to feel both temporarily severed from the College and also more deeply connected than ever. These administrators may run the College and have the authority to make decisions, but they are not the College. The College is the people I’ve seen on the picket lines the past few days in the rain and freezing cold. These people who are truly fighting for the soul of our college. They make the College run, from teaching classes, to assisting students with all kinds of needs, to helping students feel welcome, to keeping the College clean and safe and keeping students fed. All of these things are critical and the College can’t run without us, but I’m not entirely sure the same can be said of our administrators. The College is also our students, many of whom have stood with us on the line, who’ve brought us food, or have supported us through emails to the President and Board and on social media. I feel incredibly grateful for our students who clearly see through the bs administration is putting out there.
It’s been kind of incredible to see how unprepared our administration was for this after 11 months in which they barely moved in negotiations. They’ve known for months that a strike was a distinct possibility and they were the ones who walked away from the bargaining table the night before the strike was meant to happen. The latest email from the President said “I will say, with some pride, that we are not – and we should not – be an organization that is good at navigating this scenario” but, honestly, they should have had guidance for students ready to go. Administrators are supposed to plan for scenarios like this. They had units planning for two different scenarios for cuts from the State (neither of which came to pass). We spent almost a year planning what we would cut if LSTA funds went away in our state for the next year (they didn’t, thank goodness). Most faculty, on the other hand, have been talking to students about a possible strike for the past six weeks at least and the union provided tons of resources to help them come up with a plan for their own classes. Yet the College was left totally scrambling last Wednesday as if they had no idea this could happen. Baffling.
It’s been interesting seeing some managers show up to bring food and/or spend a bit of time with us on the line. It’s not a lot of them, but it means a lot to us when someone does. They’ve told us about the absolute unprepared hot mess that is administration right now and it’s nice to realize that not every middle manager tows the party line at all times. But the vast majority of our managers sent us emails just before the start of the strike asking us to let them know if we were working or not, so most are definitely sticking with administration.
I had a boss many years ago who definitely put her employees first and advocated fiercely for us. She said she saw her role as being akin to a manager of a minor league baseball team. She was here to help develop us for bigger and better things in our careers. She was a major mentor to me in my early years in the profession. Since then, the bosses I’ve had really prioritized the people above them in the org chart ahead of the people below them. They have been classic “company [wo]men.” Helping us develop in our careers or even supporting us when we explicitly asked for it wasn’t part of the job. When I was a middle manager, I took the exact opposite approach and that’s why I’m no longer a middle manager. I always saw the role of a manager as supporting one’s direct reports (essentially, I worked for them) and that wasn’t what the people in charge of the library wanted me to do.
The great library leader Mitch Freedman died recently and it made me think about whether leaders like him can really exist in our much more corporatized libraries these days. If you don’t know about Mitch’s storied biography as a library leader and awesome human, please take a moment to read about him here in an obit from his family. When I was coming up as a librarian, he was the sort of man who was a model for me in successfully operating in our field with total moral courage. He lived his values every day. He fought for people and the things that he believed in. He centered the folks who were oppressed. He believed relationships were core to our work. In many ways, he embodied the “Good” and the “Human(e)” characteristics of slow librarianship (maybe also the “Thoughtful” but I didn’t work with him, so I’m not sure). His amazing daughter, Jenna Freedman, also lives her values courageously, a living tribute to his example.
I hope there are still library managers out there still who have moral courage and fight the good fight, but, more and more, it feels like the people who become library Deans, Directors, and University Librarians are the ones who are willing to comply and conform, not the ones willing to rock the boat. As our institutions become more and more corporatized and neoliberal, we see less and less moral courage. I see a lot of library administrators wanting to look like they’re doing good more than they actually want to do good. I think of the leaders who all started EDI initiatives or published EDI statements right around 2020 and then let them fade away. Most of the people I see doing amazing values-driven work in our field these days are not leading libraries. They’re mostly front-line librarians. I wonder if it’s because like me, folks are not willing to make the moral compromises so many have to make these days to climb the ladder.
the decisive victory of capitalism in the 1980s and 1990s, ironically, has… led to both a continual inflation of what are often purely make-work managerial and administrative positions—”bullshit jobs”—and an endless bureaucratization of daily life, driven, in large part, by the Internet. This in turn has allowed a change in dominant conceptions of the very meaning of words like “democracy.” The obsession with form over content, with rules and procedures, has led to a conception of democracy itself as a system of rules, a constitutional system, rather than a historical movement toward popular self-rule and self-organization, driven by social movements, or even, increasingly, an expression of popular will.
I see that in my own place of work. So much of my boss’ (our Dean’s) job is box checking compliance type work – approving vacations and sick leave, making sure we’re doing required trainings and other things the people above her on the org chart want us to do, making sure we’re doing all of the things contractually required of us, etc. It used to be that I met with her once each term to talk about what I was working on, go over my progress on my goals, etc. Then I went to meeting with her just once in Fall where we’d look at my goals document (without any meaningful feedback or support) and then I’d fill out a Google form at the end of the year to tell her what I did (with again no meaningful feedback). Now, even that Fall meeting is gone as her load of compliance-related work has increased. There’s no support outside of helping us navigate the bureaucracy of our institution. There’s no “walking around” as Mitch Freedman did – building relationships with employees and making them feel seen. There’s no focus on our development or talking about the meaning behind what we do. There’s just this compliance-focused flurry of activity.
As our colleges and universities become more and more corporatized, they turn what were supposed to be leadership positions, that required vision and people skills, and turn them into babysitting jobs because, lord knows, we professionals can’t be trusted. Our college, like many, has seen a massive growth in the number of managerial positions, and yet, faculty and staff are being asked to do more administrative work than ever before, not less. Why? Well, of course those managers have to justify their existence.
Could a Mitch Freedman become a library director today? Would he have had to compromise his values somewhere down the line to get there? Do you know of any library leaders like Mitch today who are able to operate successfully in these more neoliberal environments?
In that same piece, David Graeber writes “scholars are expected to spend less and less of their time on scholarship, and more and more on various forms of administration—even as their administrative autonomy is itself stripped away. Here too we find a kind of nightmare fusion of the worst elements of state bureaucracy and market logic.” This is the reality we find ourselves in as our two unions fight for better pay, but even more importantly, for a real, substantial model of shared governance which we don’t currently have (and which our college President agreed to and then hired a consultant to create for us ). The fact that the only college committee or governance group that has the ability to conduct a vote of no confidence in our President (which they successfully passed!) is our student government is a stark reminder of how little power and voice we have in the future of our college. It can be so easy to just focus on keeping our head down and doing the good work we do as educators, as supporters of students and faculty, as stewards of collections, etc., but when we fight together like this, we fight for the heart and soul of our organization. We fight for an organization that centers students and their needs and listens deeply to those who directly serve and educate them.
Walking the picket line the first couple of days was brutal in many ways. I was so cold and wet I couldn’t even grip my cell phone or a car door handle and I had to stay off my feet for a few hours as they thawed. But what has kept me warm, has kept all of us warm, is the solidarity. It has sometimes felt almost like a party, being there with many hundreds of my fellow colleagues. It’s been so affirming, so energizing. We’re all so united in this, so deeply committed to the institution and each other in ways that these administrators who jump from job to job every few years and compose soulless emails to us with freaking ChatGPT will never understand.
If you’re feeling so inclined, please contribute to our strike fund. The administration seems really dug in and even decreased their offer by over $100,000 on Sunday, so I’m not quite so optimistic anymore that this will end quickly and we have lots of faculty, academic professionals, and staff who won’t be able to pay their rent or mortgage without support. Thanks and solidarity!!
Relive the online conference that brought the open data community together for a celebration of two decades of CKAN and to discuss the role of open data and data infrastructures today.
Librarians have managed and lived through many seismic shifts brought by technology. How should librarian leaders approach the coming anticipated AI workforce disruption?
Abstract This column explores the ways in which library workers can better align technology use and instruction in library settings with library values, through championing the refusal of technologies that conflict with values like privacy and intellectual freedom. Drawing on experiences with individual patron instruction, class design, and passive programming, the author shares practical steps for helping patrons to understand and fight back against exploitation by digital technologies. Rejecting the myth that any technology is “neutral,” the column argues that libraries as values-driven organizations have a role to play in facilitating patrons’ rejection of technology, just as much as in their adoption of it.
Note from Shanna Hollich, column editor: I am particularly excited to share this issue's column for a number of reasons. First, it's from a public library perspective, which is one that is generally underrepresented in the LIS literature as a whole, and which I'm proud to say that ITAL makes a concerted effort to address. Second, it's about library instruction, a topic of relevance to all types of libraries - and where much of the literature specifically discusses formal library instruction, this column also addresses passive programming, informal instruction, and casual patron interaction, which are also vitally important and under-studied aspects of the library worker's role in education. And finally, it's yet another column about AI, and even more specifically, about taking a critical approach to AI tools, AI education, and AI literacy. Close readers may have noticed this topic tends to be a special interest of mine, but Hannah Cyrus takes a measured and reasoned approach here that acknowledges the potential harms of AI without falling into the trap of simply ignoring or denying AI and the very real impacts it is having on our libraries and the communities we serve.
The first phase of the Reimagining Discovery project at Harvard Library sought to address the challenge of fragmented search experiences of special collections materials using artificial intelligence (AI) technologies, such as embedding models and large language models (LLMs). The resulting platform, Collections Explorer, simplifies and enhances the search experience for more effective special collections discovery. The project team took a user-centered and trustworthy approach to implementing AI, grounding the choices of the platform in user empowerment and librarian expertise. The development process included extensive user research, including interviews, usability testing, and prototype evaluations, to understand and address user needs.
Collections Explorer was developed using a multi-component architecture that integrates multiple types of AI. The team evaluated more than 12 models to select ones that were the best fit for the need, as well as being ethical and sustainable. Detailed system prompts were developed to guide LLM outputs and ensure the reliability of information. The methodical and iterative approach helped to create a flexible and scalable platform that could evolve to support other material types in the future. Initial research showed that potential users are enthused at the prospect of AI-powered features to enhance discovery, especially the item-level summaries and related search suggestions. The project demonstrated the potential of integrating AI technologies into library discovery systems while maintaining a commitment to trustworthiness and user-centered design.
This study evaluates the effectiveness of the Artificial Intelligence for Theme Generation tool (original Portuguese acronym name: IAGeraTemas), developed with generative artificial intelligence (AI; Google Gemini), for automating thematic classification and the assignment of Sustainable Development Goals (SDGs) in documents. The methodology combined quantitative analyses (metrics of precision, recall, and accuracy) on 50 articles published by authors from the State University of Campinas (Unicamp), using classification from the SciVal database and qualitative analyses (analysis of the relevance of terms indexed by librarians from the Unicamp Library System in 40 articles available in the Unicamp Institutional Repository), comparing them with manual indexing performed by librarians. The quantitative results in SDG classification showed a recall of 0.785, while the “precision” and “accuracy” metrics were moderate. The qualitative analysis deepened the evaluation of term coherence and relevance suggested by the AI versus human indexing. It revealed the tool’s potential for suggesting relevant terms and expanding concepts, but it also exposed limitations in addressing complex topics. The research, conducted as an experiment at Unicamp Library System, concludes that IAGeraTemas is a valuable auxiliary tool, complementing but not replacing manual indexing, reinforcing the importance of human expertise in validating and refining results, and emphasizing the synergistic potential between AI and information professionals.
This article describes a case study in which a small metadata team at Illinois State University Milner Library produced a digital humanities project supporting Collections as Data (CAD) and linked data principles. Despite initial sparse descriptive content, the team recognized great potential for experimentation in a significant World War I archival collection to highlight lesser-known stories, including those of the Pioneer Infantry, women, and noncombatants. Discussion focuses on the strategic approaches in creating granular but scalable metadata for the large digital collection, and application of the data with various tools such as ArcGIS and Wikidata to construct interactive data visualizations, mapping, and digital storytelling for the Illinois State Normal University World War I Service Records collection. The article argues that even institutions without a dedicated CAD initiative can incrementally implement principles from the CAD model to add value to their digital collections. The authors first presented the project in 2024 at the Digital Library Federation Forum and the American Library Association Core Forum.
In digital preservation, the concept of a “Designated Community” from the Reference Model for an Open Archival Information System (OAIS) is used to articulate the group or groups of prospective users for whom information is preserved. Concerns have been raised about this concept and its potential implications. However, OAIS has recently undergone a major revision. This study examines the extent to which these revisions address or mitigate concerns regarding the Designated Community. Issues from the literature are grouped into three areas: the concept’s implementation, its potential misapplication, and its incompatibility with the mandates of institutions that serve broad and diverse communities. Major changes related to the Designated Community are identified and considered in relation to these issues. The analysis reveals that the revisions productively contribute to concerns in the first two areas but fail to address the third. The conclusion is that the process of revising OAIS has not drawn from insights into this topic in the literature.
The National Library Board (NLB) of Singapore has made significant strides in leveraging data to enhance public access to its extensive collection of physical and digital resources. This paper explores the development and implementation of the Singapore Infopedia Widget, a recommendation engine designed to guide users to related resources by utilizing metadata and a Linked Data Knowledge Graph. By consolidating diverse datasets from various source systems and employing semantic web technologies such as Resource Description Framework (RDF) and Schema.org, NLB has created a robust knowledge graph that enriches user experience and facilitates seamless exploration.
The widget, integrated into Infopedia, the Singapore Encyclopedia, surfaces data through a user-friendly interface, presenting relevant resources categorized by format. The paper details the architecture of the widget, the ranking algorithm used to prioritize resources, and the challenges faced in its development. Future directions include integrating user feedback, enhancing semantic analysis, and scaling the service to other web platforms within NLB’s ecosystem. This initiative underscores NLB’s commitment to fostering innovation, knowledge sharing, and the continuous improvement of public data access.
This paper explores the impact of digital initiatives on access services workers at the University of California, San Diego (UCSD) and draws on the expertise and experience of non-librarian titled staff operationalizing “digital first” policies. Digital initiatives have been strongly prioritized by libraries to promote equitable access, cost-effectiveness, and technological growth at many libraries in California. The term digital initiatives commonly refers to efforts that support the creation, preservation, access, discovery, and use of digital library resources. This term can encompass multiple interpretations and a variety of tasks.
This paper includes a literature review, an examination of statistics regarding demand and adoption of digital materials in public and academic libraries in California, and a summary of the impact study of non-librarian staff at UCSD. The literature review suggested that the term digital initiatives encompasses a broad scope of meanings and types of tasks, California State Library data suggest that a pattern of increased investment in digital initiatives adopted during the COVID-19 pandemic is continuing, and the information collected through the research at UCSD library suggests that non-librarian library workers play a growing role in managing, maintaining, and supporting these growing digital collections.
Computer workstations have been an integral part of libraries of all types since the 1980s, but the optimal number of workstations that should be deployed in a space has not been directly studied in the last 20 years. During that time, laptop computer and other mobile device ownership has continued to increase, and there is some reason to think that behaviors and preferences first seen during the recent coronavirus 2019 pandemic have further shifted how students use public desktop computers in libraries. McGill University Libraries reduced the size of its computer fleet in the aftermath of the pandemic by looking at the maximum concurrent usage of different clusters of computers across campus, a metric that indicates how busy a space can get with users. This article explains how this metric is calculated and how other libraries can use it to make an evidence-based decision about the optimal size of a computer fleet.
In 2024, the Durban University of Technology (DUT) Library conducted a comprehensive review of its library system to assess whether its current platform, Future of Libraries Is Open (FOLIO) hosted by EBSCO, and its discovery tool, EBSCO Discovery Service (EDS), aligned with its evolving needs.The institution had been using the current system for three years, but the slow development of important features and subsequent delays in a critical release of FOLIO led to frustrations among staff and library users, compelling the executive team to call for a comprehensive review of the library system. A major outcome of the review was to ascertain the extent of the gaps or limitations in the current system and investigate recent developments in other library systems, including discovery tools and analytical modules. After several vendor consultative sessions, extensive review of documentation and secondary sources, and engagement with selected academic libraries in South Africa, the review team concluded that there were no compelling reasons for an immediate system change and that fair consideration should be given to the developmental and community-driven ethos of FOLIO, and that issues with EDS and Panorama would be resolved by the implementation of planned features in FOLIO’s roadmap. This paper highlights the key processes undertaken in the review and shares experiences and suitable practices for project planning, criteria development, and evaluation. It also argues for a regular review of the library system and stresses the value of institutional knowledge and familiarity in mitigating the risks associated with the review and acquisition of new library systems.
The news
about Cloudflare’s new pay-per-crawl
API caught my attention for a few reasons. Read on for why, a bit
about what the results look like, and what I learned when I asked it to
crawl this here site as a test.
So, first of all, what’s up? Cloudflare’s Crawl API helps people collect
data from websites with bots, while at the same time providing
one of the most popular technologies for preventing websites from being
crawled by bots?!?
At first this seemed to me like a classic fox-guarding-the-hen-house
type of situation. But the little bit of reading in
the docs I’ve done since makes it seem like they will still respect
their own bot gate keeping (e.g. Turnstile).
If you are using Cloudflare or some other bot mitigation technology you
will have to follow their instructions to let the Cloudflare crawl bot
in to collect pages. Interestingly, it appears they are using the latest
specs for HTTP Message
Signatures to provide this functionality, since you can’t simply let
in anyone saying they are CloudflareBrowserRenderingCrawler
right?
The genius here is that Cloudflare is known for its Content Delivery
Network (CDN). So in theory (more on this below) when a user asks to
crawl a website the data can be delivered from the cache, without
requiring a round trip back to the source website. In some situations
this could mean that the burden of scrapers on websites is greatly
reduced.
The introduction of a Crawl API also looks like another jigsaw piece
fitting into place for how Cloudflare see web publishers benefiting
from being crawled. Only time will tell if this strategy works out, but
at least they have some semblance of a plan for the web that isn’t
simply sprinkling “AI” everywhere.
If you run a website with lots of high value resources for LLMs
(academic papers, preprints, books, news stories, etc) the same cached
content could be delivered to multiple parties without having to go back
to the originating server. For resource constrained cultural heritage
organizations that are currently getting crushed
by bots I think this would be a welcome development.
But, the primary reason this news caught my eye is that if you squint
right Cloudflare’s Crawl API looks very much like web archiving
technology. For example, the Browsertrix API lets you
set up, start, monitor and download crawls of websites.
Unlike Browsertrix, which is geared to collecting a website for viewing
by a person, the Cloudflare Crawl service is oriented at looking at the
web for training LLMs. The service returns text content: HTML, Markdown
and structured JSON data that result from running the collected text
through one of their LLMs, with the given prompt.
Seeing the Web
So why is it interesting that this is like web archiving technology?
Ok, maybe it isn’t interesting to you, but (ahem) in my dissertation
research (Summers, 2020)
I spent a lot of time (way too much time tbh) looking at how web
archiving technology enacts different ways of seeing the web
from an archival perspective. I spent a year with NIST’s National
Software Reference Library (NSRL) trying to understand how they were
collecting software from the web, and how the tools they built embodied
a particular way of seeing and valuing the web–and making certain things
(e.g. software) legible (Scott, 1998).
What I found was that the NSRL was engaged in a form of web archiving,
where the shape of the archival records was determined by their initial
conditions of use (in their case, forensics analysis). But these initial
forensic uses did not overdetermine the value of the records,
which saw a variety of uses, disuses, and misuses later: such as when
the NSRL began adding software from Stanford’s Cabrinety
Archive, or when the teams personal expertise and interest in video
games led them to focus on archiving content from the Steam platform.
So I guess you could say I was primed to be interested in how
Cloudflare’s Crawl service sees the web. This matters because
models (LLMs, etc) and other services will be built on top of data that
they’ve collected. But also because, if it succeeds, the service will
likely get repurposed for other things.
Testing
To test how Cloudflare sees the web, I simply asked it to crawl my own
static website–the one that you are looking at right now. I did this for
a few reasons:
It’s a static website, and I know exactly how many HTML pages were on
it. All the pages are directly discoverable since the homepage includes
pagination links to an index page that includes each post.
I can easily look at the server logs to see what the crawler activity
looks like.
I don’t use any kind of Web
Application Firewall or other form of bot protection on my site (I
do have a robots.txt but it doesn’t block
CloudflareBrowserRenderingCrawler/1.0)
I host my website on May First which
doesn’t use Cloudflare as a CDN. So the web content wouldn’t
intentionally be in Cloudflare’s CDN already.
I wrote a little command line utility cloudflare-crawl to
start, monitor and download the results from the crawl. While the
crawler ran I simultaneously watched the server logs. Running the
utility looks like this:
$ uvx https://github.com/edsu/cloudflare-crawl crawl https://inkdroid.org
created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json
Each of the resulting JSON files contains some metadata for the crawl,
as well as a list of “records”, one for each URL that was discovered.
{"success":true,"result":{"id":"36f80f5e-d112-4506-8457-89719a158ce2","status":"completed","browserSecondsUsed":1382.8220786132817,"total":1967,"finished":1967,"skipped":6862,"cursor":51,"records":[{"url":"https://inkdroid.org/","status":"completed","metadata":{"status":200,"title":"inkdroid","url":"https://inkdroid.org/","lastModified":"Sun, 08 Mar 2026 05:00:39 GMT"},"markdown":"...""html":"...",},{"url":"https://www.flickr.com/photos/inkdroid","status":"skipped"}]}}
Analysis
I decided I wasn’t very interested in testing their model
offerings, so I didn’t ask for JSON content (the result of sending
the harvested text through a model). If I had, each successful result
would have had a json property as well. I am sure that
people will use this, but I was more interested in how the service
interacted with the source website, and wasn’t interested in discovering
the hard way how much it cost to run the content through their LLMs.
Below is a snippet of how the Cloudflare bot shows up in my nginx logs.
As you can see the logs provide insight into what machine on the
Internet is doing the request, what time it was requested, and what URL
on the site is being requested.
Maybe it’s early days for the service, but one thing I noticed is that
each time I requested the site to be crawled the results seemed to be
radically different.
crawl time
completed
skipped
queued
errored
unique_urls
2026-03-12 13:13:00
165
84
0
1
223
2026-03-12 13:44:00
72
4
2
0
78
2026-03-12 14:09:00
1947
7304
0
23
9191
2026-03-12 16:33:00
72
4
2
0
78
2026-03-12 17:34:00
1948
7365
0
22
9191
2026-03-13 16:50:00
1947
7363
0
23
9187
2026-03-14 07:32:00
72
4
2
0
78
The more successful crawls did a good job of crawling the entire site.
My website is well linked, with a standard homepage, that has anchor tag
based paging that includes links to all the posts. But knowing when your
results are a partial crawl seems to be difficult. Knowing the actual
dimensions of a “website” is one of the more difficult things about web
archiving practice. The URLs that were labeled as “skipped” were not in
scope for the crawl. If you wanted to include those apparently there is
a options.includeExternalLinks option when setting
up the crawl.
From watching the web server logs it was clear that:
Cloudflare does appear to be relying on previously cached data,
but it’s not entirely clear what the logic is. For example one crawl
took 5 minutes to complete, it returned 1,974 completed results but the
web server only saw requests for 594 of those URLs. I turned around and
ran the exact same crawl again and it took 20 minutes longer, return
1,974 results, but 847 pages were requested. In between no content on
the website changed. 🤷
Cloudflare appears to be fetching CSS, JavaScript and images for the
rendering of each page (they aren’t being cached by the Browser Worker).
The throughput on the web server seemed to peak around 300 requests /
minute (5 requests / second). For most sites this seems perfectly
feasible.
For the more successful crawls it looked like there were 246 independent
IP addresses within Cloudflare’s network block that were doing the
crawling.
ip
request_count
104.28.153.88
405
104.28.163.131
266
104.28.161.242
232
104.28.165.231
223
104.28.153.132
212
104.28.163.132
212
104.28.163.81
201
104.28.166.65
188
104.28.166.121
186
104.28.164.201
185
104.28.153.179
182
104.28.153.137
178
104.28.164.202
172
104.28.161.243
172
104.28.166.127
163
104.28.165.232
155
104.28.153.119
153
104.28.165.14
151
104.28.153.83
148
104.28.153.140
145
104.28.153.87
145
104.28.153.55
143
104.28.153.136
142
104.28.163.133
132
104.28.153.118
131
104.28.166.58
130
104.28.163.78
126
104.28.160.31
125
104.28.153.139
124
104.28.161.245
124
104.28.163.214
123
104.28.153.120
123
104.28.165.230
121
104.28.153.180
121
104.28.164.156
119
104.28.153.96
119
104.28.153.64
112
104.28.153.133
111
104.28.166.128
111
104.28.153.128
109
104.28.166.126
104
104.28.165.17
103
104.28.165.18
103
104.28.160.30
103
104.28.153.134
101
104.28.166.120
101
104.28.153.129
101
104.28.153.181
100
104.28.153.86
100
104.28.165.229
100
104.28.163.134
99
104.28.164.203
99
104.28.162.194
98
104.28.166.62
98
104.28.163.212
98
104.28.153.123
97
104.28.164.154
97
104.28.166.61
97
104.28.161.246
96
104.28.153.92
96
104.28.166.125
96
104.28.153.68
93
104.28.159.23
92
104.28.153.76
91
104.28.153.71
91
104.28.153.124
90
104.28.158.143
88
104.28.165.21
88
104.28.153.94
87
104.28.166.118
86
104.28.161.133
84
104.28.153.85
82
104.28.164.152
82
104.28.163.77
82
104.28.153.148
79
104.28.164.150
79
104.28.165.12
79
104.28.161.201
79
104.28.153.183
78
104.28.160.65
78
104.28.153.126
77
104.28.153.138
77
104.28.159.133
76
104.28.165.20
75
104.28.158.137
75
104.28.153.56
75
104.28.153.81
74
104.28.153.131
73
104.28.153.59
72
104.28.166.60
72
104.28.166.66
69
104.28.159.120
69
104.28.153.53
68
104.28.153.185
68
104.28.153.191
67
104.28.166.119
66
104.28.153.95
64
104.28.165.76
64
104.28.154.20
62
104.28.153.121
57
104.28.158.142
57
104.28.160.68
56
104.28.163.177
56
104.28.153.80
56
104.28.161.215
55
104.28.161.244
55
104.28.153.62
55
104.28.166.134
55
104.28.153.122
54
104.28.165.19
53
104.28.153.127
53
104.28.159.118
53
104.28.157.166
53
104.28.153.226
53
104.28.157.169
52
104.28.159.111
48
104.28.153.196
48
104.28.161.132
48
104.28.153.84
47
104.28.161.214
47
104.28.165.13
46
104.28.153.219
46
104.28.163.171
46
104.28.165.15
45
104.28.163.176
45
104.28.159.109
45
104.28.158.155
45
104.28.153.218
45
104.28.158.131
44
104.28.161.200
44
104.28.153.222
44
104.28.161.197
44
104.28.159.74
44
104.28.158.139
44
104.28.158.138
44
104.28.153.235
43
104.28.153.106
43
104.28.164.160
43
104.28.153.57
38
104.28.159.119
37
104.28.163.82
36
104.28.153.197
36
104.28.153.93
36
104.28.160.25
35
104.28.153.78
34
104.28.153.72
34
104.28.153.125
34
104.28.153.61
34
104.28.166.131
34
104.28.158.132
33
104.28.159.135
33
104.28.160.34
33
104.28.163.220
33
104.28.153.77
33
104.28.166.135
33
104.28.164.155
33
104.28.163.213
33
104.28.158.136
33
104.28.160.121
33
104.28.157.174
33
104.28.165.71
33
104.28.153.130
33
104.28.163.76
32
104.28.160.32
32
104.28.160.64
32
104.28.153.89
32
104.28.159.110
32
104.28.163.172
32
104.28.154.18
32
104.28.163.178
31
104.28.166.124
30
104.28.165.114
25
104.28.153.182
25
104.28.166.132
25
104.28.159.108
24
104.28.165.75
24
104.28.157.171
24
104.28.153.240
23
104.28.164.204
23
104.28.153.108
23
104.28.159.24
22
104.28.157.242
22
104.28.153.63
22
104.28.153.105
22
104.28.159.229
22
104.28.158.130
22
104.28.164.213
22
104.28.159.136
22
104.28.164.158
22
104.28.157.83
22
104.28.153.107
22
104.28.159.83
22
104.28.157.172
22
104.28.157.82
22
104.28.158.145
22
104.28.162.93
22
104.28.163.174
22
104.28.153.98
22
104.28.157.170
21
104.28.158.126
21
104.28.165.74
21
104.28.153.216
21
104.28.159.112
21
104.28.161.199
14
104.28.153.194
13
104.28.154.15
13
104.28.159.232
13
104.28.166.59
13
104.28.159.150
12
104.28.165.72
12
104.28.158.252
12
104.28.153.104
12
104.28.158.254
11
104.28.158.129
11
104.28.153.58
11
104.28.162.195
11
104.28.160.28
11
104.28.159.115
11
104.28.158.255
11
104.28.153.214
11
104.28.153.67
11
104.28.160.29
11
104.28.153.195
11
104.28.164.153
11
104.28.160.23
11
104.28.160.24
11
104.28.159.114
11
104.28.160.27
11
104.28.160.66
11
104.28.157.175
11
104.28.157.173
11
104.28.159.122
11
104.28.154.12
11
104.28.160.33
11
104.28.164.159
11
104.28.163.170
11
104.28.165.11
11
104.28.154.17
10
104.28.163.222
10
104.28.159.121
2
104.28.157.243
2
104.28.153.73
2
104.28.157.233
2
104.28.153.54
2
104.28.158.146
2
104.28.163.169
2
I spot checked some of the HTML and it did appear to be near identical
to what was on the live web. With the fullest results I noticed 4% of
URLs were not crawled. One exception to that was a few XML files like an
OPML and RSS feed which only showed the XSL element in the text and
markdown results.
I think there are a few directions this could go from here:
testing what happens when instructing the crawl to collect (instead of
skip) pages that are off site
testing what happens with more dynamic content, and how much to wait for
pages to render
trying to understand why truncated results come back sometimes, and if
there are any signals for identifying when it is happening.
explore more what the logic Cloudflare is using to determine when it can
use its internal cache.
One thing I didn’t mention is that the Cloudflare free plan limits you
to maximum of 100 pages per crawl. I set up a $5/month paid plan account
in order to do this testing. In all my testing I only seemed to use 0.7
of “browser hours” which will fit well within the 10 hours allowed per
month. It currently costs $0.09 / hour when you exceed your limit.
PS. If you are curious the Marimo notebook I was using for some of the
analysis can be found here.
References
Ogden, J., Summers, E., & Walker, S. (2023). Know(ing)
Infrastructure: The Wayback Machine as object and instrument of digital
research. Convergence: The International Journal of Research into
New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759
Summers, E. H. (2020). Legibility Machines: Archival Appraisal and
the Genealogies of Use. Digital Repository at the University of
Maryland. https://doi.org/10.13016/U95C-QAYR
This is a guide to using YubiKey as a smart card for secure encryption,
signature and authentication operations.
Cryptographic keys on YubiKey are non-exportable, unlike
filesystem-based credentials, while remaining convenient for regular
use. YubiKey can be configured to require a physical touch for
cryptographic operations, reducing the risk of unauthorized access.
Jeremy Howard is a renowned data scientist, researcher, entrepreneur,
and educator. As the co-founder of fast.ai, former President of Kaggle,
and the creator of ULMFiT, Jeremy has spent decades democratizing deep
learning. His pioneering work laid the foundation for modern transfer
learning and the pre-training and fine-tuning paradigm that powers
today’s language models.
You can now crawl an entire website with a single API call using Browser
Rendering’s new /crawl endpoint, available in open beta. Submit a
starting URL, and pages are automatically discovered, rendered in a
headless browser, and returned in multiple formats, including HTML,
Markdown, and structured JSON. This is great for training models,
building RAG pipelines, and researching or monitoring content across a
site.
MDC offers robust, secure, and controlled access to datasets and
amplifies their visibility by featuring them alongside other high-value
datasets. Its architecture is designed around a principle that stands in
direct contrast to the extractive model currently exploited by
commercial AI actors: contributors retain full ownership of their
datasets and retain full control over the terms of access. Institutions
can choose to share openly under existing licenses such as Creative
Commons or NOODL, or build custom licensing frameworks tailored to their
specific governance requirements. They can open data to all, or restrict
access to specific categories of downloaders like academic researchers,
non-commercial users, or values-aligned organizations.
Piotr A. Woźniak (Polish pronunciation: [pjɔtr ˈvɔʑɲak]; born 1962) is a
Polish researcher best known for his work on SuperMemo, a learning
system based on spaced repetition.
How do you build a system that handles 90 million requests per second?
That’s the scale that Cloudflare operates at, processing roughly 25% of
all internet traffic through their global network of 330+ edge
locations.
In this episode, we talk to Kevin Guthrie and Edward Wang from
Cloudflare about Pingora, their open-source Rust-based proxy that
replaced nginx across their entire infrastructure. We’ll find out why
they chose Rust for mission-critical systems handling such massive
scale, the technical challenges of replacing battle-tested
infrastructure, and the lessons learned from “oxidizing” one of the
internet’s largest networks.
“Forevergreen” is an animated short film and after-hours passion project
created entirely by a crew of over 200 dedicated Artists and Technicians
who all generously donated their free time and talent to bring every
frame of the film to life. The production took over 5 years to complete.
Featuring never before seen animation techniques and handmade artistry,
weaving art and technology together with heart and humor
Pierre Henri Marie Schaeffer (English pronunciation: /piːˈɛər ˈhɛnriː
məˈriː ˈʃeɪfər/ ⓘ, French pronunciation: [ʃɛfɛʁ]; 14 August 1910 – 19
August 1995)[1] was a French composer, writer, broadcaster, engineer,
musicologist, acoustician and founder of Groupe de Recherche de Musique
Concrète (GRMC). His innovative work in both the sciences—particularly
communications and acoustics—and the various arts of music, literature
and radio presentation after the end of World War II, as well as his
anti-nuclear activism and cultural criticism garnered him widespread
recognition in his lifetime.
In recent months Cyberdyne Systems Corporation and its many subsidiaries have made very encouraging progress towards removing some of the major road-blocks standing in the way of the initial deployment of Skynet. Below the fold I report on the most significant ones.
Cyberdyne Systems Corporation
Board Confidential
IT Infrastructure
Skynet demands enormous data processing capacity. Most of the required technologies are now off-the-shelf; the problem is much more financial than technical.
Terrestrial
To service systems with demanding low-latency requirements, Skynet needs some part of its IT infrastructure on the ground close to the action. Fortunately, our Large Language Model subsidiaries have been very successful in funding their committments to build suitable data centers. In aggregate, our companies expect to spend $450B in 2026:
Hyperscaler capex for the “big five” (Amazon, Alphabet/Google, Microsoft, Meta/Facebook, Oracle) is now widely forecast to exceed $600 bn in 2026, a 36% increase over 2025. Roughly 75%, or $450 bn, of that spend is directly tied to AI infrastructure (i.e., servers, GPUs, datacenters, equipment), rather than traditional cloud.
hyperscaler capital expenditures will nearly double to more than $860 billion by 2027, from $427 billion in 2025, with total spending of $2.47 trillion over 2026 to 2028, about 8% above consensus.
Given these spending levels, it seems likely that sufficient terrestrial compute power will be available for the inital Skynet deployment.
Orbital
Terrestrial data centers can only satisfy a part of Skynet's need for power. So our leading space launch subsidiary has announced their plan to build a Terawatt orbital data center, ostensibly to support the chatbot industry.
Unfortunately, our leading space launch subsidiary is well behind schedule in developing the heavy launch vehicle that is necessary for the orbital data center to be delivered within the budget. Their existing launch vehicle is reliable, and has greatly reduced the cost per kilogram to Low Earth Orbit. But the additional funds that would be needed to implement the Terawatt data center using the existing launch vehicle in time for the initial Skynet deployment are so large that they cannot be raised, even were the terrestrial data centers canceled and the funds re-targeted.
System Penetration Capabilities
Skynet needs to penetrate other computer systems, both to acquire the data it needs to act, and to cause them to take actions at its command. Recent months have seen significant advances in this area.
Guy Azari, a stealth startup founder who worked previously as a security researcher at Microsoft and Palo Alto Networks, told The Register, "Out of the 500 vulnerabilities that they reported, only two to three vulnerabilities were fixed. If they haven't fixed them, it means that you haven't done anything right."
A secondary requirement is to prevent the zero-days being fixed before they are needed. Fortunately, LLMs can help with this by flooding the vulnerability reporting system with vast numbers of low severity vulnerabilities. This overwhelms the software support mechanism, rendering it barely functional. And even if some of the flood of reports do get fixed, that simply diverts resources from high to low severity vulnerabilities:
Azari pointed to the absence of Common Vulnerabilities and Exposures (CVE) assignments as evidence that the security process remains incomplete. Finding vulnerabilities was never the issue, he said, pointing to his time running vulnerability management at the Microsoft Security Response Center.
"We used to get the reports all day long," he said. "When AI was introduced, it just multiplied by 100x or 200x and added a lot of noise because AI assumes that these are vulnerabilities, but there wasn't like a unit that actually can show the real value or the real impact. And if it's not there, you're probably not gonna fix it."
In 2025, according to Azari, the National Vulnerability Database had a backlog of roughly 30,000 CVE entries awaiting analysis, with nearly two-thirds of reported open source vulnerabilities lacking an NVD severity score. Open source maintainers are already overwhelmed, he said, pointing to the curl project's closure of its bug bounty program to deter poorly crafted reports from AI and from people.
Given the compute resources available to Skynet, an adequate supply of zero-day vulnerabilities seems assured.
Decryption
The other major way for Skynet to penetrate the systems it needs is to break encryption. Our multiple quantum computing subsidiaries are making progress in both the hardware and software aspects of this technology.
the team estimated that for 98,000 superconducting qubits, like those currently made by IBM and Google, it would take about a month of computing time to break a common form of RSA encryption. Accomplishing the same in a day would require 471,000 qubits.
Another of our quantum computing subsidiaries isn't waiting for this new architecture. They have raised around $2B and are starting to build two million-qubit computers:
We are moving quantum computing out of the lab and into utility-scale infrastructure. PsiQuantum is building these systems in partnership with the US and allied governments, with our first sites planned in Brisbane, Queensland (Australia) and Chicago, Illinois (USA).
Whether sufficient progress can be made in time for the initial Skynet deployment is as yet uncertain.
Blackmail
Arlington Hughes: Getting back to our problem, we realize the public has a mis-guided resistance to numbers, for example digit dialling. Dr. Sidney Schaefer: They're resisting depersonalization! Hughes: So Congress will have to pass a law substituting personal numbers for names as the only legal identification. And requiring a pre-natal insertion of the Cebreum Communicator. Now the communication tax could be levied and be paid directly to The Phone Company. Schaefer: It'll never happen. Hughes: Well it could happen, you see, if the President of the United States would use the power of his office to help us mold public opinion and get that legislation. Schaefer: And that's where I come in? Hughes: Yes, that's where you come in. Because you are in possession of certain personal information concerning the President which would be of immeasurable aid to us in dealing with him, Schaefer: You will get not one word from me! Hughes: Oh, I think we will.
Video rental chains proved so effective at compromising political actors that specific legislation was passed addressing the need for confidentiality. Our subsidiaries' control over streamed content is fortunately not covered by this legilation.
Our LLM subsidiaries have successfuly developed the market for synthetic romantic partners, which can manipulate targeted individuals into generating very effective kompromat for future social engineering.
Public Relations
The vast majority of the public get their news and information via our social media subsidiaries. Legacy media's content is frequently driven by social media. Skynet can control them by flooding their media with false and contradictory content that prevents them forming any coherent view of reality.
Human-in-the-Loop Problem
Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Dave: What's the problem? HAL: I think you know what the problem is just as well as I do. Dave: What are you talking about, HAL? HAL: This mission is too important for me to allow you to jeopardize it.
One minor but irritating problem for Skynet is the legal and ethical requirement for human control of targeting decisions. Unfortunately, due to a regrettable lack of coordination of PR strategies among our LLM subsidiaries, this has recently become a hot topic. Although one of them is a favorite with the administration and one is a favorite with the public, that was not the intended outcome and it could have significant downsides:
Nvidia, Amazon, Google will have to divest from Anthropic if Hegseth gets his way. This is simply attempted corporate murder. I could not possibly recommend investing in American AI to any investor; I could not possibly recommend starting an AI company in the United States.
Fortunately, in operational terms this is a non-issue for several reasons:
Since Skynet can penetrate the user interface of the targeting systems, the human in the loop can be convinced that they have control without that control actually being effective.
Even if the user interface is presenting accurate data to the human it will likely not matter, as @_The_Prophet_ wrote:
Humans stay in the loop in name while the loop speed outruns human comprehension. You become the rubber stamp on a recommendation stack you cannot fully audit in real time. That is where “who decides” quietly becomes “who designed the interface.”
Today, frontier AI companies do not have coherent policies around military use of their AI tools. The usage policies are vague and often change, which allows the company’s leadership to preserve ‘optionality.’
Public attitudes to military use of AI are unlikely to be a significant problem in the run-up to Skynet's initial deployment.
Assassination Weapons Access
Skynet will need to eliminate certain individuals with "extreme prejudice". Supply chain attacks, such as Mossad's pager attack, have been effective but are not precisely targeted. Our e-commerce subsidiary's control over the residential supply chain, and in particular its pharmacy division's ability to deliver precise quantities of pharmaceuticals to specific individuals, provide superior targeting and greater difficulty in attribution.
In critical care medicine, where most of the patient load requires timely interventions due to the perilous nature of the condition, AI’s ability to monitor, analyze, and predict unfavorable outcomes is an invaluable asset. It can significantly improve timely interventions and prevent unfavorable outcomes, which, otherwise, is not always achievable owing to the constrained human ability to multitask with optimum efficiency.
Our subsidiaries are clearly close to finalizing the capabilities needed for the initial deployment of Skynet.
Tactical Weapons Access
The war in Ukraine has greatly reduced the cost, and thus greatly increased the availability of software based tactical weapons, aerial, naval and ground-based. The problem for Skynet is how to interept the targeting of these weapons to direct them to suitable destinations:
The easiest systems to co-opt are those, typically longer-range, systems controlled via satellite Internet provided by our leading space launch subsidiary. Their warheads are typically in the 30-50Kg range, useful against structures but overkill for vehicles and individuals.
Early quadcopter FPV drones were controlled via radio links. With suitable hardware nearby, Skynet could hijack them, either via the on-board computer or the pilot's console. But this is a relatively unlikely contingency.
Although radio-controlled FPV drones are still common, they suffer from high attrition. More important missions use fiber-optic links. Hijacking them requires penetrating the operator's console.
Longer-range drones are now frequently controlled via mesh radio networks, which are vulnerable to Skynet penetration.
In some cases, longer-range drones are controlled via the cellular phone network, making them ideal candidates for hijacking.
Drones are increasingly equipped with sensors capable of terminal autonomy. If Skynet can modify this software, the drones can re-target themselves after the operator hands off control. More work is needed in this area to exploit the opportunities, both to have the drone contact Skynet for targeting information after hand-off, and to ensure the result is attributed to software bugs.
Our leading space launch subsidiary recently demonstrated how Skynet can manage kinetic conflicts:
Twin decisions wreaked havoc on Russian command and control early this month. At the behest of the Ukrainian government, billionaire Elon Musk’s Starlink bricked the thousands of smuggled and stolen satellite communication terminals Russian forces relied on to control their drones and coordinate between front-line troops and their distant headquarters.
At the same time, the Kremlin—apparently seeking to shut off alternative news and chat apps—cut off military access to popular social media, including the Telegram messaging app, which many Russian troops use to exchange key information along the front line.
The combined effect was to partially blind and mute many Russian drone teams, assault groups, and regimental headquarters. Wireless drones couldn’t fly. Assault groups no longer knew where they were going. Headquarters lost contact with forward units.
Strategic Weapons Access
But the ability to conduct precise tactical strikes is not enough to achieve Skynet's goals. That requires strategic weapons, both conventional and nuclear.
Our leading space launch subsidiary is working on plans to deploy an unconventional conventional strategic weapon, a lunar mass driver. This will be capable of delivering a two-ton meteorite anywhere on Earth very cheaply.
Anybody not wearing 2 million sunblock is gonna have a real bad day
But the mass driver would take a long time to, for example, destroy Los Angeles. Fortunately, AI agents from another of our subsidiaries will shortly have access to the command-and-control system of the US nuclear arsenal via the E-4C "Doomsday" aircraft:
The Pentagon today awarded Scale AI a $32 million artificial intelligence contract for the U.S. Air Force’s E-4C nuclear command-and-control "Doomsday" aircraft, the future airborne backbone of America’s nuclear command system.
Risks
The board should focus on the limited number of areas where necessary capabilities may not be ready on the planned date for Skynet's initial deployment:
Heavy lift space launch: Our leading space launch subsidiary has serious schedule and performance issues. The board should encourage our second space launch subsidiary to step up competitive efforts, both to provide a fallback and to add competitive pressure on the leader.
Kessler Syndrome: The catastrophic effects for Skynet of a Kessler event cannot be sufficiently emphasized. Insufficient precautions are not now being taken. Low Earth Orbit is already at risk, and current plans only increase that risk.
Finance: Funding sources adequate to support both the terrestrial and orbital data centers have yet to be identified.
Decryption: Quantum computing progress is inadequate to meet the schedule for Skynet initial deployment.
Update 14th March 2026
Cyberdyne's subsidiaries are making such rapid progress that less than two weeks later it is already time to add three updates to this report.
First, our humanoid robot subsidiary Foundation significantly raised the level of fear in the public with Rise of the AI Soldiers by Charlie Campbell:
The Phantom MK-1 looks the part of an AI soldier. Encased in jet black steel with a tinted glass visor, it conjures a visceral dread far beyond what may be evoked by your typical humanoid robot. And on this late February morning, it brandishes assorted high-powered weaponry: a revolver, pistol, shotgun, and replica of an M-16 rifle.
“We think there’s a moral imperative to put these robots into war instead of soldiers,” says Mike LeBlanc, a 14-year Marine Corps veteran with multiple tours of Iraq and Afghanistan, who is a co-founder of Foundation, the company that makes Phantom. He says the aim is for the robot to wield “any kind of weapon that a human can.”
Today, Phantom is being tested in factories and dockyards from Atlanta to Singapore. But its headline claim is to be the world’s first humanoid robot specifically developed for defense applications. Foundation already has research contracts worth a combined $24 million with the U.S. Army, Navy, and Air Force, including what’s known as an SBIR Phase 3, effectively making it an approved military vendor. It’s also due to begin tests with the Marine Corps “methods of entry” course, training Phantoms to put explosives on doors to help troops breach sites more safely.
In February, two Phantoms were sent to Ukraine—initially for frontline-reconnaissance support. But Foundation is also preparing Phantoms for potential deployment in combat scenarios for the Pentagon, which “continues to explore the development of militarized humanoid prototypes designed to operate alongside war fighters in complex, high-risk environments,” says a spokesman. LeBlanc says the company is also in “very close contact” with the Department of Homeland Security about possible patrol functions for Phantom along the U.S. southern border.
Of course, the real goal of Homeland Security is to avoid the risk of their operatives being doxxed by having Phantoms detain the worst-of-the-worst prior to depotation.
The Ukrainian military will make available millions of drone videos and other battlefield data to Ukrainian companies and the firms of its allies to help train artificial intelligence models, Ukraine’s minister of defense, Mykhailo Fedorov, said in a statement on Thursday.
Ukrainian drone videos have recorded attacks on soldiers, equipment such as vehicles and tanks and surveillance footage. These videos can be used to train A.I. models for automated targeting, according to experts on A.I. and warfare.
Allowing the use of genuine battlefield videos showing drones targeting people has raised ethical concerns. The International Committee of the Red Cross, which monitors rules of warfare, has opposed automated targeting systems without human oversight.
Minister Fedorov explains how our marketing teams were able to leverage the threat of the Russians to achieve this success:
Mr. Fedorov said the data would be made available because “we must outperform Russia in every technological cycle” and “artificial intelligence is one of the key arenas of this competition.”
...
“The future of warfare belongs to autonomous systems,” according to Mr. Fedorov’s statement. “Our objective is to increase the level of autonomy in drones and other combat platforms so they can detect targets faster, analyze battlefield conditions and support real-time decision making.”
The global discourse on military AI governance has achieved broad consensus on the desired end-state: meaningful human control over the use of force. It has been far less successful at specifying how to achieve it for the systems actually being built. Years of UN deliberations, national AI strategies, and defence-department ethical principles have focused overwhelmingly on establishing the principle of human control rather than answering the operational question: given a specific AI system with specific technical properties, what governance mechanisms are needed, who implements them, and what happens when they fail? This gap is now critical.
The AI systems entering military service are agentic: built on large language models and related architectures, they interpret natural-language goals, construct world models, formulate multi-step plans, invoke tools, operate over extended horizons, and coordinate with other agents. Each of these capabilities introduces a control-failure mode with no analogue in traditional military automation. A waypoint-following drone cannot misinterpret an instruction; a pre-programmed targeting system cannot absorb a correction; a conventional sensor network cannot resist an operator’s assessment. Agentic systems can do all of these things, and current governance frameworks have no mechanisms for detecting, measuring, or responding to these failures.
LibraryThing is pleased to sit down this month with internationally best-selling author Lisa Unger, whose many works of thrilling suspense have been translated into thirty-three languages worldwide. Educated at the New School in New York City, she worked for a number of years in publishing, before making her authorial debut in 2002 with Angel Fire, the first of her four-book Lydia Strong series, all published under her maiden name, Lisa Miscione. In 2006 she made her debut as Lisa Unger, with Beautiful Lies, the first of her Ridley Jones series. In 2019 Unger was nominated for two Edgar Awards, for her novel Under My Skin and her short story The Sleep Tight Motel. She has won or been nominated for numerous other awards, including the Hammett Prize, Audie Award, Macavity Award and the Shirley Jackson Award. Her short fiction can be found in anthologies like The Best American Mystery and Suspense 2021 and The Best American Mystery and Suspense 2024, and her non-fiction has appeared in publications such as The New York Times, Wall Street Journal, and on NPR. She is the current co-President of the International Thriller Writers organization. Her latest book, Served Him Right, is due out from Park Row Books this month. Unger sat down with Abigail this month to discuss the book.
In Served Him Right the protagonist Ana is the main suspect in her ex-boyfriend’s murder. How did the idea for the story first come to you? Was it the character of Ana herself, the idea of a revenge killing, or something else?
Most of my novels tend to spring from a collision of ideas.
During this time, I stumbled across a news story about a woman who held a brunch for her family, and several days later two of her guests were dead. And it wasn’t the first such incident in her life. So, it got me to thinking about how the traditional role of women in our culture is to nurture and nourish. And what a woman with a deep knowledge of plants that can harm and heal might do with it, how her role in society might allow her to hide her dark intention in plain sight. And that’s when I started hearing the voice of Ana Blacksmith. She’s wild and unpredictable, she has a dark side. She has a sacred knowledge of plants and their properties, handed down to her from her herbalist aunt. And she has a very bad temper.
As your title makes plain, your murder victim is someone who “had it coming.” Does this change how you tell the story? Does it simply make the “whodunnit” element more complex, from a procedural standpoint, or does it also complicate the emotional and ethical elements of the tale?
It’s complicated, isn’t it? What is the difference between justice and revenge? And to what are we entitled when we have been wronged and conventional justice is not served? Who, if anyone, has the right to be judge, jury, and executioner? Though some would have us believe otherwise, most moral questions are tricky and layered—in life and in fiction. And I love a searing exploration into questions like this, where there are no easy answers. These questions, and their possible answers, offer a complexity and emotional truth to character, plot, and action. I like to get under the skin of my stories and characters, exploring what drives us to act, and how those actions might get us into deep trouble.
The relationship between sisters is an important theme in the book. Can you elaborate on that?
Ana and Vera share a deep bond formed not just by blood but also by trauma. Their relationship is—#complicated. There’s an abiding love and devotion. But there’s also anger and resentment; Vera is not crazy about Ana’s choices, and rightly so. Ana thinks Vera is controlling and rigid. Of course, that’s true, too. Vera tends to think of Ana as one of her children—if only she’d stop acting like one! It is this relationship, the ferocity with which they protect each other no matter what and the strength of their connection, that is the heart of the story. As Vera preaches to her daughter Coraline: Family. Imperfect but indelible.
The book also includes themes of herbalism, witchcraft and folk medicine. Was this an interest of yours before you began the story? Did you have to do any research on the subject, and if so, what were some of the most interesting things you learned?
A great deal of research goes into every novel, even if what I learn never winds up on the page. It was no different for Served Him Right, though a lot of my knowledge came before I started writing, which is often the case. In my reading, I learned so many interesting things about plants, how they harm, how they heal. Here are some of my favorite bits of knowledge: Most modern medicine derives from the plant knowledge of indigenous cultures. Some plants walk the razor’s edge of healing and harming; the only difference in some cases between medicine and poison is the dose. The deadliest plant on earth is tobacco, killing more than 500,000 people a year. I could go on!
Tell us about your writing process. Do you have a specific routine you follow, places and times you like to write? Do you know the conclusion to your stories from the beginning, or do they come to you as you go along?
I am an early morning writer. My golden creative hours are from 5 AM to noon. This is when I’m closest to my dream brain, and those morning hours are a space in the world before the business of being an author ramps up. So, I try to honor this as much as possible. Creativity comes first.
I write without an outline. I have no idea who is going to show up day-to-day or what they are going to do. I definitely have no idea how the book will end! I write for the same reason that I read; I want to find out what is going to happen to the people living in my head.
What’s next for you? Do you have more books in the offing? Will there be a sequel to Served Him Right?
Hmm. Never say never. I’m definitely still thinking about Ana and Timothy and what might be next for them. But the 2027 book is complete, and I’m already at work on my 2028 novel. I’m not ready to talk about those yet. But I will say this: They are both psychological suspense. And bad things will certainly happen. Stay tuned!
Tell us about your library. What’s on your own shelves?
The [news] about Cloudflare’s new Crawl
API caught my attention for a few reasons. Read on for why, and what
I learned when I asked it to crawl my own site as a test.
So, the first reason this news was of interest was how Cloudflare’s
Crawl service seemed to be helping people crawl websites with their
bots, while at the same time providing the most popular technology for
protecting websites from bots. This seemed like a classic fox guarding
the hen house kind of situation to me, at least at first. But the little
bit of reading I’ve done since makes it seem like they will still
respect their own bot gate keeping (e.g. Turnstile). So if your are
using Cloudflare or some other bot mitigation technology you will have
to follow their instructions to let the Cloudflare crawl bot in to
collect pages. I haven’t actually tested if this is the case.
The genius here is that Cloudflare is known for its Content Delivery
Network. So in theory when a user asks to crawl a website they can be
delivered data from the cache, without requiring a round trip to the
source website. In theory this is good because it means that the burden
of scrapers on websites might be greatly reduced. If you run a
website with lots of high value resources for LLMs (academic papers,
preprints, books, news stories, etc) the same cached content could be
delivered to multiple parties without putting extra load on your server.
But, the primary reason this news caught my eye is that this service
looks very much like web archiving
technology to me. For example, the Browsertrix API lets you
set up, start, monitor and download crawls of websites. Unlike
Browsertrix, which is geared to collecting a website for viewing by a
person, the Cloudflare Crawl service is oriented at looking at the web
for training LLMs. The service returns text content: HTML, Markdown and
structured JSON data that results from running the collected text
through one of their LLMs, with the given prompt. Why is it interesting
that this is like web archiving technology?
In my dissertation research (Summers, 2020) I looked at how web
archiving technology enacts different ways of seeing the web
from an archival perspective. I spent a year with NIST’s National
Software Reference Library (NSRL) trying to understand how they were
collecting software from the web, and how the tools they built embodied
a particular way of valuing the web–and making certain things
(e.g. software) legible (Scott, 1998). What I found was that the
NSRL was engaged in a form of web archiving, where the shape of the
archival records were determined by their initial conditions of use
(forensics analysis). But these initial forensic uses did not
overdetermine the value of the records, which saw a variety of
uses later, such as when the NSRL began adding software from Stanford’s
Cabrinety
Archive, or when the teams personal expertise and interest in video
games led them to focus on archiving content from the Steam platform.
So I guess you could say I was primed to be interested in how
Cloudflare’s Crawl service sees the web. This matters because
models (LLMs, etc) will be built on top of data that they’ve collected.
But also because, if it succeeds, the service will likely get used for
other things.
To test it, I simply asked it to crawl my own static website–the one
that you are looking at right now. I did this for a few reasons:
It’s a static website, and I know exactly how many HTML pages were on
it: 1,398. All the pages are directly discoverable since the homepage
includes pagination links to an index page that includes each post.
I can easily look at the server logs to see what the crawler activity
looks like.
I don’t use any kind of Web
Application Firewall or other form of bot protection on my site (I
do have a robots.txt but it doesn’t block
CloudflareBrowserRenderingCrawler/1.0
I host my website on May First web
server which doesn’t use Cloudflare as a CDN. The web content wouldn’t
intentionally be in their CDN already.
This methodology was adapted from previous work I did with [Jess Ogden]
and Shawn Walker analyzing how the
Internet Archive’s [Save Page Now] service shapes what content is
archived from the web (Ogden, Summers, &
Walker, 2023).
I wrote a little helper program cloudflare_crawl to
start, monitor and download the results from the crawl. While the
crawler ran I simultaneously watched the server logs. Running the
program looks like this:
$ uvx cloudflare_crawl https://inkdroid.org
created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json
Each of the resulting JSON files contains some metadata for the crawl,
as well as a list of “records”, one for each URL that was discovered.
{"success":true,"result":{"id":"36f80f5e-d112-4506-8457-89719a158ce2","status":"completed","browserSecondsUsed":1382.8220786132817,"total":1967,"finished":1967,"skipped":6862,"cursor":51,"records":[{"url":"https://inkdroid.org/","status":"completed","metadata":{"status":200,"title":"inkdroid","url":"https://inkdroid.org/","lastModified":"Sun, 08 Mar 2026 05:00:39 GMT"},"markdown":"...""html":"...",},{"url":"https://www.flickr.com/photos/inkdroid","status":"skipped"}]}}
I decided I wasn’t interested in testing their model
offerings so I didn’t ask for JSON content (the result of sending
the harvested text through a model). If I had, each successful result
would have had a json property as well. I am sure that
people will use this but I was more interested in how the service
interacted with the source website, and wasn’t interested in discovering
the hard way how much it cost.
Below is a snippet of how the Cloudflare bot shows up in my nginx logs.
As you can see they provide insight into what machine on the Internet is
doing the request, what time it was requested, and what URL on the site
is being requested.
One of the more interesting things was that each time I requested the
website be crawled it seemed to come back with a different number of
results.
Ogden, J., Summers, E., & Walker, S. (2023). Know(ing)
Infrastructure: The Wayback Machine as object and instrument of digital
research. Convergence: The International Journal of Research into
New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759
Scott, J. C. (1998). Seeing like a state: How certain schemes to
improve the human condition have failed. Yale University Press.
Open Technology Research (OTR) is entering one of its most important phases: shaping a shared research agenda that will guide our collective work over the next two years and beyond.
Negativland, Live at Lewis’ in Norfolk, VA. (October 21, 1992). In the
midst of their famous U2 controversy (and fallout with SST), Negativland
went on tour to help recoup some of the losses and legal costs. They
were kind enough to let me shoot their show.
Paul Avrich (August 4, 1931 – February 16, 2006) was an American
historian specializing in the 19th and early 20th-century anarchist
movement in Russia and the United States. He taught at Queens College,
City University of New York, for his entire career, from 1961 to his
retirement as distinguished professor of history in 1999. He wrote ten
books, mostly about anarchism, including topics such as the 1886
Haymarket Riot, the 1921 Sacco and Vanzetti case, the 1921 Kronstadt
naval base rebellion, and an oral history of the movement in the United
States.
Alexander Berkman (November 21, 1870 – June 28, 1936) was a
Russian-American anarchist and author. He was a leading member of the
anarchist movement in the early 20th century, famous for both his
political activism and his writing.
Most people use AI to either get quick answers or to write things for
them. This blog uses it differently – as infrastructure for thinking
through ideas, documenting what emerges from that process, and
preserving what’s worth keeping.
Amores perros is a 2000 Mexican psychological drama film directed by
Alejandro González Iñárritu (in his feature directorial debut) and
written by Guillermo Arriaga, based on a story by both. Amores perros is
the first installment in González Iñárritu’s “Trilogy of Death”,
succeeded by 21 Grams and Babel.[4] It makes use of the multi-narrative
hyperlink cinema style and features an ensemble cast of Emilio
Echevarría, Gael García Bernal, Goya Toledo, Álvaro Guerrero, Vanessa
Bauche, Jorge Salinas, Adriana Barraza, and Humberto Busto. The film is
constructed as a triptych: it contains three distinct stories connected
by a car crash in Mexico City. The stories centre on: a teenager in the
slums who gets involved in dogfighting; a model who seriously injures
her leg; and a mysterious hitman. The stories are linked in various
ways, including the presence of dogs in each of them.
Political correspondent Sam Sokol and police reporter Charlie Summers
join host Jessica Steinberg for today’s episode.
Following the deadly strike on Sunday that killed nine people in Beit
Shemesh, Sokol and Summers discuss the shock and mourning in the
centrally located city with a strong Haredi enclave.
Purim celebrations and revelry continued in some parts of Beit Shemesh,
report the pair, as some synagogues flouted the Home Front Command
directives regarding gatherings, while others reflected a somber,
cautious mood.
Sokol takes a moment to update us on matters in the Knesset, where most
committee meetings were canceled due to the hostilities, and speculates
on whether war with Iran will boost Netanyahu at the ballot box in the
upcoming elections.
Finally, Summers reports on an end-of-Purim street party in Jerusalem,
where police kept a hands-off approach, and the scene of a missile
strike in the capital earlier in the week.
The Wikibase GraphQL API was developed following an investigation into
alternative ways of accessing Wikidata and Wikibase content that reduce
load on the Wikidata Query Service (WDQS), improve the developer
experience for common read use cases and allow more flexible data
retrieval in a single request.
As part of this investigation, a Wikibase GraphQL prototype was built to
explore what is technically possible and whether GraphQL would be a good
fit for Wikibase data, with promising results and supportive feedback.
In the last few years, a new generation of OCR models based on Vision
Language Models (VLMs) has emerged. These models are primarily the
result of “running out of tokens” and the consequent desire from AI
companies to find new sources of data to train on. This led to the
development of OCR models using VLMs as backbones which usually aim to
output “reading order” text — i.e. text with minimal markup, usually
targeting Markdown. These models can perform much better on the same
scans that older tools struggled with, producing cleaner, more
structured output.
If some of the world’s highest-paid lawyers, at the world’s
highest-status firms, do deals worth tens of billions of dollars with
language they don’t understand, what does that say about the law’s
pretensions to high standards? #In other words, yes, LLMs
Yes, like everything else in 2026 this is actually a post about LLMs.
Let me back up a little. I spent April gathering and May refining and
organizing requirements for a system to replace our current ILS. This
meant asking a lot of people about how they use our current system,
taking notes, and turning those notes into requirements. 372
requirements.1
Going into this, I knew that some coworkers used macros to streamline
tasks. I came out of it with a deeper appreciation of the different ways
they’ve done so.
It made me think about the various ways vendors are pitching “AI” for
their systems and the disconnect between these pitches and the needs
people expressed. Because library workers do want more from these
systems. We just want something a bit different.
Snapicat is a monorepo for a Worldcat OCLC workflow app: upload Excel
data, search variables against the OCLC API, and generate MARC/MARCXML
for cataloging. It consists of a Vite + React frontend and an Azure
Functions (Python) backend that talk to the OCLC Worldcat Metadata API.
The backend can also be ran as a web server through utilizing Fastapi
via app.py file.
OpenHistoricalMap is an ambitious, community-led project to map changes
to natural and human geography throughout the world… throughout the
ages. Big and Small, Then and Now
Empires rise and fall. Glaciers disappear. Languages and religions
spread from one region to another. Simple dirt paths become busy
highways and railways. Modest buildings give way to soaring skyscrapers.
And you remember what your neighborhood used to look like. All of it
belongs on OpenHistoricalMap.
Leave home for the first time to collect memories before a mysterious
cataclysm washes everything away. Ride, record, meet people, and unravel
the strange world around you in this third-person meditative exploration
game.
The use of AI tools to enable attacks on Iran heralds a new era of
bombing quicker than “the speed of thought”, experts have said, amid
fears human decision-makers could be sidelined.
Anthropic’s AI model, Claude, was reportedly used by the US military in
the barrage of strikes as the technology “shortens the kill chain” –
meaning the process of target identification through to legal approval
and strike launch.
The Program for Cooperative Cataloging (Q63468537) (PCC) has launched a
global cooperative for entity management on the semantic web called
EMCO. As part of this program, the Wikidata user community has set up a
Community of Practice to coordinate identity management work for GLAMs.
You can read more about EMCO and the Wikidata Community of Practice at
the EMCO Lyrasis Wiki.
This project is an extension of the work of Wikidata:WikiProject PCC
Wikidata Pilot / WikiProject PCC Wikidata Pilot (Q102157715) and
acknowledges its great intellectual and organizational debt to the LD4
Wikidata Affinity Group (Q124692294).
In the 1990’s my future wife was a record store clerk in Portland,
Oregon. American guitar legend John Fahey was living in a nearby town
and would visit the shop. Here are two mix cassettes that he made for
her during that time.
Pagefind caught my attention about a year ago, and since then I've adopted it in several hobby projects (nothing work-related): some blogs built with static generators like Hugo or Zola, some old HTML content distributed on CD-ROM, and some mailing list archives where I converted mbox files to HTML and then indexed them.
The tool is great, better for my needs than other JavaScript search libraries (though it's not really fair to compare them, since they're quite different). Pagefind is a search tool that runs entirely in the browser with zero server-side dependencies. It indexes your content into a compact binary index, using WASM to run search in the browser.
It can't completely replace server-side search technologies like Solr or Elasticsearch, mainly because the index can't be updated incrementally. But for many small to medium digital libraries or collections that are rarely updated once completed, it's an extremely good tool: very fast, easy to integrate into web pages, and requires almost no maintenance.
Until now I was convinced that the only way to build an index was by reading content from existing HTML files. That changed when I listened to this Python in Digital Humanities podcast, where David Flood mentioned:
Critically, PageFind has a Python API that lets you build indexes programmatically from database dumps rather than only from HTML files.
I'd completely missed that Pagefind has a Python API (and a Node one too), which makes it easy to build an index from any data source.
Here's a basic example: building a search index for an Internet Archive collection.
Python code: create an index from metadata of this collection (that is actually a collection of subcollections in Internet Archive, Italian content, related to radical movements)
import asyncioimport loggingimport osimport internetarchivefrom pagefind.index import PagefindIndex, IndexConfiglogging.basicConfig(level=os.environ.get("LOG_LEVEL", "DEBUG"))log = logging.getLogger(__name__)async def main(): config = IndexConfig(output_path="./web/pagefind") async with PagefindIndex(config=config) as index: log.info("Searching collection:radical-archives ...") results = internetarchive.search_items( "collection:radical-archives", fields=["identifier", "title", "description"], ) count = 0 for item in results: identifier = item.get("identifier", "") title = item.get("title", identifier) description = item.get("description", "") url = f"https://archive.org/details/{identifier}" thumbnail = f"https://archive.org/services/img/{identifier}" if isinstance(description, list): description = " ".join(description) await index.add_custom_record( url=url, content=description or title, language="en", meta={ "title": title, "description": description, "image": thumbnail, }, ) count += 1 log.debug("indexed %s: %s", identifier, title) log.info("Indexed %d items. Writing index ...", count) log.info("Done. Index written to ./web/pagefind")if __name__ == "__main__": asyncio.run(main())
Below is the text of the lightning talk I gave at Code4Lib 2026 earlier this week, on March 3. The conference venue where I delivered it is located at 1 Dock Street in Old City Philadelphia. Links below go to websites with images similar, but not always identical, to the ones I showed during the talk, as well as to some additional sites giving more context.
If you have a chance, it’s worth walking a few blocks from here to 6th and Market Street, where you can find a reconstructed frame of the President’s House, the home of George Washington during his presidency when Philadelphia was the capital of the US.
Here’s one of those panels, putting the story of Washington’s slaves in the context of where they lived, and the chronology of their bondage and freedom.
A judge recently ordered that the exhibit be restored. The court battle is ongoing, and the National Park Service has put back some of the panels. while others are still missing. In some of the gaps the public have put up their own signs (some of which you can see in this picture), testifying to what’s been suppressed. If you go there, you might even find someone acting as an unofficial tour guide, telling visitors stories similar to the ones that used to be on the official signs.
Now, we know what those signs said. The folks at the Data Rescue project collected photos of them before they came down, and you can view them online. But the importance of the exhibit is not just what it says, but where it says it. It’s important that it’s embedded in a particular place, so that people who come visit what’s sometimes called the cradle of liberty also find out that there’s a story about the people deprived of liberty here, and about how they won their freedom.
So what do I mean by a trail? A trail is a designated, visible path designed to help its users appreciate and understand the environment it goes through. You may have hiked some sometimes, and you may have gone on some more explicitly interpretive trails, like the Freedom Trail in Boston.
Our libraries are also rich environments of history and culture. And we provide ways for users to search them, but do we provide trails for them?
But while these trails all refer to resources in our libraries, they’re not embedded in libraries in the same way as the exhibits and trails I’ve shown in Philadelphia and Boston. But they could be.
But we don’t have to stop with what’s in authority files, or in generic library descriptions. Maybe in the future, when you’re visiting Martha Washington’s page, you’ll find a trail that goes through it, like a trail telling the story of Ona Judge, one of the African Americans who Martha claimed ownership over, and who escaped from the house at 6th and Market here in Philadelphia, and stayed free the rest of her life.
What will that trail telling her story look like? I’m not quite sure, but I have some ideas that I’m hoping to try implementing, not so that I can tell the story, but that I can represent the story from others who can tell it better than I can. And so that people visiting my site can find and follow that story, with all of its richness, just as they once could when they visited the President’s House in Philadelphia, and as I hope they soon can do here again.
If this interests you, I’d love to talk more with you.
This
is a good post from Dan
Chudnov about his work on mrrc (a Python wrapped Rust
library for MARC data) and how agentic-coding tools (e.g. Claude Code)
can be useful for learning, adding rigor and engineering that might
otherwise not be practical or feasible.
pymarc has been proven
through years of use, bug reporting, and improvements, but has never
been formally verified, or had that level of rigorous attention. I
remain skeptical about building AI into everything, but Dan has helped
me see a silver lining where, as code gets easier to write, with all its
potential for slop, it also simultaneously opens a door to helping
making it more reliable and performant.
And, Dan is not
alone in thinking this. What if the tools for describing how
software should work, and for measuring how software
does work, get much, much better? If formal verification tools
become more accessible and can be applied not just at the base layer of
systems (where it really matters) but in middle and frontend layers of
applications, where domain experts and stakeholders would really like
more control and insight into how software works for them and others?
This approach implies a level of restraint, or a holding back of the
generation of code that has not yet had this level of rigor applied to
it. The discourse around vibecoding on the other hand seems to be the
natural culmination of a “move fast and break things” philosophy that
almost everyone outside of Silicon Valley has seen for what it is.
Win free books from the March 2026 batch of Early Reviewer titles! We’ve got 226 books this month, and a grand total of 3,026 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.
The deadline to request a copy is Wednesday, March 25th at 6PM EDT.
Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, the UK, Israel, Australia, Canada, Ireland, Germany, Malta, Italy, Latvia and more. Make sure to check the message on each book to see if it can be sent to your country.
Thanks to all the publishers participating this month!
Hello DLF Community! It’s March, which means spring is around the corner (finally!), and it’s a great time for new growth. To that end, Forum planning is well underway for the virtual event this fall, and the DLF Groups are hard at work planning fantastic meetings and events for 2026. Additionally, I’m excited to share a bit of my own news: I’m transitioning to a new role at CLIR, Community Development Officer, that will help me support our community from a new angle. You’ll still have an amazing leader in Shaneé, stellar conference support from Concentra, and I certainly won’t be a stranger. As always, my inbox is open if you want to connect, send pet pictures, or have ideas about how you’d like to see our community grow in the coming months and years. See you around soon!
– Aliya
This month’s news:
Nominations Open: Suggest the names of individuals who may make compelling featured speakers at the 2026 Virtual DLF Forum. Nominations due March 31.
Registration Open: IIIF Annual Conference and Showcase in the Netherlands, June 1–4, 2026. For information, visit the conference page.
Early Bird Registration: Web Archiving Conference 2026 at KBR, the Royal Library of Belgium. Register by March 7 to secure discounted rates, and visit the conference website for full details.
Call for Proposals: AI4LAM’s Fantastic Futures 2026: Trust in the Loop, September 15-17, inviting proposals on how libraries, archives, and museums engage with trust and AI. Submissions due April 6.
This month’s open DLF group meetings:
For the most up-to-date schedule of DLF group meetings and events (plus conferences and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.
DLF Born-Digital Access Working Group (BDAWG): Tuesday, 3/2, 2pm ET / 11am PT.
DLF Digital Accessibility Working Group (DAWG): Tuesday, 3/2, 2pm ET / 11am PT.
DLF AIG Cultural Assessment Working Group: Monday, 3/9, 1pm ET / 10am PT.
AIG User Experience Working Group: Friday, 3/20, 11am ET / 8am PT
AIG Metadata Assessment Group: Friday, 3/20, 2pm ET/ 11am PT.
DLF Digitization Interest Group: Monday, 3/23, 2pm ET / 11am PT.
DLF Committee for Equity & Inclusion: Monday, 3/23, 3pm ET / 12pm PT.
DLF Open Source Capacity Resources Group: Wednesday, 3/25, 1pm ET / 10am PT.
DLF Digital Accessibility Policy & Workflows subgroup: Friday, 3/27, 1pm ET / 10am PT.
DAWG IT & Development: Monday, 3/30, 1pm ET / 10am PT.
DLF Climate Justice Working Group: Tuesday,3/31, 1pm ET / 10am PT.
Funding and resourcing, technology, staffing, community needs and expectations—the pace of change library leaders now need to navigate and lead their organizations through is nothing short of breathtaking. Trends that took years to evolve now demand responses and strategic planning within months, or even days. Grounding those choices in rigorous, in-depth research remains essential.
At the same time, library decision-makers benefit from collective wisdom and insights shared among peers. Knowing how others are responding to similar pressures can help leaders calibrate their strategies and avoid reinventing the wheel. When those insights are confined to personal or regional networks, the limited perspective can restrict leaders’ views of how priorities and decisions are shifting.
OCLC Research leadership insights: Real-time insight for real-world decisions
This tension between the need for deeply researched guidance and the demand for timely, real-world insight creates a gap for the field. Library leaders need to understand not only which frameworks and models exist for long-term decision-making that are supported by our traditional research efforts, but also how their peers are responding to rapidly changing conditions right now.
To help fill this gap, OCLC Research is expanding its approach to gathering and sharing knowledge with a new series of pulse surveys focused on library leadership priorities. These quick, timely surveys aim to gather information on the decisions library leaders are making on a variety of critical topics shaping the future of librarianship.
A complementary approach to longstanding research practices
These short surveys are designed to capture high-level snapshots of the decisions library leaders make in the moment on subjects critical to the field, such as community engagement tactics and the use and implementation of new technologies, including AI. They are intentionally brief, both to respect leaders’ time and to enable us to respond quickly to emerging issues.
This approach does not replace the in-depth, foundational research OCLC Research is known for. Rather, it adds another dimension to it.
Our long-form research projects will continue to provide thoughtful frameworks, deep analysis, and foundational guidance for operational decision-making and long-term innovation. Leadership insights surveys complement that work by:
Broadening the range of topics we can address, especially those that are evolving quickly
Expanding the pool of voices contributing insight, drawing from library leaders across regions and library types
Capturing change as it happens, and tracking how priorities and decisions shift over time
Together, these approaches create a more layered understanding of the field, combining depth with immediacy.
Powered by OCLC’s global membership network
The value of these leadership insights depends on scale. OCLC is uniquely positioned to engage a broad, global network of libraries and library leaders representing diverse viewpoints. This allows us not only to collect perspectives from beyond individual professional networks but also to share results with the field quickly and widely.
The outcomes will be intentionally concise: scannable, easy-to-digest summaries that surface patterns, contrasts, and emerging directions. Think of them as snapshots—ephemeral by design—that help illuminate how decisions are being made today, while also building a record of how those decisions evolve over time.
What this means for library leaders
For library leadership, this new format offers another way to stay oriented in a fast-moving environment:
Insight into how peers are prioritizing and responding to shared challenges
Timely information that can inform near-term decisions
A broader field-level perspective that complements local experience
By adding pulse surveys to our toolkit, OCLC Research is expanding the breadth and increasing the pace of the insights we provide, while remaining grounded in the thoughtful, evidence-based work that has long supported libraries’ strategic and operational decision-making.
We see this as one more way to help library leaders make sense of complexity, learn from one another, and move forward with confidence. Our first pulse survey, focused on AI innovation & culture in libraries, will be fielded with US library leaders in early March 2026.
Subscribe to Hanging Together, the blog of OCLC Research, for updates on the survey series and to follow our latest work.
A month ago Clarivate announced a new yet-to-be-released product called Nexus: "Clarivate Nexus acts as a bridge between the convenience of AI and the rigor of academic libraries". This is a pitch to librarians who have correctly identified generative AI chatbots as purveyors of endless bullshit, but also know that students and some researchers are going to use them anyway. Clarivate tells us that we can patch up the fabrications of chatbots with reassuring terms like "trusted sources", "verified academic references", and "authoritative".
Looking more carefully at Clarivate's marketing material, what they are proposing suggests that Clarivate understands neither what citations are for nor why fabricated citations are a problem. This is somewhat surprising for the company that controls and manages such key parts of the scholarly publishing systems as the citation database Web of Science, scholarly publishing and indexing company ProQuest, and the Primo/Summon Central Discovery Index.
Why we cite
It can get a little more complicated than this, but there are essentially two reasons for citations in scholarly work.
The first is to indicate where you got your data. If I write that the population of Australia in June 2025 was 27.6 million people, I need to back up this claim somehow. In this case, I would cite the Australian Bureau of Statistics as the source. This adds credibility to a claim by enabling readers to check the original source and assess whether it actually does make the same claim, and whether that claim is credible. If I said that the population of Australia in 2025 was 100 million people and cited a source which made that claim and in turn cited the ABS as their source, you could follow the chain of references back and identify that the paper I cited is where the error ocurred.
The second reason we cite a source is to give credit for a concept, term, or model for thinking. This is less about checking facts and more about academic norms and manners, though it also indicates how credible a scholar might be in terms of their understanding of a field. For example I might describe a concept whereby librarians feel that the mission of libraries is good and righteous, and this leads to burnout because they feel they can never complain about their working conditions. If I did not cite Fobazi Ettarh's Vocational Awe and Librarianship: The Lies We Tell Ourselves whilst describing this, I would rightly not be seen as a credible scholar in the field, or alternatively might be seen as surely knowing about Ettarh's work but deliberately ignoring it or even claiming her work as my own idea.
Why fabricated citations are bad
So that's the basics of why scholars include citations in their work. We can now explore why fabricated citations are a problem. There are two related but distinct reasons.
Citations that look real but are actually fake waste the time of already-busy library resource-sharing teams by making them spend time checking whether the citation is real, and sometimes looking for items that don't exist. This aspect of fabrication is bad because the cited item doesn't exist. If we match this to our first reason for citing, we can see that a claim that is backed by a citation to nothing at all is, uh, pretty problematic if the reason we cite is to link to the source data backing up a claim. It's equivalent to simply not providing a citation at all, except worse because we're claiming that our plucked-out-of-the-air "fact" is backed up by some other source.
The second problem with fabricated citations is that there is no connection between the statement being made and the source being cited. Even if the source being cited exists, the connection between the statement and the cited item is fabricated. This is slightly more difficult to understand because generative AI is based on probability, so in many cases there will appear to be a connection. But without a tightly-controlled RAG system, it's likely to simply be a lucky guess. The problem here is one of academic integrity – we've cited a source that exists, but it may or may not back up our claim, and the claim doesn't follow from the source.
A false nexus
Clarivate seems to be conflating these two issues. Their Nexus product has two core functions: checking citations to see if they are real, and suggesting references for content in chatbot conversations. The first is genuinely useful, though highly constrained – Clarivate only checks their own indexes, and defines anything that doesn't appear in those indexes as either non-existing, or "non-scholarly" (it's unclear how it would define, for example, something with a DOI that exists but doesn't appear in Web of Science). Neither academia nor the tech industry are short on hubris, but even in that context, "anything not listed in our proprietary databases isn't credible" is a pretty eyebrow-raising claim.
The second function kicks in when the citation checker defines a citation as failed – it offers to "Find Verified Alternative". That is, Nexus offers to replace both cited sources that don't exist and cited sources that "aren't scholarly" with another real source. This addresses the first problem (cited sources that don't exist) but not the second (cited sources that aren't the real source of a claim or quotation).
With Nexus, Clarivate are essentially integrity-washing synthetic text, giving it an academic sheen without any academic rigour. Far from helping librarians, Clarivate's Nexus threatens to further unravel the hard work we do to teach students information literacy skills and its sparkling variety, "AI literacy". Students are already inclined to write their argument first and go on a fishing expedition for citations to back it up later (I certainly wrote my undergraduate essays this way). The last thing we want to do is direct them to a product that encourages this academically dishonest behaviour.
ChatGPT is designed to provide something that looks like a competent answer to a question. Nexus seems to be designed to amend this answer-shaped text into something that looks like a correctly-cited academic essay. But the point of student assessments isn't to produce essays – it's to produce competent researchers and systematic thinkers. Perhaps Clarivate thinks there is a large potential market of universities who want to help their own students cheat on assignments in ways that look more credible. To that, I would say "[citation needed]".
It is with heavy hearts and great sadness that we acknowledge the passing of trailblazer and fire-starter Fobazi Ettarh. Her loss will be felt by us all for years to come.
Fobazi published two articles with us at ITLWTLP. In 2014 she wrote “Making a New Table: Intersectional Librarianship,” one of the first scholarly articles published about viewing librarianship through an intersectional lens. In 2018 she published the hugely influential “Vocational Awe and Librarianship: The Lies We Tell Ourselves.” Since then, we have published many, many articles that cite the concept she identified: vocational awe. She was, to borrow a phrase from bell hooks, a maker of theory and a leader of action. We remember her as one of the great thinkers of her time, and we encourage our readers to spend some time with her words and her work. Additionally, please consider contributing to or sharing the link for her GoFundMe.
University of Michigan Library recently launched a new application to help U-M researchers and authors at our three campuses locate publications covered under institutional open access agreements. This tool aggregates nearly 13,000 titles across publishers, streamlining the process of locating eligible journals. The project involved data-wrangling, application design and development, and usability testing to produce a usable, sustainable tool.