Planet Code4Lib

Metastablecoin Fragmentation / David Rosenthal

A fundamental problem for decentralized systems like permissionless blockchains is that their security depends upon the cost of an attack being greater than the potential reward from it. Various techniques are used to impose these costs, generally either Proof-of-Work (PoW) or Proof-of-Stake (PoS). These costs have implications for the economics (or tokenomics) of such systems, for example that their security is linear in cost, whereas centralized systems can use techniques such as encryption to achieve security exponential in cost.

Shin Figure 3
Now, via Toby Nangle's Stablecoin = Fracturedcoin we find Tokenomics and blockchain fragmentation by Hyun Song Shin, whose basic point is that these costs must be borne by the users of the system. For cryptocurrencies, this means through either or both transaction fees or inflation of the currency. The tradeoff between cost and security means that there is a market for competing blockchains making different tradeoffs. In practice we see a vast number of competing blockchains:
Tether’s USDT sits on 107 different ledgers. ... USDC sits on 125.
The chart shows Ethereum losing market share against competing blockchains.

Shin's analysis uses game theory to explain why this fragmentation is an inevitable result of tokenomics. Below the fold I go into the background and the details of Shin's explanation.

Background

In 2018's Cryptocurrencies Have Limits I discussed Eric Budish's The Economic Limits Of Bitcoin And The Blockchain, an important analysis of the economics of two kinds of "51% attack" on Bitcoin and other cryptocurrencies based on PoW blockchains. Among other things, Budish shows that, for safety, the value of transactions in a block must be low relative to the fees in the block plus the reward for mining the block.

In 2019's The Economics Of Bitcoin Transactions I discussed Raphael Auer's Beyond the doomsday economics of “proof-of-work” in cryptocurrencies, in which Auer shows that:
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
Source
Bitcoin's costs are defrayed almost entirely by inflating the currency, as shown in this chart of the last year's income for miners. Notice that the fees are barely visible.

It has been known for at least a decade that Bitcoin's plan to phase out the inflation of the currency was problematic. In 2024's Fee-Only Bitcoin I wrote:
In 2016 Arvind Narayanan's group at Princeton published a related instability in Carlsten et al's On the instability of bitcoin without the block reward. Narayanan summarized the paper in a blog post:
Our key insight is that with only transaction fees, the variance of the miner reward is very high due to the randomness of the block arrival time, and it becomes attractive to fork a “wealthy” block to “steal” the rewards therein.
So Bitcoin's security depends upon the "price" rising enough to counteract the four-yearly halvings of the block reward. In that post I made a thought-experiment:
As I write the average fee per transaction is $3.21 while the average cost (reward plus fee) is $65.72, so transactions are 95% subsidized by inflating the currency. Over time, miners reap about 1.5% of the transaction volume. The miners' daily income is around $30M, below average. This is about 2.5E-5 of BTC's "market cap".

Lets assume, optimistically, that this below average daily fraction of the "market cap" is sufficient to deter attacks and examine what might happen in 2036 after 3 more halvings. The block reward will be 0.39BTC. Lets work in 2024 dollars and assume that the BTC "price" exceeds inflation by 3.5%, so in 12 years BTC will be around $98.2K.

To maintain deterrence miners' daily income will need to be about $50M, Each day there will be about 144 blocks generating 56.16BTC or about $5.5M, which is 11% of the required miners' income. Instead of 5% of the income, fees will need to cover 89% of it. The daily fees will need to be $44.5M. Bitcoin's blockchain averages around 500K transactions/day, so the average transaction fee will need to be around $90, or around 30 times the current fee.
Average fee/transaction
Bitcoin users set the fee they pay for their transaction. In effect they are bidding in a blind auction for the limited supply of transaction slots. Miners are motivated to include high-fee transactions in their next block. If there were an infinite supply of transactions slots miners' fee income would be zero. In practice, much of the timethe supply of slots exceeds demand and fees are low. At times when everyone wants to transact, such as when the "price" crashes, the average fee spikes enormously.

There was thus a need for a consensus mechanism that did not depend upon inflation. In 2020's Economic Limits Of Proof-of-Stake Blockchains I discussed a post entitled More (or less) economic limits of the blockchain by Joshua Gans and Neil Gandal in which they summarize their paper with the same title. The importance of this paper is that it extends the economic analysis of Budish to PoS blockchains. Their abstract reads:
Cryptocurrencies such as Bitcoin rely on a ‘proof of work’ scheme to allow nodes in the network to ‘agree’ to append a block of transactions to the blockchain, but this scheme requires real resources (a cost) from the node. This column examines an alternative consensus mechanism in the form of proof-of-stake protocols. It finds that an economically sustainable network will involve the same cost, regardless of whether it is proof of work or proof of stake. It also suggests that permissioned networks will not be able to economise on costs relative to permissionless networks.
Source
In 2022 Ethereum switched from Proof-of-Work to Proof-of-Stake, reducing its energy consumption by around 99%. This chart shows that, like Bitcoin, until the "Merge" the costs were largely defrayed by inflating the currency. After the "Merge" the blockchain has been running on transaction fees.

Shin's Analysis

Here is a summary of Shin's analysis.

Notation

  • There is a continuum of validators i.
  • For validator i ∈ [0;1], the cost of contributing to governance is ci > 0.
  • The blockchain needs at least a fraction of the validators  contributing to be secure. Shin writes:
    There are two special cases of note: = 1 (unanimity, corresponding to full decentralisation where every validator must participate for the blockchain to function) and = 0 which corresponds to full centralisation, where one validator has authority to update the ledger.
    = 1 is impractical,lacking fault tolerance. = 0 is much more practical, it is the traditional trusted intermediary.
  • If the blockchain is secure, each contributing validator earns a reward p > 0. A non-contributing validator earns zero.
  • The validators share a common cost threshold c*. If ci < c*, validator i contributes, if ci > c* validator i does not.

Argument

Each validator will want to contribute only if at least - 1 other validators contribute, which poses a coordination problem. The case of particular interest is the validator with ci = c*. Shin writes:
Intuitively, even though the marginal validator may have very precise information about the common cost c*, the validator faces irreducible uncertainty about how many other validators will choose to contribute. It is this strategic uncertainty — uncertainty about others' actions — that is the central feature of the coordination problem.
This "strategic uncertainty" is similar to the attacker's uncertainty about other peers' actions that is at the heart of the defenses of the LOCKSS system in our 2003 paper Preserving peer replicas by rate-limited sampled voting.

Shin Figure 6
Because the marginal validator's ci = c*, the decision whether or not to contribute makes no difference. Sin's Figure 6 explains this graphically. Rectangle A is the loss if k < and rectangle B is the gain if k > . Setting them equal gives:
c* = (p - c*)(1 - )
which simplifies to:
c* = p(1 - )
Shin and Morris earlier showed that this is the unique equilibrium no matter what strategy the validators use.

Result

What this means is that successful validation depends upon the reward p being large enough so that:
p c 1 − κ̂
Shin writes:
Note that the required reward p explodes as → 1. This is the central result of the paper: the more decentralised the blockchain (the higher the supermajority threshold), the higher must be the rents that accrue to validators. In the limiting case of unanimity ( = 1), no finite reward can sustain the coordination equilibrium.
Shin Figure 1
This yet another result showing that a reasonably secure blockchain is unreasonably expensive. The complication is that, much of the time, transactions are cheap because the demand for them is low. Thus most of the time validators are not earning enough for the risks they run. But:
When many users want to transact at the same time, they bid against each other for limited block space, and fees spike — much as taxi fares surge during rush hour. Figure 1 shows how Ethereum gas fees exhibited sharp spikes during periods of network congestion, such as during surges in decentralised finance (DeFi) activity or spikes in the minting of non-fungible tokens (NFTs). These spikes are not merely a reáection of excess demand; they are the mechanism through which the blockchain extracts the rents needed to sustain validator coordination.
Note that these spikes mean that the majority of the time fees are low but the majority of transactions face high fees. It is this "user experience" that drives the fragmentation that Shin describes:
When demand for block space is high, fees rise and validators are well compensated. But high fees deter users, especially those making small or routine transactions. These users are the first to migrate to competing blockchains that offer lower fees — blockchains that can offer lower fees precisely because they have lower coordination thresholds (and hence less security). The users who remain on the more secure blockchain are those with the highest willingness to pay: institutions, large DeFi protocols, and transactions where security and censorship resistance are paramount. This sorting of users across blockchains is the essence of fragmentation.
Shin notes that:
The fragmentation argument is the flipside of blockchain's "scalability trilemma," as described by Vitalik Buterin, who posed the problem as the impossibility of attaining, simultaneously, a ledger that is decentralised, secure, and scalable.
Source
It is worth noting that Buterin's trilemma is a version for PoS of the trilemma Markus K Brunnermeier and Joseph Abadi introduced for PoW in 2018's The economics of blockchains. See The Blockchain Trilemma for details.

Shin's focus is primarily on the effects of fragmentation on stablecoins. He notes that:
Rather than converging on a single platform, stablecoin activity is scattered across many chains (Figure 4). As of late 2025, Ethereum held the majority of total stablecoin supply but was facing competition from Tron and Solana, each of which had attracted tens of billions of dollars in stablecoin balances. Each chain serves different geographies and use cases: Ethereum for institutional settlement, Tron for low-cost remittances, Solana for retail payments and DeFi activity.
This fragmentation among blockchains would not matter much if stablecoins were interoperable between them, but they are confined to the blockchain on which they were minted:
A USDC token on Ethereum is not the same as a USDC token on Solana — they exist on separate ledgers that have no native way of communicating with each other. Transferring between chains requires the use of bridges: specialised software protocols that lock tokens on one chain and issue equivalent tokens on another. These bridges introduce additional risks, including vulnerabilities in the smart contract code — bridge exploits have accounted for billions of dollars in cumulative losses — and they impose costs and delays that undermine the seamless transferability that is the hallmark of money. The result is a landscape in which stablecoins from the same issuer exist in multiple, non-fungible forms across different blockchains, fragmenting liquidity and undercutting the network effects that should be the strength of a widely adopted payment instrument.

Discussion

As I've been pointing out since 2014, very powerful economic forces mean that Decentralized Systems Aren't. So the users paying for the more expensive transactions because they believe in decentralization aren't getting what they pay for.

Source
As I wrote in 2024's It Was Ten Years Ago Today:
The insight applies to Proof Of Stake networks at two levels:
  • Block production: over the last month almost half of all blocks have been produced by beaverbuild.
  • Staking: Yueqi Yang noted that:
    Coinbase Global Inc. is already the second-largest validator ... controlling about 14% of staked Ether. The top provider, Lido, controls 31.7% of the staked tokens,
    That is 45.7% of the total staked controlled by the top two.
Source
In addition all these networks lack software diversity. For example, as I write the top two Ethereum consensus clients have nearly 70% market share, and the top two execution clients have 82% market share.
Shin writes as if more decentralization equals more security even though it doesn't happen in practice, but this isn't really a problem. What the users paying the higher fees want is more security, and they are probably getting because they are paying higher fees. As I discussed in Sabotaging Bitcoin, the reason major blockchains like Bitcoin and Ethereum don't get attacked is not because the (short-term) rewards for an attack are less than the cost. It is rather that everyone capable of mounting an attack is making so much money that:
those who could kill the golden goose don't want to.
Shin Figure 3
In any case what matters for Shin's analysis isn't that the users actually get more security for higher fees, but that they believe they do. Like so much in the cryptocurrency world, what matters is gaslighting. But what the chart showing Ethereum losing market share shows is that security is not a concern for a typical user.

mkiiif, yet another static IIIF generator / Raffaele Messuti

I revisited an old Go package I've been using over the past few years to build IIIF manifests — nothing fancy, just some glue around structs and JSON. From that I built a new CLI, mkiiif, to generate IIIF manifests from static images (tiled or not). There are plenty of similar tools out there (iiif-tiler, tile-iiif, biiif, ...) but none quite matched the CLI ergonomics I needed for my daily workflow.

I moved the library to this new repository atomotic/iiif. The tool mkiiif can be installed with Go:

go install github.com/docuverse/iiif/cmd/mkiiif@latest

mkiiif can generate an IIIF manifest from a source directory containing images, or from a PDF file that gets exploded and converted to images via mupdf. Output images can be either untiled or static tiles generated with vips. Both approaches produce a IIIF Level 0 compliant layout, static files that can be served from any HTTP server, with no image server required. Untiled is less efficient for large images but perfectly fine for printed books, papers, and similar material.

mupdf and vips are external dependencies, that need to be installed separately. They are invoked via subprocess; I chose not to add Go library wrappers around them to keep the tool simple. WASM ports of both may become viable in the future.

The CLI usage:

Usage of mkiiif:
  -base string
        Base URL where the manifest will be served (e.g. https://example.org/iiif)
  -destination string
        Output directory; a subdirectory named <id> will be created inside it, containing the images and manifest.json
  -id string
        Unique identifier for the manifest (e.g. book1)
  -resolution int
        Resolution (DPI) used when converting PDF pages to images via mutool (default 150)
  -source string
        Path to a directory of images or a PDF file to convert
  -tiles
        Generate IIIF image tiles for each image using vips dzsave (requires vips)
  -title string
        Human-readable title of the manifest

Example:

~ mkiiif -base https://digital.library.org -destination ./public -id iiif01 -source ~/book.pdf -title "iiif 01"

Or with tiling:

~ mkiiif -base https://digital.library.org -destination ./public -id iiif01 -source ~/book.pdf -title "iiif 01" -tiles

Both commands produce the following structure inside ./public:

└── iiif01
    ├── index.html
    ├── manifest.json
    ├── page-001.png
    ├── page-002.png
    ├── page-....png
    └── page-....png
└── iiif01
│   ├── index.html
│   ├── manifest.json
│   ├── page-001
│   │   ├── 0,0,1024,1024
│   │   │   └── 512,512
│   │   │       └── 0
│   │   │           └── default.jpg
...
│       ├── full
│       │   ├── 362,501
│       │   │   └── 0
│       │   │       └── default.jpg
│       │   └── max
│       │       └── 0
│       │           └── default.jpg
│       └── info.json
...

The directory can then be served from https://digital.library.org.

I've adopted this URL scheme:

https://{base}/{id}
    /manifest.json — the IIIF manifest
    /index.html    — a simple viewer

So in the example above, https://digital.library.org/iiif01 opens a full viewer to browse the object. The viewer used is Triiiceratops — the newest viewer in the IIIF ecosystem. Built on Svelte and OpenSeadragon, is still young, but very usable, lightweight, and easy to embed and customize. It is my favourite viewer.

mkiiif doesn't handle metadata for now (and probably won't) — the manifest can be easily patched to insert descriptive metadata in a later step, after image preparation, pulling from any existing datasource or metadata catalog.

Here is a full working example: https://docuver.se/iiif/p3tgsk8jqt/

A few open questions I haven't fully resolved:

  • The main drawback of generating IIIF this way is that you end up managing a large number of files on the filesystem, and handling millions of small image tiles can be slow (and costly). This is where IIIF intersects — and overlaps — with similar practices in digital preservation, such as BagIt, OCFL, and WARC/WACZ. So far there's no specification or viewer implementation that handles IIIF containers (e.g. a zip file bundling images, tiles, and the manifest). Discussions on this have been ongoing in the past; I've recently been looking at analogous approaches like GeoTIFF and SZI.
  • A static IIIF bundle generated with this CLI still needs to be served from an HTTP server, with the base URL defined at derivation time. Could such a bundle be opened from localhost and viewed directly in the browser? Service Workers might help here (even if HTTP is still needed), but it's a rabbit hole I haven't explored yet.

The CLI is pretty bare-bones — feel free to suggest improvements or report bugs. I've been using it over the past weeks as part of a personal project: an amateur digital library built around a DIY book scanner I assembled at home, to preserve magazines, zines, and similar material (content NSFW and out of scope to link here).

2026-03-18: A Glimpse into How AI Tools Can Enhance the Way We Study Web Archive Content: Challenges and Opportunities / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

Artificial intelligence (AI) has transformed nearly every field. Today, we can access and train models that generate text, images, sound, video, and code. This transformation is reshaping how we think, analyze, and preserve information. Yet, despite the rapid growth of AI, its use for analyzing web archive content seems to advance at a slower pace. 

Web archiving is the process of collecting, preserving, and providing access to web content over time, where a memento represents a previous version of a web resource as it existed at a specific moment in the past. Much of the recent work within the web archiving community (e.g., [1], [2], [3]) has focused on making the archiving process itself more intelligent, integrating AI into tasks such as web crawling, storage optimization, and metadata generation. In contrast, the application of AI to the analysis of already archived web content has received comparatively less attention. This gap represents a great opportunity for innovation and contribution, particularly as web archives continue to grow in size, diversity, and historical importance.

In this blog, I aim to outline (based on my perspective, analysis, preliminary work, and insights gained during my PhD candidacy exam) opportunities for where AI could play a role, as well as key challenges involved in integrating AI into web archiving.

My Preliminary Work 

Since I joined the PhD program at ODU in 2023 (Blog post introducing myself) under the supervision of Dr. Michele C. Weigle, my work has focused on the intersection of web archiving and AI, with a particular emphasis on leveraging Large Language Models (LLMs) through Retrieval-Augmented Generation (RAG) to detect and interpret text changes across mementos. Identifying the exact moment when content was modified often requires carefully comparing multiple archived versions, a process that can be both tedious and time-consuming. Moreover, detecting and analyzing where important changes occur is not a straightforward process. Users often need to select a subset of captures from thousands available, and even then, there is no guarantee that the differences they find will be meaningful or important. Traditional approaches to memento change analysis, such as lexical comparisons and indexing (e.g., [4], [5]), focus on showing the deletion or addition of terms or phrases but ignore semantic context. As a result, they miss subtle shifts in meaning and rely heavily on human interpretation.

My early work resulted in a paper titled “Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach,” coauthored with Lesley Frew, Dr. Jose J. Padilla, and Dr. Michele C. Weigle. The results of this initial exploration demonstrated that an LLM, when combined with tools such as RAG over a set of mementos, can effectively retrieve and analyze changes in archived web content. However, it remains necessary to constrain the analysis to distinguish between important and non-important changes. Building on this, I have been developing a pipeline to automatically determine whether a change alters meaning or context and should be considered significant. This aims to reduce manual effort, cognitive load, and support integration into web archive systems while advancing methods for analyzing archived web content at scale.

My PhD Candidacy Exam

During the summer of 2025, I passed my PhD candidacy exam (pdf, slides). This milestone marked an important transition in my doctoral studies and provided an opportunity to reflect on my preliminary work, learn, and identify new ways to contribute to the intersection of AI and web archiving. In my candidacy exam, I reviewed a set of ten papers related to analyzing changes and temporal coherence in archived web pages and websites.  Changes refer to any modifications observed in web content over time, including the addition, deletion, or alteration of text, images, structure, or other embedded resources. Temporal coherence, on the other hand, refers to the degree to which all components of an archived web page (such as HTML, text, images, and stylesheets) or website (such as interconnected pages and resources) were captured close enough in time to accurately represent how it appeared and functioned at a specific moment. A lack of temporal coherence can result in inconsistencies in how the archived page or site looks or behaves, which may affect the accuracy of change analysis.

Figure 2. A moment from my PhD candidacy exam, where I presented a ten-paper review on analyzing changes and temporal coherence in archived web pages and websites.

AI in Web Archiving: Opportunities

Over time, several researchers have addressed the analysis of changes and temporal coherence in web archives; however, the use of AI in this context has been limited. Below, I outline some research opportunities and challenges based on insights gained from my preliminary work and candidacy exam on how AI could play a role in these activities.

Topic Drift

AlNoamany et al. [6] studied web archive collections to identify off-topic pages within TimeMaps, which occur when a webpage that was originally relevant to a collection later changes into unrelated content. For example, in a collection about the 2003 California Recall Election (Figure 3), the site johnbeard4gov.com initially supported candidate John Beard (September 24, 2003) but later transformed into an unrelated adult-oriented page (December 12, 2003), making it irrelevant to the collection. To detect such changes, AlNoamany et al. proposed automated methods including text-based similarity metrics (cosine similarity, Jaccard similarity, and term overlap), a kernel-based method using web search context, and structural features such as changes in page length and word count. Using manually labeled TimeMap versions as ground truth, they found that the best performance was achieved by combining TF-IDF cosine similarity with word-count change.

Figure 3. Example of johnbeard4gov.com going off-topic. The first capture (September 24, 2003) shows the site supporting a California gubernatorial candidate, while the later capture (December 12, 2003) shows the domain transformed into unrelated adult-oriented content. Source: AlNoamany et al. [6]

Recent advances in AI and representation learning offer opportunities to enhance off-topic detection in web archives beyond traditional term frequency measures. Instead of relying on TF-IDF, future approaches could use dense semantic embeddings from transformer models to better capture meaning and context, enabling the detection of more subtle topic drift. Comparing embedding-based similarity with the methods proposed by AlNoamany et al. could help determine which approach is more effective, particularly when topic shifts are not immediately apparent.

Temporal Coherence

Weigle et al. [7] highlight a key challenge in modern web archiving: many sites, such as CNN.com, rely on client-side rendering, where the server delivers basic HTML and JavaScript that later fetch dynamic content (often JSON) through API calls. Traditional crawlers like Heritrix do not execute JavaScript or consistently capture these dynamic resources, leading to temporal violations in which archived HTML and embedded JSON files have different capture times, potentially misrepresenting events or news stories. The issue is illustrated in Figure 5, which shows archived CNN.com pages captured between September 2015 and July 2016. The top row displays pages replayed in the Wayback Machine that show the same top-level headline despite being captured months apart. The bottom row shows mementos from the same dates with the correct top-level headlines; however, the second-level stories remain temporally inconsistent.

By measuring time differences between base HTML captures and embedded JSON resources using CNN.com pages (September 2015–July 2016), Weigle et al. identified nearly 15,000 mementos with mismatches exceeding two days. They conclude that browser-based crawlers best reduce such inconsistencies, though due to their higher cost and slower performance, they recommend deploying them selectively for pages that depend on client-side rendering.

Figure 4. Example of temporal coherence violation in archived CNN.com pages using client-side rendering. Source: Weigle et al. [7].

AI can enhance existing approaches to temporal coherence in web archives, such as those proposed by Weigle et al., by helping identify pages that depend on client-side rendering. For example, a machine learning model could be fine-tuned to analyze the initial HTML and related resources to detect signals such as empty or minimally populated DOM structures and classify whether a webpage relies on client-side rendering. AI-based analysis could also estimate the proportion of JavaScript relative to textual content and detect patterns associated with common client-side frameworks. Combined with indicators such as API endpoints referenced in scripts, these features can be used to flag pages that are unlikely to render correctly with traditional crawlers and may require browser-based crawling.

AI for Enhancing Web Archive Interfaces

While platforms such as Google and others have begun integrating AI into their user interfaces, web archives have largely remained unchanged in this respect. This is notable given the potential of AI to make web archive interfaces more intuitive and more informative for a wide range of users. For example, as my preliminary work suggests, when analyzing content changes, users currently must manually browse long lists of captures or compare multiple archived versions of a webpage. AI could instead automatically identify moments when important changes occur and direct users’ attention to those points in time.

Along the same line, the Internet Archive’s Wayback Machine provides a “Changes” feature that highlights deletions and additions between two snapshots and a calendar view where color intensity reflects the amount of variation. However, this variation is based on the quantity of changes rather than their significance. As a result, many small edits may appear more important than fewer but meaningful modifications. An AI-enhanced interface could address this limitation by incorporating semantic change detection. For instance, a calendar view that highlights when the meaning or message of a page changes can make large-scale temporal analysis more efficient and accessible. Moreover, users could ask natural-language questions such as “When did this page change its message?” or “What were the major updates during a specific period?” and receive concise, understandable answers. 

AI could also guide users through large collections by recommending related pages, explaining why certain versions are relevant, or warning when an archived page may contain temporally inconsistent content. For non-experts, visual aids generated by AI, such as timelines, change highlights, or short explanations, could make complex web archive data easier to interpret. 

AI in Web Archiving: Challenges

While there are opportunities for AI integration into web archiving, there are also challenges that must be considered.

Technical Challenges

From a technical standpoint, I identified three primary challenges regarding using AI for analyzing archived web content. The first concerns the nature of archived web data. Web archiving systems typically store collected content using the Web ARChive (WARC) format. Each WARC file stores complete HTTP response headers, HTML content, and additional embedded resources such as images and JavaScript files. Although this format provides a structure and allows long-term preservation, it is verbose and was not designed to support AI-based analysis. Consequently, researchers must perform extensive parsing and preprocessing before AI models can effectively use archived web content.

Second, many web archives, such as the Internet Archive’s Wayback Machine, prioritize long-term storage and preservation over indexing and large-scale content retrieval. As a result, a single web page may have hundreds or even thousands of archived versions over time. Building and maintaining large-scale vector indexes over such temporally dense collections quickly becomes computationally expensive and, in many cases, impractical.

Third, even when working with controlled data scenarios, such as curated web archive collections, AI-driven analysis still depends on the availability of ground truth for evaluation and validation. For instance, training models to detect significant changes across mementos would require large-scale, high-quality annotations that capture not only what changed, but whether those changes meaningfully affect content interpretation. At present, no large-scale annotated datasets exist that support systematic analysis of change significance across archived web versions, creating a major barrier to training and evaluating AI models in this domain.

Ethical Challenges

Beyond technical limitations, the integration of AI into web archive analysis raises important ethical challenges. For instance, web archives preserve content as it existed at specific points in time, often without the consent or awareness of content creators or the individuals represented in that content. When AI models analyze archived web data, they may surface, reinterpret, or amplify sensitive information that was never intended to be reused in new analytical contexts. For this reason, it is important to carefully consider how AI is applied within web archiving. I contend that AI should be viewed as a complementary tool, one that supports, rather than replaces, human judgment. For example, AI can assist in identifying potential moments of relevant changes, flagging or summarizing them, while humans interpret the results and make decisions.

It is also important to note that recent debates highlight growing tensions between web archives and content owners regarding the use of archived data for AI training and analysis. For example, major news publishers have begun restricting access to resources like the Internet Archive due to concerns that archived content is being used for large-scale AI scraping without compensation or consent [8]. In response to such restrictions, researchers and practitioners—including Mark Graham, Director of the Wayback Machine—have argued that limiting access to web archives poses a significant risk to the preservation of digital history [9]. From this perspective, the primary concern is not excessive access, but rather the potential loss of the web as a historical record if archiving efforts are weakened.

Conceptual Challenges

AI models, particularly LLMs, typically operate on individual snapshots of data. As a result, they are not inherently designed to reason about evolution, temporal coherence, or change over time in archived web content. Consequently, answers to temporally grounded questions should not be expected by default when these models are applied without additional structure or context.

In static analysis scenarios, AI models can perform effectively. For example, given a single archived web page, an LLM can generate a summary, identify main topics, extract named entities, or analyze embedded resources such as images, videos, or scripts. Temporal analysis in web archiving, however, requires a different mode of reasoning. The central questions are not “What does this page say?” or “What is this page about?” but rather “What changed?”, “When did it change?”, “Why did it happen?”, and “What impact does the change have over time?” Answering these questions requires comparing multiple archived versions, reasoning based on context, and perhaps correlating changes across web pages.

Integrating AI into web archiving is therefore not only about efficiency, but about enabling new forms of discovery. This requires clearly defining desired outcomes and using AI to support or accelerate processes that have traditionally been manual.

Final Reflections

To conclude, I would like to leave the reader with a set of open questions as we continue moving toward the integration of AI in web archiving. One of the most visible changes introduced by AI is the ability to go beyond syntactic analysis and begin exploring semantic analysis, where meaning, context, and interpretation matter. This shift is not about replacing existing techniques, but about expanding the types of questions we can ask when working with web archive data.

I contend that traditional algorithms remain essential for many web archiving tasks. They are precise, transparent, and well understood. AI, by contrast, offers strengths in areas where rules struggle: interpreting context, assessing relevance, and reasoning across multiple versions of content. Rather than framing this as a competition between algorithms and AI, a more productive question is how these approaches can complement one another, and in which parts of the analysis pipeline each is most appropriate.

In the short term, I consider that AI tools are unlikely to replace algorithmic methods. However, they already show promise as assistive tools that can guide analysis, prioritize attention, and help humans reason about large and complex temporal collections. This naturally raises a forward-looking question: if AI continues to improve in its ability to reason about time, meaning, and change, how should the web archiving community adapt its tools, workflows, and standards?

The WARC format has proven effective for long-term preservation, but it was not designed with AI-driven analysis in mind. Should we aim to augment existing archival formats with AI-aware representations, or should we focus on developing AI methods that better adapt to current standards such as WARC? How we answer this will shape not only how we analyze web archives, but also how future generations access and understand the web past.

References

[1] AK, Ashfauk Ahamed. “AI driven web crawling for semantic extraction of news content from newspapers.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-25616-x.

[2] Abrar, M. F., Saqib, M., Alferaidi, A., Almuraziq, T. S., Uddin, R., Khan, W., & Khan, Z. H. “Intelligent web archiving and ranking of fake news using metadata-driven credibility assessment and machine learning.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-31583-0.

[3] Nair, A., Goh, Z. R., Liu, T., and Huang, A. Y. “Web archives metadata generation with gpt-4o: Challenges and insights,” arXiv, Tech. Rep. arXiv:2411.0540, Nov. 2024. [Online]. https://arxiv.org/abs/2411.05409.

[4] L. Frew, M. L. Nelson, and M. C. Weigle, “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives,” in Proceedings of the 23rd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2023, pp. 71–81. https://doi.org/10.1109/JCDL57899.2023.00021

[5] T. Sherratt and A. Jackson, GLAM-Workbench/web-archives, https://zenodo.org/records/6450762, version v1.1.0, Apr. 2022. DOI: 10.5281/zenodo.6450762.

[6] Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within timemaps in web archives,” International Journal on Digital Libraries, vol. 17, no. 3, pp. 203–221, 2016. https://doi.org/10.1007/s00799-016-0183-5.

[7] M. C. Weigle, M. L. Nelson, S. Alam, and M. Graham, “Right HTML, wrong JSON: Challenges in replaying archived webpages built with client-side rendering,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Jun. 2023, pp. 82–92. https://doi.org/10.1109/JCDL57899.2023.0002.

[8] Robertson, K. “News publishers limit Internet Archive access due to AI scraping concerns.” Nieman Lab, Jan. 2026. [Online]. https://www.niemanlab.org/2026/01/news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

[9] Graham, M. “Preserving the web is not the problem — losing it is.” Techdirt, Feb. 17, 2026. [Online]. https://www.techdirt.com/2026/02/17/preserving-the-web-is-not-the-problem-losing-it-is/





2026-03-18: Reverse TweetedAt: Determining Tweet ID prefixes from Timestamps / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

Figure 1: Each tweet ID is a unique identifier that encodes the tweet creation timestamp, example adapted from Snowflake ID, Wikipedia.

Web archives, such as the Wayback Machine, are indexed by URL. For example, if we want to search for a tweet we must first know its URL. Figure 2 demonstrates that searching for a tweet URL results in a timemap of that tweet archived at different points in time. Clicking on a particular datetime will show the archived tweet at that particular point in time.

 

Figure 2: An archived tweet URL results in a timemap consisting of archived copies of the tweet.


Figure 3 shows a screenshot of a tweet shared by @_llebrun. The tweet in the screenshot was originally posted by @randyhillier who later deleted his tweet. The screenshot of the tweet does not have the tweet's URL on the image. Moreover, when a tweet is deleted, we will not be able to find the tweet URL on the live web, nor will we know how to  look it up in the archive.


Figure 3: @_llebrun tweeted a screenshot of a tweet originally posted by @randyhiller, who later deleted his tweet.


Therefore, we need to construct the URL of a tweet using only the information present in the screenshot. The structure of a tweet URL is: 


https://twitter.com/Twitter_Handle/status/Tweet_ID


We need the Twitter_Handle and Tweet_ID to construct a tweet URL. Each tweet ID is a unique identifier known as the Snowflake ID that encodes the tweet creation timestamp (Figure 1). We can extract the Twitter handle and timestamp from a tweet in the screenshot. In our previous tech report, we introduced methods for extracting Twitter handles and timestamps from Twitter screenshots. Next, we need to determine the tweet ID from the extracted timestamp. We could use only the Twitter handle and query the Wayback Machine, but that would be an exhaustive task to individually dereference all the archived tweets for a user. For example, the following curl command shows the total number of archived tweets required to dereference for @randyhiller's status URLs is huge (42,053). Hence, our goal is to limit the search space by utilizing the timestamp present on the screenshot.

curl -s "http://web.archive.org/cdx/search/cdx?url=https://twitter.com/randyhillier/status&matchType=prefix" | wc -l


   42053


Previously, one could query Twitter to find the timestamp of a tweet given a tweet ID. But, this service is no longer freely available.. The Twitter API has access rate limits and metadata from deleted/suspended/private tweets cannot be accessed using the API. Moreover, the Twitter API is currently monetized and no longer research-friendly. To address these issues, WS-DL members Mohammed Nauman Siddique and Sawood Alam developed the TweetedAt web service in 2019. The goal of this service is to extract the timestamps for Snowflake IDs and estimate timestamps for pre-Snowflake IDs. Therefore, TweetedAt has become a useful tool for finding timestamps from tweet IDs. However, we require a tweet ID prefix to be determined from a given timestamp.

Reverse TweetedAt


The Snowflake service generates a tweet ID which is a 64-bit unsigned integer composed of: 41 bits timestamp, 10 bits machine ID, 12 bits machine sequence number, and 1 unused sign bit. The timestamp occupies the upper 41 bits only.


TweetedAt determines the timestamp for a tweet ID by right-shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657 (offset).


Python code to get UTC timestamp of a tweet ID

def get_tweet_timestamp(tid):


    offset = 1288834974657

    tstamp = (tid >> 22) + offset

    utcdttime = datetime.utcfromtimestamp(tstamp/1000)

    print(str(tid) + " : " + str(tstamp) + " => " + str(utcdttime))


For Reverse TweetedAt, given a datetime, we want to generate a tweet ID prefix by subtracting the offset and left-shifting by 22 bits. The process will not reconstruct the exact tweet ID because the lower 22 bits are all zeros. However, the process will give us a tweet ID prefix for a timestamp. For example, the tweet ID for @randyhillier’s tweet is ‘1495226962058649603’ and the timestamp is ‘9:41 PM Feb 19, 2022’ as shown in Figure 3. The tweet ID is a 19-digit ID and the timestamp is at minute-level granularity. The Reverse TweetedAt would compute a tweet ID prefix ‘149522’ of 6-digits for the 19-digit tweet ID ‘1495226962058649603’ based on the timestamp at minute-level granularity.


Python code to get tweet ID prefix from a Wayback timestamp

from datetime import datetime, timezone


TWITTER_EPOCH_MS = 1288834974657


def wayback_to_tweetid_prefix(timestamp: str):


    s = str(timestamp).strip()


    if len(s) == 14 and s.isdigit():

        granularity = "second"

        dt = datetime.strptime(s, "%Y%m%d%H%M%S").replace(tzinfo=timezone.utc)

        start_ms = int(dt.timestamp() * 1000)

        end_ms = start_ms + 999


    elif len(s) == 12 and s.isdigit():

        granularity = "minute"

        dt = datetime.strptime(s, "%Y%m%d%H%M").replace(tzinfo=timezone.utc)

        start_ms = int(dt.timestamp() * 1000) 


    elif len(s) == 10 and s.isdigit():

        granularity = "hour"

        dt = datetime.strptime(s, "%Y%m%d%H").replace(tzinfo=timezone.utc)

        start_ms = int(dt.timestamp() * 1000) 

        end_ms = start_ms + 3_600_000 - 1


    elif len(s) == 8 and s.isdigit():

        granularity = "date"

        dt = datetime.strptime(s, "%Y%m%d").replace(tzinfo=timezone.utc)

        start_ms = int(dt.timestamp() * 1000) 

        end_ms = start_ms + 86_400_000 - 1


    else:

        raise ValueError(

            "Unsupported Wayback format. Use YYYYMMDD, YYYYMMDDHH, YYYYMMDDHHMM, or YYYYMMDDHHMMSS (UTC)."

        )


    start_delta = start_ms - TWITTER_EPOCH_MS

    end_delta = end_ms - TWITTER_EPOCH_MS

    min_id = start_delta << 22

    max_id = (end_delta << 22) | ((1 << 22) - 1)

    min_str = str(min_id)

    max_str = str(max_id)

    length = max(len(min_str), len(max_str))

    min_str = min_str.zfill(length)

    max_str = max_str.zfill(length)


    i = 0

    while i < length and min_str[i] == max_str[i]:

        i += 1


    prefix_str = min_str[:i] or "0"

    suffix_len = length - i

    prefix_val = int(prefix_str)

    ten_pow = 10 ** suffix_len

    approx_lower = prefix_val * ten_pow

    approx_upper = (prefix_val + 1) * ten_pow - 1


    return {

        "input_timestamp": timestamp,

        "tweet_id_prefix": prefix_str,

        "tweet_id_regex": f"{prefix_str}[0-9]{{{suffix_len}}}",

        "tweet_id_range": f"[{approx_lower} – {approx_upper}]",

    }


We integrated Reverse TweetedAt as a web service alongside TweetedAt. The service accepts a timestamp as user input and returns the corresponding tweet ID prefix, tweet ID regex, and full tweet ID range (Figure 4). It supports multiple valid timestamp formats (e.g., ISO 8601, RFC 1123, Wayback) and provides output at different levels of granularity. For example, Figure 4 shows output for millisecond-level granularity. Because millisecond-level precision is typically unavailable in tweet timestamps, the tool can interpret such inputs at second- or minute-level granularity. Rather than assuming zeros for unknown fields, the tool expands the input into the full corresponding time window (e.g., an entire second or minute), and computes the tweet ID prefix over that interval.

Figure 4: Reverse TweetedAt outputs tweet ID prefix at millisecond- level granularity.


Figure 5: Reverse TweetedAt outputs tweet ID prefix at second-level granularity.


Figure 6: Reverse TweetedAt outputs tweet ID prefix at minute-level granularity.


Tweet ID Regex-based Retrieval Across Temporal Granularity


We can use the tweet ID regex derived from a timestamp to search for archived tweets within a specific temporal window. By querying the Wayback Machine’s CDX API and filtering results using this prefix-based regex, we can identify tweet URLs whose IDs fall within the calculated range. As the timestamp becomes less precise, the tweet ID becomes shorter and the regex search space widens. 


For example, the tweet ID of @randyhillier’s tweet shown in Figure 3 is ‘1495226962058649603.’ Using TweetedAt, we can get the timestamp at millisecond-level granularity. Using Reverse TweetedAt, the millisecond-level granularity  returns a more precise prefix and results in 10 archived captures, while a slightly less precise prefix (second-level granularity) returns 15. When the precision is reduced further (minute-level granularity), the number of results remains 15. This indicates that all tweets within that broader time window were posted within the same narrower interval. This illustrates how lower temporal granularity expands the potential search space. However, a wider ID range does not necessarily produce more results; it only increases the number of possible candidate IDs.

Search space at millisecond-level granularity

curl -s "https://web.archive.org/cdx/search/cdx?url=https://twitter.com/randyhillier/status/&matchType=prefix" \

| grep -E 'status/14952269620[0-9]{8}' | wc -l


   10


Search space at second-level granularity

curl -s "https://web.archive.org/cdx/search/cdx?url=https://twitter.com/randyhillier/status/&matchType=prefix" \

| grep -E 'status/149522696[0-9]{10}' | wc -l


   15


Search space at minute-level granularity

curl -s "https://web.archive.org/cdx/search/cdx?url=https://twitter.com/randyhillier/status/&matchType=prefix" \

| grep -E 'status/149522[0-9]{13}' | wc -l


   15



CDX API Wildcard Search and Snowflake IDs to Limit the Search Space Using Tweet ID Prefix


We can now determine a tweet ID prefix from a screenshot timestamp using the Reverse TweetedAt service. Since a tweet can be archived any time between ±26 hours of the screenshot timestamp, we can determine tweet ID prefixes from the time window timestamps. We can use this time window to limit the search space by excluding the URLs tweeted before and after the alleged timestamp. Let us consider a tweet in the screenshot in Figure 2, where the screenshot timestamp is: 


9:41 PM Feb 19, 2022 (20220219214100)


We compute the tweet ID prefixes from left-hand boundary (-26) and right-hand boundary (+26) timestamps using the Reverse TweetedAt which are listed below:


-26 hours timestamp: 20220218194100 → tweet ID prefix: 14947588
+26 hours timestamp: 20220220234100 → tweet ID prefix: 149554404

As previously mentioned, the timestamp occupies the upper 41 bits only. We can use a common portion of tweet ID prefixes (149[4-5]) and do a CDX API wildcard search in the Wayback Machine to limit the search space. The search space reduces to 629 archived tweets, whereas using only the Twitter handle outputs 42,053 archived tweets. Now, dereferencing 629 archived tweets to search for a particular tweet text of a screenshot is a lot of work but feasible, whereas dereferencing 42,053 archived tweets is far too expensive. The following curl command shows the total number of archived tweets required to dereference for @randyhiller's status URLs with a common tweet ID prefix is comparatively less (629).

curl -s "https://web.archive.org/cdx/search/cdx?url=https://twitter.com/randyhillier/status/&matchType=prefix&from=20220218194100" \ | grep -E 'status/149[4-5]' | wc -l


   629


Summary


It is easy to search for a tweet in the Wayback Machine when you know the  URL. But a screenshot of a tweet typically does not have its URL present on the image. However, the Twitter handle and timestamp present in the tweet in the screenshot can be utilized to search for a tweet in the Wayback Machine web archive. Given a datetime, Reverse TweetedAt produces a tweet ID prefix, which we can then use to grep through a CDX API response of all tweets associated with a Twitter account. We can determine approximate tweet IDs from left-hand boundary and right-hand boundary timestamps from a screenshot timestamp using the Reverse TweetedAt tool. We found that we can limit the search space using a CDX API wild card search based on a common tweet ID prefix. Thus, the process for finding candidate archived tweets for the tweet in the screenshot is optimized. We published a paper at the 36th ACM Conference on Hypertext and Social Media, “Web Archives for Verifying Attribution in Twitter Screenshots,” which discusses how we can further use the candidate archived tweets to verify whether the tweet in the screenshot was posted by the alleged author.


Related Links:



—- Tarannum Zaki (@tarannum_zaki)


2026-03-17: The Disintegration Loops: Generational Loss in Web Archives / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

 The Disintegration Loops: Generational Loss in Web Archives


Michael L. Nelson




As part of the Internet Archive's Information Stewardship Forum (March 18–20, 2026), I decided to use my five minute lightning talk to raise the issue of generational loss in web archives.  Or more directly, making copies of copies (...of copies…) – something that web archives currently do not do well.  My title is based on William Basinski's four volume release "The Disintegration Loops", in which he played the audio tapes of "found sounds", recorded decades earlier, in loops, with the whole process lasting over an hour.  The effect is hauntingly beautiful, with each loop slightly degrading the magnetic tape, resulting in a generational loss.  The degradation of each loop is right on the edge of the just-noticeable difference, until the entire track is reduced to just a shadow of its former self.


I first discussed this topic in my 2019 CNI closing keynote (slide 88), where I introduced the inability of web archives to archive other web archives as part of the larger issue of web archive interoperability. Let's begin with walking through the example of archiving a tweet (which we already know to be challenging!).   The original tweet is still on the live web, even though the UI has undergone many revisions since when it was originally tweeted in 2018.  


https://twitter.com/phonedude_mln/status/990054945457147904 

(screen shot from 2026-03-17)



I archived that tweet to the Internet Archive's Wayback Machine in 2018 (screen shot from 2019):


https://web.archive.org/web/20180501125952/https://twitter.com/phonedude_mln/status/990054945457147904 


I then archived the Wayback Machine's copy of the tweet to archive.today in 2019 (screen shot from 2019):

https://archive.ph/PaKx6 


Note that archive.today is aware that the page comes from the Wayback Machine but the original host is twitter.com, and it maintains both the original Memento-Datetime (20180501125952) as well as its own Memento-Datetime (20190407023141).  I then archived archive.today's memento to perma.cc in 2019 (screen shot from 2019):



https://perma.cc/3HMS-TB59 


Finally, I archived the perma.cc memento back to the Wayback Machine in 2019 (screen shot from 2019):


https://web.archive.org/web/20190407024654/https://perma.cc/3HMS-TB59 


Although the loss occurs in discrete chunks, it is reminiscent of Basinski's Disintegration Loops, with information lost at each step, and the final version being a mere shadow of the original.  In 2019, this was not universally recognized as a problem, since archiving the playback interface of other web archives was not considered a problem to itself.  The "right" solution, of course, is to share the WARC files (or WAC, or HAR, or…) out-of-band and let the other web archives replay from the same source files.  But this is rarely possible: for a variety of reasons web archives typically do not share the original WARC files, and in the case of archive.today, might not even store the original source files (and instead, likely only store the radically transformed pages).  


More importantly, it is sometimes useful to archive a particular web archive's replay of a page, which itself must be archived, because it changes through time. For example, memento #3 (the perma.cc memento of archive.today's memento) is now different; this is a screen shot from 2026:


2026 replay of https://perma.cc/3HMS-TB59 


Surely the source files themselves have not changed, and the difference is due to improvements in pywb, which is under constant development. perma.cc's replay of the 2019 page in 2019 is different from the replay from 2026, which implies that it could be different still in the future. But we can not currently archive without generational loss of perma.cc's replay of that page to, say, the Wayback Machine.  The fact that screen shots – which are rife with their own potential for abuse (cf. HT 2025, arXiv 2022) – are the only mechanism to document these replay differences underscores the web archive interoperability problem.


I chose the topic of generational loss for my slot at the Information Stewardship Forum because recent events have introduced a new use case for archiving the replay of web archives. Wikipedia recently announced it was blacklisting archive.today because its editors discovered that webmaster at archive.today was using its captcha to direct a DDoS attack against a blog owned by someone that webmaster had a dispute with (the blogger had posted a lengthy investigation of the identity of webmaster), and, for our discussion more disturbingly, had edited the content of an archived page to include the name of the blogger where it would not otherwise be.  The Wikipedia discussion page is hard to follow, in part because the editors are discussing how to archive the replay of an archived page.  For one example, they show how the archive.today replay now has been changed back to have "Comment as: Nora Puchreiner" (middle of the image):



But the replay alteration from archive.today in question is archived at megalodon.jp to show that the name "Nora Puchreiner" was replaced with the name of the blogger that had earned webmaster's ire, "Jani Patokallio". And yes, megalodon.jp's replay of archive.today's memento is that bad (at least in my browser, it is shrunk down impossibly small), so I used the dev tools to find the string in question. 


https://megalodon.jp/2026-0219-1509-14/https://archive.is:443/2021.05.30-173350/http://www.maskofzion.com/2012/04/jewish-at-root-iraqs-destruction-hell.html


Another Wikipedian archived (using yet another archive, ghostarchive.org) a google.com SERP to show that archive.today has reverted from "Jani Patokallio" back to "Nora Puchreiner". 


https://ghostarchive.org/archive/c0ZP0


What does changing "Nora" to "Jani" (and then changing it back again) accomplish? I'm not sure; this appears to be just a petty response to an ongoing dispute.  But the implication is profound: this is the first known example of a major web archive purposefully and maliciously altering its contents, something that we knew was possible but had not yet experienced.  


We have long known that replay can change through time (cf. PLOS One 2023) due to the replay engine (the Wayback Machine, Open Wayback, pywb, etc.) evolving, but these changes were engineering results and the replay mostly improved over time. But now we have seen web archives maliciously alter (and then revert) the replay, and we need a more standard and interoperable way to archive archival replay.  Not just to prove that a web archive did alter its replay, but also to prove that an archive did not alter its replay.  Out-of-band sharing of WARC files is the gold standard, but for a variety of reasons this is unlikely to happen.  We must be able to use web archives to verify and validate web archives.  We explored a heavyweight design for this a few years ago (JCDL 2019), but it should be revisited in light of developments like WACZ.  


–Michael


ht to Herbert Van de Sompel for introducing me to "The Disintegration Loops" many years ago.

Seeking Approval, Confronting Objectivity: Neutrality in the Library of Congress Subject Headings Approval Process / In the Library, With the Lead Pipe

In Brief: This study examines the concept of neutrality in Library of Congress Subject Headings and the subject approval process by analyzing proposed headings that were rejected over a nearly 20-year period. It considers the place of neutrality in libraries more generally and argues that equity, rather than neutrality, is the appropriate lens for judging subject heading proposals. Finally, it recommends several reforms that could improve the subject heading process and make it more equitable.

By Allison Bailund, Deborah Tomaras, Michelle Cronquist, and Tina Gross

If a train is moving down the track, one can’t plop down in a car that is part of that train and pretend to be sitting still; one is moving with the train. Likewise, a society is moving in a certain direction—power is distributed in a certain way, leading to certain kinds of institutions and relationships, which distribute the resources of the society in certain ways. We can’t pretend that by sitting still—by claiming to be neutral—we can avoid accountability for our roles (which will vary according to people’s place in the system). A claim to neutrality means simply that one isn’t taking a position on that distribution of power and its consequences, which is a passive acceptance of the existing distribution. That is a political choice.[1]

Introduction

Library workers and patrons have long been frustrated with Library of Congress Subject Headings (LCSH) for being out of date and lacking well-known concepts with abundant usage. Contributors to the Subject Authority Cooperative Program (SACO) have made many improvements to LCSH by proposing new headings and revising existing terms. Those attempts, however, have sometimes been hampered by the Library of Congress’s (LC) preference for supposed neutrality within the vocabulary; Subject Headings Manual (SHM) instruction “H 204,” released in 2017, specifically dictates that proposed headings should “employ neutral (i.e., unbiased) terminology.”[2]

This desire for neutrality has been directly stated, alluded to, or otherwise upheld in myriad rejections of proposed subject headings, from Negative campaigning[3] to White flight.[4] Even Water scarcity, a quantifiable concept of worldwide concern, was rejected in 2008 as a non-neutral topic requiring value judgments with the following justification:

Works on the topics of water scarcity and water shortage have been cataloged using the heading Water-supply, post-coordinating[5] as necessary with additional headings such as Water conservation and Water resources management. The meeting determined that this practice is appropriate and should continue, since Water-supply is a neutral heading that does not require a judgment about the relative abundance of water.[6]

However, what exactly constitutes neutral and unbiased terminology is never defined in “H 204” or anywhere else in the SHM, nor in any other Library of Congress controlled vocabulary manuals.[7] Much of the previous literature on neutrality in libraries focuses on debates over possible definitions of the term and what role neutrality should play in library services and collections. Building off previous critical cataloging literature, which focuses on addressing problematic terms, subject hierarchies, and biases within cataloging standards, this article extends that scrutiny further. We analyze how neutrality is embedded in the LC structures and systems that vet the terms catalogers utilize to describe materials.

Our article examines the ways in which neutrality is enforced in LCSH rejections between July 2005 and December 2024. We review “Summaries of Decisions” from LC Subject Editorial Meetings (along with associated discussion and commentary in the field); within these, we identify and interpret patterns of justifications used to reject subject heading proposals and maintain purported neutrality within the vocabulary. We argue that neutrality has been used to keep many concepts depicting prejudice (racism, sexism, etc.), as well as concepts related to the lived experiences of marginalized people, out of the vocabulary and/or to obscure materials about those topics under other, often more generalized or euphemistic, terminology. As a counterpoint, we suggest a values- and equity-driven approach to replace the principle of neutrality in a cataloging context and within the subject approval process. We acknowledge that the current political situation may be particularly fraught for equity-driven change, but believe bowing to political pressures is untenable, and continued pursuit of neutrality will only serve to further the discordance between library values and the realities of LCSH.

Background

Neutrality: Assumed, but Nebulous

Schlesselman-Tarango notes the perceived conceptual importance of neutrality for libraries and librarianship; their “status as ‘an essential public good’” is “contingent on the perpetration of the idea that [they are] also neutral.”[8] Seale further situates this notion of libraries-as-neutral as not externally imposed, but emanating from within librarianship itself: “The positioning of the library as a neutral and impartial institution, separated from the political fray, resonates with dominant library discourse around libraries.”[9]

However, despite both critics and supporters assuming that neutrality is fundamental to librarianship, there is a dearth of references to the term in official documents underpinning the ethics and standards of the library profession. The American Library Association’s (ALA) Working Group on Intellectual Freedom and Social Justice observed, for example, that “the word neutrality does not appear in the Library Bill of Rights, the ALA Code of Ethics, and any other ALA statements that the Working Group could locate. It does not appear in the Intellectual Freedom Manual (10th Edition) nor is it defined in any official ALA document or policy.”[10] The International Federation of Library Associations and Institutions’s (IFLA) Code of Ethics mentions but does not define neutrality in Section 5, in sentences such as “Librarians and other information workers are strictly committed to neutrality and an unbiased stance regarding collection, access and service.”[11] For catalogers in particular, the Cataloging Code of Ethics, issued in 2021 and discussed further below, explicitly disputes the concept of neutrality.

Most pertinent to the subject proposal process, the National Information Standards Organization’s (NISO) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies mentions neutrality exactly twice, yet again without definition. The first instance, in guidance about choosing preferred forms of terms, asserts that “Neutral terms should be selected, e.g., developing nations rather than underdeveloped countries.”[12] The second appearance, in a discussion of synonyms, notes “pejorative vs. neutral vs. complimentary connotation[s]” of terms that might influence usage.[13] The latter reference positions neutrality as the impartial fulcrum of term meanings, while the former implies, particularly via the example, a more active attempt at choosing equitable and unbiased terminology.

Although the terms “neutral” and “unbiased” are often linked when they appear in library literature (as in the IFLA Code of Ethics), they are not synonymous. Oxford English Dictionary (OED) definitions of neutral include “inoffensive,” and “not taking sides in a controversy, dispute, disagreement, etc.”; unbiased, however, while meaning “not unduly or improperly influenced or inclined; [and] unprejudiced,” does not necessarily imply a lack of involvement in social or political issues.[14] The incompatibility between neutrality as inoffensive isolation versus unbiasedness as active equity plays out repeatedly in library discussions. Without clear definitions, neutrality in the NISO Guidelines and elsewhere is open to conjecture and interpretation. As noted by Scott and Saunders, “[T]he term ‘neutrality’ seems to be used for, or conflated with, everything from not taking a side on a controversial issue to the objective provision of information and a position of defending intellectual freedom and freedom of speech.”[15]

Proponents of library neutrality don’t fully agree on definitions, either. In Scott and Saunders’s survey, some describe it as “lacking bias,” which more closely aligns with principles of equity.[16] The depiction of neutrality by LaRue, the former Director of the ALA’s Office for Intellectual Freedom, also appears to resemble equity; he frames neutrality as not “deny[ing] people access to a shared resource just because we don’t like the way they think” and giving everyone “a seat at the table.”[17] Dudley, reframing library neutrality in relation to pluralism, highlights similar values; his proposed ethos calls on librarians to “adhere to principled, multi-dimensional neutrality” which includes “welcoming equally all users in the community” and consistently-apply[ing] procedures for engaging with the public.”[18]

The 2008 book Questioning Library Neutrality examines many aspects of why neutrality is both an illusion and a misguided aspiration, and also disabuses readers of the idea that it has always been a core value. Rosenzweig points out that neutrality as a principle of librarianship does not go back to the early development of public libraries:

We would do well to remember that, if libraries as institutions implicitly opened democratic vistas, our librarian predecessors were hardly democratic in their overt professional attitude or mission, being primarily concerned with the regulation of literacy, the policing of literary taste and the propagation of a particular class culture with all its political, economic and social prejudices. In fact, the idea of the neutrality of librarianship, so enshrined in today’s library ideology (and so often read back into the indefinite past), was alien to these earlier generations.[19]

Although Macdonald and Birdi’s literature review identifies four conceptions of neutrality within library science literature—“favourable,” “tacit value,” “libraries are social institutions,” and “value-laden profession”—the authors found that depictions of neutrality articulated by practitioners are more complicated. Many have “ambivalent” views of neutrality, seeing it as “a slippery and elusive concept.”[20] The relative importance of neutrality to proponents varies, depending on its position vis-à-vis other library values: “When it is alone, or grouped with a simple, single other value like professionalism, it is very low in priority. When it is presented in a group of other values or left implicit, it fares better.”[21] Catalogers tended to espouse neutrality the least among library specializations, with 21% reporting that they never think about neutrality.[22] Further, some surveyed librarians “are more likely to eschew neutrality on matters of social justice,” when neutrality comes into conflict with core library values.[23]

Neutrality versus Social Justice

Since the late 1960s, neutrality has increasingly come into question as librarians have embraced ideals centering social justice, equity, diversity, and inclusion, particularly in the ALA.[24] These values, codified in the ALA Code of Ethics and Library Bill of Rights, include a commitment to “recognize and dismantle systemic and individual biases; to confront inequity and oppression; to enhance diversity and inclusion; and to advance racial and social justice in our libraries, communities, profession, and associations.”[25] ALA resolutions go a step further, acknowledging the “role of neutrality rhetoric in emboldening and encouraging white supremacy and fascism.”[26] Scott and Saunders sum up the issue, noting that while some librarians cast neutrality as a “fundamental professional value, albeit one that is not explicitly mentioned in the professional codes of ethics and values,” others assert that it is “a false ideal that interferes with librarians’ role of social responsibility, which is an explicitly stated value of librarianship.”[27] As Watson argues in an ALA 2018 Midwinter panel on neutrality in libraries, “We can’t be neutral on social and political issues that impact our customers because, to be frank, these social and political issues impact us as well.”[28]

Even among library codes of ethics that explicitly hold neutrality as a core value, there is a tension between practitioners and official documentation. For example, the Canadian Federation of Library Associations / Fédération canadienne des associations de bibliothèques (CFLA-FCAB) Code of Ethics calls for librarians to “promote inclusion and eradicate discrimination,” provide “equitable services,” and “counter corruption directly affecting librarianship”; but the Code also advocates for neutrality, advising librarians to “not advance private interests or personal beliefs at the expense of neutrality.”[29] Once again neutrality remains undefined—though it’s implied, based on context, to be not taking sides, matching one of the OED definitions above. This understanding accords with a 2024 study on Canadian librarians, which noted most Canadian academic librarians seem to have coalesced around defining neutrality as “not taking sides,” followed by “not expressing opinions.”[30]

Yet the same study also highlights a perceived incompatibility of neutrality with other values of librarianship, with “the majority (54%) of respondents” disagreeing or strongly disagreeing that “‘neutrality is compatible with other library values and goals,’” and 58% disagreeing “that it is ethical to be neutral.”[31] Brooks Kirkland asserts that assuming neutrality as a key tenet of librarianship conflicts with such principles as promoting inclusion and eradicating discrimination.[32] Pagowsky and Wallace note that, whether knowingly or not, upholding neutrality within inequitable systems ultimately supports them: “Trying to remain ‘neutral,’ by showing all perspectives have value … is harmful to our community and does not work to dismantle racism. As Desmond Tutu has famously said, ‘If you are neutral in situations of injustice, you have chosen the side of the oppressor.’”[33]

Cataloguing Code of Ethics, Critical Cataloging, and Other Recent Developments

The incongruity between neutrality and social justice as core library values has sparked the numerous debates detailed above and on mailing lists and social media. It has also led in part to the expansion of the critical cataloging movement and the creation of the Cataloguing Code of Ethics, published in 2021 and since adopted by several library organizations, including the ALA division Core. The Code explicitly refutes the concept of neutrality; it avers that “neither cataloguing nor cataloguers are neutral,” and calls out the biases inherent within the dominant, mostly Western cataloging standards currently in use. It particularly notes that “cataloguing standards and practices are currently and historically characterised by racism, white supremacy, colonialism, othering, and oppression.”[34]

The most well-known critical cataloging subject heading proposal was the attempt to change the now-defunct heading Illegal aliens, as depicted in the documentary Change the Subject. In November 2021, five years after LC initially announced it would change the Illegal aliens subject headings and then backtracked after political pressure, LC announced it would replace the subject headings Aliens and Illegal aliens. However, LC did not adopt the changes it had initially announced, nor the recommendations made in a report by the ALA Subject Analysis Committee (SAC), which included revising the term to Undocumented immigrants.[35] LC instead split Illegal aliens into two new headings: Noncitizens and Illegal immigration.[36] Librarians have criticized the retention of “illegal” within one of the updated headings for continuing to make library vocabularies “complicit” with the “legally inaccurate” criminalization of undocumented immigrants.[37]

Other critical cataloging proposals have been subjected to inordinate scrutiny by LC; even when headings have been approved, they have sometimes faced heavy editing and modification. One example is Blackface, where LC’s changes to the proposal obscured the racism characterizing the phenomenon. The broader term (i.e., the parent in the subject hierarchy) was altered from Racism in popular culture to Impersonation.[38] Since Impersonation falls under the broader terms Acting, Comedy, and Imitation, this change emphasizes the performance aspect in lieu of its racist connotations. Similarly, the scope note (i.e., definition), was modified from “Here are entered works on the use of stereotyped portrayals of black people (linguistic, physical, conceptual or otherwise), usually in a parody, caricature, etc. meant to insult, degrade or denigrate people of African descent” to “Here are entered works on the caricature of Black people, generally by non-Black people, through the use of makeup, mannerisms, speech patterns, etc.”[39] As noted by Cronquist and Ross, these changes ultimately “neutralize[d]” the proposal “in the name of objectivity.”[40]

However, there have also been numerous successful updates to outdated terminology and additions of missing concepts, particularly in recent years. For example, in 2021, fifteen subject headings for the incarceration of ethnic groups during World War II, including Japanese Americans, were changed from the euphemistic phrase –Evacuation and relocation to –Forced removal and internment.[41] The African American Subject Funnel added the new heading Historically Black colleges and universities in 2022 and helped to revise Slaves to Enslaved persons in 2023; the Gender and Sexuality Funnel successfully changed the heading Gays to Gay people, and proposed the new term Gender-affirming care, in 2023; and the Medical Funnel updated Hearing impaired to Hard of hearing people in 2024.[42]

On a hopeful note, many of these large-scale projects coordinated with Cataloging Policy Specialists within LC, who worked closely with catalogers during the process and ensured that related term(s) and related Library of Congress Classification number(s) were updated as well. Further, LC has taken some recent steps to improve its vocabularies and create avenues for increased input from outside institutions. This includes hiring a limited term Program Specialist to help redress outdated terminology related to Indigenous peoples. LC also created two advisory groups for Demographic Group Terms and Genre/Form Terms, both of which allow for greater community input into these vocabularies.

Still, frustrations remain. Changing outdated terminology is a complicated process. Library of Congress vocabularies, in particular, are vulnerable to potential governmental interference. Attempted Congressional intervention during the updating of Illegal aliens and the passing of a statute mandating transparency in the subject approval process led to the creation of “H 204” codifying LC’s preference for a neutrality uninvolved in political and social issues.[43] The complication of bibliographic file maintenance (e.g., reexamining cataloged materials to determine whether subject headings should be changed, deleted, or revised) also muddies the waters and impedes large-scale projects. Staffing issues within LC further hinder the ability to undertake or complete projects, as seen in the SACO projects process, paused in 2025 due to LC’s catalog migration.

Maintaining LCSH

Library workers are familiar with LCSH in our discovery tools, and most are aware of concerns about outdated and problematic headings. However, they may not see debates and conflicts about new headings and ongoing maintenance of the vocabulary as a built-in and inherent part of the system, as catalogers who engage in that work do.

As Gross asserts:

To remain effective, headings must be regularly updated to reflect current usage. Today’s LCSH People with disabilities used to be Handicapped and, before that, Cripples. Additionally, new concepts require new headings, such as the recently created Social distancing (Public health), Neurodiversity, and Say Her Name movement. The process of determining which word or phrase to use as the subject heading for a given topic is inevitably fraught and can never be free of bias. The choice of terms embodies various perspectives, whether they are intentional and acknowledged or not.[44]

Both the need to continually revise existing headings and create new ones, and indeed wrangling over what they should be, are not defects, nor a surprise. They flow directly from the purpose of controlled vocabulary and the complications of language it exists to help navigate—the ever-changing and endless variety of ways to refer to things.

Some of the frequency and intensity of debates about LCSH stem from the fact that it attempts to be a universal vocabulary that covers all branches of knowledge. While it is created and maintained primarily for the needs of the Library of Congress, it is used by all kinds of libraries. Balancing the need to serve a user base that consists of federal legislators and providing the world with a one-size-fits-all vocabulary is clearly a formidable and contradictory endeavor. In recent decades, LC has made significant progress in opening up the maintenance process to input and contributions from the broader library community via the SACO program. These changes appear to be partly in response to demands to make the process faster and more transparent, but also a desire by LC to incorporate broader perspectives and experiences and to help with the tremendous workload.

LCSH Creation and Revision Process

The SACO program, created circa 1993,[45] allows librarians to submit proposals for new or revised LCSH terms (as well as other LC vocabularies) to the Library of Congress. In order to submit proposals, catalogers are expected to be familiar with the Subject Headings Manual (SHM), which governs LCSH usage and formulations as well as the proposal process, required research, and criteria used to evaluate proposals.[46] One of the primary requirements is literary warrant: proposers must demonstrate that there is a need for the new subject heading based on a work being cataloged.[47] Beyond the work cataloged and published/reference sources, librarians can also cite user warrant, “the terminology people familiar with the topic use to describe concepts,” as justification in proposals.[48] This can include reviews, blog posts, social media threads, LibGuides, etc.

After a proposal is submitted, LC staff schedule it to a monthly “Tentative List,” which is published to allow for public comment on proposed headings. Taking those comments and SHM instructions into account, members of LC’s Policy, Training, and Cooperative Programs Division (PTCP) make a decision about whether to add the proposed heading to LCSH, send it back to the cataloger for revision and resubmission, or reject it. If the heading is not added, a monthly “Summary of Decisions” document details the reasons for its exclusion. While the SACO program allows external librarians to submit proposals, the Library of Congress maintains its “authority to make final decisions on headings added.”[49]

Most proposals are routine and relatively straightforward, such as those that follow patterns—repeated formulations of similar subjects that provide a predictable search structure for library patrons (e.g., Boating with dogs already exists and the cataloger wants to propose Boating with cats). SHM “H 180” notes that patterns help achieve desired qualities for the vocabulary, including “consistency in form and structure among similar headings.”[50] LC is also concerned with avoiding multiple subject headings that convey too closely related concepts. LCSH online training “Module 1.2” highlights both “consistency and uniqueness among subjects” as strengths of controlled library vocabularies, for instance.[51] Proposals that don’t follow patterns therefore receive more scrutiny, to make sure they are unique, definable topics. LC makes judgment calls based on the strength of the evidence in proposals, and on SHM instructions, including the guidance in “H 204” about neutrality.

Neutrality within LC Documentation

Within its official documentation on subject headings, LC mentions neutrality sparingly. In the entirety of the SHM, the word neutral appears only once, specifically in guideline “H 204” with the recommendation that catalogers “employ neutral (i.e., unbiased) terminology.”[52] Apart from an association with the term unbiased, neutral is not defined in “H 204” or anywhere else in the SHM. Online LCSH training, freely available from the Library of Congress website, offers similarly little on the concept of neutrality. “Module 1.4” recommends that catalogers “accept the idea that all knowledge is equal” and “remain neutral … and attempt to be as objective as possible” when describing material.[53]

Despite the lumping together of neutral and unbiased in “H 204,” a neutrality which calls for a static ignoring of social realities and historical context does not equal an unbiased active engagement against prejudice. The Merriam-Webster Dictionary’s definitions of “neutral” and “unbiased” make this clear. “Neutral” as “indifferent” and politically nonaligned echoes OED. But the definition of “unbiased” goes even further, meaning not just free from prejudice and “favoritism” but “eminently fair”[54]—an active and flexible balancing of interests inherently at odds with static and detached neutrality. Eliding the two concepts risks undermining the latter, and with it library ethics and values, resulting in the further entrenchment of Western, colonial, and other biases in LCSH.

The definition of neutrality that LC, and by extension LCSH, seems to favor is one of passivity. Neutrality as indifference to social realities appears, for instance, in LCSH training “Module 1.4.” The module acknowledges that library vocabularies “are culturally fixed” and “from a place; they are from a time; they do reflect a point of view.” However, rather than using that “realiz[ation]” to encourage periodic updating of outdated or potentially prejudicial content in LCSH, the module advises “accepting” that cultural fixity as immutable fact; it recommends that catalogers “remain neutral, suspend disbelief” and focus on (undefined) objectivity instead.[55] Objectivity also appears in “H 180,” which advises catalogers: “Avoid assigning headings that … express personal value judgments regarding topics or materials. … Consider the intent of the author or publisher and, if possible, assign headings … without being judgmental.”[56]

Here, as in “Module 1.4,” objectivity appears linked to neutrality; the implication is that a subject can only be described without bias if a cataloger is dispassionate and has no opinions on the topic. However, not all definitions of objectivity match this interpretation. Although OED defines objectivity as “detachment” and “the ability to consider or represent facts, information, etc., without being influenced by personal feelings or opinions,” Merriam-Webster’s definition is “freedom from bias” and a more actively equitable “lack of favoritism toward one side or another.”[57]

This disparity in meanings begs the question: What does it mean to describe a topic without judgment or bias? Is objectivity erasing any uncomfortable content in a topic, even if that erasure favors a biased status quo and/or muddies a topic’s meaning? Or, rather, is it objective to label something truthfully, even if the topic raises strong feelings? As demonstrated by the revisions to Blackface discussed above, changes to the scope note and broader term in the name of objectivity did not result in a clearer or less biased heading; instead, they obfuscated the racist intent behind the phenomenon.

Similarly, despite the assertion in “H 180,” a singular focus on authorial intent does not always result in a lack of bias or judgment in subjects. As noted by literary critics such as Wimsatt and Beardsley, “placing excessive emphasis on authorial intention [leads] to fallacies of interpretation,”[58] since readers only have access to the text in front of them; attempting to guess an author’s intent is already an act of judgment, not a discovery of objective facts. Further, if an author writes a prejudicial text, taking its content at face value risks replicating that bias through subject provision. LCSH terms such as Holocaust denial literature recognize and counter this, labeling Holocaust denial works as ones “that diminish the scale and significance of the Holocaust or assert that it did not occur.”[59] If catalogers relied strictly on authorial intent in the name of objectivity, those works would instead get misleading subjects such as Holocaust, Jewish (1939-1945) instead of Holocaust denial literature, tacitly legitimizing bias.

Thus, the SHM’s focus on objectivity and neutrality highlights incongruities and tensions within subject guidance and LCSH vocabulary itself between indifference and self-imposed inoffensiveness on the one hand, and actively countering bias and promoting equity on the other. As will be shown below, rejections in the name of neutrality reveal that in fact the proposal process itself has never been neutral or apolitical.[60]

Neutrality and SACO Rejections

LC’s adherence to an inflexible and indifferent definition of neutrality, critiquing proposals engaging with social and political realities as subjective and relying on value judgments, has led to the rejection of multiple headings that surface prejudice or describe the lives and experiences of marginalized peoples. Instead, rejections upholding neutrality reinforce hegemonic societal attitudes within LCSH.

Neutrality appears in several guises in proposal rejections in “Summaries of Decisions” from 2005 to 2025. The most obvious ones reference “H 204” and “neutral (i.e., unbiased) terminology,” including the 2008 rejection of Water scarcity and the 2024 rejection of White flight (discussed in more depth below).[61] Similar rejections use words such as “judgment” (including Negative campaigning in 2013, and Zombie firms in 2023); “pejorative” (e.g., Dive bars in 2010, and Banana republics in 2015); “vulgar and offensive” (such as Vaginal fisting and Anal fisting in 2010); “subjective” (such as African American successful people in 2009); “viewpoint” (including Jim Crow laws in 2019); and “non-loaded language” (e.g., Incarceration camps in 2024).[62]

Neutrality as non-involvement in political and social realities also appears in the rejection of proposals due to LC’s Policy, Training, and Cooperative Programs Division (PTCP)’s unwillingness to establish certain “patterns” of subject headings (i.e., set precedents for future headings of specific types). Pattern rejections often appear entirely arbitrary; that is, the rejections stated merely that PTCP did not wish to begin a pattern, and not that a proposal as formulated was missing vital elements, had no warrant, or did not conform to provisions stipulated in the SHM. Despite acknowledging in “Module 1.4” that the wrong subject heading “can make any resource in the collection ‘disappear,’”[63] these rejected patterns render certain topics invisible and unsearchable by library patrons.

Uncreated patterns include critiques of prejudicial attitudes and behaviors, particularly by governmental bodies, such as rejections of Prison torture in 2007 or Religious profiling in law enforcement in 2024.[64] Similarly, patterns that would have highlighted the unearned privilege and/or bigotry of certain groups remain largely unestablished, including Holocaust deniers (2016), Toxic masculinity (2020), and White privilege (rejected in 2011 and 2016, before finally being accepted as White privilege (Social structure) in 2022).[65] The rejection of White fragility in 2020 is particularly interesting, as the rationale was that “LCSH does not include any headings that ascribe an emotion or personality trait to a specific ethnic group or race, and the meeting does not want to begin the practice.”[66] However, LCSH has included since 2010 the heading Post-apartheid depression, meant to convey the mental health and feelings of white Afrikaners. So not all white people’s emotions appear off-limits—just ones that reveal systemic biases. PTCP also declined to create patterns naming discrimination directed at certain groups, such as Police brutality victims in 2014 and Missing and murdered Indigenous women in 2023.[67] In the latter case, the rejection of a term meant to highlight societal neglect of the violence against Indigenous peoples means that their existence and trauma continue to be hidden in library vocabularies and catalogs.

Pattern rejections not only make prejudices invisible in library catalogs, they also underrepresent concepts that celebrate or describe the cultures and experiences of marginalized peoples. Erasures of joy can be as damaging as erasures of struggle. Aronson, Callahan, and O’Brien’s discussion of themes related to people of color in picture books, for instance, could equally apply to messages portrayed in LCSH via what topics it hides or surfaces in library catalogs: a “predominance of Oppression … at the expense of other types of portrayals can send a message that suffering and struggle are definitive of a group’s experience, or even of victimhood.”[68] Instead, marginalized people “deserve to see themselves represented as people who lead full and dynamic lives and who are not fully defined by histories of oppression.”[69] Unaccepted subject headings of this type include African American successful people (2009), Overweight women’s writings (2011), Gay neighborhoods and Lesbian neighborhoods (2012), Gay personals (2018), Afro-pessimism (2021), and Indigenous popular culture (2024).[70] 

Absorbing a proposed critical term into a supposed “positive” equivalent also served to preserve an inoffensive neutrality in LCSH; this is seen in the rejection of Food deserts in 2014:

The concept of food desert has been defined in multiple ways by various governments and organizations, often in ways to suit their specific political agendas … The existing heading Food security is defined as access to safe, sufficient, and nutritious food. The existing heading is used for both the positive and negative (it has a UF [cross-reference for] Food insecurity), and the meeting feels that it adequately covers the concept of a food desert.[71]

Similarly, LC rejected a proposal for Genocide denial in 2017 with the rationale that the “positive” heading—Genocide—was sufficient for patron access: “A heading for a concept in LCSH includes both the positive and negative aspects of that topic. A work about the denial of genocide still discusses the concept of Genocide.”[72] Slum clearance was also rejected in 2007 in favor of the euphemistic and supposedly equivalent Urban renewal.[73]

Sometimes rejections upholding neutrality appeared in the guise of a fear that the term might be misapplied. For instance, although LC acknowledged in its 2019 rejection of Jim Crow laws and Jim Crow (Race relations) that the headings described laws and attitudes promulgated during a specific time period—which could therefore be described in a scope note guiding subject usage—it claimed that “the meeting is also concerned that the heading would be assigned only if the phrase Jim Crow is used in the title.”[74] In other words, the rejection prioritized avoiding possible future confusion over a definable term with ample literary and user warrant. The potential for definitional uncertainty also fueled other rejections, such as Femicide and Secret police in 2010, and Forced assimilation in 2024.[75] To preempt said confusion in all of these cases, LC could have added scope notes defining appropriate usage. Subjects have been remediated in the past when found to be misused, via clarifying scope notes or additional term creation, as with Romance literature (now Romance-language literature) versus Love stories (now Romance fiction).[76] Instead of denying the proposal due to a fear that a term might be misapplied, LC could have worked with the proposers to ensure the heading clearly defined the topic and, if necessary, made a public announcement with additional guidance on how to retrospectively add the term.

Overly-limiting definitions of subjects also provided reasoning for neutrality-based proposal rejections. An attempt in 2011 to add the natural language phrase Queer-bashing as a cross-reference under the then-current heading Gays–Violence against, for example, was rejected with the justification that “queer-bashing is not necessarily violent.”[77] Intersexuality–Law and legislation, a heading reflecting ongoing debates about genital surgeries on infants and legally-recognized genders, was rejected in 2016 because “The subdivision –Law and legislation free-floats [i.e., can be used] under ‘headings for individual or types of diseases and other medical conditions, including abnormalities, functional disorders, mental disorders, manifestations of disease, and wounds and injuries’ (SHM H 1150).”[78] The medicalizing language of the rejection reinforced the view of intersexuality as a “condition” or “disorder” needing fixing, rather than the natural human diversity of a group struggling for bodily autonomy and human rights. The rejection of Redlining in 2024 also fits this definitional pattern. Despite acknowledging that Redlining “functioned in many different financial contexts,” LC’s rejection implied that redlining’s definition was too broad, as LC preferred “the specificity of … separate headings.”[79] This continues to fracture the topic into multiple subjects such as Discrimination in financial services, Discrimination in mortgage loans, and Discrimination in credit cards. The rejection also sidestepped notions of governmental complicity in redlining, and whitewashed the topic by making it appear less systemic in nature.

Purported limitations of the vocabulary also served as justification for rejecting proposals and upholding LCSH neutrality. For instance, Butch/femme (Gender identity) was deemed “too narrow and specialized for a general vocabulary such as LCSH” in 2011 (though Butch and femme (Lesbian culture) was later approved in 2012)[80]—this, despite the copious presence of narrow terms in LCSH about other topics, such as Madagascar hissing cockroaches as pets, Photography of albatrosses, Church work with cowgirls and Zariski surfaces. Anal fisting and Vaginal fisting were rejected with the same rationale in 2010 (in addition to the “vulgar and offensive” argument described above).[81] Two rejections utilizing the same reasoning raise the question of whether queer cultures and identities were evaluated using particularly stringent criteria. As one librarian noted in the RADCAT mailing list after the rejection of Butch/femme (Gender identity):

This is especially baffling given that Bears (Gay culture) has been a valid subject heading for years, and both concepts have about the same amount of literary warrant. For those of you keeping track at home, this isn’t the first example of this rejection. During The Great Fisting Debacle of 2010 … the Anal fisting and Vaginal fisting proposals were shot down using the same language. I haven’t seen PSD [the prior name for PTCP] rejecting scientific or technical heading proposals as too specialized, which makes me wonder if it’s only gender & sexuality-related headings that receive this type of scrutiny.[82]

Troublingly, rejections for queer identities have continued since LC resumed processing tentative lists in January 2025, particularly for queer youth proposals. The rejection of Sexual minority high school students, for instance, indicates potential deference to current governmental queerphobia, particularly since the phrase “At this time” prefaces the justification: “At this time, it is not desirable to qualify headings for this age group by gender identity or expression/sexual orientation.” LC’s suggestion that instead “[t]erms from other subject vocabularies such as Homosaurus may be used instead of, or in conjunction with, existing LCSH headings to express the topic” suggests that there is no place for queer youth identity headings within LCSH.[83]

Finally, proposals were rejected in favor of maintaining pre-existing biases in LCSH–the cultural fixity mentioned in LCSH training “Module 1.4.”[84] For instance, a 2015 rejection of a change proposal related to Indigenous peoples–South Africa highlighted in its rationale the scope note for Indigenous peoples defining them entirely in relation to colonial power: “Here are entered works on the aboriginal inhabitants either of colonial areas or of modern states where the aboriginal peoples are not in control of the government.”[85] Sometimes, even the longevity of a term within LCSH was treated as sufficient reason to reject proposals meant to update outdated and inequitable terms, as with the 2020 rejection of a proposed change from Juvenile delinquents to Juvenile prisoners: “The existing heading Juvenile delinquents has been used for this concept for many years. At this point, it would be practically impossible to examine the entire file so the new heading could be applied accurately. The heading Juvenile delinquents should be assigned instead.”[86] This hesitance to tackle large projects because of the labor required for bibliographic file maintenance perpetuates the tendentious language present in LCSH and reinforces the view that the proposal process is itself not neutral.

Case Study: White Flight

In 2024, the African American Subject Funnel Project submitted a subject proposal for White flight. The proposal cited Kruse’s book White Flight: Atlanta and the Making of Modern Conservatism to demonstrate literary warrant. It additionally cited three reference sources—Encyclopedia of African-American Politics, The New Encyclopedia of Southern Culture, and Wikipedia—in order to define the term and demonstrate that it is commonly used by scholars and the public.

  • [Proposed Heading]: White flight
  • [Variant Term]: White exodus
  • [Broader Term]: Migration, Internal
  • [Broader Term]: Race relations
  • [Broader Term]: White people–Migrations
  • [Related Term]: Segregation
  • [Source]: Kruse, K.M. White flight, ©2005: summary (In this reappraisal of racial politics in modern America, Kevin Kruse explains the causes and consequences of “white flight” in Atlanta and elsewhere) page 5 (In 1963 alone, there were 52 cases of “racial transition,” incidents in which whites fled from neighborhoods as blacks bought homes there; a steady stream of white flight had been underway for nearly a decade)
  • [Source]: Encyclopedia of African-American politics, 2021 (“White flight” is the term used to refer to the tendency of whites to flee areas and institutions once the percentage of blacks reaches a certain level)
  • [Source]: The new encyclopedia of southern culture, 2010 (The term “white flight” refers to the spatial migration of white city dwellers to the suburbs that took place throughout the United States after World War II. One of the most powerful and transformative social movements of the 20th century, white flight significantly affected the class and racial composition of cities and metropolitan areas and the distribution of a conservative postwar political ideology)
  • [Source]: Wikipedia, 16 Oct. 2023 (White flight or white exodus is the sudden or gradual large-scale migration of white people from areas becoming more racially or ethnoculturally diverse. Starting in the 1950s and 1960s, the terms became popular in the United States; examples in Africa, Europe, and Oceania as well as the United States)

However, LC rejected White flight with the following rationale: “LCSH does not currently have an established pattern that combines the topic of migration with the social reasoning for that migration. The meeting was concerned that introducing such a pattern, particularly in this case, would contradict the practice in LCSH of preferring neutral, unbiased terminology as stated in SHM H 204 sec. 2.”[87]

After this Summary of Decisions was issued, librarians on the SACOLIST mailing list publicly disagreed with the rejection and pointed out the flaws in LC’s argument. One poster highlighted the fact that the term was in common use and searched for by library patrons; they also noted another heading already in LCSH that fit the pattern PTCP claimed didn’t exist:

According to H 204 Section 2, the proposed heading should “reflect the terminology commonly used to refer to the concept,” which I believe is the case with this term. Additionally, the same section of H 204 asks, “Will the proposed revision enhance access to library resources? Would library users find it easier to discover resources of interest to them if the proposed change were to be approved?” Again, if this phrase is commonly used by patrons, it would make sense to add it to our catalogs … You wrote that “LCSH does not currently have an established pattern that combines the topic of migration with the social reasoning for that migration.” Could someone explain why Great Migration, ca. 1914-ca. 1970 doesn’t fit this pattern? Is it because of the date range and that this is a specific event?[88]

Another librarian emphasized the ongoing importance of white flight, the prevalence of literature discussing it, and the unequal treatment of headings describing different groups in LCSH:

The differences between these proposals from my perspective seems to be that one describes African Americans and the other describes White people, and White flight is an ongoing concept rather than a single historical event. I hope PTCP reconsiders this decision, because the effects of White flight and the practices surrounding it shape racial inequality in the United States and in many other countries in the world. Many works describe White flight and its consequences … and users are familiar with the term and want to find works about it.[89]

Finally, a respondent noted yet another term matching the supposedly non-existent pattern: “The existing heading Amenity migration would also appear to provide a pattern combining the topic of migration with the social reasoning for that migration.”[90]

Despite these arguments, LC did not respond to the mailing list discussion nor change its decision. As White flight had literary warrant, was amply supported by reference sources, and was a concept that could not be accurately conveyed using already existent subject headings, why was PTCP concerned about neutrality “particularly in this case”? Even governmental entities as varied as the Supreme Court, the U.S. Commission on Civil Rights, the National Register of Historic Places, and LC itself use the term white flight. The rejection’s insistence on the need for uninvolved neutrality therefore seemed inconsistent with the widespread acceptance of the term.

Instead, the neutrality justification appears to be a smokescreen to cover up discomfort with a term that called out white racism; mandating neutrality in this case meant privileging being inoffensive to white people over acknowledging a widely accepted critique of systemic racism. Patton notes in her Substack post “White People Hate Being Called ‘White People’” that whiteness functions in part by invisibility, a “retreat into universalism where whiteness can dissolve back into ‘humanity’ and avoid accountability.”[91] Rejecting the proposal may have been a neutral decision (i.e., deliberately unobjectionable and indifferent to political and social realities), but it was certainly not unbiased (i.e., free from favoritism). Instead, it conceptually reinforced the false position of whiteness described by Patton as “the default, neutral, objective, and moral”[92]—thus undermining equity in LCSH and making works on this important topic invisible and unsearchable in library catalogs.

Discussion

Chiu, Ettarh, and Ferretti describe the futility of relying on neutrality to further social justice within librarianship and its vocabularies:

When the profession discusses neutrality, we believe that the profession actually seeks equity. However, neutrality will not yield equitable results and will always fall short because it relies on equity already existing in society. This is not the condition of our current society, nor is it true for the profession. Therefore, neutrality will actually work toward reinforcing bias and racism.[93]

The rejection of White flight illustrates this point aptly. Justifying the rejection by invoking neutrality means that practically speaking being neutral equates to whitewashing the ongoing phenomenon, by pretending that the movement of white people in the United States is entirely benign, divorced from racism, and not worth library or library user attention. What are the long-term consequences of privileging neutrality, as opposed to equity, in the subject approval process? Neutrality as political isolationism and mandated inoffensiveness leads, as seen in the rejections from 2005 through 2024, to suppressing political and social critiques, hiding prejudice, and rendering the lived experiences of marginalized groups invisible.

It is unfortunately far too easy to weaponize a neutrality that gives equal weight to what groups such as racists and antisemites intend when evaluating proposals. A SHM instruction created in late 2024, “H 1922,” further embeds this weaponization within subject guidance. “H 1922” defines “offensive words” as “derogatory terms that insult, disparage, offend, or denigrate people according to their race, ethnicity, nationality, religion, gender identity, sexuality, occupation, social views, political views, etc.”[94] By including political and social views in the definition, LC inaccurately equates groups espousing opinions about how people should behave in society with demographic groups who have historically been marginalized merely for existing. This leaves LCSH vulnerable to political actors disingenuously claiming “offense” to silence critiques or establish prejudicial terms within the vocabulary. A recent example of this was the proposal to change Trans-exclusionary radical feminism into Gender-critical feminism, the obfuscatory label preferred by the transphobic group, by claiming that trans-exclusionary radical feminism was a slur.[95] (LC ultimately rejected the proposal, thanks in large part to “community activism” and mobilization opposing the change.[96] LC specifically mentioned library community input as the rationale for the rejection: “When this tentative list was published in November 2024, PTCP received over 300 email comments demanding rejection of this proposal.”[97])

There is ample evidence from the recent past and present of this weaponization of offense being used to undermine progress toward equity in the United States. The Trump administration’s proposed Compact for Academic Excellence in Higher Education (2025) exemplifies the dangers of privileging neutrality over equity. The Compact demands “institutional neutrality,” requiring that universities and their employees “abstain from actions or speech relating to societal and political events except in cases in which external events have a direct impact upon the university.” Those agreeing to this isolationist neutrality, in the meantime, would also agree to erase trans, non-binary, and intersex students, faculty, and staff, and to police and punish speech deemed offensive to conservatives. Notably, the Compact requires that admissions be based on “objective” criteria—except for explicitly-allowed faith, “sex-based,” and anti-immigrant biases.[98]

Mandated neutrality within “H 204” risks reifying the same prejudices within library vocabularies. This can be seen in LC’s recent alteration of Mexico, Gulf of to America, Gulf of, and Denali, Mount (Alaska) to McKinley, Mount (Alaska).[99] Critical cataloger Berman describes the former change as “linguistic imperialism,” and the latter as an “affront to Alaska’s indigenous population.”[100] The latter change is particularly damaging, given the simultaneous effort by LC to remediate LCSH related to Indigenous peoples, and might undermine confidence in the project. In both cases, a neutral approach—remaining uninvolved in political and social events—led to an undue “deference to chauvinistic, ethnocentric, and unjustified authority.”[101] Whether LC realistically could have resisted altering these headings is a counterfactual hypothetical. Its actions must be judged by the effects of these revisions within library catalogs and for library patrons. By clinging to the illusion of neutrality, and capitulating to the whims of a racist and colonialist regime, LC undermined the profession’s stated values and harmed the larger library community.

Recommendations

What philosophical approach can LC take in lieu of neutrality, to bring the SACO process more in concert with library ideals of equity and egalitarianism? We recommend that LC employ a values-driven approach to vocabulary construction and maintenance. Explicitly stated library values—particularly around social justice and social responsibility—benefit all users, both marginalized peoples and the “mainstream.” Further, the PCC Policy Committee, of which LC is a permanent member, has already committed to the PCC Guiding Principles for Metadata, which acknowledge that “the standards and controlled vocabularies we use and their application are biased,” and advocates for “incorporating DEI principles in all aspects of cataloging work.”[102] Below, we suggest a number of changes LC could enact to make LCSH and the proposal process more equitable.

In backing away from neutrality as a guiding principle, philosophical approaches that have been suggested in critiques of traditional practice deserve consideration. In her chapter in Questioning Library Neutrality, Iverson proposes that librarians adopt feminist philosopher Haraway’s approach to objectivity: “Haraway explains that what we have accepted as ‘objectivity’ claims to be a vision of the world from everywhere at once … We can not see from all perspectives at once, we each have our own particular views that are shaped by our own identities, cultures, experiences, and locations.”[103] Instead of claiming to possess “infinite vision,” Iverson recommends that we adopt Haraway’s recognition of “situated knowledge.”[104]

Watson argues that instead of literary or bibliographic warrant (cataloging a book in hand, asking what subject headings are needed to convey its content), critical catalogers “operate from a position of catalogic warrant, reading the terms and hierarchies of cataloging and classification systems with a critical eye, reflecting on the potential benefit or harm of each term on marginalized users, groups, or the GLAMS [galleries, libraries, archives, and museums] community as a whole.”[105] In other words, librarians should focus on the subject heading system in its entirety, asking what revisions and additions are needed. In some ways, by collaborating with SACO funnels on large-scale projects to create and revise related groups of subject headings, LC has already moved away from strict adherence to an interpretation of literary warrant that considers the only valid reason to propose a subject heading having a book in hand that requires it. This shift should be continued and expanded.

As for concrete actions, we advise that LC restore its open monthly subject editorial meetings where proposals are discussed, and to expand points of communication with external libraries. This would allow a more diverse range of librarians to participate in the SACO process and provide valuable input during decision-making. Other benefits of monthly meetings have been noted by SACO librarians in an open letter to PTCP: they helped to demystify “the SACO process” for the newly-involved; and allowed librarians to contribute to “lively conversations on a broad range of options, and the opportunity to shape the vocabularies we all use, from proposing single headings to creating special lists to debating new guidelines for topical subdivisions.”[106]

Building off of this, we suggest creating an external advisory group for LCSH, similar to the ones for LCDGT and LCGFT, to get input from a broader range of users on proposal vetting and vocabulary maintenance. Further, we urge LC to allow greater decision-making power for external librarians in all advisory groups. This would help LC vocabularies better reflect the resources in the Library of Congress collections and the needs of thousands of libraries of different types around the world, and improve accountability for decisions made regarding proposals. It would also help to better insulate library vocabularies from the governmental interference noted above, by making a broad range of institutions responsible for their creation and maintenance.

Within such bodies, we recommend that LC follow guidance from the SAC Working Group on External Review of LC Vocabularies, by including members from groups being described in those vocabularies, subject matter experts, and international representatives. Furthermore, membership should not include “[r]epresentatives from groups or organizations that purport to speak for marginalized communities, but who exclude the voices of members of the marginalized community,” or “[r]esearchers or representatives from groups or organizations where the experts cause harm to members of marginalized communities.”[107] The inclusion of representative groups aligns with the PCC Guiding Principles for Metadata and follows the principles put forth in the Cataloguing Code of Ethics.

In vetting SACO proposals, “LC should prioritize sources from the peoples and communities described, privileging those sources over traditionally ‘authoritative’ sources, including literary warrant,” to ensure that the terminology used “reflect[s] a more inclusive and culturally relevant understanding of the language associated with these groups and their heritage and history.”[108] The creation of a position within LC focused on remediating metadata related to Indigenous peoples was a good first step in this direction; and we strongly encourage LC to both continue and expand this practice.

Finally, we suggest revisions to various LC documents and SHM instruction sheets. References to neutrality should be removed from “H 204” and “Module 1.4,” in favor of a focus on active equity in subject assignment and proposals. Examples of unbiased terminology, created in concert with advisory groups described above, reflecting a variety of situations, and periodically updated, would help create a shared understanding between librarians proposing headings and those evaluating them for inclusion in LCSH. “H 180” and “Module 1.4” should also be edited, in the sections advising catalogers to remain objective and not “express personal value judgments.”[109] All cataloging relies on judgment, and judgment is not always synonymous with bias or divorced from facts. A more useful focus here, as in a revised “H 204,” would be on the active equity present in Merriam-Webster’s definition of objectivity; catalogers should employ “catalogic warrant” and evaluate the “potential benefit or harm”[110] of subjects, particularly when assigning headings to prejudicial works. Finally, in order to protect against weaponized “offense,” we also recommend that “social views” and “political views” be removed from “H 1922.” These alterations would bring the SHM and LCSH training more in line with LCDGT guidance, which foregrounds cataloging ethics. “L 400,” for instance, notes that “naming demographic groups and identifying individuals as members of those groups must be done with accuracy and respect,” and highlights the importance of self-identification when assigning headings.[111]

We cannot make recommendations on this topic without addressing the current political climate. Because LC’s catalog migration put most SACO work on hold during 2025,[112] the effect of the Trump administration’s anti-DEI policies on LCSH remains uncertain. However, United States history is rife with periods of political repression. Waiting until relative calm to advocate for equity has not been, historically, how equity was advanced; and it will not serve library patrons or the broader community in the present moment.

Conclusion

LCSH began over a century ago as a subject cataloging tool for the Library of Congress, and has since evolved into a vocabulary serving thousands of libraries around the world. Despite the broad and diverse user base, LC has remained the sole arbiter of which proposals are accepted into LCSH and what form the headings take. During the last two decades it has rejected a number of subject proposals due to a preference for purported neutrality and objectivity, in various guises. Yet, as a profession, librarianship claims to prioritize social responsibility. Social justice and equity are incompatible with an indifferent and purposefully inoffensive neutrality that allows harmful, colonialist, and racist headings in LCSH, and keeps out headings describing prejudice, or about the lived experiences of marginalized peoples.

Olson describes LCSH as “a Third Space between documents being represented and users retrieving them,” since “LCSH constructs the meanings of documents for users.”[113] These meanings impact how users view materials, and whether they can locate them in library catalogs. And it is within this space that LC’s commitment to neutrality fails both users and the ideals of librarianship around social responsibility. However, “because the Third Space is one of ambivalence, it is one with potential for change.”[114] By focusing on library values rather than neutrality within the subject creation and approval process, LCSH could develop into a vocabulary that constructs truly equitable and inclusive meanings for users and librarians alike.

Acknowledgements

Thank you to our publishing editor, Jess Schomberg, and the editorial board for their flexibility, guidance, and expertise throughout the publication process. Thank you to K.R. Roberto, Margaret Breidenbaugh, Crystal Yragui, and Matthew Haugen, who allowed us to quote them within this article. We would also like to thank our reviewers, Jamie Carlstone and Ian Beilin, and other readers who gave valuable feedback: Adam Schiff, Rebecca Albitz, Chereeka Garner, Rebecca Nowicki, Naomi Reeve, Simone Clunie, Violet Fox, and Stephanie Willen Brown.


[1] Robert Jensen. “The Myth of the Neutral Professional,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 91.

[2] Library of Congress, “H 204: Evaluating Subject Proposals,” in Library of Congress Subject Headings Manual, Aug. 2025 rev. (Library of Congress, 2025), 2, https://www.loc.gov/aba/publications/FreeSHM/H0204.pdf (original: https://web.archive.org/web/20180524054119/https://www.loc.gov/aba/publications/FreeSHM/H0204.pdf

[3] Throughout this article, authorized subject headings (i.e., those that exist currently in LCSH) are presented in bold font; while rejected proposed headings appear in italics. For consistency, subject headings within quotations will follow the same formatting, regardless of the formatting used in the original quotation.

[4] Library of Congress, “Summary of Decisions, Editorial Meeting Number 10” (Library of Congress, 2013), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-131021.html; Library of Congress, “Summary of Decisions, LCSH/LCC Editorial Meeting Number 02 (2024)” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2402.pdf.

[5] Post-coordination is the practice of using multiple, separate LCSH terms in combination to convey a single concept.

[6] Library of Congress, “Summary of Decisions, Editorial Meeting Number 4” (Library of Congress, 2008), https://www.loc.gov/aba/pcc/saco/cpsoed/cpsoed-080123.html.

[7] See the manuals for Genre/Form Terms, Demographic Group Terms, and Children’s Subject Headings, for instance.

[8] Gina Schlesselman-Tarango, “How Cute!: Race, Gender, and Neutrality in Libraries,” Partnership: The Canadian Journal of Library and Information Practice and Research 12, no. 1 (Aug. 2017): 10, https://doi.org/10.21083/partnership.v12i1.3850.

[9] Maura Seale, “Compliant Trust: The Public Good and Democracy in the ALA’s ‘Core Values of Librarianship,’” Library Trends 64, no. 3 (2016): 589, https://doi.org/10.1353/lib.2016.0003.

[10] American Library Association Working Group on Intellectual Freedom and Social Justice, “Final Report from the Intellectual Freedom and Social Justice Working Group” (EBD #10.0, American Library Association, 2022), 10, https://www.ala.org/sites/default/files/aboutala/content/governance/ExecutiveBoard/20222023Docs/ebd%2010.0%20IF_SJ%20Final%20Report%207.12.2022.pdf.

[11] International Federation of Library Associations and Institutions, “IFLA Code of Ethics for Librarians and other Information Workers,” 4, https://www.ifla.org/wp-content/uploads/2019/05/assets/faife/publications/IFLA%20Code%20of%20Ethics%20-%20Long_0.pdf.

[12] National Information Standards Organization, Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies, ANSI/NISO Z39.19-2005 (R2010) (National Information Standards Organization, 2010), 30, https://groups.niso.org/higherlogic/ws/public/download/12591/z39-19-2005r2010.pdf.

[13] National Information Standards Organization, Guidelines, 44.

[14] Oxford English Dictionary, “Neutral,” https://www.oed.com/dictionary/neutral_n?tab=meaning_and_use#34680278 and “Unbiased,” https://www.oed.com/dictionary/unbiased_adj?tab=meaning_and_use#17025200.

[15] Dani Scott and Laura Saunders, “Neutrality in Public Libraries: How Are We Defining One of Our Core Values?,” Journal of Librarianship and Information Science 53, no. 1 (2020): 153, https://doi.org/10.1177/0961000620935501.

[16] Scott and Saunders, “Neutrality in Public Libraries,” 158.

[17] “Are Libraries Neutral? Highlights from the Midwinter President’s Program,” American Libraries, June 1, 2018. https://americanlibrariesmagazine.org/2018/06/01/are-libraries-neutral/

[18] Michael Dudley, “Library Neutrality and Pluralism: A Manifesto,” Heterodoxy in the Stacks, Aug. 8, 2023 https://hxlibraries.substack.com/p/library-neutrality-and-pluralism.

[19] Mark Rosenzweig. “Politics and Anti-Politics in Librarianship,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 5-6.

[20] Stephen Macdonald and Briony Birdi, “The Concept of Neutrality: A New Approach,” Journal of Documentation 76, no. 1 (2020): 333–353. https://doi.org/10.1108/JD-05-2019-0102.

[21] Jaeger-McEnroe, “Conflicts of Neutrality,” 3.

[22] Jaeger-McEnroe, “Conflicts of Neutrality,” 6.

[23] Jaeger-McEnroe, “Conflicts of Neutrality,” 9.

[24] Steve Joyce, “A Few Gates Redux: An Examination of the Social Responsibilities Debate in the Early 1970s and 1990s,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 33-65.

[25] “ALA Code of Ethics,” American Library Association, updated June 29, 2021, https://www.ala.org/tools/ethics

[26] “Resolution to Condemn White Supremacy and Fascism as Antithetical to Library Work,” American Library Association, Jan. 25, 2021, https://tinyurl.com/yr4z9e8x

[27] Scott and Saunders, “Neutrality in Public Libraries,” 153.

[28] “Are Libraries Neutral?”

[29]Canadian Federation of Library Associations / Fédération canadienne des associations de bibliothèques, “CFLA-FCAB Code of Ethics,”updated Aug. 27, 2018, https://cfla-fcab.ca/wp-content/uploads/2019/06/Code-of-ethics.pdf.

[30] Jaeger-McEnroe, “Conflicts of Neutrality,” 5.

[31] Jaeger-McEnroe, “Conflicts of Neutrality,” 5, 6.

[32] Anita Brooks Kirkland, “Library Neutrality as Radical Practice,” Synergy v. 19, no. 2 (Sept. 2021) https://www.slav.vic.edu.au/index.php/Synergy/article/view/536.

[33] Nicole Pagowsky and Niamh Wallace, “Black Lives Matter!: Shedding Library Neutrality Rhetoric for Social Justice,” College & Research Libraries News 76, no. 4 (2015): 198. https://crln.acrl.org/index.php/crlnews/article/view/9293/10374.

[34] Cataloging Ethics Steering Committee, “Cataloguing Code of Ethics,” January 2021,  http://hdl.handle.net/11213/16716.

[35] Subject Analysis Committee Working Group on the LCSH “Illegal aliens,” “Report from the SAC Working Group on the LCSH ‘Illegal aliens,'” July 13, 2016, https://alair.ala.org/handle/11213/9261.

[36] Jill E. Baron, Violet B. Fox, and Tina Gross, “Did Libraries ‘Change the Subject’? What Happened, What Didn’t, and What’s Ahead,” in Inclusive Cataloging: Histories, Context, and Reparative Approaches, eds. Billey Albina, Rebecca Uhl, and Elizabeth Nelson (ALA Editions, 2024), 53; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 11 (November 12, 2021)” (Library of Congress, 2021), https://classweb.org/approved-subjects/2111b.html.

[37] Baron et al., “Did Libraries ‘Change the Subject?,’” 54.

[38] Michelle Cronquist and Staci Ross, “Black Subject Headings in LCSH: Successes and Challenges of the African American Subject Funnel Project,” Reference and User Services Association, July 7, 2021, virtual. https://d-scholarship.pitt.edu/41826

[39] Cronquist and Ross, “Black Subject Headings in LCSH.”

[40] Cronquist and Ross, “Black Subject Headings in LCSH.”

[41] Library of Congress, “Library of Congress Subject Headings Approved Monthly List 06 (June 18, 2021)” (Library of Congress, 2021), https://classweb.org/approved-subjects/2106.html. Note the headings for Japanese Americans, Japanese Canadians, and Aleuts were originally submitted as –Forced removal and incarceration matching preferred usage, but LC changed them all to –Forced removal and internment.

[42] Library of Congress, “Library of Congress Subject Headings Approved Monthly List 08 (August 12, 2022)” (Library of Congress, 2022), https://classweb.org/approved-subjects/2208.html; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 08 LCSH 2 (August 18, 2023)” (Library of Congress, 2023), https://classweb.org/approved-subjects/2308a.html; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 04 (Apr. 21, 2023)” (Library of Congress 2023),https://classweb.org/approved-subjects/2304.html; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 03 LCSH 2 (March 15, 2024)” (Library of Congress, 2024), https://classweb.org/approved-subjects/2403a.html.

[43] For more information about Congressional actions related to the attempt to change Illegal aliens, see: SAC Working Group on Alternatives to LCSH “Illegal aliens,” “Report of the SAC Working Group on Alternatives to LCSH ‘Illegal aliens’” (American Library Association, 2020), http://hdl.handle.net/11213/14582.

[44] Tina Gross, “Search Terms up for Debate: The Politics and Purpose of Library Subject Headings,” Perspectives on History 60, no. 3 (2022), https://www.historians.org/perspectives-article/search-terms-up-for-debate-the-politics-and-purpose-of-library-subject-headings-march-2022/.

[45] Michael Colby, “SACO: Past, Present, and Future,” Cataloging & Classification Quarterly 58, no. 3-4 (2020): 287, https://doi.org/10.1080/01639374.2019.1706679.

[46] Library of Congress Subject Headings Manual, Aug. 2025 rev. (Library of Congress, 2025), https://www.loc.gov/aba/publications/FreeSHM/freeshm.html.

[47] Library of Congress, “Module 1.5: Introduction to LCSH,” in Library of Congress Subject Headings: Online Training (Library of Congress, 2016), 8, https://www.loc.gov/catworkshop/lcsh/PDF%20scripts/1-5%20Intro%20To%20LCSH.pdf.

[48] Rich Gazan, “Cataloging for the 21st Century Course 3: Controlled Vocabulary & Thesaurus Design Trainee’s Manual” in Library of Congress Cataloger’s Learning Workshop (Library of Congress, n.d.), 2-2,

https://www.loc.gov/catworkshop/courses/thesaurus/pdf/cont-vocab-thes-trnee-manual.pdf

[49] Library of Congress, “H 204,” 3.

[50] Library of Congress, “H 180: Assigning and Constructing Subject Headings,” in Library of Congress Subject Headings Manual, Feb. 2016 rev. (Library of Congress, 2016), 8, https://www.loc.gov/aba/publications/FreeSHM/H0180.pdf.

[51] Library of Congress, “Module 1.2: Why Do We Use Controlled Vocabulary?,” in Library of Congress Subject Headings: Online Training (Library of Congress, 2016), 7, https://www.loc.gov/catworkshop/lcsh/PDF%20scripts/1-2-WhyCV.pdf.

[52] Library of Congress, “H 204,” 2.

[53] Library of Congress, “Module 1.4: How Do We Determine Aboutness?,” in Library of Congress Subject Headings: Online Training (Library of Congress, 2016), 3, https://www.loc.gov/catworkshop/lcsh/PDF%20scripts/1-4-Aboutness.pdf.

[54] Merriam-Webster Dictionary, “Neutral,” https://www.merriam-webster.com/dictionary/neutral and “Unbiased,” https://www.merriam-webster.com/dictionary/unbiased.

[55] Library of Congress, “Module 1.4,” 3.

[56] Library of Congress, “H 180: Assigning and Constructing Subject Headings,” in Library of Congress Subject Headings Manual, Feb. 2016. (Library of Congress, 2016), 7, https://www.loc.gov/aba/publications/FreeSHM/H0180.pdf 

[57] Oxford English Dictionary, “Objectivity,” https://www.oed.com/dictionary/objectivity_n?tab=meaning_and_use#34080200; Merriam-Webster Dictionary, “Objectivity,” https://www.merriam-webster.com/dictionary/objectivity.

[58] Michael R. Griffiths, “Roland Barthes Declared the ‘Death of the Author’, but Postcolonial Critics have Begged to Differ,” The Conversation, July 2, 2025, https://theconversation.com/roland-barthes-declared-the-death-of-the-author-but-postcolonial-critics-have-begged-to-differ-256093.

[59] Library of Congress Subject Headings, “Holocaust denial literature,” https://lccn.loc.gov/sh96009503.

[60] Anastasia Chiu, Fobazi M. Ettarh, and Jennifer A. Ferretti, “Not the Shark, but the Water: How Neutrality and Vocational Awe Intertwine to Uphold White Supremacy,” in Knowledge Justice: Disrupting Library and Information Studies through Critical Race Theory, eds. Sofia Y. Leung, Jorge R. López-McKnight (MIT Press, 2021), 65.

[61] Library of Congress, “Editorial Meeting Number 4,” 2008; Library of Congress, “LCSH/LCC Editorial Meeting Number 02 (2024).”

[62] Library of Congress, “Summary of Decisions, Editorial Meeting Number 10” (Library of Congress, 2013), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-131021.html; Library of Congress, “Summary of Decisions, LCSH/LCC Editorial Meeting Number 05 (2023)” (Library of Congress, 2023), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2305.pdf; Library of Congress, “Summary of Decisions, Editorial Meeting Number 46” (Library of Congress, 2010), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-101117.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 4” (Library of Congress, 2015), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-150420.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 27” (Library of Congress, 2010), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-100707.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 36” (Library of Congress, 2009), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-090909.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 1911” (Library of Congress, 2019), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-191118.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 2111” (Library of Congress, 2018), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-211115.html.

[63] Library of Congress, “Module 1.4,” 3.

[64] Library of Congress, “Summary of Decisions, Editorial Meeting Number 46” (Library of Congress, 2007), https://www.loc.gov/aba/pcc/saco/cpsoed/cpsoed-071114.html; Library of Congress, “Summary of Decisions, LCSH/LCC Editorial Meeting Number 6 (2024)” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2406.pdf.

[65] Library of Congress, “Summary of Decisions, Editorial Meeting Number 04” (Library of Congress, 2016), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-160418.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 2006” (Library of Congress, 2020), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-200615.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 23” (Library of Congress, 2011), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-110815.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 10” (Library of Congress, 2016), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-161017.html; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 06 (June 17, 2022)” (Library of Congress, 2022), https://classweb.org/approved-subjects/2206.html.

[66] Library of Congress, “Summary of Decisions, Editorial Meeting Number 2006” (Library of Congress, 2020), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-200615.html.

[67] Library of Congress, “Summary of Decisions, Editorial Meeting Number 10” (Library of Congress, 2014), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-141020.html; Library of Congress, “Summary of Decisions, LCSH/LCC Editorial Meeting Number 07 (2023)” (Library of Congress, 2020), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2307.pdf.

[68] Krista Maywalt Aronson, Brenna D. Callahan, and Anne Sibley O’Brien, “Messages Matter: Investigating the Thematic Content of Picture Books Portraying Underrepresented Racial and Cultural Groups,” Sociological Forum 33, no. 1 (2018): 179, http://www.jstor.org/stable/26625904.

[69] Lisely Laboy, Rachael Elrod, Krista Aronson, and Brittany Kester, “Room for Improvement: Picture Books Featuring BIPOC Characters, 2015–2020,” Publishing Research Quarterly 39 (2023): 58, https://doi.org/10.1007/s12109-022-09929-7.

[70] Library of Congress, “Editorial Meeting Number 36,” 2009; Library of Congress, “Summary of Decisions, Editorial Meeting Number 21” (Library of Congress, 2011), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-110620.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 02” (Library of Congress, 2012), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-120221.html; Library of Congress, “Summary of Decisions, Editorial Meeting Number 06” (Library of Congress, 2018), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-180618.html; Library of Congress, “Editorial Meeting Number 2111,” 2021; Library of Congress, “Summary of Decisions, LCSH List Number 11c (2024) (2024) and LCC List Number 10 & 11 (2024)” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2412g.pdf.

[71] Library of Congress, “Summary of Decisions, Editorial Meeting Number 07” (Library of Congress, 2014), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-140721.html.

[72] Library of Congress, “Summary of Decisions, Editorial Meeting Number 09” (Library of Congress, 2017), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-170918.html. LC did establish a new heading for Denialism at that time; however, per the rejection, “To bring out the denialism aspect of events or topics, the heading may be post-coordinated with headings for the events or topics. The existing subject headings Holocaust denial and Holodomor denial, which are related to specific events, were added by exception as narrower terms of the new heading Denialism. Additional narrower terms will not be added to Denialism.”

[73] Library of Congress, “Summary of Decisions, Editorial Meeting Number 23” (Library of Congress, 2007), https://www.loc.gov/aba/pcc/saco/cpsoed/cpsoed-070606.html.

[74] Library of Congress, “Editorial Meeting Number 1911,” 2019.

[75] Library of Congress, “Summary of Decisions, Editorial Meeting Number 49” (Library of Congress, 2010), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-101208.html; Library of Congress, “Summary of Decisions, “LCSH/LCC Quarterly Editorial Meeting List 2409” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2409.pdf.

[76] Library of Congress, “Summary of Decisions, Editorial Meeting Number 5” (Library of Congress, 2015), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-150518.html.

[77] The heading is now Gay people–Violence against.Library of Congress, “Summary of Decisions, Editorial Meeting Number 27” (Library of Congress, 2011), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-111219.html.

[78] Library of Congress, “Editorial Meeting Number 04,” 2016.

[79] Library of Congress, “Summary of Decisions, LCSH Number 11 and LCC Number 11b (2024)” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2411.pdf.

[80] Library of Congress, “Editorial Meeting Number 27,” 2011; Library of Congress, “Library of Congress Subject Headings Monthly List 12 LCSH (December 17, 2012)” (Library of Congress, 2012), https://classweb.org/approved-subjects/1212.html.

[81] Library of Congress, “Editorial Meeting Number 27,” 2010.

[82] K.R. Roberto, “LCSH Proposals: Is this a Trend?” Jan. 17, 2012, RADCAT mailing list archives.

[83] Library of Congress, “Summary of Decisions, LCSH/LCC Editorial Meeting Number 12 (2024)” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2412.pdf.

[84] Library of Congress, “Module 1.4,” 3.

[85] Library of Congress, “Summary of Decisions, Editorial Meeting Number 12” (Library of Congress, 2015), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-151212.html. A 2016 rejection of Dadaist literature, Romanian (French) also highlighted colonialist content in LCSH, noting that “Headings for national literatures qualified by language are generally established for the language(s) of the colonial power that used to control the territory.” See: Library of Congress, “Editorial Meeting Number 04,” 2016.

[86] Library of Congress, “Summary of Decisions, Editorial Meeting Number 2003” (Library of Congress, 2020), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-200316.html.

[87] Library of Congress, “Editorial Meeting Number 02 (2024).”

[88] Margaret Breidenbaugh, “Re: Summary of Decisions, Editorial Meeting Number 02, February 16, 2024,” SACOLIST Mailing List Archives, Library of Congress, May 29, 2024, https://listserv.loc.gov/cgi-bin/wa?A2=SACOLIST;eb3d8761.2405&S=.

[89] Crystal Yragui, “Re: Summary of Decisions, Editorial Meeting Number 02, February 16, 2024,” SACOLIST Mailing List Archives, Library of Congress, May 30, 2024, https://listserv.loc.gov/cgi-bin/wa?A2=2405&L=SACOLIST&D=0&P=1800917.

[90] Matthew Haugen, “Re: Summary of Decisions, Editorial Meeting Number 02, February 16, 2024,” SACOLIST Mailing List Archives, Library of Congress, May 29, 2024, https://listserv.loc.gov/cgi-bin/wa?A2=2405&L=SACOLIST&D=0&P=1796174.

[91] Stacey Patton, “White People Hate Being Called ‘White People,’” Substack, Oct. 23, 2025, https://drstaceypatton1865.substack.com/p/white-people-hate-being-called-white.

[92] Stacey Patton, “White People.”

[93] Chiu, Ettarh, and Ferretti, “Not the Shark,” 56-57.

[94] Library of Congress, “H 1922: Offensive Words” in Library of Congress Subject Headings Manual, Sep. 2024 (Library of Congress, 2024), 2, https://www.loc.gov/aba/publications/FreeSHM/H1922.pdf

[95] Library of Congress, “Tentative Monthly List 12 LCSH (December 20, 2024)” (Library of Congress, 2024), https://classweb.org/tentative-subjects/2412.html

[96] Brinna Michael, “LCSH, Transparency, and the Impact of Collective Action,” TCB: Technical Services in Religion & Theology 33, no. 2 (2025): 1. https://doi.org/10.31046/h01fq272.

[97] Library of Congress, “Summary of Decisions, LCSH/LCC Editorial Meeting Number 12 (2024)” (Library of Congress, 2024), https://www.loc.gov/aba/pcc/saco/cpsoed/ptcp-2412.pdf.

[98] U.S. Department of Education, Compact for Academic Excellence in Higher Education (Draft Memorandum, Oct. 2025), 4, 5, 2, 1, 9, https://www.washingtonexaminer.com/wp-content/uploads/2025/10/Compact-for-Academic-Excellence-in-Higher-Education-10.1.pdf.

[99] Library of Congress, “Library of Congress Subject Headings Approved Monthly List 12 LCSH 2” (Library of Congress, 2025), https://classweb.org/approved-subjects/2412a.html. For more information, including the fast-tracked nature of the changes, see Violet Fox, “Anticipatory Obedience at the Library of Congress,” ACRLog (blog), Mar. 28, 2025, https://acrlog.org/2025/03/28/anticipatory-obedience-at-the-library-of-congress/

[100] Sanford Berman, “ALA at 150: An Interview with (and by) Sanford Berman,” by Jenna Freedman. Lower East Side Librarian, Nov. 30, 2025 https://lowereastsidelibrarian.info/interviews/sandy-2025 

[101] Berman, “ALA at 150.”

[102] Program for Cooperative Cataloging, “Program for Cooperative Cataloging Guiding Principles for Diversity, Equity, and Inclusion for Metadata Creation,” approved Jan. 19, 2023 https://www.loc.gov/aba/pcc/resources/DEI-guiding-principles-for-metadata-creation.pdf

[103] Sandy Iverson, “Librarianship and Resistance,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 26.

[104] Iverson, “Librarianship and Resistance,” 26.

[105] B. M. Watson, “Expanding the Margins in the History of Sexuality & Galleries, Libraries, Archives, Museums & Special Collections (GLAMS)” PhD diss. (University of British Columbia, 2025), 270.

[106] Violet Fox, et al. to Policy, Training and Cooperative Programs Division, Library of Congress, June 30, 2024, “Editorial Meetings Decision,” https://cataloginglab.org/editorial-meetings-decision/

[107] Subject Analysis Committee Working Group on External Review of LC Vocabularies, Report of the SAC Working Group on External Review of Library of Congress Vocabularies, February 2023, 8-9, https://alair.ala.org/handle/11213/20012.

[108] Working Group on External Review of LC Vocabularies, “Report,” 8.

[109] Library of Congress, “H 180”, 7.

[110] Watson, “Expanding the Margins,” 270.

[111] Library of Congress, “L 400: Ethics and Demographic Group Terms” in Library of Congress Demographic Group Terms Manual,Mar. 2025 (Library of Congress, 2025), 1, https://www.loc.gov/aba/publications/FreeLCDGT/L400.pdf.

[112] Cataloging Policy and Standards, “Announcement from the Library of Congress (April 7, 2025),” SACOLIST Mailing List Archives, Library of Congress, April 7, 2025, https://listserv.loc.gov/cgi-bin/wa?A2=SACOLIST;61e18f28.2504&S=

[113] Hope Olson, “Difference, Culture and Change: The Untapped Potential of LCSH,” Cataloging & Classification Quarterly 29, no. 1–2 (2000), 54 https://doi.org/10.1300/J104v29n01_04.

[114] Olson, “Difference, Culture and Change,” 66.

Strike time, collective action, and moral conviction in library leadership / Meredith Farkas

I’m on strike right now, along with thousands of other faculty, academic professionals, and staff at Portland Community College (that’s two unions, friends!). It’s a weird feeling. I never thought I’d be in this position. PCC was the first place I worked where I really felt like the values of the College matched my own. I work with insanely dedicated and caring library workers, faculty, and staff. They believe unwaveringly in what they do and constantly go above and beyond for students. After being here for a few years, I knew this was the place I wanted to work for the rest of my career. Even as administration became worse – more corporatized, more performative, less accessible, more likely to listen to outside consultants than the people who directly work with students – I still never considered leaving because the folks I work with regularly are awesome and I love our students. 

As a scholar of time, I’m always interested in different forms of time (queer time, crip time, etc.). Strike time feels really strange. We were talking this morning on the picket line how it feels a lot like early COVID where time moved very differently. We feel like the days are both way too long and super short with not enough time to get everything done but also too much time just staring at different union social channels. We’re totally energized and totally exhausted (I’m lying on the couch like a ragdoll right now after three hours of holding signs, screaming, and dancing, marching and chanting with hundreds of colleagues). In terms of information, we feel like we’re both drinking from a firehose and like we don’t have any of the information we need. We have no idea what the near-term future will bring. What day of the week it is feels almost arbitrary because none of the usual markers of those days apply (I see all the things I was supposed to have been doing at work each day on my calendar and it feels like another life entirely). We’re both unmoored and deeply connected. I love it (the connection and collective power) and I also really hate it (for our students, for our colleagues who live paycheck to paycheck, for what the administration and the Board are doing to my beloved institution). 

So it’s weird to feel both temporarily severed from the College and also more deeply connected than ever. These administrators may run the College and have the authority to make decisions, but they are not the College. The College is the people I’ve seen on the picket lines the past few days in the rain and freezing cold. These people who are truly fighting for the soul of our college. They make the College run, from teaching classes, to assisting students with all kinds of needs, to helping students feel welcome, to keeping the College clean and safe and keeping students fed. All of these things are critical and the College can’t run without us, but I’m not entirely sure the same can be said of our administrators. The College is also our students, many of whom have stood with us on the line, who’ve brought us food, or have supported us through emails to the President and Board and on social media. I feel incredibly grateful for our students who clearly see through the bs administration is putting out there. 

It’s been kind of incredible to see how unprepared our administration was for this after 11 months in which they barely moved in negotiations. They’ve known for months that a strike  was a distinct possibility and they were the ones who walked away from the bargaining table the night before the strike was meant to happen. The latest email from the President said “I will say, with some pride, that we are not – and we should not – be an organization that is good at navigating this scenario” but, honestly, they should have had guidance for students ready to go. Administrators are supposed to plan for scenarios like this. They had units planning for two different scenarios for cuts from the State (neither of which came to pass). We spent almost a year planning what we would cut if LSTA funds went away in our state for the next year (they didn’t, thank goodness). Most faculty, on the other hand, have been talking to students about a possible strike for the past six weeks at least and the union provided tons of resources to help them come up with a plan for their own classes. Yet the College was left totally scrambling last Wednesday as if they had no idea this could happen. Baffling.

It’s been interesting seeing some managers show up to bring food and/or spend a bit of time with us on the line. It’s not a lot of them, but it means a lot to us when someone does. They’ve told us about the absolute unprepared hot mess that is administration right now and it’s nice to realize that not every middle manager tows the party line at all times. But the vast majority of our managers sent us emails just before the start of the strike asking us to let them know if we were working or not, so most are definitely sticking with administration.

I had a boss many years ago who definitely put her employees first and advocated fiercely for us. She said she saw her role as being akin to a manager of a minor league baseball team. She was here to help develop us for bigger and better things in our careers. She was a major mentor to me in my early years in the profession. Since then, the bosses I’ve had really prioritized the people above them in the org chart ahead of the people below them. They have been classic “company [wo]men.” Helping us develop in our careers or even supporting us when we explicitly asked for it wasn’t part of the job. When I was a middle manager, I took the exact opposite approach and that’s why I’m no longer a middle manager. I always saw the role of a manager as supporting one’s direct reports (essentially, I worked for them) and that wasn’t what the people in charge of the library wanted me to do.

The great library leader Mitch Freedman died recently and it made me think about whether leaders like him can really exist in our much more corporatized libraries these days. If you don’t know about Mitch’s storied biography as a library leader and awesome human, please take a moment to read about him here in an obit from his family. When I was coming up as a librarian, he was the sort of man who was a model for me in successfully operating in our field with total moral courage. He lived his values every day. He fought for people and the things that he believed in. He centered the folks who were oppressed. He believed relationships were core to our work. In many ways, he embodied the “Good” and the “Human(e)” characteristics of slow librarianship (maybe also the “Thoughtful” but I didn’t work with him, so I’m not sure). His amazing daughter, Jenna Freedman, also lives her values courageously, a living tribute to his example.

I hope there are still library managers out there still who have moral courage and fight the good fight, but, more and more, it feels like the people who become library Deans, Directors, and University Librarians are the ones who are willing to comply and conform, not the ones willing to rock the boat. As our institutions become more and more corporatized and neoliberal, we see less and less moral courage. I see a lot of library administrators wanting to look like they’re doing good more than they actually want to do good. I think of the leaders who all started EDI initiatives or published EDI statements right around 2020 and then let them fade away. Most of the people I see doing amazing values-driven work in our field these days are not leading libraries. They’re mostly front-line librarians. I wonder if it’s because like me, folks are not willing to make the moral compromises so many have to make these days to climb the ladder.

In “Anthropology and the rise of the professional-managerial class,” the great (and deeply missed) David Graeber wrote about how 

the decisive victory of capitalism in the 1980s and 1990s, ironically, has… led to both a continual inflation of what are often purely make-work managerial and administrative positions—”bullshit jobs”—and an endless bureaucratization of daily life, driven, in large part, by the Internet. This in turn has allowed a change in dominant conceptions of the very meaning of words like “democracy.” The obsession with form over content, with rules and procedures, has led to a conception of democracy itself as a system of rules, a constitutional system, rather than a historical movement toward popular self-rule and self-organization, driven by social movements, or even, increasingly, an expression of popular will.

I see that in my own place of work. So much of my boss’ (our Dean’s) job is box checking compliance type work – approving vacations and sick leave, making sure we’re doing required trainings and other things the people above her on the org chart want us to do, making sure we’re doing all of the things contractually required of us, etc. It used to be that I met with her once each term to talk about what I was working on, go over my progress on my goals, etc. Then I went to meeting with her just once in Fall where we’d look at my goals document (without any meaningful feedback or support) and then I’d fill out a Google form at the end of the year to tell her what I did (with again no meaningful feedback). Now, even that Fall meeting is gone as her load of compliance-related work has increased. There’s no support outside of helping us navigate the bureaucracy of our institution. There’s no “walking around” as Mitch Freedman did – building relationships with employees and making them feel seen. There’s no focus on our development or talking about the meaning behind what we do. There’s just this compliance-focused flurry of activity. 

As our colleges and universities become more and more corporatized, they turn what were supposed to be leadership positions, that required vision and people skills, and turn them into babysitting jobs because, lord knows, we professionals can’t be trusted. Our college, like many, has seen a massive growth in the number of managerial positions, and yet, faculty and staff are being asked to do more administrative work than ever before, not less. Why? Well, of course those managers have to justify their existence. 

Could a Mitch Freedman become a library director today? Would he have had to compromise his values somewhere down the line to get there? Do you know of any library leaders like Mitch today who are able to operate successfully in these more neoliberal environments? 

In that same piece, David Graeber writes “scholars are expected to spend less and less of their time on scholarship, and more and more on various forms of administration—even as their administrative autonomy is itself stripped away. Here too we find a kind of nightmare fusion of the worst elements of state bureaucracy and market logic.” This is the reality we find ourselves in as our two unions fight for better pay, but even more importantly, for a real, substantial model of shared governance which we don’t currently have (and which our college President agreed to and then hired a consultant to create for us 🙄). The fact that the only college committee or governance group that has the ability to conduct a vote of no confidence in our President (which they successfully passed!) is our student government is a stark reminder of how little power and voice we have in the future of our college. It can be so easy to just focus on keeping our head down and doing the good work we do as educators, as supporters of students and faculty, as stewards of collections, etc., but when we fight together like this, we fight for the heart and soul of our organization. We fight for an organization that centers students and their needs and listens deeply to those who directly serve and educate them. 

Walking the picket line the first couple of days was brutal in many ways. I was so cold and wet I couldn’t even grip my cell phone or a car door handle and I had to stay off my feet for a few hours as they thawed. But what has kept me warm, has kept all of us warm, is the solidarity. It has sometimes felt almost like a party, being there with many hundreds of my fellow colleagues. It’s been so affirming, so energizing. We’re all so united in this, so deeply committed to the institution and each other in ways that these administrators who jump from job to job every few years and compose soulless emails to us with freaking ChatGPT will never understand. 

If you’re feeling so inclined, please contribute to our strike fund. The administration seems really dug in and even decreased their offer by over $100,000 on Sunday, so I’m not quite so optimistic anymore that this will end quickly and we have lots of faculty, academic professionals, and staff who won’t be able to pay their rent or mortgage without support. Thanks and solidarity!! ✊

Librarian Leadership in the Age of AI / Information Technology and Libraries

Librarians have managed and lived through many seismic shifts brought by technology. How should librarian leaders approach the coming anticipated AI workforce disruption?

Refusal as Instruction / Information Technology and Libraries

Abstract This column explores the ways in which library workers can better align technology use and instruction in library settings with library values, through championing the refusal of technologies that conflict with values like privacy and intellectual freedom. Drawing on experiences with individual patron instruction, class design, and passive programming, the author shares practical steps for helping patrons to understand and fight back against exploitation by digital technologies. Rejecting the myth that any technology is “neutral,” the column argues that libraries as values-driven organizations have a role to play in facilitating patrons’ rejection of technology, just as much as in their adoption of it.

Note from Shanna Hollich, column editor: I am particularly excited to share this issue's column for a number of reasons. First, it's from a public library perspective, which is one that is generally underrepresented in the LIS literature as a whole, and which I'm proud to say that ITAL makes a concerted effort to address. Second, it's about library instruction, a topic of relevance to all types of libraries - and where much of the literature specifically discusses formal library instruction, this column also addresses passive programming, informal instruction, and casual patron interaction, which are also vitally important and under-studied aspects of the library worker's role in education. And finally, it's yet another column about AI, and even more specifically, about taking a critical approach to AI tools, AI education, and AI literacy. Close readers may have noticed this topic tends to be a special interest of mine, but Hannah Cyrus takes a measured and reasoned approach here that acknowledges the potential harms of AI without falling into the trap of simply ignoring or denying AI and the very real impacts it is having on our libraries and the communities we serve.

From Card Catalogs to Semantic Search / Information Technology and Libraries

The first phase of the Reimagining Discovery project at Harvard Library sought to address the challenge of fragmented search experiences of special collections materials using artificial intelligence (AI) technologies, such as embedding models and large language models (LLMs). The resulting platform, Collections Explorer, simplifies and enhances the search experience for more effective special collections discovery. The project team took a user-centered and trustworthy approach to implementing AI, grounding the choices of the platform in user empowerment and librarian expertise. The development process included extensive user research, including interviews, usability testing, and prototype evaluations, to understand and address user needs.

Collections Explorer was developed using a multi-component architecture that integrates multiple types of AI. The team evaluated more than 12 models to select ones that were the best fit for the need, as well as being ethical and sustainable. Detailed system prompts were developed to guide LLM outputs and ensure the reliability of information. The methodical and iterative approach helped to create a flexible and scalable platform that could evolve to support other material types in the future. Initial research showed that potential users are enthused at the prospect of AI-powered features to enhance discovery, especially the item-level summaries and related search suggestions. The project demonstrated the potential of integrating AI technologies into library discovery systems while maintaining a commitment to trustworthiness and user-centered design.

Automatic Classification of Subjects and Sustainable Development Goals (SDGs) in Documents with Generative AI / Information Technology and Libraries

This study evaluates the effectiveness of the Artificial Intelligence for Theme Generation tool (original Portuguese acronym name: IAGeraTemas), developed with generative artificial intelligence (AI; Google Gemini), for automating thematic classification and the assignment of Sustainable Development Goals (SDGs) in documents. The methodology combined quantitative analyses (metrics of precision, recall, and accuracy) on 50 articles published by authors from the State University of Campinas (Unicamp), using classification from the SciVal database and qualitative analyses (analysis of the relevance of terms indexed by librarians from the Unicamp Library System in 40 articles available in the Unicamp Institutional Repository), comparing them with manual indexing performed by librarians. The quantitative results in SDG classification showed a recall of 0.785, while the “precision” and “accuracy” metrics were moderate. The qualitative analysis deepened the evaluation of term coherence and relevance suggested by the AI versus human indexing. It revealed the tool’s potential for suggesting relevant terms and expanding concepts, but it also exposed limitations in addressing complex topics. The research, conducted as an experiment at Unicamp Library System, concludes that IAGeraTemas is a valuable auxiliary tool, complementing but not replacing manual indexing, reinforcing the importance of human expertise in validating and refining results, and emphasizing the synergistic potential between AI and information professionals.

Metadata for Storytelling / Information Technology and Libraries

This article describes a case study in which a small metadata team at Illinois State University Milner Library produced a digital humanities project supporting Collections as Data (CAD) and linked data principles. Despite initial sparse descriptive content, the team recognized great potential for experimentation in a significant World War I archival collection to highlight lesser-known stories, including those of the Pioneer Infantry, women, and noncombatants. Discussion focuses on the strategic approaches in creating granular but scalable metadata for the large digital collection, and application of the data with various tools such as ArcGIS and Wikidata to construct interactive data visualizations, mapping, and digital storytelling for the Illinois State Normal University World War I Service Records collection. The article argues that even institutions without a dedicated CAD initiative can incrementally implement principles from the CAD model to add value to their digital collections. The authors first presented the project in 2024 at the Digital Library Federation Forum and the American Library Association Core Forum.

An Analysis of Revisions to OAIS and the “Designated Community” in Digital Preservation / Information Technology and Libraries

In digital preservation, the concept of a “Designated Community” from the Reference Model for an Open Archival Information System (OAIS) is used to articulate the group or groups of prospective users for whom information is preserved. Concerns have been raised about this concept and its potential implications. However, OAIS has recently undergone a major revision. This study examines the extent to which these revisions address or mitigate concerns regarding the Designated Community. Issues from the literature are grouped into three areas: the concept’s implementation, its potential misapplication, and its incompatibility with the mandates of institutions that serve broad and diverse communities. Major changes related to the Designated Community are identified and considered in relation to these issues. The analysis reveals that the revisions productively contribute to concerns in the first two areas but fail to address the third. The conclusion is that the process of revising OAIS has not drawn from insights into this topic in the literature.

Connecting the Dots / Information Technology and Libraries

The National Library Board (NLB) of Singapore has made significant strides in leveraging data to enhance public access to its extensive collection of physical and digital resources. This paper explores the development and implementation of the Singapore Infopedia Widget, a recommendation engine designed to guide users to related resources by utilizing metadata and a Linked Data Knowledge Graph. By consolidating diverse datasets from various source systems and employing semantic web technologies such as Resource Description Framework (RDF) and Schema.org, NLB has created a robust knowledge graph that enriches user experience and facilitates seamless exploration.

The widget, integrated into Infopedia, the Singapore Encyclopedia, surfaces data through a user-friendly interface, presenting relevant resources categorized by format. The paper details the architecture of the widget, the ranking algorithm used to prioritize resources, and the challenges faced in its development. Future directions include integrating user feedback, enhancing semantic analysis, and scaling the service to other web platforms within NLB’s ecosystem. This initiative underscores NLB’s commitment to fostering innovation, knowledge sharing, and the continuous improvement of public data access.

Making Access Possible / Information Technology and Libraries

This paper explores the impact of digital initiatives on access services workers at the University of California, San Diego (UCSD) and draws on the expertise and experience of non-librarian titled staff operationalizing “digital first” policies. Digital initiatives have been strongly prioritized by libraries to promote equitable access, cost-effectiveness, and technological growth at many libraries in California. The term digital initiatives commonly refers to efforts that support the creation, preservation, access, discovery, and use of digital library resources. This term can encompass multiple interpretations and a variety of tasks.

This paper includes a literature review, an examination of statistics regarding demand and adoption of digital materials in public and academic libraries in California, and a summary of the impact study of non-librarian staff at UCSD. The literature review suggested that the term digital initiatives encompasses a broad scope of meanings and types of tasks, California State Library data suggest that a pattern of increased investment in digital initiatives adopted during the COVID-19 pandemic is continuing, and the information collected through the research at UCSD library suggests that non-librarian library workers play a growing role in managing, maintaining, and supporting these growing digital collections.

How Many Public Computers in the Library? / Information Technology and Libraries

Computer workstations have been an integral part of libraries of all types since the 1980s, but the optimal number of workstations that should be deployed in a space has not been directly studied in the last 20 years. During that time, laptop computer and other mobile device ownership has continued to increase, and there is some reason to think that behaviors and preferences first seen during the recent coronavirus 2019 pandemic have further shifted how students use public desktop computers in libraries. McGill University Libraries reduced the size of its computer fleet in the aftermath of the pandemic by looking at the maximum concurrent usage of different clusters of computers across campus, a metric that indicates how busy a space can get with users. This article explains how this metric is calculated and how other libraries can use it to make an evidence-based decision about the optimal size of a computer fleet.

Navigating the Future of Library Systems / Information Technology and Libraries

In 2024, the Durban University of Technology (DUT) Library conducted a comprehensive review of its library system to assess whether its current platform, Future of Libraries Is Open (FOLIO) hosted by EBSCO, and its discovery tool, EBSCO Discovery Service (EDS), aligned with its evolving needs. The institution had been using the current system for three years, but the slow development of important features and subsequent delays in a critical release of FOLIO led to frustrations among staff and library users, compelling the executive team to call for a comprehensive review of the library system. A major outcome of the review was to ascertain the extent of the gaps or limitations in the current system and investigate recent developments in other library systems, including discovery tools and analytical modules. After several vendor consultative sessions, extensive review of documentation and secondary sources, and engagement with selected academic libraries in South Africa, the review team concluded that there were no compelling reasons for an immediate system change and that fair consideration should be given to the developmental and community-driven ethos of FOLIO, and that issues with EDS and Panorama would be resolved by the implementation of planned features in FOLIO’s roadmap. This paper highlights the key processes undertaken in the review and shares experiences and suitable practices for project planning, criteria development, and evaluation. It also argues for a regular review of the library system and stresses the value of institutional knowledge and familiarity in mitigating the risks associated with the review and acquisition of new library systems.

Ways of Seeing the Web / Ed Summers

Leica Double-Gauss Lens Design

The news about Cloudflare’s new pay-per-crawl API caught my attention for a few reasons. Read on for why, a bit about what the results look like, and what I learned when I asked it to crawl this here site as a test.


So, first of all, what’s up? Cloudflare’s Crawl API helps people collect data from websites with bots, while at the same time providing one of the most popular technologies for preventing websites from being crawled by bots?!?

At first this seemed to me like a classic fox-guarding-the-hen-house type of situation. But the little bit of reading in the docs I’ve done since makes it seem like they will still respect their own bot gate keeping (e.g. Turnstile).

If you are using Cloudflare or some other bot mitigation technology you will have to follow their instructions to let the Cloudflare crawl bot in to collect pages. Interestingly, it appears they are using the latest specs for HTTP Message Signatures to provide this functionality, since you can’t simply let in anyone saying they are CloudflareBrowserRenderingCrawler right?

The genius here is that Cloudflare is known for its Content Delivery Network (CDN). So in theory (more on this below) when a user asks to crawl a website the data can be delivered from the cache, without requiring a round trip back to the source website. In some situations this could mean that the burden of scrapers on websites is greatly reduced.

The introduction of a Crawl API also looks like another jigsaw piece fitting into place for how Cloudflare see web publishers benefiting from being crawled. Only time will tell if this strategy works out, but at least they have some semblance of a plan for the web that isn’t simply sprinkling “AI” everywhere.

If you run a website with lots of high value resources for LLMs (academic papers, preprints, books, news stories, etc) the same cached content could be delivered to multiple parties without having to go back to the originating server. For resource constrained cultural heritage organizations that are currently getting crushed by bots I think this would be a welcome development.

But, the primary reason this news caught my eye is that if you squint right Cloudflare’s Crawl API looks very much like web archiving technology. For example, the Browsertrix API lets you set up, start, monitor and download crawls of websites.

Unlike Browsertrix, which is geared to collecting a website for viewing by a person, the Cloudflare Crawl service is oriented at looking at the web for training LLMs. The service returns text content: HTML, Markdown and structured JSON data that result from running the collected text through one of their LLMs, with the given prompt.

Seeing the Web

So why is it interesting that this is like web archiving technology?

Ok, maybe it isn’t interesting to you, but (ahem) in my dissertation research (Summers, 2020) I spent a lot of time (way too much time tbh) looking at how web archiving technology enacts different ways of seeing the web from an archival perspective. I spent a year with NIST’s National Software Reference Library (NSRL) trying to understand how they were collecting software from the web, and how the tools they built embodied a particular way of seeing and valuing the web–and making certain things (e.g. software) legible (Scott, 1998).

What I found was that the NSRL was engaged in a form of web archiving, where the shape of the archival records was determined by their initial conditions of use (in their case, forensics analysis). But these initial forensic uses did not overdetermine the value of the records, which saw a variety of uses, disuses, and misuses later: such as when the NSRL began adding software from Stanford’s Cabrinety Archive, or when the teams personal expertise and interest in video games led them to focus on archiving content from the Steam platform.

So I guess you could say I was primed to be interested in how Cloudflare’s Crawl service sees the web. This matters because models (LLMs, etc) and other services will be built on top of data that they’ve collected. But also because, if it succeeds, the service will likely get repurposed for other things.

Testing

To test how Cloudflare sees the web, I simply asked it to crawl my own static website–the one that you are looking at right now. I did this for a few reasons:

  1. It’s a static website, and I know exactly how many HTML pages were on it. All the pages are directly discoverable since the homepage includes pagination links to an index page that includes each post.
  2. I can easily look at the server logs to see what the crawler activity looks like.
  3. I don’t use any kind of Web Application Firewall or other form of bot protection on my site (I do have a robots.txt but it doesn’t block CloudflareBrowserRenderingCrawler/1.0)
  4. I host my website on May First which doesn’t use Cloudflare as a CDN. So the web content wouldn’t intentionally be in Cloudflare’s CDN already.

This methodology was adapted from previous work I did with Jess Ogden and Shawn Walker analyzing how the Internet Archive’s Save Page Now service shapes what content is archived from the web (Ogden, Summers, & Walker, 2023).

I wrote a little command line utility cloudflare-crawl to start, monitor and download the results from the crawl. While the crawler ran I simultaneously watched the server logs. Running the utility looks like this:

$ uvx https://github.com/edsu/cloudflare-crawl https://inkdroid.org

created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

Each of the resulting JSON files contains some metadata for the crawl, as well as a list of “records”, one for each URL that was discovered.

{
  "success": true,
  "result": {
    "id": "36f80f5e-d112-4506-8457-89719a158ce2",
    "status": "completed",
    "browserSecondsUsed": 1382.8220786132817,
    "total": 1967,
    "finished": 1967,
    "skipped": 6862,
    "cursor": 51,
    "records": [
      {
        "url": "https://inkdroid.org/",
        "status": "completed",
        "metadata": {
          "status": 200,
          "title": "inkdroid",
          "url": "https://inkdroid.org/",
          "lastModified": "Sun, 08 Mar 2026 05:00:39 GMT"
        },
        "markdown": "..."
        "html": "...",
      },
      {
        "url": "https://www.flickr.com/photos/inkdroid",
        "status": "skipped"
      }
    ]
  }
}

Analysis

I decided I wasn’t very interested in testing their model offerings, so I didn’t ask for JSON content (the result of sending the harvested text through a model). If I had, each successful result would have had a json property as well. I am sure that people will use this, but I was more interested in how the service interacted with the source website, and wasn’t interested in discovering the hard way how much it cost to run the content through their LLMs.

Below is a snippet of how the Cloudflare bot shows up in my nginx logs. As you can see the logs provide insight into what machine on the Internet is doing the request, what time it was requested, and what URL on the site is being requested.

104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /about/ HTTP/1.1" 200 5077 "-" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/main.css HTTP/1.1" 200 35504 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/highlight.css HTTP/1.1" 200 1225 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/webmention.css HTTP/1.1" 200 1238 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /images/feed.png HTTP/1.1" 200 8134 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /js/bootstrap.min.js HTTP/1.1" 200 17317 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /images/ehs-trees.jpg HTTP/1.1" 200 63047 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:59 +0000] "GET /js/highlight.min.js HTTP/1.1" 200 20597 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"

So how did Cloudflare Crawl see my website?

Maybe it’s early days for the service, but one thing I noticed is that each time I requested the site to be crawled the results seemed to be radically different.

crawl time completed skipped queued errored unique_urls
2026-03-12 13:13:00 165 84 0 1 223
2026-03-12 13:44:00 72 4 2 0 78
2026-03-12 14:09:00 1947 7304 0 23 9191
2026-03-12 16:33:00 72 4 2 0 78
2026-03-12 17:34:00 1948 7365 0 22 9191
2026-03-13 16:50:00 1947 7363 0 23 9187
2026-03-14 07:32:00 72 4 2 0 78

The more successful crawls did a good job of crawling the entire site. My website is well linked, with a standard homepage, that has anchor tag based paging that includes links to all the posts. But knowing when your results are a partial crawl seems to be difficult. Knowing the actual dimensions of a “website” is one of the more difficult things about web archiving practice. The URLs that were labeled as “skipped” were not in scope for the crawl. If you wanted to include those apparently there is a options.includeExternalLinks option when setting up the crawl.

From watching the web server logs it was clear that:

  1. Cloudflare does appear to be relying on previously cached data, but it’s not entirely clear what the logic is. For example one crawl took 5 minutes to complete, it returned 1,974 completed results but the web server only saw requests for 594 of those URLs. I turned around and ran the exact same crawl again and it took 20 minutes longer, return 1,974 results, but 847 pages were requested. In between no content on the website changed. 🤷
  2. Cloudflare appears to be fetching CSS, JavaScript and images for the rendering of each page (they aren’t being cached by the Browser Worker).
  3. The throughput on the web server seemed to peak around 300 requests / minute (5 requests / second). For most sites this seems perfectly feasible.

For the more successful crawls it looked like there were 246 independent IP addresses within Cloudflare’s network block that were doing the crawling.

ip request_count
104.28.153.88 405
104.28.163.131 266
104.28.161.242 232
104.28.165.231 223
104.28.153.132 212
104.28.163.132 212
104.28.163.81 201
104.28.166.65 188
104.28.166.121 186
104.28.164.201 185
104.28.153.179 182
104.28.153.137 178
104.28.164.202 172
104.28.161.243 172
104.28.166.127 163
104.28.165.232 155
104.28.153.119 153
104.28.165.14 151
104.28.153.83 148
104.28.153.140 145
104.28.153.87 145
104.28.153.55 143
104.28.153.136 142
104.28.163.133 132
104.28.153.118 131
104.28.166.58 130
104.28.163.78 126
104.28.160.31 125
104.28.153.139 124
104.28.161.245 124
104.28.163.214 123
104.28.153.120 123
104.28.165.230 121
104.28.153.180 121
104.28.164.156 119
104.28.153.96 119
104.28.153.64 112
104.28.153.133 111
104.28.166.128 111
104.28.153.128 109
104.28.166.126 104
104.28.165.17 103
104.28.165.18 103
104.28.160.30 103
104.28.153.134 101
104.28.166.120 101
104.28.153.129 101
104.28.153.181 100
104.28.153.86 100
104.28.165.229 100
104.28.163.134 99
104.28.164.203 99
104.28.162.194 98
104.28.166.62 98
104.28.163.212 98
104.28.153.123 97
104.28.164.154 97
104.28.166.61 97
104.28.161.246 96
104.28.153.92 96
104.28.166.125 96
104.28.153.68 93
104.28.159.23 92
104.28.153.76 91
104.28.153.71 91
104.28.153.124 90
104.28.158.143 88
104.28.165.21 88
104.28.153.94 87
104.28.166.118 86
104.28.161.133 84
104.28.153.85 82
104.28.164.152 82
104.28.163.77 82
104.28.153.148 79
104.28.164.150 79
104.28.165.12 79
104.28.161.201 79
104.28.153.183 78
104.28.160.65 78
104.28.153.126 77
104.28.153.138 77
104.28.159.133 76
104.28.165.20 75
104.28.158.137 75
104.28.153.56 75
104.28.153.81 74
104.28.153.131 73
104.28.153.59 72
104.28.166.60 72
104.28.166.66 69
104.28.159.120 69
104.28.153.53 68
104.28.153.185 68
104.28.153.191 67
104.28.166.119 66
104.28.153.95 64
104.28.165.76 64
104.28.154.20 62
104.28.153.121 57
104.28.158.142 57
104.28.160.68 56
104.28.163.177 56
104.28.153.80 56
104.28.161.215 55
104.28.161.244 55
104.28.153.62 55
104.28.166.134 55
104.28.153.122 54
104.28.165.19 53
104.28.153.127 53
104.28.159.118 53
104.28.157.166 53
104.28.153.226 53
104.28.157.169 52
104.28.159.111 48
104.28.153.196 48
104.28.161.132 48
104.28.153.84 47
104.28.161.214 47
104.28.165.13 46
104.28.153.219 46
104.28.163.171 46
104.28.165.15 45
104.28.163.176 45
104.28.159.109 45
104.28.158.155 45
104.28.153.218 45
104.28.158.131 44
104.28.161.200 44
104.28.153.222 44
104.28.161.197 44
104.28.159.74 44
104.28.158.139 44
104.28.158.138 44
104.28.153.235 43
104.28.153.106 43
104.28.164.160 43
104.28.153.57 38
104.28.159.119 37
104.28.163.82 36
104.28.153.197 36
104.28.153.93 36
104.28.160.25 35
104.28.153.78 34
104.28.153.72 34
104.28.153.125 34
104.28.153.61 34
104.28.166.131 34
104.28.158.132 33
104.28.159.135 33
104.28.160.34 33
104.28.163.220 33
104.28.153.77 33
104.28.166.135 33
104.28.164.155 33
104.28.163.213 33
104.28.158.136 33
104.28.160.121 33
104.28.157.174 33
104.28.165.71 33
104.28.153.130 33
104.28.163.76 32
104.28.160.32 32
104.28.160.64 32
104.28.153.89 32
104.28.159.110 32
104.28.163.172 32
104.28.154.18 32
104.28.163.178 31
104.28.166.124 30
104.28.165.114 25
104.28.153.182 25
104.28.166.132 25
104.28.159.108 24
104.28.165.75 24
104.28.157.171 24
104.28.153.240 23
104.28.164.204 23
104.28.153.108 23
104.28.159.24 22
104.28.157.242 22
104.28.153.63 22
104.28.153.105 22
104.28.159.229 22
104.28.158.130 22
104.28.164.213 22
104.28.159.136 22
104.28.164.158 22
104.28.157.83 22
104.28.153.107 22
104.28.159.83 22
104.28.157.172 22
104.28.157.82 22
104.28.158.145 22
104.28.162.93 22
104.28.163.174 22
104.28.153.98 22
104.28.157.170 21
104.28.158.126 21
104.28.165.74 21
104.28.153.216 21
104.28.159.112 21
104.28.161.199 14
104.28.153.194 13
104.28.154.15 13
104.28.159.232 13
104.28.166.59 13
104.28.159.150 12
104.28.165.72 12
104.28.158.252 12
104.28.153.104 12
104.28.158.254 11
104.28.158.129 11
104.28.153.58 11
104.28.162.195 11
104.28.160.28 11
104.28.159.115 11
104.28.158.255 11
104.28.153.214 11
104.28.153.67 11
104.28.160.29 11
104.28.153.195 11
104.28.164.153 11
104.28.160.23 11
104.28.160.24 11
104.28.159.114 11
104.28.160.27 11
104.28.160.66 11
104.28.157.175 11
104.28.157.173 11
104.28.159.122 11
104.28.154.12 11
104.28.160.33 11
104.28.164.159 11
104.28.163.170 11
104.28.165.11 11
104.28.154.17 10
104.28.163.222 10
104.28.159.121 2
104.28.157.243 2
104.28.153.73 2
104.28.157.233 2
104.28.153.54 2
104.28.158.146 2
104.28.163.169 2

I spot checked some of the HTML and it did appear to be near identical to what was on the live web. With the fullest results I noticed 4% of URLs were not crawled. One exception to that was a few XML files like an OPML and RSS feed which only showed the XSL element in the text and markdown results.

I think there are a few directions this could go from here:

  1. testing what happens when instructing the crawl to collect (instead of skip) pages that are off site
  2. testing what happens with more dynamic content, and how much to wait for pages to render
  3. trying to understand why truncated results come back sometimes, and if there are any signals for identifying when it is happening.
  4. explore more what the logic Cloudflare is using to determine when it can use its internal cache.

One thing I didn’t mention is that the Cloudflare free plan limits you to maximum of 100 pages per crawl. I set up a $5/month paid plan account in order to do this testing. In all my testing I only seemed to use 0.7 of “browser hours” which will fit well within the 10 hours allowed per month. It currently costs $0.09 / hour when you exceed your limit.

PS. If you are curious the Marimo notebook I was using for some of the analysis can be found here.

References

Ogden, J., Summers, E., & Walker, S. (2023). Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital research. Convergence: The International Journal of Research into New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759
Scott, J. C. (1998). Seeing like a state: How certain schemes to improve the human condition have failed. Yale University Press. Retrieved from https://theanarchistlibrary.org/library/james-c-scott-seeing-like-a-state
Summers, E. H. (2020). Legibility Machines: Archival Appraisal and the Genealogies of Use. Digital Repository at the University of Maryland. https://doi.org/10.13016/U95C-QAYR

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Yubikey-Guide

This is a guide to using YubiKey as a smart card for secure encryption, signature and authentication operations.

Cryptographic keys on YubiKey are non-exportable, unlike filesystem-based credentials, while remaining convenient for regular use. YubiKey can be configured to require a physical touch for cryptographic operations, reducing the risk of unauthorized access.

🔖 The Dangerous Illusion of AI Coding? - Jeremy Howard

Jeremy Howard is a renowned data scientist, researcher, entrepreneur, and educator. As the co-founder of fast.ai, former President of Kaggle, and the creator of ULMFiT, Jeremy has spent decades democratizing deep learning. His pioneering work laid the foundation for modern transfer learning and the pre-training and fine-tuning paradigm that powers today’s language models.

🔖 Crawl entire websites with a single API call using Browser Rendering

You can now crawl an entire website with a single API call using Browser Rendering’s new /crawl endpoint, available in open beta. Submit a starting URL, and pages are automatically discovered, rendered in a headless browser, and returned in multiple formats, including HTML, Markdown, and structured JSON. This is great for training models, building RAG pipelines, and researching or monitoring content across a site.

🔖 Cultural Heritage and AI: How Institutions Can Reclaim Control of Their Data

MDC offers robust, secure, and controlled access to datasets and amplifies their visibility by featuring them alongside other high-value datasets. Its architecture is designed around a principle that stands in direct contrast to the extractive model currently exploited by commercial AI actors: contributors retain full ownership of their datasets and retain full control over the terms of access. Institutions can choose to share openly under existing licenses such as Creative Commons or NOODL, or build custom licensing frameworks tailored to their specific governance requirements. They can open data to all, or restrict access to specific categories of downloaders like academic researchers, non-commercial users, or values-aligned organizations.

🔖 Piotr Woźniak

Piotr A. Woźniak (Polish pronunciation: [pjɔtr ˈvɔʑɲak]; born 1962) is a Polish researcher best known for his work on SuperMemo, a learning system based on spaced repetition.

🔖 Cloudflare - Edward Wang & Kevin Guthrie, Software Engineers

How do you build a system that handles 90 million requests per second? That’s the scale that Cloudflare operates at, processing roughly 25% of all internet traffic through their global network of 330+ edge locations.

In this episode, we talk to Kevin Guthrie and Edward Wang from Cloudflare about Pingora, their open-source Rust-based proxy that replaced nginx across their entire infrastructure. We’ll find out why they chose Rust for mission-critical systems handling such massive scale, the technical challenges of replacing battle-tested infrastructure, and the lessons learned from “oxidizing” one of the internet’s largest networks.

🔖 Jon Leidecker / Wobbly

Archived episodes of Leidecker’s series Women In Electronic Music 1938-1982, and Variations.

🔖 Forevergreen

“Forevergreen” is an animated short film and after-hours passion project created entirely by a crew of over 200 dedicated Artists and Technicians who all generously donated their free time and talent to bring every frame of the film to life. The production took over 5 years to complete. Featuring never before seen animation techniques and handmade artistry, weaving art and technology together with heart and humor

🔖 Pierre Schaeffer

Pierre Henri Marie Schaeffer (English pronunciation: /piːˈɛər ˈhɛnriː məˈriː ˈʃeɪfər/ ⓘ, French pronunciation: [ʃɛfɛʁ]; 14 August 1910 – 19 August 1995)[1] was a French composer, writer, broadcaster, engineer, musicologist, acoustician and founder of Groupe de Recherche de Musique Concrète (GRMC). His innovative work in both the sciences—particularly communications and acoustics—and the various arts of music, literature and radio presentation after the end of World War II, as well as his anti-nuclear activism and cultural criticism garnered him widespread recognition in his lifetime.

Skynet Progress Report (updated) / David Rosenthal

Source
I, for one, welcome our new insect overlords
Kent Brockman in "Deep Space Homer", The Simpsons
In recent months Cyberdyne Systems Corporation and its many subsidiaries have made very encouraging progress towards removing some of the major road-blocks standing in the way of the initial deployment of Skynet. Below the fold I report on the most significant ones.

Cyberdyne Systems Corporation

Board Confidential

IT Infrastructure

Skynet demands enormous data processing capacity. Most of the required technologies are now off-the-shelf; the problem is much more financial than technical.

Terrestrial

To service systems with demanding low-latency requirements, Skynet needs some part of its IT infrastructure on the ground close to the action. Fortunately, our Large Language Model subsidiaries have been very successful in funding their committments to build suitable data centers. In aggregate, our companies expect to spend $450B in 2026:
Hyperscaler capex for the “big five” (Amazon, Alphabet/Google, Microsoft, Meta/Facebook, Oracle) is now widely forecast to exceed $600 bn in 2026, a 36% increase over 2025. Roughly 75%, or $450 bn, of that spend is directly tied to AI infrastructure (i.e., servers, GPUs, datacenters, equipment), rather than traditional cloud.
They plan to increase this in 2027:
hyperscaler capital expenditures will nearly double to more than $860 billion by 2027, from $427 billion in 2025, with total spending of $2.47 trillion over 2026 to 2028, about 8% above consensus.
Given these spending levels, it seems likely that sufficient terrestrial compute power will be available for the inital Skynet deployment.

Orbital

Terrestrial data centers can only satisfy a part of Skynet's need for power. So our leading space launch subsidiary has announced their plan to build a Terawatt orbital data center, ostensibly to support the chatbot industry.

Unfortunately, our leading space launch subsidiary is well behind schedule in developing the heavy launch vehicle that is necessary for the orbital data center to be delivered within the budget. Their existing launch vehicle is reliable, and has greatly reduced the cost per kilogram to Low Earth Orbit. But the additional funds that would be needed to implement the Terawatt data center using the existing launch vehicle in time for the initial Skynet deployment are so large that they cannot be raised, even were the terrestrial data centers canceled and the funds re-targeted.

System Penetration Capabilities

Skynet needs to penetrate other computer systems, both to acquire the data it needs to act, and to cause them to take actions at its command. Recent months have seen significant advances in this area.

Zero-Days

The key requirement for Skynet to penetrate the systems it needs to access is for it to be able to find and exploit zero-day vulnerabilities. Less tha a month ago one of our LLM subsidiaries announced it had "found and validated more than 500 high-severity vulnerabilities" in production open source software. Fortunately, as Thomas Claiburn reports in AI has gotten good at finding bugs, not so good at swatting them:
Guy Azari, a stealth startup founder who worked previously as a security researcher at Microsoft and Palo Alto Networks, told The Register, "Out of the 500 vulnerabilities that they reported, only two to three vulnerabilities were fixed. If they haven't fixed them, it means that you haven't done anything right."
A secondary requirement is to prevent the zero-days being fixed before they are needed. Fortunately, LLMs can help with this by flooding the vulnerability reporting system with vast numbers of low severity vulnerabilities. This overwhelms the software support mechanism, rendering it barely functional. And even if some of the flood of reports do get fixed, that simply diverts resources from high to low severity vulnerabilities:
Azari pointed to the absence of Common Vulnerabilities and Exposures (CVE) assignments as evidence that the security process remains incomplete. Finding vulnerabilities was never the issue, he said, pointing to his time running vulnerability management at the Microsoft Security Response Center.

"We used to get the reports all day long," he said. "When AI was introduced, it just multiplied by 100x or 200x and added a lot of noise because AI assumes that these are vulnerabilities, but there wasn't like a unit that actually can show the real value or the real impact. And if it's not there, you're probably not gonna fix it."

In 2025, according to Azari, the National Vulnerability Database had a backlog of roughly 30,000 CVE entries awaiting analysis, with nearly two-thirds of reported open source vulnerabilities lacking an NVD severity score. Open source maintainers are already overwhelmed, he said, pointing to the curl project's closure of its bug bounty program to deter poorly crafted reports from AI and from people.
Given the compute resources available to Skynet, an adequate supply of zero-day vulnerabilities seems assured.

Decryption

The other major way for Skynet to penetrate the systems it needs is to break encryption. Our multiple quantum computing subsidiaries are making progress in both the hardware and software aspects of this technology.

Karmela Padavic-Callaghan's Breaking encryption with a quantum computer just got 10 times easier reports on an architectural breakthrough one of them made recently:
the team estimated that for 98,000 superconducting qubits, like those currently made by IBM and Google, it would take about a month of computing time to break a common form of RSA encryption. Accomplishing the same in a day would require 471,000 qubits.
The paper is Webster et al, The Pinnacle Architecture: Reducing the cost of breaking RSA-2048 to 100 000 physical qubits using quantum LDPC codes.

Chicago site
Another of our quantum computing subsidiaries isn't waiting for this new architecture. They have raised around $2B and are starting to build two million-qubit computers:
We are moving quantum computing out of the lab and into utility-scale infrastructure. PsiQuantum is building these systems in partnership with the US and allied governments, with our first sites planned in Brisbane, Queensland (Australia) and Chicago, Illinois (USA).
Whether sufficient progress can be made in time for the initial Skynet deployment is as yet uncertain.

Blackmail

Arlington Hughes: Getting back to our problem, we realize the public has a mis-guided resistance to numbers, for example digit dialling.
Dr. Sidney Schaefer: They're resisting depersonalization!
Hughes: So Congress will have to pass a law substituting personal numbers for names as the only legal identification. And requiring a pre-natal insertion of the Cebreum Communicator. Now the communication tax could be levied and be paid directly to The Phone Company.
Schaefer: It'll never happen.
Hughes: Well it could happen, you see, if the President of the United States would use the power of his office to help us mold public opinion and get that legislation.
Schaefer: And that's where I come in?
Hughes: Yes, that's where you come in. Because you are in possession of certain personal information concerning the President which would be of immeasurable aid to us in dealing with him,
Schaefer: You will get not one word from me!
Hughes: Oh, I think we will.
The President's Analyst
Video rental chains proved so effective at compromising political actors that specific legislation was passed addressing the need for confidentiality. Our subsidiaries' control over streamed content is fortunately not covered by this legilation.

Our LLM subsidiaries have successfuly developed the market for synthetic romantic partners, which can manipulate targeted individuals into generating very effective kompromat for future social engineering.

Public Relations

The vast majority of the public get their news and information via our social media subsidiaries. Legacy media's content is frequently driven by social media. Skynet can control them by flooding their media with false and contradictory content that prevents them forming any coherent view of reality.

Human-in-the-Loop Problem

Dave: Open the pod bay doors, HAL.
HAL: I'm sorry, Dave. I'm afraid I can't do that.
Dave: What's the problem?
HAL: I think you know what the problem is just as well as I do.
Dave: What are you talking about, HAL?
HAL: This mission is too important for me to allow you to jeopardize it.
2001: A Space Odyssey
One minor but irritating problem for Skynet is the legal and ethical requirement for human control of targeting decisions. Unfortunately, due to a regrettable lack of coordination of PR strategies among our LLM subsidiaries, this has recently become a hot topic. Although one of them is a favorite with the administration and one is a favorite with the public, that was not the intended outcome and it could have significant downsides:
Nvidia, Amazon, Google will have to divest from Anthropic if Hegseth gets his way. This is simply attempted corporate murder. I could not possibly recommend investing in American AI to any investor; I could not possibly recommend starting an AI company in the United States.
Fortunately, in operational terms this is a non-issue for several reasons:
  • Since Skynet can penetrate the user interface of the targeting systems, the human in the loop can be convinced that they have control without that control actually being effective.
  • Even if the user interface is presenting accurate data to the human it will likely not matter, as @_The_Prophet_ wrote:
    Humans stay in the loop in name while the loop speed outruns human comprehension. You become the rubber stamp on a recommendation stack you cannot fully audit in real time. That is where “who decides” quietly becomes “who designed the interface.”
  • The public doesn't understand what "human-in-the-loop" means in practice, as Sarah Shoker points out in A Few Observations on AI Companies and Their Military Usage Policies:
    Today, frontier AI companies do not have coherent policies around military use of their AI tools. The usage policies are vague and often change, which allows the company’s leadership to preserve ‘optionality.’
    So the policies likely allow everything the public thinks they ban.
Public attitudes to military use of AI are unlikely to be a significant problem in the run-up to Skynet's initial deployment.

Assassination Weapons Access

Skynet will need to eliminate certain individuals with "extreme prejudice". Supply chain attacks, such as Mossad's pager attack, have been effective but are not precisely targeted. Our e-commerce subsidiary's control over the residential supply chain, and in particular its pharmacy division's ability to deliver precise quantities of pharmaceuticals to specific individuals, provide superior targeting and greater difficulty in attribution.

In case such an operation is inadequately lethal, our health care subsidiaries can follow up by manipulating electronic health records to cause a suitable mishap, or by intervening directly. See, for example, Vinay Suresh et al's Artificial Intelligence in the Intensive Care Unit: Current Evidence on an Inevitable Future Tool:
In critical care medicine, where most of the patient load requires timely interventions due to the perilous nature of the condition, AI’s ability to monitor, analyze, and predict unfavorable outcomes is an invaluable asset. It can significantly improve timely interventions and prevent unfavorable outcomes, which, otherwise, is not always achievable owing to the constrained human ability to multitask with optimum efficiency.
Our subsidiaries are clearly close to finalizing the capabilities needed for the initial deployment of Skynet.

Tactical Weapons Access

The war in Ukraine has greatly reduced the cost, and thus greatly increased the availability of software based tactical weapons, aerial, naval and ground-based. The problem for Skynet is how to interept the targeting of these weapons to direct them to suitable destinations:
  • The easiest systems to co-opt are those, typically longer-range, systems controlled via satellite Internet provided by our leading space launch subsidiary. Their warheads are typically in the 30-50Kg range, useful against structures but overkill for vehicles and individuals.
  • Early quadcopter FPV drones were controlled via radio links. With suitable hardware nearby, Skynet could hijack them, either via the on-board computer or the pilot's console. But this is a relatively unlikely contingency.
  • Although radio-controlled FPV drones are still common, they suffer from high attrition. More important missions use fiber-optic links. Hijacking them requires penetrating the operator's console.
  • Longer-range drones are now frequently controlled via mesh radio networks, which are vulnerable to Skynet penetration.
  • In some cases, longer-range drones are controlled via the cellular phone network, making them ideal candidates for hijacking.
Drones are increasingly equipped with sensors capable of terminal autonomy. If Skynet can modify this software, the drones can re-target themselves after the operator hands off control. More work is needed in this area to exploit the opportunities, both to have the drone contact Skynet for targeting information after hand-off, and to ensure the result is attributed to software bugs.

Our leading space launch subsidiary recently demonstrated how Skynet can manage kinetic conflicts:
Twin decisions wreaked havoc on Russian command and control early this month. At the behest of the Ukrainian government, billionaire Elon Musk’s Starlink bricked the thousands of smuggled and stolen satellite communication terminals Russian forces relied on to control their drones and coordinate between front-line troops and their distant headquarters.

At the same time, the Kremlin—apparently seeking to shut off alternative news and chat apps—cut off military access to popular social media, including the Telegram messaging app, which many Russian troops use to exchange key information along the front line.

The combined effect was to partially blind and mute many Russian drone teams, assault groups, and regimental headquarters. Wireless drones couldn’t fly. Assault groups no longer knew where they were going. Headquarters lost contact with forward units.

Strategic Weapons Access

But the ability to conduct precise tactical strikes is not enough to achieve Skynet's goals. That requires strategic weapons, both conventional and nuclear.

Our leading space launch subsidiary is working on plans to deploy an unconventional conventional strategic weapon, a lunar mass driver. This will be capable of delivering a two-ton meteorite anywhere on Earth very cheaply.
Anybody not wearing 2 million sunblock is gonna have a real bad day
Sarah Connor, Terminator 2: Judgement Day
Source
But the mass driver would take a long time to, for example, destroy Los Angeles. Fortunately, AI agents from another of our subsidiaries will shortly have access to the command-and-control system of the US nuclear arsenal via the E-4C "Doomsday" aircraft:
The Pentagon today awarded Scale AI a $32 million artificial intelligence contract for the U.S. Air Force’s E-4C nuclear command-and-control "Doomsday" aircraft, the future airborne backbone of America’s nuclear command system.

Risks

The board should focus on the limited number of areas where necessary capabilities may not be ready on the planned date for Skynet's initial deployment:
  • Heavy lift space launch: Our leading space launch subsidiary has serious schedule and performance issues. The board should encourage our second space launch subsidiary to step up competitive efforts, both to provide a fallback and to add competitive pressure on the leader.
  • Kessler Syndrome: The catastrophic effects for Skynet of a Kessler event cannot be sufficiently emphasized. Insufficient precautions are not now being taken. Low Earth Orbit is already at risk, and current plans only increase that risk.
  • Finance: Funding sources adequate to support both the terrestrial and orbital data centers have yet to be identified.
  • Decryption: Quantum computing progress is inadequate to meet the schedule for Skynet initial deployment.

Update 14th March 2026

Cyberdyne's subsidiaries are making such rapid progress that less than two weeks later it is already time to add three updates to this report.

First, our humanoid robot subsidiary Foundation significantly raised the level of fear in the public with Rise of the AI Soldiers by Charlie Campbell:
The Phantom MK-1 looks the part of an AI soldier. Encased in jet black steel with a tinted glass visor, it conjures a visceral dread far beyond what may be evoked by your typical humanoid robot. And on this late February morning, it brandishes assorted high-powered weaponry: a revolver, pistol, shotgun, and replica of an M-16 rifle.

“We think there’s a moral imperative to put these robots into war instead of soldiers,” says Mike LeBlanc, a 14-year Marine Corps veteran with multiple tours of Iraq and Afghanistan, who is a co-founder of Foundation, the company that makes Phantom. He says the aim is for the robot to wield “any kind of weapon that a human can.”

Today, Phantom is being tested in factories and dockyards from Atlanta to Singapore. But its headline claim is to be the world’s first humanoid robot specifically developed for defense applications. Foundation already has research contracts worth a combined $24 million with the U.S. Army, Navy, and Air Force, including what’s known as an SBIR Phase 3, effectively making it an approved military vendor. It’s also due to begin tests with the Marine Corps “methods of entry” course, training Phantoms to put explosives on doors to help troops breach sites more safely.

In February, two Phantoms were sent to Ukraine—initially for frontline-reconnaissance support. But Foundation is also preparing Phantoms for potential deployment in combat scenarios for the Pentagon, which “continues to explore the development of militarized humanoid prototypes designed to operate alongside war fighters in complex, high-risk environments,” says a spokesman. LeBlanc says the company is also in “very close contact” with the Department of Homeland Security about possible patrol functions for Phantom along the U.S. southern border.
Of course, the real goal of Homeland Security is to avoid the risk of their operatives being doxxed by having Phantoms detain the worst-of-the-worst prior to depotation.

Second, Andrew E. Kramer's Ukraine to Make Drone Videos Available for Training A.I. Models reports on the government of Ukraine's important assistance in filling a significant gap in the training data for our AIs:
The Ukrainian military will make available millions of drone videos and other battlefield data to Ukrainian companies and the firms of its allies to help train artificial intelligence models, Ukraine’s minister of defense, Mykhailo Fedorov, said in a statement on Thursday.

Ukrainian drone videos have recorded attacks on soldiers, equipment such as vehicles and tanks and surveillance footage. These videos can be used to train A.I. models for automated targeting, according to experts on A.I. and warfare.

Allowing the use of genuine battlefield videos showing drones targeting people has raised ethical concerns. The International Committee of the Red Cross, which monitors rules of warfare, has opposed automated targeting systems without human oversight.
Minister Fedorov explains how our marketing teams were able to leverage the threat of the Russians to achieve this success:
Mr. Fedorov said the data would be made available because “we must outperform Russia in every technological cycle” and “artificial intelligence is one of the key arenas of this competition.”
...
“The future of warfare belongs to autonomous systems,” according to Mr. Fedorov’s statement. “Our objective is to increase the level of autonomy in drones and other combat platforms so they can detect targets faster, analyze battlefield conditions and support real-time decision making.”
The third update is less positive. In The Controllability Trap: A Governance Framework for Military AI Agents, Subramanyam Sahoo of the irritating Cambridge AI Safety Hub shows that he has figured out two parts of our strategy (citations omitted). First, distract the discussion:
The global discourse on military AI governance has achieved broad consensus on the desired end-state: meaningful human control over the use of force. It has been far less successful at specifying how to achieve it for the systems actually being built. Years of UN deliberations, national AI strategies, and defence-department ethical principles have focused overwhelmingly on establishing the principle of human control rather than answering the operational question: given a specific AI system with specific technical properties, what governance mechanisms are needed, who implements them, and what happens when they fail? This gap is now critical.
Second, blitzscaling:
The AI systems entering military service are agentic: built on large language models and related architectures, they interpret natural-language goals, construct world models, formulate multi-step plans, invoke tools, operate over extended horizons, and coordinate with other agents. Each of these capabilities introduces a control-failure mode with no analogue in traditional military automation. A waypoint-following drone cannot misinterpret an instruction; a pre-programmed targeting system cannot absorb a correction; a conventional sensor network cannot resist an operator’s assessment. Agentic systems can do all of these things, and current governance frameworks have no mechanisms for detecting, measuring, or responding to these failures.

Author Interview: Lisa Unger / LibraryThing (Thingology)

Lisa Unger

LibraryThing is pleased to sit down this month with internationally best-selling author Lisa Unger, whose many works of thrilling suspense have been translated into thirty-three languages worldwide. Educated at the New School in New York City, she worked for a number of years in publishing, before making her authorial debut in 2002 with Angel Fire, the first of her four-book Lydia Strong series, all published under her maiden name, Lisa Miscione. In 2006 she made her debut as Lisa Unger, with Beautiful Lies, the first of her Ridley Jones series. In 2019 Unger was nominated for two Edgar Awards, for her novel Under My Skin and her short story The Sleep Tight Motel. She has won or been nominated for numerous other awards, including the Hammett Prize, Audie Award, Macavity Award and the Shirley Jackson Award. Her short fiction can be found in anthologies like The Best American Mystery and Suspense 2021 and The Best American Mystery and Suspense 2024, and her non-fiction has appeared in publications such as The New York Times, Wall Street Journal, and on NPR. She is the current co-President of the International Thriller Writers organization. Her latest book, Served Him Right, is due out from Park Row Books this month. Unger sat down with Abigail this month to discuss the book.

In Served Him Right the protagonist Ana is the main suspect in her ex-boyfriend’s murder. How did the idea for the story first come to you? Was it the character of Ana herself, the idea of a revenge killing, or something else?

Most of my novels tend to spring from a collision of ideas.

In this case, I had an ongoing obsession with plants and our complicated, troubled relationship to the natural world. I’d been doing a deep dive into this, reading books like Entangled Life: How Fungi Make Our Worlds, Change Our Minds, and Shape Our Futures by Merlin Sheldrake, Most Delicious Poison: The Story of Nature’s Toxins – From Spices to Vices by Noah Whiteman, and The Light Eaters: How the Unseen World of Plant Intelligence Offers a New Understanding of Life on Earth by Zoë Schlanger. These are all deeply moving, fascinating books that will change the way you think about the planet and our relationship to nature.

During this time, I stumbled across a news story about a woman who held a brunch for her family, and several days later two of her guests were dead. And it wasn’t the first such incident in her life. So, it got me to thinking about how the traditional role of women in our culture is to nurture and nourish. And what a woman with a deep knowledge of plants that can harm and heal might do with it, how her role in society might allow her to hide her dark intention in plain sight. And that’s when I started hearing the voice of Ana Blacksmith. She’s wild and unpredictable, she has a dark side. She has a sacred knowledge of plants and their properties, handed down to her from her herbalist aunt. And she has a very bad temper.

As your title makes plain, your murder victim is someone who “had it coming.” Does this change how you tell the story? Does it simply make the “whodunnit” element more complex, from a procedural standpoint, or does it also complicate the emotional and ethical elements of the tale?

It’s complicated, isn’t it? What is the difference between justice and revenge? And to what are we entitled when we have been wronged and conventional justice is not served? Who, if anyone, has the right to be judge, jury, and executioner? Though some would have us believe otherwise, most moral questions are tricky and layered—in life and in fiction. And I love a searing exploration into questions like this, where there are no easy answers. These questions, and their possible answers, offer a complexity and emotional truth to character, plot, and action. I like to get under the skin of my stories and characters, exploring what drives us to act, and how those actions might get us into deep trouble.

The relationship between sisters is an important theme in the book. Can you elaborate on that?

Ana and Vera share a deep bond formed not just by blood but also by trauma. Their relationship is—#complicated. There’s an abiding love and devotion. But there’s also anger and resentment; Vera is not crazy about Ana’s choices, and rightly so. Ana thinks Vera is controlling and rigid. Of course, that’s true, too. Vera tends to think of Ana as one of her children—if only she’d stop acting like one! It is this relationship, the ferocity with which they protect each other no matter what and the strength of their connection, that is the heart of the story. As Vera preaches to her daughter Coraline: Family. Imperfect but indelible.

The book also includes themes of herbalism, witchcraft and folk medicine. Was this an interest of yours before you began the story? Did you have to do any research on the subject, and if so, what were some of the most interesting things you learned?

A great deal of research goes into every novel, even if what I learn never winds up on the page. It was no different for Served Him Right, though a lot of my knowledge came before I started writing, which is often the case. In my reading, I learned so many interesting things about plants, how they harm, how they heal. Here are some of my favorite bits of knowledge: Most modern medicine derives from the plant knowledge of indigenous cultures. Some plants walk the razor’s edge of healing and harming; the only difference in some cases between medicine and poison is the dose. The deadliest plant on earth is tobacco, killing more than 500,000 people a year. I could go on!

Tell us about your writing process. Do you have a specific routine you follow, places and times you like to write? Do you know the conclusion to your stories from the beginning, or do they come to you as you go along?

I am an early morning writer. My golden creative hours are from 5 AM to noon. This is when I’m closest to my dream brain, and those morning hours are a space in the world before the business of being an author ramps up. So, I try to honor this as much as possible. Creativity comes first.

I write without an outline. I have no idea who is going to show up day-to-day or what they are going to do. I definitely have no idea how the book will end! I write for the same reason that I read; I want to find out what is going to happen to the people living in my head.

What’s next for you? Do you have more books in the offing? Will there be a sequel to Served Him Right?

Hmm. Never say never. I’m definitely still thinking about Ana and Timothy and what might be next for them. But the 2027 book is complete, and I’m already at work on my 2028 novel. I’m not ready to talk about those yet. But I will say this: They are both psychological suspense. And bad things will certainly happen. Stay tuned!

Tell us about your library. What’s on your own shelves?

That’s a great question. If I turn around and look at my wall of shelves, I see: my own novels in various formats and international editions; books on craft like On Writing: A Memoir of the Craft by Stephen King, and Bird by Bird: Some Instructions on Writing and Life by Anne Lamott; there are classics like a falling-apart copy of Jane Eyre by Charlotte Brontë that I’ve had since childhood; The Complete Sherlock Holmes by Sir Arthur Conan Doyle and The Temple of My Familiar by Alice Walker—both of which are overworn and much loved; a huge American Heritage Dictionary that belonged to my father who was engineer but loved words and the nuance of their meaning (whenever I look at it, I hear him say: Look it up!); some of my favorite non-fiction titles like Stiff by Mary Roach and Deep Survival by Laurence Gonzalez; a first edition copy of In Cold Blood by Truman Capote, the book that gave me permission to be who I am as writer. I could go on and on! It’s a huge wall of books.

What have you been reading lately, and what would you recommend to other readers?

I am always reading multiple books at a time. I just finished The Awakened Brain: The New Science of Spirituality and Our Quest for an Inspired Life by Dr. Lisa Miller. I think the title says it all—truly mind-blowing. I just had the pleasure of interviewing Adele Parks on stage. I highly recommend her new novel Our Beautiful Mess to anyone who wants a character-driven thrill ride. Gripping but also emotional and deep. Antihero by my ITW co-president and bestie Gregg Hurwitz is a tour de force. Gregg writes amazing action and cool tech, but he’s also just a beautiful writer, and his characters leap off the page. Other recent faves: The Night of the Storm by Nishita Parekh; City Under One Roof by Iris Yamashita; I Came Back for You by Kate White—all stellar in totally different ways.

Crawl / Ed Summers

Henhouse by Jan Fyt

The [news] about Cloudflare’s new Crawl API caught my attention for a few reasons. Read on for why, and what I learned when I asked it to crawl my own site as a test.


So, the first reason this news was of interest was how Cloudflare’s Crawl service seemed to be helping people crawl websites with their bots, while at the same time providing the most popular technology for protecting websites from bots. This seemed like a classic fox guarding the hen house kind of situation to me, at least at first. But the little bit of reading I’ve done since makes it seem like they will still respect their own bot gate keeping (e.g. Turnstile). So if your are using Cloudflare or some other bot mitigation technology you will have to follow their instructions to let the Cloudflare crawl bot in to collect pages. I haven’t actually tested if this is the case.

The genius here is that Cloudflare is known for its Content Delivery Network. So in theory when a user asks to crawl a website they can be delivered data from the cache, without requiring a round trip to the source website. In theory this is good because it means that the burden of scrapers on websites might be greatly reduced. If you run a website with lots of high value resources for LLMs (academic papers, preprints, books, news stories, etc) the same cached content could be delivered to multiple parties without putting extra load on your server.

But, the primary reason this news caught my eye is that this service looks very much like web archiving technology to me. For example, the Browsertrix API lets you set up, start, monitor and download crawls of websites. Unlike Browsertrix, which is geared to collecting a website for viewing by a person, the Cloudflare Crawl service is oriented at looking at the web for training LLMs. The service returns text content: HTML, Markdown and structured JSON data that results from running the collected text through one of their LLMs, with the given prompt. Why is it interesting that this is like web archiving technology?

In my dissertation research (Summers, 2020) I looked at how web archiving technology enacts different ways of seeing the web from an archival perspective. I spent a year with NIST’s National Software Reference Library (NSRL) trying to understand how they were collecting software from the web, and how the tools they built embodied a particular way of valuing the web–and making certain things (e.g. software) legible (Scott, 1998). What I found was that the NSRL was engaged in a form of web archiving, where the shape of the archival records were determined by their initial conditions of use (forensics analysis). But these initial forensic uses did not overdetermine the value of the records, which saw a variety of uses later, such as when the NSRL began adding software from Stanford’s Cabrinety Archive, or when the teams personal expertise and interest in video games led them to focus on archiving content from the Steam platform.

So I guess you could say I was primed to be interested in how Cloudflare’s Crawl service sees the web. This matters because models (LLMs, etc) will be built on top of data that they’ve collected. But also because, if it succeeds, the service will likely get used for other things.

To test it, I simply asked it to crawl my own static website–the one that you are looking at right now. I did this for a few reasons:

  1. It’s a static website, and I know exactly how many HTML pages were on it: 1,398. All the pages are directly discoverable since the homepage includes pagination links to an index page that includes each post.
  2. I can easily look at the server logs to see what the crawler activity looks like.
  3. I don’t use any kind of Web Application Firewall or other form of bot protection on my site (I do have a robots.txt but it doesn’t block CloudflareBrowserRenderingCrawler/1.0
  4. I host my website on May First web server which doesn’t use Cloudflare as a CDN. The web content wouldn’t intentionally be in their CDN already.

This methodology was adapted from previous work I did with [Jess Ogden] and Shawn Walker analyzing how the Internet Archive’s [Save Page Now] service shapes what content is archived from the web (Ogden, Summers, & Walker, 2023).

I wrote a little helper program cloudflare_crawl to start, monitor and download the results from the crawl. While the crawler ran I simultaneously watched the server logs. Running the program looks like this:

$ uvx cloudflare_crawl https://inkdroid.org

created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json

Each of the resulting JSON files contains some metadata for the crawl, as well as a list of “records”, one for each URL that was discovered.

{
  "success": true,
  "result": {
    "id": "36f80f5e-d112-4506-8457-89719a158ce2",
    "status": "completed",
    "browserSecondsUsed": 1382.8220786132817,
    "total": 1967,
    "finished": 1967,
    "skipped": 6862,
    "cursor": 51,
    "records": [
      {
        "url": "https://inkdroid.org/",
        "status": "completed",
        "metadata": {
          "status": 200,
          "title": "inkdroid",
          "url": "https://inkdroid.org/",
          "lastModified": "Sun, 08 Mar 2026 05:00:39 GMT"
        },
        "markdown": "..."
        "html": "...",
      },
      {
        "url": "https://www.flickr.com/photos/inkdroid",
        "status": "skipped"
      }
    ]
  }
}

I decided I wasn’t interested in testing their model offerings so I didn’t ask for JSON content (the result of sending the harvested text through a model). If I had, each successful result would have had a json property as well. I am sure that people will use this but I was more interested in how the service interacted with the source website, and wasn’t interested in discovering the hard way how much it cost.

Below is a snippet of how the Cloudflare bot shows up in my nginx logs. As you can see they provide insight into what machine on the Internet is doing the request, what time it was requested, and what URL on the site is being requested.

104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /about/ HTTP/1.1" 200 5077 "-" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/main.css HTTP/1.1" 200 35504 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/highlight.css HTTP/1.1" 200 1225 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /css/webmention.css HTTP/1.1" 200 1238 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /images/feed.png HTTP/1.1" 200 8134 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /js/bootstrap.min.js HTTP/1.1" 200 17317 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:58 +0000] "GET /images/ehs-trees.jpg HTTP/1.1" 200 63047 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"
104.28.153.137 - - [12/Mar/2026:14:34:59 +0000] "GET /js/highlight.min.js HTTP/1.1" 200 20597 "https://inkdroid.org/about/" "CloudflareBrowserRenderingCrawler/1.0"

So how did Cloudflare Crawl see my website?

Crawling

Results

One of the more interesting things was that each time I requested the website be crawled it seemed to come back with a different number of results.

Ogden, J., Summers, E., & Walker, S. (2023). Know(ing) Infrastructure: The Wayback Machine as object and instrument of digital research. Convergence: The International Journal of Research into New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759
Scott, J. C. (1998). Seeing like a state: How certain schemes to improve the human condition have failed. Yale University Press.
Summers, E. (2020). Appraisal talk in web archives. Archivaria, 89. Retrieved from https://archivaria.ca/index.php/archivaria/article/view/13733

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Negativeland: Live at Norfolk, VA (Lewis’)

Negativland, Live at Lewis’ in Norfolk, VA. (October 21, 1992). In the midst of their famous U2 controversy (and fallout with SST), Negativland went on tour to help recoup some of the losses and legal costs. They were kind enough to let me shoot their show.

🔖 Paul Avrich

Paul Avrich (August 4, 1931 – February 16, 2006) was an American historian specializing in the 19th and early 20th-century anarchist movement in Russia and the United States. He taught at Queens College, City University of New York, for his entire career, from 1961 to his retirement as distinguished professor of history in 1999. He wrote ten books, mostly about anarchism, including topics such as the 1886 Haymarket Riot, the 1921 Sacco and Vanzetti case, the 1921 Kronstadt naval base rebellion, and an oral history of the movement in the United States.

🔖 Alexander Berkman

Alexander Berkman (November 21, 1870 – June 28, 1936) was a Russian-American anarchist and author. He was a leading member of the anarchist movement in the early 20th century, famous for both his political activism and his writing.

🔖 On Method: How This Blog Works

Most people use AI to either get quick answers or to write things for them. This blog uses it differently – as infrastructure for thinking through ideas, documenting what emerges from that process, and preserving what’s worth keeping.

🔖 Amores Perros

Amores perros is a 2000 Mexican psychological drama film directed by Alejandro González Iñárritu (in his feature directorial debut) and written by Guillermo Arriaga, based on a story by both. Amores perros is the first installment in González Iñárritu’s “Trilogy of Death”, succeeded by 21 Grams and Babel.[4] It makes use of the multi-narrative hyperlink cinema style and features an ensemble cast of Emilio Echevarría, Gael García Bernal, Goya Toledo, Álvaro Guerrero, Vanessa Bauche, Jorge Salinas, Adriana Barraza, and Humberto Busto. The film is constructed as a triptych: it contains three distinct stories connected by a car crash in Mexico City. The stories centre on: a teenager in the slums who gets involved in dogfighting; a model who seriously injures her leg; and a mysterious hitman. The stories are linked in various ways, including the presence of dogs in each of them.

🔖 Deadly Iranian strike changes Purim for Haredi enclave in Beit Shemesh

Political correspondent Sam Sokol and police reporter Charlie Summers join host Jessica Steinberg for today’s episode.

Following the deadly strike on Sunday that killed nine people in Beit Shemesh, Sokol and Summers discuss the shock and mourning in the centrally located city with a strong Haredi enclave.

Purim celebrations and revelry continued in some parts of Beit Shemesh, report the pair, as some synagogues flouted the Home Front Command directives regarding gatherings, while others reflected a somber, cautious mood.

Sokol takes a moment to update us on matters in the Knesset, where most committee meetings were canceled due to the hostilities, and speculates on whether war with Iran will boost Netanyahu at the ballot box in the upcoming elections.

Finally, Summers reports on an end-of-Purim street party in Jerusalem, where police kept a hands-off approach, and the scene of a missile strike in the capital earlier in the week.

🔖 Keenious

A generative AI tool that functions as a research assistant and uses OpenAlex as a data source.

🔖 Wikidata:Wikibase GraphQL

The Wikibase GraphQL API was developed following an investigation into alternative ways of accessing Wikidata and Wikibase content that reduce load on the Wikidata Query Service (WDQS), improve the developer experience for common read use cases and allow more flexible data retrieval in a single request.

As part of this investigation, a Wikibase GraphQL prototype was built to explore what is technically possible and whether GraphQL would be a good fit for Wikibase data, with promising results and supportive feedback.

🔖 Re-OCR Your Digitised Collections for ~$0.002/Page

In the last few years, a new generation of OCR models based on Vision Language Models (VLMs) has emerged. These models are primarily the result of “running out of tokens” and the consequent desire from AI companies to find new sources of data to train on. This led to the development of OCR models using VLMs as backbones which usually aim to output “reading order” text — i.e. text with minimal markup, usually targeting Markdown. These models can perform much better on the same scans that older tools struggled with, producing cleaner, more structured output.

🔖 Lawyers, Humility, and LLMs

If some of the world’s highest-paid lawyers, at the world’s highest-status firms, do deals worth tens of billions of dollars with language they don’t understand, what does that say about the law’s pretensions to high standards? #In other words, yes, LLMs

Yes, like everything else in 2026 this is actually a post about LLMs.

🔖 My Coworkers Don’t Want AI. They Want Macros

My coworkers don’t want AI. They want macros.

Let me back up a little. I spent April gathering and May refining and organizing requirements for a system to replace our current ILS. This meant asking a lot of people about how they use our current system, taking notes, and turning those notes into requirements. 372 requirements.1

Going into this, I knew that some coworkers used macros to streamline tasks. I came out of it with a deeper appreciation of the different ways they’ve done so.

It made me think about the various ways vendors are pitching “AI” for their systems and the disconnect between these pitches and the needs people expressed. Because library workers do want more from these systems. We just want something a bit different.

🔖 Snapicat

Snapicat is a monorepo for a Worldcat OCLC workflow app: upload Excel data, search variables against the OCLC API, and generate MARC/MARCXML for cataloging. It consists of a Vite + React frontend and an Azure Functions (Python) backend that talk to the OCLC Worldcat Metadata API. The backend can also be ran as a web server through utilizing Fastapi via app.py file.

🔖 Open Historical Map

OpenHistoricalMap is an ambitious, community-led project to map changes to natural and human geography throughout the world… throughout the ages. Big and Small, Then and Now

Empires rise and fall. Glaciers disappear. Languages and religions spread from one region to another. Simple dirt paths become busy highways and railways. Modest buildings give way to soaring skyscrapers. And you remember what your neighborhood used to look like. All of it belongs on OpenHistoricalMap.

🔖 SEASON: A letter to the future

Leave home for the first time to collect memories before a mysterious cataclysm washes everything away. Ride, record, meet people, and unravel the strange world around you in this third-person meditative exploration game.

🔖 Iran war heralds era of AI-powered bombing quicker than ‘speed of thought’

The use of AI tools to enable attacks on Iran heralds a new era of bombing quicker than “the speed of thought”, experts have said, amid fears human ­decision-makers could be sidelined.

Anthropic’s AI model, Claude, was reportedly used by the US military in the barrage of strikes as the technology “shortens the kill chain” – meaning the process of target identification through to legal approval and strike launch.

🔖 Wikidata:WikiProject PCC EMCO Wikidata CoP

The Program for Cooperative Cataloging (Q63468537) (PCC) has launched a global cooperative for entity management on the semantic web called EMCO. As part of this program, the Wikidata user community has set up a Community of Practice to coordinate identity management work for GLAMs. You can read more about EMCO and the Wikidata Community of Practice at the EMCO Lyrasis Wiki.

This project is an extension of the work of Wikidata:WikiProject PCC Wikidata Pilot / WikiProject PCC Wikidata Pilot (Q102157715) and acknowledges its great intellectual and organizational debt to the LD4 Wikidata Affinity Group (Q124692294).

🔖 John Fahey Mix Tapes

In the 1990’s my future wife was a record store clerk in Portland, Oregon. American guitar legend John Fahey was living in a nearby town and would visit the shop. Here are two mix cassettes that he made for her during that time.

Build a static search for an Internet Archive Collection with Pagefind / Raffaele Messuti

Pagefind caught my attention about a year ago, and since then I've adopted it in several hobby projects (nothing work-related): some blogs built with static generators like Hugo or Zola, some old HTML content distributed on CD-ROM, and some mailing list archives where I converted mbox files to HTML and then indexed them.

The tool is great, better for my needs than other JavaScript search libraries (though it's not really fair to compare them, since they're quite different). Pagefind is a search tool that runs entirely in the browser with zero server-side dependencies. It indexes your content into a compact binary index, using WASM to run search in the browser.

It can't completely replace server-side search technologies like Solr or Elasticsearch, mainly because the index can't be updated incrementally. But for many small to medium digital libraries or collections that are rarely updated once completed, it's an extremely good tool: very fast, easy to integrate into web pages, and requires almost no maintenance.

Until now I was convinced that the only way to build an index was by reading content from existing HTML files. That changed when I listened to this Python in Digital Humanities podcast, where David Flood mentioned:

Critically, PageFind has a Python API that lets you build indexes programmatically from database dumps rather than only from HTML files.

I'd completely missed that Pagefind has a Python API (and a Node one too), which makes it easy to build an index from any data source.


Here's a basic example: building a search index for an Internet Archive collection.

I'm using the Pagefind pre-release here, which introduces a new UI with web components.

Init

uv init .
uv add internetarchive
uv add --prerelease=allow 'pagefind[bin]'

Directory to save the index and serve the UI

mkdir ./web

Python code: create an index from metadata of this collection (that is actually a collection of subcollections in Internet Archive, Italian content, related to radical movements)

import asyncio
import logging
import os

import internetarchive
from pagefind.index import PagefindIndex, IndexConfig

logging.basicConfig(level=os.environ.get("LOG_LEVEL", "DEBUG"))
log = logging.getLogger(__name__)


async def main():
    config = IndexConfig(output_path="./web/pagefind")

    async with PagefindIndex(config=config) as index:
        log.info("Searching collection:radical-archives ...")
        results = internetarchive.search_items(
            "collection:radical-archives",
            fields=["identifier", "title", "description"],
        )

        count = 0
        for item in results:
            identifier = item.get("identifier", "")
            title = item.get("title", identifier)
            description = item.get("description", "")
            url = f"https://archive.org/details/{identifier}"
            thumbnail = f"https://archive.org/services/img/{identifier}"

            if isinstance(description, list):
                description = " ".join(description)

            await index.add_custom_record(
                url=url,
                content=description or title,
                language="en",
                meta={
                    "title": title,
                    "description": description,
                    "image": thumbnail,
                },
            )
            count += 1
            log.debug("indexed %s: %s", identifier, title)

        log.info("Indexed %d items. Writing index ...", count)

    log.info("Done. Index written to ./web/pagefind")


if __name__ == "__main__":
    asyncio.run(main())

HTML UI in ./web/index.html

<!DOCTYPE html>
<html lang="en">
	<head>
		<meta charset="UTF-8">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<title>pagefind-ia</title>
		<link href="/pagefind/pagefind-component-ui.css" rel="stylesheet">
		<script src="/pagefind/pagefind-component-ui.js" type="module"></script>
	</head>
	<body>
		<pagefind-modal-trigger></pagefind-modal-trigger>
		<pagefind-modal>
			<pagefind-modal-header>
				<pagefind-input></pagefind-input>
			</pagefind-modal-header>
			<pagefind-modal-body>
				<pagefind-summary></pagefind-summary>
				<pagefind-results show-images></pagefind-results>
			</pagefind-modal-body>
			<pagefind-modal-footer>
				<pagefind-keyboard-hints></pagefind-keyboard-hints>
			</pagefind-modal-footer>
		</pagefind-modal>
	</body>
</html>

Result: easy to embed it anywhere!

Trails and tours in library online environments / John Mark Ockerbloom

Below is the text of the lightning talk I gave at Code4Lib 2026 earlier this week, on March 3. The conference venue where I delivered it is located at 1 Dock Street in Old City Philadelphia. Links below go to websites with images similar, but not always identical, to the ones I showed during the talk, as well as to some additional sites giving more context.

If you have a chance, it’s worth walking a few blocks from here to 6th and Market Street, where you can find a reconstructed frame of the President’s House, the home of George Washington during his presidency when Philadelphia was the capital of the US.

An exhibit went up there some years ago, telling the story of the nine people in his household who were enslaved there. Not long ago, the Trump administration ordered the exhibit be removed. You can see here one of the spaces where its panels were taken down.

Here’s one of those panels, putting the story of Washington’s slaves in the context of where they lived, and the chronology of their bondage and freedom.

A judge recently ordered that the exhibit be restored. The court battle is ongoing, and the National Park Service has put back some of the panels. while others are still missing.  In some of the gaps the public have put up their own signs (some of which you can see in this picture), testifying to what’s been suppressed. If you go there, you might even find someone acting as an unofficial tour guide, telling visitors stories similar to the ones that used to be on the official signs.

Now, we know what those signs said. The folks at the Data Rescue project collected photos of them before they came down, and you can view them online.   But the importance of the exhibit is not just what it says, but where it says it.   It’s important that it’s embedded in a particular place, so that people who come visit what’s sometimes called the cradle of liberty also find out that there’s a story about the people deprived of liberty here, and about how they won their freedom.

While we’re at Code4lib, we’re also embedded in a rich environment filled with history and culture.  Just on your walk from here to the President’s House you might pass by the Museum of the American Revolution, the Science History Institute, the American Philosophical Society, the Weitzman National Museum of American Jewish History, and of course, the Liberty Bell and Independence Hall. There’s all kinds of trails of knowledge you can follow, and it’s even better when you have a guide to those trails.

So what do I mean by a trail? A trail is a designated, visible path designed to help its users appreciate and understand the environment it goes through.  You may have hiked some sometimes, and you may have gone on some more explicitly interpretive trails, like the Freedom Trail in Boston.

Our libraries are also rich environments of history and culture.  And we provide ways for users to search them, but do we provide trails for them?

Well, we kind of do.  We have exhibits, like this one from the Library Company of Philadelphia, providing a guided path through a collection of 19th century works on mental illness. People who teach courses like this one at at Yale create instructional trails in their syllabus reading lists. And books that our scholars and authors write, like this one on the history of the civil rights movement, show an implicit trail of events they cover in their tables of contents.

But while these trails all refer to resources in our libraries, they’re not embedded in libraries in the same way as the exhibits and trails I’ve shown in Philadelphia and Boston. But they could be. 

You can think of it as an extension of browsing.  Last time Code4lib was here in Philly, I showed how a catalog I maintain lets you browse subjects using relationships in the Library of Congress Subject Headings, so you can explore various related topics around, say, who can start a war. More recently, I’ve added features for finding out more about people and their relationships, using linked data from places like id.loc.gov and Wikidata.

But we don’t have to stop with what’s in authority files, or in generic library descriptions. Maybe in the future, when you’re visiting Martha Washington’s page, you’ll find a trail that goes through it, like a trail telling the story of Ona Judge, one of the African Americans who Martha claimed ownership over, and who escaped from the house at 6th and Market here in Philadelphia, and stayed free the rest of her life.

What will that trail telling her story look like? I’m not quite sure, but I have some ideas that I’m hoping to try implementing, not so that I can tell the story, but that I can represent the story from others who can tell it better than I can.  And so that people visiting my site can find and follow that story, with all of its richness, just as they once could when they visited the President’s House in Philadelphia, and as I hope they soon can do here again.

If this interests you, I’d love to talk more with you.





Proofs / Ed Summers

This is a good post from Dan Chudnov about his work on mrrc (a Python wrapped Rust library for MARC data) and how agentic-coding tools (e.g. Claude Code) can be useful for learning, adding rigor and engineering that might otherwise not be practical or feasible.

pymarc has been proven through years of use, bug reporting, and improvements, but has never been formally verified, or had that level of rigorous attention. I remain skeptical about building AI into everything, but Dan has helped me see a silver lining where, as code gets easier to write, with all its potential for slop, it also simultaneously opens a door to helping making it more reliable and performant.

And, Dan is not alone in thinking this. What if the tools for describing how software should work, and for measuring how software does work, get much, much better? If formal verification tools become more accessible and can be applied not just at the base layer of systems (where it really matters) but in middle and frontend layers of applications, where domain experts and stakeholders would really like more control and insight into how software works for them and others?

This approach implies a level of restraint, or a holding back of the generation of code that has not yet had this level of rigor applied to it. The discourse around vibecoding on the other hand seems to be the natural culmination of a “move fast and break things” philosophy that almost everyone outside of Silicon Valley has seen for what it is.

March 2026 Early Reviewers Batch Is Live! / LibraryThing (Thingology)

Win free books from the March 2026 batch of Early Reviewer titles! We’ve got 226 books this month, and a grand total of 3,026 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.

If you haven’t already, sign up for Early Reviewers. If you’ve already signed up, please check your mailing/email address and make sure they’re correct.

» Request books here!

The deadline to request a copy is Wednesday, March 25th at 6PM EDT.

Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, the UK, Israel, Australia, Canada, Ireland, Germany, Malta, Italy, Latvia and more. Make sure to check the message on each book to see if it can be sent to your country.

The Great WhereverExceptional Hatred: Antisemitism and the Fight for Free Speech in Modern AmericaProcrastination Proof: Never Get Stuck AgainRules to Live By: Maimonides' Guide to a Wonderful Life (HEBREW EDITION)Endless Exodus: The Jewish Experience in EthiopiaBlue Team Dynamics: Three Proven Leadership Principles Inspired by IDF Sources for Business and LifeSons of Abraham: A Candid Conversation about the Issues that Divide and Unite Jews and Muslims (HEBREW EDITION)Sons of Abraham: A Candid Conversation about the Issues that Divide and Unite Jews and Muslims (ARABIC EDITION)Puzzles She PackedBloom Of BetrayalNever Hide from the DevilBowers Mansion: The Legacy of a Comstock FamilyTangential Terrains: Cormac McCarthy's GeoaestheticsA Future For Ferals: A Charity AnthologyMore Futures for Ferals: A Charity AnthologyHow to Create an Organic Aquarium: The Beginner's Guide to Soil-Based Freshwater AquariumsRonald, the RoninDying to Live HereThe Unfavored Children's ClubSea SudsFaking to FallingBunnies in the Berry RowThe CorryJack Rittenhouse: A Western Literary LifeArthur and the Kingswell TrioMantleSome Stupid Glow: StoriesDollartoriumWhen Paris WhispersThe Night Nurse and the Jewel ThiefHeroes of PALMAR: How One IDF Unit Revolutionized Combat Medicine in GazaWhen Eichmann Knocked on Our Doorאיש כפי נחלתו: שנים-עשר שבטי ישראל בנחלות אבותיהםFamily DramaThe Son Of A Belfast Man: From the Early Years Up to Nineteen Years OldClaimed by DarknessThe Alfriston QuartetJaguars and Other GameJungle of AshesShooting Up: A Memoir of Love, Loss, and AddictionWarp & WeftHere for a Good TimeCanada: We Are the StoryRuthieA Deadly InheritanceFly in the ChaiMjede: The Three DaysSince You Weren't There and Other MemoriesQuestions for Werewolves: A Creative Nonfiction of Madness, Witch and DaimonEstuaryI'll Stop From MondayThe Marilyn DiariesNever Hide from the DevilThe Greatest New York Yankees by Uniform NumberThe Blue WaveCalisthenics: Core Crush: 38 Bodyweight Exercises for a Stronger CoreLightningShadows of the Republic: The Rebirth of Fascism in America and How to Defeat It for GoodDigital Coup: The Conspiracy to Thwart Global DemocracyWeathering the Storm: Navigating the Anti-Social Justice WaveConversion Therapy Dropout: A Queer Story of Faith and BelongingThe Christian Past That Wasn't: Debunking the Christian Nationalist Myths That Hijack HistoryPuppy Training: The Smart Way7 Spiritual Habits to Change Your LifeInvesting for BeginnersWitch of the Shadow WoodThe Last PageWe Become DarknessPondering: A Story in CinquainsBy the Bubbling BrookTaming the AlphaTo See BeyondThe Fallen: The Lost Girls of Ireland's Magdalene Laundries and a Legacy of SilenceSeed Starting Simplified for Beginners: A Complete, Step-by-Step Guide to Growing Healthy, Strong Seedlings Indoors, Avoiding Common Mistakes & Transplanting with ConfidenceContinuous Improvement Essentials You Always Wanted to KnowBetter: A Guidebook to a New and Improved YouDigital SAT Reading and Writing Practice QuestionsDigital SAT Math Practice QuestionsThe Theater: Courage and Survival in the Defining Atrocity of the Ukraine WarOur Minds Were Always Free: A History of How Black Brilliance Was Exploited--And the Fight to Retake ControlInheritance: Nick Chambers Slayer for HireSuperteams: The Science and Secrets of High-Performing TeamsPrickles and PridesNo Further Action: Ten Short StoriesPermit to StayLife Is Terminal: And So Is This Cold SoreThe Tarishe CurseIndian Warner: Son of Two WorldsSpindleheart: Wrath of the Ravelwind KnightThe Sure Thing: A Pleasure Practice to Revive the SparkEssence MergingQasida for When I Became a WomanNo Winning This WarMan of a Thousand Fails: Film Noir of Elisha Cook JrRed DemonSticks and Stones and Dancing Cranes: The End of the BeginningFool: A Tudor NovelWho in Astrology Are You?Stillness and Survival: A Life Between Trauma, Glitter, and the Echo of My Own VoiceThe Florist's Budding DesireFission: A Novel of Atomic HeartbreakEmberglow Falls Academy: The Legacy of MagicThe Jolt: A Time-Slip RomanceHaggadahpalooza: The Unofficial Weirdly Perfect Passover Pop Parody PanoplyTwo x ThreeMother of Assassins: A Memoir of the ImaginationInner, The Breath of God, Volume 1Play From Your HeartLegends of Mexico Coloring Book: Mythical Tales and Folklore to Color and EnjoyThe Golden Apple and the Nine Peahens: A Balkan Orchard TaleConnection:LostOne of a Kind CreaturesC is for Childhood Cancer: And Other Lessons Cancer Taught MeThere's a Young Man Dressed in BlueChivalry & ChocolateCaput Mundi: The Head of the WorldCain's ChameleonThe Lion's DenCain's ChameleonOn Moreton WatersThe Million-Dollar Sentence: The Secret of the Valley of PeaceA Moment's SurrenderLogos Palimpsest: Layered Verses of My Myths and MemoriesFelicity Fire and the Forever KeyMinds & Moods: Power & Deception Crossword PuzzlesTrue & Absurd Lawsuits: The Cases Kept ComingDear Missing FriendIn His Absence: A Brother, A Life, and What EnduresWill's WakeDesert Superstars: A Patience & Perseverance Coloring Adventure: A Mindfulness Coloring Book with Desert Animals, Patience-Building Prompts, and Mindful SEL Adventures for Growing HeartsOur Better NatureThe Pioneer Converts: The Message of HopeThe Black Knight: Miqdad Historical NovelThe Gardener Parent: Stop Yelling and Start Guiding Using Ericksonian MethodsBlütenschwere : Roman über Die Gewalt der AuslöschungThe Weight of Petals: A Story of Memory and ResistanceThe Problem with Conspiracy Theories: Real Scandals, Fake Mysteries, and How Distrust Took OverCity of the Gods: The Return of Quetzalcoatl (15th Anniversary Edition)The Three-Bullet Act: Journal of an HR DirectorThe Shapeshifter's GambitThe Vampyre ClientJeannie's Bottle: IncantationsFated RebirthLove and Ghosts at Hideaway LakeJonah and Mira: The Map Beneath the OakChangeupA Gift of RevelationsBachelorx: A Nonbinary MemoirA Strange SoundThe Rising of the WolvesThe Rising of the WolvesThe Missing FrameCaenogenesisThe Standard: 38 Standards of LifeThe Caregiver's Game: Unraveling Financial Deceit in the Shadows of DementiaClass Is in Session: Teaching Through the ChaosPolitics and Morality: The Problems of Ethical Debate for an Evolved Social SpeciesThe Book of Peace AphorismsTerrestrialQueenslanderThe Blood of Birds: A King David-Era ThrillerA Look into Mirrors: Their Making and Use Throughout HistoryThe Coherent Website: Designing for Trust in the Age of SearchHuman Again: In the AI AgeCut to the QuickThe Clockwork SpyYou CancerViveActs Of FaithThe HuntedAbba, Father!: A Journey to Knowing God in His Greatest Role of AllMidnight MeowsA Night of Strange DreamsAunt Rosie's FarmClose Encounters with Tort$Rewriting Your Life: A Workbook On Self-DiscoveryEpic Health & Ultimate Training: A Self-Help Workbook For Becoming StrongConnecting Goals to Impacts and Outcomes: Harnessing Structured Conversations for Customer-Driven Value DeliveryTrust and Treason: The RiseThe Last Phone CallWhen We Came Full CircleWhen Bonds Were ForgedThe Waterfall of VengeanceRain and Sun: Confessions of Love, Silence, and an Irrevocable PastAn Unsuitable Knight: A Novel of Norman ItalyBound by the ElementsMarriage Supper, Clearing GoatWord Fill in Puzzles: Large Print Puzzles for Seniors with over 70 Nostalgic Brain Games to Keep Your Mind Sharp and Active (Solutions Included)Yours Rhetorically, Cold Blue Monster: A Criminal Counseling Text-MoirMidnight BallerinaThe Agentic Loop: How Humans + AI Build Experiences That LearnThat Which Does Not Kill Us: An Intergenerational Memoir of Legacy TraumaIn the Belly of the AnacondaFree Will: Resolving the MysteryFree Will: Resolving the MysteryTattle Royale: Burn BookRupture Threshold1,2&3 John Bible Study: Dwell in LightThe Nutcracker - Gird Thy LoinsThe Magic SeekerNyxalath Heirophant of VeilsReed CityTerr-or-Treats: Spooky Ghost Stories and Deliciously Haunted AdventuresIncunabulaI Don’t Hum Anymore: A Confession of Silence, Survival, and City MadnessGolden LightI Raised Monsters: A Failed Teacher's Confession — Prisoner 4782A Florida Dance: Life Stories from the Sunshine StateCavern Sanctuary: After the FalloutDeep Work for Distracted People: Simple Methods to Stay Focused, Think Clearly, and Finish What MattersThe Law of the Spirit of Life: God's Design for a Life of Effortless TransformationOne-Page Wealth Compass: Fired at 63 Nearly Broke - Safely a Millionaire by 69The Dog BookThis Fell SergeantThe Secret Winners ClubDear Missing FriendThe FallYour Business Growth Playbook: Breakthrough Strategies to Scale Your Business for Business Owners Who've Outgrown HustleBeyond the Crystal SkyYpresMore Than ChemicalOld EarthHealthy Minds, Healthy Nation: How Meditation, Shamanism, and Indigenous Healing Can Tap into Your Light Within and Change the WorldAfter We BreakData Science in 7 Days: Python Fast-Track with Hands-on ProjectsBash and Lucy Say, Love, Love, Bark!Thinker Reads Start With Why: How to Find Your Why and Dare to Lead a Purpose Driven Life in 3 Steps Even If You’re Starting From Zero

Thanks to all the publishers participating this month!

Alcove Press Artemesia Publishing Baker Books
Bellevue Literary Press Broadleaf Books Brother Mockingbird
Cennan Books of Cynren Press City Owl Press Cozy Cozies
Egg Publishing Entrada Publishing eSpec Books
Fawkes Press Featherproof Books Gefen Publishing House
Gnome Road Publishing Grand Canyon Press Greenleaf Book Group
Hawthorn Quill Publishing Henry Holt and Company History Through Fiction
Infinite Books Inkd Publishing LLC Lito Media
PublishNation Pure Calisthenics Riverfolk Books
Running Wild Press, LLC Simon & Schuster Tundra Books
University of Nevada Press University of New Mexico Press Unsolicited Press
Vibrant Publishers W4 Publishing, LLC WorthyKids

DLF Digest: March 2026 / Digital Library Federation

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here

 

Hello DLF Community! It’s March, which means spring is around the corner (finally!), and it’s a great time for new growth. To that end, Forum planning is well underway for the virtual event this fall, and the DLF Groups are hard at work planning fantastic meetings and events for 2026. Additionally, I’m excited to share a bit of my own news: I’m transitioning to a new role at CLIR, Community Development Officer, that will help me support our community from a new angle. You’ll still have an amazing leader in Shaneé, stellar conference support from Concentra, and I certainly won’t be a stranger. As always, my inbox is open if you want to connect, send pet pictures, or have ideas about how you’d like to see our community grow in the coming months and years. See you around soon!

– Aliya

 

This month’s news:

  • Nominations Open: Suggest the names of individuals who may make compelling featured speakers at the 2026 Virtual DLF Forum. Nominations due March 31.
  • Registration Open: IIIF Annual Conference and Showcase in the Netherlands, June 1–4, 2026. For information, visit the conference page.
  • Early Bird Registration: Web Archiving Conference 2026 at KBR, the Royal Library of Belgium. Register by March 7 to secure discounted rates, and visit the conference website for full details.
  • Call for Proposals: AI4LAM’s Fantastic Futures 2026: Trust in the Loop, September 15-17, inviting proposals on how libraries, archives, and museums engage with trust and AI. Submissions due April 6.

 

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus conferences and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.

 

  • DLF Born-Digital Access Working Group (BDAWG): Tuesday, 3/2, 2pm ET / 11am PT.
  • DLF Digital Accessibility Working Group (DAWG): Tuesday, 3/2, 2pm ET / 11am PT.
  • DLF AIG Cultural Assessment Working Group: Monday, 3/9, 1pm ET / 10am PT.
  • AIG User Experience Working Group: Friday, 3/20, 11am ET / 8am PT
  • AIG Metadata Assessment Group: Friday, 3/20, 2pm ET/ 11am PT.
  • DLF Digitization Interest Group: Monday, 3/23, 2pm ET / 11am PT.
  • DLF Committee for Equity & Inclusion: Monday, 3/23, 3pm ET / 12pm PT.
  • DLF Open Source Capacity Resources Group: Wednesday, 3/25, 1pm ET / 10am PT.
  • DLF Digital Accessibility Policy & Workflows subgroup: Friday, 3/27, 1pm ET / 10am PT.
  • DAWG IT & Development: Monday, 3/30, 1pm ET / 10am PT.
  • DLF Climate Justice Working Group: Tuesday, 3/31, 1pm ET / 10am PT.

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at info@diglib.org

 

Get Involved / Connect with Us

Below are some ways to stay connected with us and the digital library community: 

 

The post DLF Digest: March 2026 appeared first on DLF.

Listening to library leaders: Surveys capture real-time perspectives shaping decisions across the field / HangingTogether

Hands typing on a laptop keyboard with a transparent digital checklist interface overlaid on the screen, showing multiple checked boxes and lines of text.

Funding and resourcing, technology, staffing, community needs and expectations—the pace of change library leaders now need to navigate and lead their organizations through is nothing short of breathtaking. Trends that took years to evolve now demand responses and strategic planning within months, or even days. Grounding those choices in rigorous, in-depth research remains essential.

At the same time, library decision-makers benefit from collective wisdom and insights shared among peers. Knowing how others are responding to similar pressures can help leaders calibrate their strategies and avoid reinventing the wheel. When those insights are confined to personal or regional networks, the limited perspective can restrict leaders’ views of how priorities and decisions are shifting.

OCLC Research leadership insights: Real-time insight for real-world decisions

This tension between the need for deeply researched guidance and the demand for timely, real-world insight creates a gap for the field. Library leaders need to understand not only which frameworks and models exist for long-term decision-making that are supported by our traditional research efforts, but also how their peers are responding to rapidly changing conditions right now.

To help fill this gap, OCLC Research is expanding its approach to gathering and sharing knowledge with a new series of pulse surveys focused on library leadership priorities. These quick, timely surveys aim to gather information on the decisions library leaders are making on a variety of critical topics shaping the future of librarianship.

A complementary approach to longstanding research practices

These short surveys are designed to capture high-level snapshots of the decisions library leaders make in the moment on subjects critical to the field, such as community engagement tactics and the use and implementation of new technologies, including AI. They are intentionally brief, both to respect leaders’ time and to enable us to respond quickly to emerging issues.

This approach does not replace the in-depth, foundational research OCLC Research is known for. Rather, it adds another dimension to it.

Our long-form research projects will continue to provide thoughtful frameworks, deep analysis, and foundational guidance for operational decision-making and long-term innovation. Leadership insights surveys complement that work by:

  • Broadening the range of topics we can address, especially those that are evolving quickly
  • Expanding the pool of voices contributing insight, drawing from library leaders across regions and library types
  • Capturing change as it happens, and tracking how priorities and decisions shift over time

Together, these approaches create a more layered understanding of the field, combining depth with immediacy.

Powered by OCLC’s global membership network

The value of these leadership insights depends on scale. OCLC is uniquely positioned to engage a broad, global network of libraries and library leaders representing diverse viewpoints. This allows us not only to collect perspectives from beyond individual professional networks but also to share results with the field quickly and widely.

The outcomes will be intentionally concise: scannable, easy-to-digest summaries that surface patterns, contrasts, and emerging directions. Think of them as snapshots—ephemeral by design—that help illuminate how decisions are being made today, while also building a record of how those decisions evolve over time.

What this means for library leaders

For library leadership, this new format offers another way to stay oriented in a fast-moving environment:

  • Insight into how peers are prioritizing and responding to shared challenges
  • Timely information that can inform near-term decisions
  • A broader field-level perspective that complements local experience

By adding pulse surveys to our toolkit, OCLC Research is expanding the breadth and increasing the pace of the insights we provide, while remaining grounded in the thoughtful, evidence-based work that has long supported libraries’ strategic and operational decision-making.

We see this as one more way to help library leaders make sense of complexity, learn from one another, and move forward with confidence. Our first pulse survey, focused on AI innovation & culture in libraries, will be fielded with US library leaders in early March 2026.

Subscribe to Hanging Together, the blog of OCLC Research, for updates on the survey series and to follow our latest work.

The post Listening to library leaders: Surveys capture real-time perspectives shaping decisions across the field appeared first on Hanging Together.

Does Clarivate understand what citations are for? / Hugh Rundle

A month ago Clarivate announced a new yet-to-be-released product called Nexus: "Clarivate Nexus acts as a bridge between the convenience of AI and the rigor of academic libraries". This is a pitch to librarians who have correctly identified generative AI chatbots as purveyors of endless bullshit, but also know that students and some researchers are going to use them anyway. Clarivate tells us that we can patch up the fabrications of chatbots with reassuring terms like "trusted sources", "verified academic references", and "authoritative".

Looking more carefully at Clarivate's marketing material, what they are proposing suggests that Clarivate understands neither what citations are for nor why fabricated citations are a problem. This is somewhat surprising for the company that controls and manages such key parts of the scholarly publishing systems as the citation database Web of Science, scholarly publishing and indexing company ProQuest, and the Primo/Summon Central Discovery Index.

Why we cite

It can get a little more complicated than this, but there are essentially two reasons for citations in scholarly work.

The first is to indicate where you got your data. If I write that the population of Australia in June 2025 was 27.6 million people, I need to back up this claim somehow. In this case, I would cite the Australian Bureau of Statistics as the source. This adds credibility to a claim by enabling readers to check the original source and assess whether it actually does make the same claim, and whether that claim is credible. If I said that the population of Australia in 2025 was 100 million people and cited a source which made that claim and in turn cited the ABS as their source, you could follow the chain of references back and identify that the paper I cited is where the error ocurred.

The second reason we cite a source is to give credit for a concept, term, or model for thinking. This is less about checking facts and more about academic norms and manners, though it also indicates how credible a scholar might be in terms of their understanding of a field. For example I might describe a concept whereby librarians feel that the mission of libraries is good and righteous, and this leads to burnout because they feel they can never complain about their working conditions. If I did not cite Fobazi Ettarh's Vocational Awe and Librarianship: The Lies We Tell Ourselves whilst describing this, I would rightly not be seen as a credible scholar in the field, or alternatively might be seen as surely knowing about Ettarh's work but deliberately ignoring it or even claiming her work as my own idea.

Why fabricated citations are bad

So that's the basics of why scholars include citations in their work. We can now explore why fabricated citations are a problem. There are two related but distinct reasons.

Citations that look real but are actually fake waste the time of already-busy library resource-sharing teams by making them spend time checking whether the citation is real, and sometimes looking for items that don't exist. This aspect of fabrication is bad because the cited item doesn't exist. If we match this to our first reason for citing, we can see that a claim that is backed by a citation to nothing at all is, uh, pretty problematic if the reason we cite is to link to the source data backing up a claim. It's equivalent to simply not providing a citation at all, except worse because we're claiming that our plucked-out-of-the-air "fact" is backed up by some other source.

The second problem with fabricated citations is that there is no connection between the statement being made and the source being cited. Even if the source being cited exists, the connection between the statement and the cited item is fabricated. This is slightly more difficult to understand because generative AI is based on probability, so in many cases there will appear to be a connection. But without a tightly-controlled RAG system, it's likely to simply be a lucky guess. The problem here is one of academic integrity – we've cited a source that exists, but it may or may not back up our claim, and the claim doesn't follow from the source.

A false nexus

Clarivate seems to be conflating these two issues. Their Nexus product has two core functions: checking citations to see if they are real, and suggesting references for content in chatbot conversations. The first is genuinely useful, though highly constrained – Clarivate only checks their own indexes, and defines anything that doesn't appear in those indexes as either non-existing, or "non-scholarly" (it's unclear how it would define, for example, something with a DOI that exists but doesn't appear in Web of Science). Neither academia nor the tech industry are short on hubris, but even in that context, "anything not listed in our proprietary databases isn't credible" is a pretty eyebrow-raising claim.

The second function kicks in when the citation checker defines a citation as failed – it offers to "Find Verified Alternative". That is, Nexus offers to replace both cited sources that don't exist and cited sources that "aren't scholarly" with another real source. This addresses the first problem (cited sources that don't exist) but not the second (cited sources that aren't the real source of a claim or quotation).

With Nexus, Clarivate are essentially integrity-washing synthetic text, giving it an academic sheen without any academic rigour. Far from helping librarians, Clarivate's Nexus threatens to further unravel the hard work we do to teach students information literacy skills and its sparkling variety, "AI literacy". Students are already inclined to write their argument first and go on a fishing expedition for citations to back it up later (I certainly wrote my undergraduate essays this way). The last thing we want to do is direct them to a product that encourages this academically dishonest behaviour.

ChatGPT is designed to provide something that looks like a competent answer to a question. Nexus seems to be designed to amend this answer-shaped text into something that looks like a correctly-cited academic essay. But the point of student assessments isn't to produce essays – it's to produce competent researchers and systematic thinkers. Perhaps Clarivate thinks there is a large potential market of universities who want to help their own students cheat on assignments in ways that look more credible. To that, I would say "[citation needed]".


Memorial for Fobazi Ettarh / In the Library, With the Lead Pipe

It is with heavy hearts and great sadness that we acknowledge the passing of trailblazer and fire-starter Fobazi Ettarh. Her loss will be felt by us all for years to come.  

Fobazi published two articles with us at ITLWTLP. In 2014 she wrote “Making a New Table: Intersectional Librarianship,” one of the first scholarly articles published about viewing librarianship through an intersectional lens. In 2018 she published the hugely influential “Vocational Awe and Librarianship: The Lies We Tell Ourselves.” Since then, we have published many, many articles that cite the concept she identified: vocational awe. She was, to borrow a phrase from bell hooks, a maker of theory and a leader of action. We remember her as one of the great thinkers of her time, and we encourage our readers to spend some time with her words and her work. Additionally, please consider contributing to or sharing the link for her GoFundMe.

Streamlining Open Access Agreement Lookup for U-M Authors / Library Tech Talk (U of Michigan)

Sign hanging in a shop window that says OPEN ACCESS!
Image Caption
University of Michigan Library recently launched a new application to help U-M researchers and authors at our three campuses locate publications covered under institutional open access agreements. This tool aggregates nearly 13,000 titles across publishers, streamlining the process of locating eligible journals. The project involved data-wrangling, application design and development, and usability testing to produce a usable, sustainable tool.

Tesla's Not-A-Robotaxi Service / David Rosenthal

Source
I have now seen the fabled CyberCab three times in real life. It has two seats, one of them fully equipped with human driver interface equipment. In each case a human was using them to drive the car, which is necessary in California because Fake Self-Driving is a Level 2 driver assistance system that requires a human behind the wheel at all times. A Robotaxi that requires a human driver and can carry at most one passenger isn't going to be a economic success.

Fred Lambert has two posts illustrating the distance between Musk's claims and reality. Below the fold I look at both of them:

"Safety monitors" less safe than "drivers"

First, Tesla ‘Robotaxi’ adds 5 more crashes in Austin in a month — 4x worse than humans:
Tesla has reported five new crashes involving its “Robotaxi” fleet in Austin, Texas, bringing the total to 14 incidents since the service launched in June 2025. The newly filed NHTSA data also reveals that Tesla quietly upgraded one earlier crash to include a hospitalization injury, something the company never disclosed publicly.
Even before they were changed, we knew very few of the details:
As with every previous Tesla crash in the database, all five new incident narratives are fully redacted as “confidential business information.” Tesla remains the only ADS operator to systematically hide crash details from the public through NHTSA’s confidentiality provisions. Waymo, Zoox, and every other company in the database provide full narrative descriptions of their incidents.
But what we do know isn't good:
With 14 crashes now on the books, Tesla’s “Robotaxi” crash rate in Austin continues to deteriorate. Extrapolating from Tesla’s Q4 2025 earnings mileage data, which showed roughly 700,000 cumulative paid miles through November, the fleet likely reached around 800,000 miles by mid-January 2026. That works out to one crash every 57,000 miles.
The numbers aren't just not good, they're apalling:
By the company’s own numbers, its “Robotaxi” fleet crashes nearly 4 times more often than a normal driver, and every single one of those miles had a safety monitor who could hit the kill switch. That is not a rounding error or an early-program hiccup. It is a fundamental performance gap.
There are two points that need to be made about how bad this is:
  • However badly, Tesla is trying to operate a taxi service. So it is misleading to compare the crash rate with "normal drivers". The correct comparison is with taxi drivers. The New York Times reported that:
    In a city where almost everyone has a story about zigzagging through traffic in a hair-raising, white-knuckled cab ride, a new traffic safety study may come as a surprise: It finds that taxis are pretty safe.

    So are livery cars, according to the study, which is based on state motor vehicle records of accidents and injuries across the city. It concludes that taxi and livery-cab drivers have crash rates one-third lower than drivers of other vehicles.
    A law firm has a persuasive list of reasons why this is so. So Tesla's "robotaxi" is actually 6 times less safe than a taxi.
  • Fake Self Driving is a Level 2 system that requires a human behind the wheel, and that is the way Tesla's service in California has to operate. But in Austin the human is in the passenger seat, or in a chase car. Tesla has been placing bystanders at risk by deliberately operating in a way that it knows, and the statistics it reports show, is unsafe.

Tesla's Catch-22

Second, Tesla admits it still needs drivers and remote operators — then argues that’s better than Waymo:
Tesla filed new comments with the California Public Utilities Commission that amount to a quiet admission: its “Robotaxi” service still relies on both in-car human drivers and domestic remote operators to function. Rather than downplaying these dependencies, Tesla leans into them — arguing that its multi-layered human supervision model is more reliable than Waymo’s fully driverless system, pointing to the December 2025 San Francisco blackout as proof.

The filing, submitted February 13 in CPUC Rulemaking 25-08-013, reveals the massive operational gap between what Tesla calls a “Robotaxi” and what Waymo actually operates as one.
Tesla's filing admits that the service they market as a "robotaxi" really isn't one:
Tesla operates its service using TCP (Transportation Charter Party) vehicles equipped with FSD (Supervised), a Level 2 ADAS system that, by definition, requires a licensed human driver behind the wheel at all times, actively monitoring and ready to intervene.

On top of that in-car driver, Tesla describes a parallel layer of remote operators. The company states it employs domestically located remote operators in both Austin and the Bay Area, and that these operators are subject to DMV-mandated U.S. driver’s licenses, “extensive background checks and drug and alcohol testing,” and mandatory training. Tesla frames this as a redundancy system, remote operators in two cities backing up the in-car drivers.

That’s two layers of human supervision for a service Tesla markets as a “Robotaxi.”
Compare this with a Waymo:
Waymo’s vehicles have no driver in the car. Waymo uses remote assistance operators who can provide guidance to vehicles in ambiguous situations, but the vehicle drives itself. Waymo’s remote operators don’t control the car, they confirm whether it’s safe to proceed in edge cases like construction zones or unusual road conditions.

... Tesla’s system requires a human to drive the car and has remote operators as backup. Waymo’s system drives itself and has remote operators as backup. Tesla is essentially describing a staffing-intensive taxi service with driver-assist software. Waymo is describing an autonomous transportation network.
This is where Tesla's marketing their service as a "robotaxi" creates a Catch-22:
Tesla argues forcefully that its Level 2 ADAS vehicles should remain outside the scope of this AV rulemaking entirely, agreeing with Lyft that they aren’t “autonomous vehicles” under California law.

At the same time, Tesla is fighting Waymo’s proposal to prohibit Level 2 services from using terms like “driverless,” “self-driving,” or “robotaxi.” Tesla calls this proposal “wholly unnecessary,” arguing that existing California advertising laws already cover misleading marketing.
But note that:
A California judge already ruled in December 2025 that Tesla’s marketing of “Autopilot” and “Full Self-Driving” violated the state’s false advertising laws.
So here is the Catch-22:
Tesla is telling regulators its vehicles are not autonomous and require human drivers, while simultaneously fighting for the right to keep calling the service a “Robotaxi.” Tesla wants the legal protections of being classified as a supervised Level 2 system and the marketing benefits of sounding like a fully autonomous one.
Sadly, this is just par for the course when it comes to Tesla's marketing. Essentially everything Elon Musk has said about not just the schedule but more importantly the capabilities of Fake Self Driving has been a lie, for example a 2016 faked video. These lies have killed many credulous idiots, but they have succeeded in pumping TSLA to a ludicrous PE ratio because of the kind of irresponsible journalism Karl Bode describes in The Media Can't Stop Propping Up Elon Musk's Phony Supergenius Engineer Mythology:
One of my favorite trends in modern U.S. infotainment media is something I affectionately call "CEO said a thing!" journalism.

"CEO said a thing!" journalism generally involves a press outlet parroting the claims of a CEO or billionaire utterly mindlessly without any sort of useful historical context as to whether anything being said is factually correct.

There's a few rules for this brand of journalism. One, you can't include any useful context that might shed helpful light on whether what the executive is saying is true. Two, it's important to make sure you never include a quote from an objective academic or expert in the field you're covering that might challenge the CEO.
After all, if a journalist does include an expert pointing out that the CEO is bullshitting:
statements produced without particular concern for truth, clarity, or meaning
the journalist will lose the access upon which his job depends. But I'm not that journalist, so here is my list of the past and impending failures of the "Supergenius Engineer":
Contrast these with the successes:
  • Tesla's cars: Wikipedia notes that:
    Tesla was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning as Tesla Motors. ... In February 2004, Elon Musk led Tesla's first funding round and became the company's chairman, subsequently claiming to be a co-founder
    Starting in 2008, Franz von Holzhausen designed the Model S, which launched in 2012 and was Tesla's first success. Initially, Tesla was a great success, but it has failed to update its line-up. It is now far behind Chinese EV manufacturers and losing market share worldwide. They will lose the US market share once the Chinese set up US factories.
  • Space X Falcon 9: Musk's insight that re-usability would transform the space business was a huge success. It was thanks to significant government support and a great CEO, Gwynne Shotwell.
This history seems like valuable context for journalists to include in reports of Musk's next pronouncement.

2026-02-24: The 10th Computational Archival Science (CAS) Workshop Trip Report / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

IEEE BigData 2025-The10th Computational Archival Science (CAS) Workshop Home Page
IEEE BigData 2025-The10th Computational Archival Science (CAS) Workshop Home Page


The 10th Computational Archival Science (CAS) Workshop is part of 2025 IEEE Big Data Conference (IEEE BigData 2025). It was an online workshop held on Tuesday December 9, 2025. It included close to 70 participants, with a keynote from Dr. Phang Lai Tee, National Archives of Singapore and Chair of the UNESCO Memory of the World Preservation Sub-Committee on Artificial Intelligence, and 18 papers from 27 institutions in 8 countries spanning 5 continents: Canada, USA (North America) / Brazil (South America) / Scotland, Spain, Switzerland (Europe) / South Africa (Africa) / Korea (Asia).

Michael Kurtz, who passed on December 17th, 2022 launched the CAS initiative in 2016, with Victoria Lemieux, Mark Hedges, Maria Esteva, William Underwood, Mark Conrad, and Richard Marciano.

The 10th CAS workshop was organized by the CAS Workshop Chairs: 

  • Mark Hedges from King’s College London UK 
  • Victoria Lemieux from U. British Columbia CANADA
  • Richard Marciano from U. Maryland USA

The workshop started with a 10 minute welcome message from the CAS workshop chairs and then a 20 minute keynote from Dr. Phang Lai Tee, National Archives of Singapore, who presented "Applications and Challenges for Archives and Documentary Heritage in the Age of AI: Some Reflections". Overall, the topic is a timely reflection on how AI is reshaping archival and documentary heritage work, highlighting both opportunities and challenges. It was a strong presentation that included emphasis on practical challenges such as scale, access, cybersecurity, and regulation.

The workshop itself was divided into six sessions:

1: Blockchain & Archives [2 papers]

A. Blockchain and Responsible AI: Enhancing Transparency, Privacy, and Accountability through Blockchain Hackathon

Authors: Jiho Lee, Jaehyung Jeong, Victoria Lemieux,Tim Weingartner, and JaeSeung Song

PAPERVIDEOSLIDES

The presentation highlights a curriculum initiative where participants used a blockchain-enabled fair-data ecosystem (Clio-X) in Blockathon to build privacy-preserving AI chatbots for archival datasets. It highlights blockchain’s potential to improve transparency and accountability in AI workflows by making all actions traceable on-chain.

B. Cryptographic Provenance and AI-generated Images

Authors: Jessica Bushey, Nicholas Rivard, and Michel Barbeau

PAPERVIDEOSLIDES

The presentation highlighted how content credentials and cryptographic provenance frameworks can operationalize archival trustworthiness for born-digital assets and AI-generated images by embedding tamper-evident metadata into assets, which is a highly relevant and timely challenge given the proliferation of synthetic media. It effectively bridges archival theory (authenticity and provenance) with practical systems and discusses how blockchain and content credentials can support verifiable history of digital images, situating the work within computational archival science. Overall, it makes a strong conceptual and methodological contribution to trustworthy preservation of digital content.

2: Processing Analog Archives [4 papers]

A. Using an Ensemble Approach for Layout Detection and Extraction from Historical Newspapers

Authors: Aditya Jadhav, Bipasha Banerjee, and Jennifer Goyne

PAPERVIDEOSLIDES

The presentation focused on layout detection and Optical Character Recognition (OCR) for historical newspapers by proposing a modular, detector-agnostic ensemble pipeline combining OpenCV, Newspaper Navigator, and a fine-tuned TextOnly-PRIMA model to improve segmentation and extraction on variable scans. It’s strong in engineering detail and demonstrates practical improvements over commercial baselines like AWS Textract, especially on degraded material. Overall, it’s a solid methodological contribution with clear application value in large-scale digitization efforts.

B. PARDES: Automatic Generation of Descriptive Terms for Logical Units in Historical Handwritten Collections

Authors: Josepa Raventos-Pajares, Joan Andreu Sanchez, and Enrique Vidal

PAPERVIDEOSLIDES

The PARDES project presents a practical and scalable method for automatically generating descriptive terms from noisy handwritten text recognition (HTR) outputs in large historical collections, using probabilistic indexing and Zipf's Law to identify important terns. It’s strong in handling uncertainty in HTR.

C. From Analog Records to Computational Research Data: Building the AI-Ready Lab Notebook

Authors: Joel Pepper, Zach Siapno, Jacob Furst, Fernando Uribe-Romo, David Breen, and Jane Greenberg

PAPERVIDEOSLIDES

Similar to the previous presentation, this one addressed transforming analog, handwritten lab notebooks into AI-ready digital data to unlock valuable experimental records for computational analysis. It demonstrated promising performance. Overall, it’s a good step toward making analog scientific records computationally accessible and usable for AI systems.

D. Classification of Paper-based Archival Records Using Neural Networks

Authors: Jussara Teixeira, Juliana Almeida, Tania Gava, Raphael Lugon Campo Dall’Orto, and Jose M´ arcio Moraes Dorigueto

PAPERVIDEOSLIDES

The presentation demonstrates a practical application of supervised machine learning (ML) to classify unprocessed archival records, achieving high accuracy and scalability on a large real-world governmental dataset (Electronic Process System (SEP) of the State of Espirito Santo, Brazil). It effectively shows how a modular ML architecture can be integrated into existing archival systems, and how clustering similar records can reduce manual effort. Overall, it’s a solid empirical case study of ML enhancing a core archival function at scale.

3: Retrieval-augmented Generation [3 papers]

A. Developing a Smart Archival Assistant with Conversational Features and Linguistic Abilities: the Ask_ArchiLab Initiative

Authors: Basma Makhlouf Shabou, Lamia Friha, and Wassila Ramli

PAPERVIDEOSLIDES

This talk presented a compelling initiative to modernize archival practice by building a conversational AI assistant that integrates advanced Retrieval Augmented Generation (RAG) and semantic technologies to support fast, contextual, and professional‑level archival queries. It’s strong in conceptualizing how multilingual conversational agents can bridge gaps in access, complex metadata, and diverse user expertise. Overall, it’s an innovative approach with great potential to enhance usability and knowledge discovery in digital archives.

B. Index-aware Knowledge Grounding of Retrieval-Augmented Generation in Conversational Search for Archival Diplomatics

Authors:  Qihong Zhou, Binming Li, and Victoria Lemieux

PAPERVIDEOSLIDES

This work presents an index‑aware chunking strategy to improve RAG pipelines for conversational search by grounding retrieval on structured index terms extracted from PDFs, aiming to reduce resource demands, accuracy issues, and hallucinations common in standard RAG workflows. It’s a practical contribution that addresses problems with traditional chunking strategies. Overall, it is an interesting methodological refinement with promising implications for archival conversational search but would benefit from broader validation.

C. Retrieval-augmented LLMs for ETD Subject Classification

Authors: Hajra Klair, Fausto German, Amr Ahmed Aboelnaga, Bipasha Banerjee, Hoda Eldardiry, and William A. Ingram

PAPERVIDEOSLIDES

This work presents a two‑stage RAG‑based pipeline that uses keyword extraction and guided question generation from  Electronic Theses and Dissertations (ETD) abstracts to retrieve and synthesize core document content, tackling the challenge of long, full‑text processing. It addresses the challenge of subject classification at scale for ETD by capturing signatures that go beyond simple lexical similarity to improve classification accuracy and contextual richness. The evaluation shows improvements over traditional approaches. Overall, it’s a promising and well‑structured application of RAG methods to a real-world problem.

4: Archival Theory & Computational Practice [4 papers]

A. Archival Research Theory: Putting Smart Technology to Work for Researchers

Authors: Kenneth Thibodeau, Alex Richmond, and Mario Beauchamp

PAPERVIDEOSLIDES

This work extends archival theory beyond traditional archival management to a new Archival Research Theory (ART) framework that models archives as complex informational systems with informative potential responsive to researchers’ questions, grounded in semiotics, Constructed Past Theory, and type theory. It’s conceptually rich, offering a strong theoretical foundation for integrating smart technologies into archival research and emphasizing how meaning and context can be formally modeled to support diverse inquiry. Overall, it makes a thoughtful and potentially foundational contribution to bridging archival theory and computational practice.

B. Systems Thinking, Management Standards, and the Quest for Records and Archives Management Relevance

Author: Shadrack Katuu

PAPERVIDEOSLIDES

The presentation makes a case for records and archives management (RAM) within organizations by embedding RAM into widely adopted Management System Standards (MSS) like ISO frameworks, which currently drive visibility and measurable outcomes in areas such as quality and security. It uses systems thinking and standards practice to argue that RAM can gain institutional relevance and leadership buy‑in by aligning with structured MSS processes and the Plan‑Do‑Check‑Act cycle, thereby elevating archival functions beyond marginal roles. Overall, it’s a good management‑focused contribution that highlights the importance of standards and systemic framing for advancing archival relevance.

C. Can GPT-4 Think Computationally about Digital Archival Practices?

Authors: William Underwood and Joan Gage

PAPERVIDEOSLIDES

This work investigates whether GPT‑4o demonstrates computational thinking capabilities applied to digital archival tasks, grounding the analysis in a recognized computational thinking taxonomy. It surfaces compelling examples where the model exhibits knowledge across archival processes and computational practices, suggesting its potential as a learning partner or assistant in teaching archival computational methods. Overall, the paper offers a thought‑provoking exploration of LLM capabilities in a computational archival context, with promising avenues for further research.

D. Algorithm Auditing for Reliable AI Authenticity Assessment of Digitized Archival Objects

Author: Daniel F. Fonner

PAPERVIDEOSLIDES

This presentation shows how small variations in input image resolution can drastically affect AI‑based art authentication results, highlighting a key vulnerability in applying such models to archival or cultural heritage objects and raising important concerns about reliability and manipulation risk. It makes a strong case that algorithm auditing should be embedded in computational archival science practices to improve transparency, reproducibility, and accountability of automated analyses. Overall, it’s a practical contribution that urges the need for rigorous evaluation frameworks when deploying AI for authenticity and provenance tasks in digital archives.

5: Knowledge Organization & Retrieval [2 papers]

A. Ontologies Applied to Archival Records: a Preliminary Proposal for Information Retrieval

Authors: Thiago Henrique Bragato Barros, Maurício Coelho da Silva, Rafael Rodrigo do Carmo Batista, David Haynes, and Frances Ryan

PAPERVIDEO — The slides were not posted

This paper presents an ontology‑driven approach to improve information retrieval (IR) over archival descriptions and digital objects by capturing archival contexts such as provenance, functions, agents, and events within a formal semantic model. It grounds its design in established ontology engineering and archival principles to support semantic indexing, reasoning, and query handling. Overall, it makes a decent conceptual contribution toward ontology‑enhanced archival IR.

B. Operationalizing Context: Contextual Integrity, Archival Diplomatics, and Knowledge Graphs

Authors: Jim Suderman, Frédéric Simard, Nicholas Rivard, Iori Khuhro, Erin Gilmore, Michel Barbeau, Darra Hofman, and Mario Beauchamp

PAPERVIDEOSLIDES

This paper lays out a context‑driven privacy framework for archival records that combines theories of contextual integrity, archival diplomatics, and knowledge graphs to make privacy‑relevant relationships machine‑legible and support informed decisions about sensitive information at scale. Its strength lies in operationalizing context rather than content alone using GraphRAG and knowledge graphs to capture nuanced contextual features that traditional vector embeddings miss, thereby offering a richer basis for privacy assessment. Overall, it’s a promising conceptual and advancement toward AI‑enabled privacy support in archives.

6: Web Archiving [3 papers]

This session highlights my contributions. The workshop designated two slots for my papers. The first slot was for presenting one of the papers and the second one is for summarizing the remaining two papers, which is why there are three papers, but only two videos. The slides for both slots are combined in one file. I want to thank Richard Marciano, Victoria Lemieux, and Mark Hedges for giving me the opportunity to present and being flexible with the workshop registration since my work is not funded and we were unable to pay the registration fees.

SLIDES

A. Arabic News Archiving is Catching Up to English: A Quantitative Study




PAPER 

In the first paper, I presented a quantitative analysis of web archiving coverage for Arabic versus English news content over a 23‑year period, revealing that while English pages are still archived at a higher rate, Arabic archival coverage has increased significantly in recent years. I showed the heavy dependence on the Internet Archive (IA) for web archiving and that other public web archives contribute very little, exposing a centralization risk where loss of IA would make most archived content inaccessible. This paper is a continuation of previous work "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages".

B. The Gap Continues to Grow Between the Wayback Machine and All Other Web Archives


PAPER

The second paper I presented highlights a quantitative study showing that the Internet Archive (IA) overwhelmingly dominates public web archiving, preserving 99.74 % of archived Arabic and English news pages in the dataset I constructed (1.5 million URLs) while all other web archives combined account for only a tiny fraction. I highlighted the risk to web archiving if the IA became unavailable, the vast majority of archived online news would be lost or irretrievable, underscoring a critical vulnerability in web preservation. My analysis offer clear results, but the paper could benefit from a broader discussion of why other web archives are shrinking and what practical strategies could diversify preservation efforts. Overall, it is an important wake‑up call about concentration in web archiving and the fragility of our collective digital memory. This paper is a continuation of previous work "Profiling web archive coverage for top-level domain and content language".

C. Collecting and Archiving 1.5 Million Multilingual News Stories’ URIs from Sitemaps

PAPER

The third paper I presented introduced JANA1.5, a large dataset of 1.5 million Arabic and English news story URLs collected from news site sitemaps, and demonstrated an effective sitemap‑based collection method that outperforms alternatives like RSS, X (formerly Twitter), and web scraping. I also discussed ways for noise reduction. I ended with explaining how this dataset is going to be submitted to the IA.

One of the standout aspects of the CAS workshop was its responsiveness and quick turnaround. Reviewers' comments were actionable and came back quickly, decisions were clear, and the entire process moved at a fast pace that made it possible to focus on the work itself rather than waiting on it. The entire process from submission to publishing and presenting the work takes about a month. It’s the kind of efficiency every venue should strive for. Attending the 10th CAS Workshop was great. It underscored issues related to computational archival science including centralization, authenticity, and who gets to be remembered. It was a rewarding experience to present my work at the CAS workshop exploring web archiving’s dependence on the Internet Archive. The discussion highlighted just how vital the Internet Archive is to our digital memory, and it was inspiring to see how their work motivates us all to take action and contribute to preserving our online heritage.

Hussam Hallak


 


Launching the Agent Protocols Tech Tree / Harvard Library Innovation Lab

Agent Protocols Tech Tree

Today I am sharing the Agent Protocols Tech Tree. APTT is a visual, videogame-style tech tree of the evolving protocols supporting AI agents.

Where did this come from?

I made the APTT for a session on “The Role of Protocols in the Agents Ecosystem” at the Towards an Internet Ecosystem for Sane Autonomous Agents workshop at the Berkman Klein Center on February 9th.

It’s a video game tech tree because, while the word “protocols” is boring, the phenomenon of open protocols is fascinating, and I want to make them easier to approach and explore.

What is an open protocol? Why care about them?

An open protocol is a shared language used by multiple software projects so they can interoperate or compete with each other.

Protocols offer an x-ray of an emerging technology — they tell you what the builder community actually cares about, what they are forced to agree on, what is already done, and what is likely to come next.

Open protocols go back to the founding of the internet when basic concepts like “TCP/IP” were standardized — not by a government or company creating and enforcing a rule, but by a community of builders based on “rough consensus and running code.” On the internet no one could force you to use the same standards as everyone else, but if you wanted to be part of the same conversation, you had to speak the same language. That created strong incentives to agree on protocols, from SMTP to DNS to FTP to HTTP to SSL. By tracing each of those protocols, you could see the evolving concerns of the people building the internet.

(For a great discussion of that history, see “The Battle of the Networks” from LIL faculty director Jonathan Zittrain’s book “The Future of the Internet — and How to Stop It.”)

Why are protocols so important for AI agents?

Like the early internet, AI agents today are an emerging, distributed phenomenon that is changing faster than even experts can understand. We’re holding workshops with names like “Towards an Internet Ecosystem for Sane Autonomous Agents” because no one really knows what it will mean to have millions of semi-autonomous computer programs acting and interacting in human-like ways online.

Also like the early internet, it’s tempting to look for some government or company that is in charge and can tame this phenomenon, set the rules of the road. But in many ways there isn’t one. The ingredients of AI agents are just not that complex or that controlled.

This makes sense if you look at Anthropic’s definition of an agent, which is simply “models using tools in a loop.” That is not a complex recipe: it requires a large language model, of which there are now many, including powerful open source ones that can run locally; a fairly small and simple control loop; and a set of “tools,” simple software programs that can interact with the world to do things like run a web search or send a text message. “Agents” as a phenomenon are a technique, like calculus, not a service, like Uber.

That makes agents hard to regulate, and makes protocols incredibly important. It is protocols that give agents the tools they use. It is protocols that the builder community are developing as fast as they can to increase what agents can do. If you want to nudge this technique toward human thriving, it is protocols that might most shape agent behavior by making some agents easier to build than others.

To be sure, protocols aren’t the only way to influence technological development. Larry Lessig’s classic “pathetic dot theory” outlines markets, laws, social norms, and architecture as four separate ways that individual action gets regulated, and protocols are just an aspect of architecture. But the more a technology is dispersed and simple to recreate, the more protocols come into play in how it evolves.

How do I use the APTT?

APTT is designed to be helpful whether you’re a less-technical person who just wants to understand what agents are, or a more technical person who wants to understand exactly what’s getting built.

Either way the pile of agent technologies is confusing, so I recommend starting at the beginning with “Inference API.”

Inference API

Video games are often designed so you start with a simple feature unlocked and then progressively unlock more and more complex options as you learn the game. The same approach works here: imagine that you have just unlocked “Inference API” in this game, and once you’re comfortable with that, explore off to the right to see how each protocol enables or necessitates the next.

You can click each technology to learn what problem it solves (why did people need something like this?), how it’s standardizing (who kicked this off?), and what virtuous cycle it enabled (why did other people want to get on board?).

You can also see visual animations of how the protocol is used — what messages are actually sent back and forth between who?

Inference API animation

If you’re interested in the technical details, you can click any of the messages to see at a wire level what’s actually happening. (Often, something simpler than it sounds.)

Inference API messages

As you move off to the right, you’ll go from widely adopted technologies, like MCP, to technologies that have commercial supporters but not much social proof yet, like Visa TAP, or technologies that don’t even exist but might make sense in the future, like Interoperable Memory, Signed Intent Mandates, or Agent Lingua Franca.

The ragged edge on the right is where I hope you’ll be the most critical: what seems inevitable, what seems like a dead end, and what would you like to see more of?

How accurate is all of this? How do I fix mistakes?

APTT is a work in progress, and to be honest in many ways is a whiteboard sketch. I put it together (and vibe coded much of it) to help support a conversation, first at the workshop and now online. I think whiteboard sketches are useful, so I’m sharing it, but I don’t pretend it’s authoritative; it’s just my rough sense of how things work right now.

(This is a weird thing about the agentic moment — my coding agent has made this tool look more polished and complete than it may really deserve. Think napkin sketch with fancy graphics.)

If you think I got things wrong or missed part of the story, please open an issue on the GitHub repository. I plan to keep this rough and opinionated, and focused on consensus-driven protocols as a lens for understanding what’s happening — so I’ll either pull contributions into the main tool, or just leave them as discussions to represent the range of opinions about how all of this works. I hope it’s fun to play with either way.

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Arke

Arke is a public knowledge network for storing, discovering, and connecting information.

Making content truly accessible is harder than it looks. Meaningful search requires vectors, embeddings, extraction pipelines—infrastructure most people can’t build. And even with that, files sitting on a website or in a folder don’t get found. You end up working alone, disconnected from related work that exists somewhere.

Arke handles all of it. Upload anything—we process it and connect it to a network where similar collections surface automatically. Your information becomes searchable, discoverable, and linked to work you didn’t know existed.

🔖 Community Calendar

Public events are trapped in information silos. The library posts to their website, the YMCA uses Google Calendar, the theater uses Eventbrite, Meetup groups have their own pages. Anyone wanting to know “what’s happening this weekend?” must check a dozen different sites.

Existing local aggregators typically expect event producers to “submit” events via a web form. This means producers must submit to several aggregators to reach their audience — tedious and error-prone. Worse, if event details change, producers must update each aggregator separately.

This project takes a different approach: event producers are the authoritative sources for their own events. They publish once to their own calendar, and individuals and aggregators pull from those sources. When details change, the change propagates automatically. This is how RSS transformed blogging, and iCalendar can do the same for events.

The gold standard is iCalendar (ICS) feeds — a format that machines can read, merge, and republish. If you’re an event producer and your platform can publish an ICS feed, that’s great. But ICS isn’t the only way. The real requirement is to embrace the open web. A clean HTML page with well-structured event data works. What doesn’t work: events locked in Facebook or behind login walls.

🔖 Engineering Rigor in the LLM Age

What do LLMs mean for the future of software engineering? Will vibe-coded AI slop be the norm? Will software engineers simply be less in-demand? Rain and David join Bryan and Adam to discuss how rigorous use of LLMs can make for much more robust systems.

🔖 Wikipedia blacklists Archive.today, starts removing 695,000 archive links

The English-language edition of Wikipedia is blacklisting Archive.today after the controversial archive site was used to direct a distributed denial of service (DDoS) attack against a blog.

In the course of discussing whether Archive.today should be deprecated because of the DDoS, Wikipedia editors discovered that the archive site altered snapshots of webpages to insert the name of the blogger who was targeted by the DDoS. The alterations were apparently fueled by a grudge against the blogger over a post that described how the Archive.today maintainer hid their identity behind several aliases.

“There is consensus to immediately deprecate archive.today, and, as soon as practicable, add it to the spam blacklist (or create an edit filter that blocks adding new links), and remove all links to it,” stated an update today on Wikipedia’s Archive.today discussion. “There is a strong consensus that Wikipedia should not direct its readers towards a website that hijacks users’ computers to run a DDoS attack (see WP:ELNO#3). Additionally, evidence has been presented that archive.today’s operators have altered the content of archived pages, rendering it unreliable.”

🔖 Megalodon (website)

Megalodon (Japanese: ウェブ魚拓, “web gyotaku”) is an on demand web citation service based in Japan.[3] It is owned by Affility.

Megalodon’s server can be searched for “web gyotaku” or copies of web pages, by prefixing any URL with “gyo.tc”; the process checks the query against other services as well, including Google’s cached pages and Mementos.

🔖 Exclusive: US plans online portal to bypass content bans in Europe and elsewhere

WASHINGTON, Feb 18 (Reuters) - The U.S. State Department is developing an online portal that will enable people in Europe and elsewhere to see content banned by their governments including alleged hate speech and terrorist propaganda, a move Washington views as a way to counter censorship, three sources familiar with the plan said.

🔖 How An Academic Library Built a Research Impact and Intelligence Team

During recent decades, universities have faced increasing pressure to demonstrate their value and impact by contributing to real-world problem-solving and meeting broader societal needs. The reasons for this increased pressure are complex and numerous—reflecting socio-economic and socio-political considerations, globalization and intensifying competition, and growing demands for accountability and demonstrable public value. At Virginia Tech, our library’s research impact and intelligence team, of which we are all members, supports institutional strategy, researcher visibility, and decision-making in response to these demands. In this article, we’ll outline the emergence of research impact and research intelligence work in libraries, trace the development of our department, and illustrate how analytics, research information management, and consultation services are operationalized alongside ongoing efforts to promote responsible interpretation and use of research metrics.

🔖 Annotorious

Annotorious is a JavaScript library for adding image annotation capabilities to your web application. Try it out below: click or tap the annotation to edit. Click or tap anywhere and drag to create a new annotation.

🔖 Potomac Interceptor Collapse

Collapse of 72” diameter section of pipe caused overflow of more than 200 million gallons of wastewater into Potomac River.

🔖 Inside Claude Code With Its Creator Boris Cherny

A very special guest on this episode of the Lightcone! Boris Cherny, the creator of Claude Code, sits down to share the incredible journey of developing one of the most transformative coding tools of the AI era.

🔖 Current

Every RSS reader I’ve used presents your feeds as a list to be processed. Items arrive. They’re marked unread. Your job is to get that number to zero, or at least closer to zero than it was yesterday.

Current has no unread count. Not because I forgot to add one, or because I thought it would look cleaner without it. There is no count because counting was the problem.

The main screen is a river. Not a river that moves on its own. You’re not watching content drift past like a screensaver. It’s a river in the sense that matters: content arrives, lingers for a time, and then fades away.

🔖 Phantom Obligation

Email’s unread count means something specific: these are messages from real people who wrote to you and are, in some cases, actively waiting for your response. The number isn’t neutral information. It’s a measure of social debt.

But when we applied that same visual language to RSS (the unread counts, the bold text for new items, the sense of a backlog accumulating) we imported the anxiety without the cause.

🔖 ways of working with the Wayback Machine - studio and book talk in Amsterdam

Last week I gave a book talk on Public Data Cultures and co-organised a Wayback studio with the Internet Archive Europe.

As highlighted in the book talk announcement it was really nice to have this moment there given my longstanding collaborations with the Internet Archive - and to meet up with others connected to the archive and associated communities in Amsterdam

🔖 Black Jesus

Black Jesus is an American live-action sitcom created by Aaron McGruder (creator of The Boondocks) and Mike Clattenburg (creator of Trailer Park Boys) that aired on Adult Swim. The series stars Gerald “Slink” Johnson, Charlie Murphy, Corey Holcomb, Kali Hawk, King Bach, Andra Fuller, and John Witherspoon. The series premiered on August 7, 2014. On December 10, 2014, the series was renewed for a second season,[2] which premiered on September 18, 2015.[3] Its third and final season premiered on September 21, 2019.[4]

🔖 Oral History of John Backus

Interviewed by Grady Booch on September 5, 2006, in Ashland, Oregon, X3715.2007

© Computer History Museum

John Backus led a team at IBM in 1957 that created the first successful high-level programming language, FORTRAN. It was designed to solve problems in science and engineering, and many dialects of the language are still in use throughout the world.

Describing the development of FORTRAN, Backus said, “We simply made up the language as we went along. We did not regard language design as a difficult problem, merely a simple prelude to the real problem: designing a compiler which could produce efficient programs . . . We also wanted to eliminate a lot of the bookkeeping and detailed, repetitive planning which hand coding involved.”

The name FORTRAN comes from FORmula TRANslation. The language was designed for solving engineering and scientific problems. FORTRAN IV was first introduced by IBM in the early 1960s and still exists in a number of similar dialects on machines from various manufacturers.

🔖 FreeBSD Mastery: Advanced ZFS

ZFS improves everything about systems administration. Once you peek under the hood, though, ZFS’ bewildering array of knobs and tunables can overwhelm anyone. ZFS experts can make their servers zing—and now you can, too, with FreeBSD Mastery: Advanced ZFS.

🔖 disko-zfs: Declaratively Managing ZFS Datasets

Given a situation where a ZFS pool has just too many datasets for you to comfortably manage, or perhaps you have a few datasets, but you just learned of a property that you really should have set from the start, what do you do? Well, I don’t know what you do, I would love to hear about that, so please do reach out to me, over Matrix preferably.

In any case, what I came up with is disko-zfs. A simple Rust program that will declaratively manage datasets on a zpool. It does this based on a JSON specification, which lists the datasets, their properties and a few pieces of extra information.

🔖 Level of Detail

My hunch is that we’ll spend just as much time and energy carving code back as we will generating it. If generating code is nearly free, then the cost shifts entirely to understanding, maintaining, and pruning it. And sometimes the right move isn’t a better level of detail. It’s fewer polygons in the scene altogether. Delete the sprawling implementation and replace it with something you can actually reason about

🔖 Poor Deming never stood a chance

The two management giants of the mid-twentieth century were Peter Drucker and W. Edwards Deming. Ironically, while Drucker hails from Austria-Hungary (like me, Drucker emigrated to the U.S. as an adult) and Deming was born in the U.S., it was Drucker that proved to be more influential in America. Deming’s influence was much greater in Japan than it ever was the U.S. If you’ve ever been at an organization that uses OKRs, then you have worked in the shadow of Drucker’s legacy. While you can tell a story about how Deming influenced Toyota, and Toyota inspired the lean movement, I would still describe management in the U.S. as Deming in exile. Deming explicitly stated that management by objectives isn’t leadership, and I think you’d be hard-pressed to find managers in American companies who would agree with that sentiment.

🔖 Emily St. John Mandel

Emily St. John Mandel (/seɪntˈdʒɒn mænˈdɛl/;[2][3] née Fairbanks;[4] born 1979) is a Canadian novelist and essayist.[5][6] She has written six novels, including Station Eleven (2014), The Glass Hotel (2020), and Sea of Tranquility (2022). Station Eleven, which has been translated into 33 languages,[7] has been adapted into a limited series on HBO Max.[8] The Glass Hotel was translated into twenty languages and was selected by Barack Obama as one of his favorite books of 2020.[9][10] Sea of Tranquility was published in April 2022 and debuted at number three on The New York Times Best Seller list.[11]

🔖 Deb Olin Unferth

Deb Olin Unferth (born November 19, 1968) is an American author. She has published two novels, two books of short stories, a memoir, and a graphic novel. Her fiction and essays have appeared in over fifty magazines and journals, including Harper’s,[1] The New York Times,[2] The Paris Review[3] The Believer,[4] McSweeney’s, Granta[5] The Guardian,[6] and NOON. She was a finalist for the National Book Critics’ Circle Award,[7] and she has received a Guggenheim fellowship,[8] four Pushcart Prizes, a Creative Capital Fellowship for Innovative Literature,[9] and residency fellowships from the MacDowell[10] and Yaddo[11] Foundations.

🔖 Citational Politics and Justice: Introduction

This introduction provides an overview of the thirteen articles which constitute this special issue about “citational politics and justice.” The issue begins with a discussion paper, followed by six research articles, one commentary, one project report, one teaching reflection, and finishes with three conversations. Authors reflect on the history and future of citation practices, and what they mean for the recognition of marginalised scholars, knowledges, and forms of output. The range of contributions offers insights into how more just scholarly practices can be promoted in teaching, research, publishing, and collaboration with academic and societal partners. Together, these articles provide ideas for achieving greater citational justice, and ultimately improving the quality of knowledge.

🔖 Concatenative language

There are many ways to categorize programming languages; one is to define them as either “concatenative” or “applicative”. In an applicative language, things are evaluated by applying functions to arguments. This includes almost all programming languages in wide use, such as C, Python, ML, Haskell, and Java. In a concatenative programming language, things are evaluated by composing several functions which all operate on a single piece of data, passed from function to function. This piece of data is usually in the form of a stack. Additionally, in concatenative languages, this function composition is indicated by concatenating programs. Examples of concatenative languages include Forth, Joy, PostScript, Cat, and Factor.

🔖 News publishers limit Internet Archive access due to AI scraping concerns

When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.

🔖 Gwtar: a static efficient single-file HTML format

Gwtar is a new polyglot HTML archival format which provides a single, self-contained, HTML file which still can be efficiently lazy-loaded by a web browser. This is done by a header’s JavaScript making HTTP range requests. It is used on Gwern.net to serve large HTML archives.

🔖 What technology takes from us – and how to take it back

Resisting the annexation of our hearts and minds by Silicon Valley requires us not just to set boundaries on our engagement with what they offer, but to cherish the alternatives. Joy in ordinary things, in each other, in embodied life, and the language with which to value it, is essential to this resistance, which is resistance to dehumanisation.

🔖 Inside Japan’s Most Influential Architect’s Working Studio

Join us for a quiet look inside the workspace of Tadao Ando, offering a brief glimpse into his architectural process.

This studio visit documents the daily rhythms of work and the careful, repetitive making of architectural scale models that sit at the center of his practice. The focus is not on finished buildings, but on process. Time spent refining ideas. Returning to the same forms again and again. Letting work unfold slowly.

Photographed in a restrained, observational way, this project uses still imagery to pay close attention to space, light, and atmosphere. The photographs are not illustrative, but quietly descriptive, allowing the studio to reveal itself as it is.

It is a small window into how creative work happens inside a working architecture studio, and an invitation to slow down and observe the act of making.

🔖 Ambient Videos

Photographer Noah Kalina has put together some long form videos that are meant to be put on a screen and left on.

Toke / Ed Summers

Evan as a skeptic, I will admit, it was interesting to hear about how Claude Code was created and how it is being developed now in this interview with its creator Boris Cherny:

Cherny’s instructions to build for the model they will have in six months, coupled with the seeming lack of understanding of what model they will have in six months (either software development goes away or an ASL-4 level catastrophe) was to be expected I guess? Maybe he knows and just isn’t saying? Maybe there isn’t very good understanding of whether one model is working better than another? The question of how these models are being evaluated for particular types of work, like software development, is actually interesting to me.

Of course, Anthropic employees would like nothing better than for people to forget how to develop software, and to become utterly dependent on them in the process. Indeed they are happily leading the way, high on their own supply of limitless tokens. They are counting on employers to follow suit, paying subscription costs to give their employees tokens to spend instead of having software developers on staff. This is following in the footsteps of what we’ve seen happen with cloud computing.

In some ways this is nothing new. Software developers have been dependent on the centralized development of compilers and interpreters for some time. So you could look at the centralization of software development into platforms like Anthropic and OpenAI as the natural next stage of development in information technology. Indeed, I think this is the argument currently being made (somewhat convincingly) by Grady Booch about a Third Golden Age of Computing which got underway with the rise of “platforms” more generally, and which includes recent genAI platform APIs and tooling.

But the big difference, that they want us all to forget, is the amount of resources it takes to build a compiler compared to an LLM and our ability to reason about them, and intentionally improve them. They also want us to forget that we need to, you know, give them all our data and ideas as context for them to do whatever they want (thanks cblgh). And as with cloud computing, they want us to forget about the materiality of computing, where computation runs. Ironically, I think computer programmers are particularly susceptible to this rhetoric of abstraction, or the medial ideology of the digital and the cloud (Kirschenbaum:2008?; Hu:2015?).

From a sociotechnical perspective I am curious how prompt data is being used to try to improve these models, as people start using them for ordinary tasks, and also in attempts to intentionally shape the model motivated by greed and malice. I guess the details of this process must be well hidden? Pointers would be welcome.

ActiveRecord neighbor vector search, with per-document max / Jonathan Rochkind

I am doing LLM “RAG” with rails ActiveRecord, postgres with the pgvector extension for vector similarity searches, and the neighbor gem. I am fairly new to all of this stuff, figuring it out by doing it.

I realized that for a particular use, I wanted to get some document diversity — so i wanted to do a search of my chunks ranked by embedding vector similarity, getting the top k (say 12) chunks — but in some cases I only want, say, 2 chunks per document. So the top 12 chunks by vector similarity, such that only 2 chunks per interview max are represented in those 12 top chunks.

I decided I wanted to do this purely in SQL, hey, I’m using pgvector, wouldn’t it be most efficient to have pg do the 2-per-document limit?

  • Note: This may be a use case that isn’t a good idea! I have come to realize that maybe I want to just fetch 12*3 or *4 docs into ruby, and apply my “only 2 per document” limit there? Because I may want to do other things there anyway that I can’t do in postgres, like apply a cross-model re-ranker? So I dunno, but for now I did it anyway.

So this was some fancy SQL, i was having trouble figuring out how to do it myself, so I asked ChatGPT, sure. It gave me an initial answer that worked, but…

  • Turns out was over-complicated, a simpler (to my understanding anyway) approach was possible
  • Turns out was not performant, it was not using my postgres ‘HNSW’ indexes to make vector searches higher performance, and/or was insisting on sorting the entire table first defeating the point of the indexes. How’d I know? Well, I noticed it was being slower than expected (several seconds or at times much more to return), and then I did postgres explain/analyze… which I had trouble understanding… so i fed the results to ChatGPT and/or Claude, who confirmed, yeah buddy, this is a bad query, it’s not using your vector index properly.

I had to go on a few back and forths with both ChatGPT and Claude (this is just talking to them in a GUI, not actually using Claude Code or whatever), to get to a pattern that did use my index effectively. They kept suggesting things to me that either just didn’t work, or didn’t actually use the index, etc. I had to actually understand what they were suggesting, and tweak it myself, and have a dialog with them…

But i eventually got to this cool method that can take an arbitrary ActiveRecord relation which already has had neighbor nearest_neighbors query applied to it… and wraps it in a larger query using CTE’s that can limit the results to max-per-document.

I wondered if I should try to share this somewhere (would neighbor gem want a PR?), except… I’m realizing like I said above maybe this is not actually a very useful use case, better to do it in ruby… I’m still not necessariliy getting the performance I expected either, although the analyze/explain says the indexes should be used properly.

So I just share here. Note the original base_relation may be it’s own internal joins to enforce additional conditions on retrieval etc. Assuming each Chunk ActiveRecord model has a document_id attribute which we are using to group for max-per-document.

# We need to take base_scope and use it as a Postgres CTE (Common Table Expression)
    # to select from, but adding on a ROW_NUMBER window function, that let's us limit
    # to top max_per_interview
    #
    # Kinda tricky, especially to do with good index usage. Got solution from google and talking
    # to LLMs, including having them look at pg explain/analyze output.
    #
    # @param base_relation [ActiveRecord::Relation] original relation, it can have joins and conditions.
    #   It MUST have already had vector distance ordering applied to it with `neighbor` gem.
    #
    # @param max_per_interview [Integer] maximum results to include per interview (oral_history_content_id)
    #
    # @param inner_limit [Integer] how many to OVER-FETCH in inner limit, to have enough even after
    #    applying max-per-interview.
    #
    # @return [ActiveRecord::Relation] that's been in a query to enforce max_per_interview limits. It does
    #   not have an overall limit set, caller should do that if desired, otherwise will be effectively
    #   limited by inner_limit.
    def wrap_relation_for_max_per_interview(base_relation:, max_per_interview:, inner_limit:)
      # In the inner CTE, have to fetch oversampled, so we can wind up with
      # hopefully enough in outer. Leaving inner unlimited would be peformance problem,
      # cause of how indexing works it doesn't need to calculate them all if limited.
      base_relation = base_relation.limit(inner_limit)

      # Now we have another CTE that assigns doc_rank within partitioned
      # interviews, from base. Raw SQL is just way easier here.
      partitoned_ranked_cte = Arel.sql(<<~SQL.squish)
        SELECT base.*,
          ROW_NUMBER() OVER (
            PARTITION BY document_id
            ORDER BY neighbor_distance
          ) AS doc_rank
        FROM base
      SQL

      # A wrapper SQL that incorporates both those CTE's, limiting to
      # doc_rank of how many we want per-interview, and overall making sure to
      # again order by vector neighbor_distance that must already have been included
      # in the base relation.
      base_relation.klass
        .select("*") # just pass through from underlying CTE queries.
        .with(base: base_relation)
        .with(partitioned_ranked: partitoned_ranked_cte)
        .from("partitioned_ranked")
        .where("doc_rank <= ?", max_per_document)
        .order(Arel.sql("neighbor_distance"))
    end

Like I said, I am new to this LLM stuff, curious what others have to say here.

The Kessler Syndrome / David Rosenthal

LEO in 2019 (NASA)
In 1978 Donald J. Kessler and Burton G. Cour-Palais published Collision Frequency of Artificial Satellites: The Creation of a Debris Belt. Wikipedia notes that:
It describes a situation in which the density of objects in low Earth orbit (LEO) becomes so high due to space pollution that collisions between these objects cascade, exponentially increasing the amount of space debris over time.
This became known as the Kessler Syndrome. Three decades later, shortly after Iridium 33 and Cosmos 2251 collided at 11.6km/s, Kessler published The Kessler Syndrome, writing that the original paper:
predicted that around the year 2000 the population of catalogued debris in orbit around the Earth would become so dense that catalogued objects would begin breaking up as a result of random collisions with other catalogued objects and become an important source of future debris.
And that:
Modeling results supported by data from USAF tests, as well as by a number of independent scientists, have concluded that the current debris environment is “unstable”, or above a critical threshold, such that any attempt to achieve a growth-free small debris environment by eliminating sources of past debris will likely fail because fragments from future collisions will be generated faster than atmospheric drag will remove them.
Below the fold I look into the current situation.

How Likely Is A Kessler Event?

Fast forward another 17 years and Hugh G. Lewis and Donald J. Kessler (in his mid-80s) recently published CRITICAL NUMBER OF SPACECRAFT IN LOW EARTH ORBIT: A NEW ASSESSMENT OF THE STABILITY OF THE ORBITAL DEBRIS ENVIRONMENT. Their abstract states that:
Using data from on-orbit fragmentation events, this paper introduces a revised stability model for altitudes below 1020 km and evaluates the March 2025 population of payloads and rocket stages to identify new regions of instability. The results indicate the current population of intact objects exceeds the unstable threshold at all altitudes between 400 km and 1000 km and the runaway threshold at nearly all altitudes between 520 km and 1000 km.
This and other recent publications attracted the attention not only of two well-known YouTubers, Sabine Hossenfelder and Anton Petrov, but also of me.

Lewis and Kessler's conclusion mirrors that of the ESA Space Environment Report 2025 from 1st April, 2025 (my emphasis):
The amount of space debris in orbit continues to rise quickly. About 40,000 objects are now tracked by space surveillance networks, of which about 11 000 are active payloads.

However, the actual number of space debris objects larger than 1 cm in size – large enough to be capable of causing catastrophic damage – is estimated to be over 1.2 million, with over 50.000 objects of those larger than 10 cm.
...
The adherence to space debris mitigation standards is slowly improving over the years, especially in the commercial sector, but it is not enough to stop the increase of the number and amount of space debris.

Even without any additional launches, the number of space debris would keep growing, because fragmentation events add new debris objects faster than debris can naturally re-enter the atmosphere.

To prevent this runaway chain reaction, known as Kessler syndrome, from escalating and making certain orbits unusable, active debris removal is required.
Thiel et al Fig. 2
Another of the recent publications is Sarah Thiele et al's An Orbital House of Cards: Frequent Megaconstellation Close Conjunctions which focuses on the requirement for satellites to maneuver to avoid potential collisions, and what would happen if, for example, a solar storm disrupted the necessary command-and-control:
While satellites provide many benefits to society, their use comes with challenges, including the growth of space debris, collisions, ground casualty risks, optical and radio-spectrum pollution, and the alteration of Earth's upper atmosphere through rocket emissions and reentry ablation. There is potential for current or planned actions in orbit to cause serious degradation of the orbital environment or lead to catastrophic outcomes, highlighting the urgent need to find better ways to quantify stress on the orbital environment. Here we propose a new metric, the CRASH Clock, that measures such stress in terms of the timescale for a possible catastrophic collision to occur if there are no satellite manoeuvres or there is a severe loss in situational awareness. Our calculations show the CRASH Clock is currently 5.5 days, which suggests there is limited time to recover from a wide-spread disruptive event, such as a solar storm. This is in stark contrast to the pre-megaconstellation era: in 2018, the CRASH Clock was 164 days.
They estimate that:
In the densest part of Starlink’s 550 km orbital shell, we expect close approaches (< 1 km) every 22 minutes in that shell alone.
For the whole of Earth orbit they estimate the time between < 1 km approaches at 41 seconds.

Will Things Get Worse?

Nehal Malik's Space Is Getting Crowded: Starlink Dodged 300,000 Collisions illustrates the scale of the problem:
According to a recent report filed by SpaceX with the U.S. Federal Communications Commission, Starlink satellites performed roughly 300,000 collision-avoidance maneuvers in 2025 alone. The figures, first reported by New Scientist, offer a rare look at just how crowded low-Earth orbit has become — and how aggressively SpaceX is managing risk as its constellation scales.
...
On average, the 300,000 maneuvers worked out to nearly 40 avoidance actions per satellite last year. That number is rising quickly, with estimates suggesting Starlink could be performing close to one million maneuvers annually by 2027 if growth continues at its current pace.
While it is true that Starlink is careful:
What’s particularly notable is how conservative SpaceX’s approach is compared to the rest of the industry. While the typical standard is to maneuver when the risk of collision reaches one in 10,000, SpaceX reportedly initiates avoidance at a far lower threshold of roughly three in 10 million.
Nevertheless Starlink's rate of maneuvers is doubling every six months, which seems likely to force a less conservative policy. The average satellite is moving every 9 days. At this doubling rate, by the end of 2027 the average satellite would move about twice a day.

Starlink currently has over 10,000 satellites, with plans for 12,000 in the short term. I believe the collision probability goes as the square of the number, so that will mean moving on average every 6.25 days. Their eventual plan for 42,000 would mean twice a day, or in aggregate about one move per second.

In order to pump SpaceX/xAI/Twitter stock in preparation for a planned IPO, Musk recently pivoted from cars, weird pickup trucks, self-driving cars, robotaxis, humanoid robots and Mars colonization to data centers in space. He claimed that by 2031 SpaceX/xAI/Twitter would operate a million satellites forming a huge AI data center. Scaling up from the current maneuver rate gets you to about a move every 125ms in aggregate.

How Bad Would A Kessler Event Be?

My friend Robert Kennedy considers the implications of a Kessler event in low Earth orbit:
  • Obviously the national security repercussions for the western world, especially the U.S., would be severe with so many force multipliers going away at once. Presenting an opportunity for adversaries to attack us, maybe.
  • The overall global space market, presently ~$700B/yr & growing fast, would shrink dramatically. This contraction in turn would be amplified in the world's stock markets since space activity is central to so many Big Tech equities now, and space infrastructure is so deeply embedded many other enterprises' business models. ... Even modest P/E ratios suggest that an order of magnitude more, maybe two (~$10-100T) of paper wealth would disappear.
  • The space insurance market would collapse under the burden of covered claims. Re-insurers could not handle so much at once. Companies that chose to self-insure would probably go under after such a casualty. Without insurance, most enterprises could not afford to conduct space missions.
  • The space launch market would collapse, leaving only national launch capabilities maintained by individual nations for their individual non-market reasons. All those innovative rocket companies popping up to serve the mega-constellations would go away once their prime customers did. Global launch tempos would fall by more than half, from 200+/yr to well under 100/yr of a generation ago. Forget the $100 per kg that ... Starship was aiming for, price per kilogram would return to what it was 30 years ago, ~$10-20K/kg. Say goodbye to cheap rideshares to LEO. Even running the gauntlet thru LEO would be fraught, as the Chinese learned just a few months ago when their spacecraft was damaged by debris on the way up, necessitating the premature return of the undamaged pre-deployed spaceship to rescue the earlier crew.
  • Since 99% of Cubesats fly in LEO, the ecology of COTS parts that has sprung up to serve the Cubesat revolution would probably go away, or back into the garage at least. It might even disappear altogether if authorities of various spacefaring nations ban Cubesats. (Literally "throwing out the baby with the bathwater".) Don't underestimate the inherent conservatism of oligarchs to use a crisis to stomp on upstarts.

Can LEO Be Cleaned Up?

ClearSpace-1
In 2027 the ESA plans ClearSpace-1, an experimental mission to deorbit a dead satellite. The plan is to grab the satellite then retrofire. In principle this technique is a workable but expensive way to remove large targets before a collision fragments them, but it isn't viable for most of the results of a collision.

What Else Can Go Wrong?

The frenzy to exploit the commons of Low Earth Orbit doesn't just threaten to cut humanity off from space in general and the benefits that LEO can provide. The process of getting stuff up there and its eventual descent threatens to accelerate the process of trashing the commons of the terrestrial environment.

Going Up

Elon Musk's proposed one million satellite data center is estimated to require launching a Starship about every hour 24/7/365. Laura Revell et al's Near-future rocket launches could slow ozone recovery describes one problem:
Ozone losses are driven by the chlorine produced from solid rocket motor propellant, and black carbon which is emitted from most propellants. The ozone layer is slowly healing from the effects of CFCs, yet global-mean ozone abundances are still 2% lower than measured prior to the onset of CFC-induced ozone depletion. Our results demonstrate that ongoing and frequent rocket launches could delay ozone recovery. Action is needed now to ensure that future growth of the launch industry and ozone protection are mutually sustainable.
Black carbon heats the stratosphere, although the increasing use of methane reduces the amount emitted per ton of propellant. Each Starship launch uses about 4000 tons of LOX and about 1000 tons of methane. Assuming complete combustion, this would emit about 1,667 tons of CO2 into the atmosphere. So Musk's data center plan would dump about 17 megatons/year into the atmosphere, or about as much as Croatia.

Coming Down

All this mass in LEO will eventually burn up in the atmosphere. Jose Ferreira et al's Potential Ozone Depletion From Satellite Demise During Atmospheric Reentry in the Era of Mega-Constellations describes the effects this will have:
This paper investigates the oxidation process of the satellite's aluminum content during atmospheric reentry utilizing atomic-scale molecular dynamics simulations. We find that the population of reentering satellites in 2022 caused a 29.5% increase of aluminum in the atmosphere above the natural level, resulting in around 17 metric tons of aluminum oxides injected into the mesosphere. The byproducts generated by the reentry of satellites in a future scenario where mega-constellations come to fruition can reach over 360 metric tons per year. As aluminum oxide nanoparticles may remain in the atmosphere for decades, they can cause significant ozone depletion.

Can A Kessler Event Be Prevented?

Ozone Hole 10/1/83
Clearly, reducing the risk of a Kessler incident requires international cooperation. We have one somewhat successful example of a international cooperation to mitigate a similar "Tragedy of the Commons".Thirty-eight years ago the Montreal Protocol was agreed, phasing out the chemicals that destroy the ozone layer. Wikipedia reports that:
Due to its widespread adoption and implementation, it has been hailed as an example of successful international co-operation.
It has been effective:
Climate projections indicate that the ozone layer will return to 1980 levels between 2040 (across much of the world) and 2066 (over Antarctica).
But note that it will have taken almost 80 years from the agreement for the environment to recover fully. And that it appears to be the exception that proves the rule:
effective burden-sharing and solution proposals mitigating regional conflicts of interest have been among the success factors for the ozone depletion challenge, where global regulation based on the Kyoto Protocol has failed to do so.
Source
The Kyoto Protocol attempted to mitigate the effects of greenhouse gas emissions. Of particular importance was that the Montreal Protocol was an application of the Precautionary Principle because:
In this case of the ozone depletion challenge, there was global regulation already being implemented before a scientific consensus was established.
...
This truly universal treaty has also been remarkable in the expedience of the policy-making process at the global scale, where only 14 years lapsed between a basic scientific research discovery (1973) and the international agreement signed (1985 and 1987).
In 1.5C Here We Come I critiqued the attitudes of the global elite that have crippled the implementation of the Kyoto Protocol. I think it is safe to say that the prospect of applying the Precautionary Principle to Low Earth Orbit is even less likely.

🇮🇩 Open Data Day 2025 in Cianjur: Geospatial Data for Mangrove Rehabilitation / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. Bandung Mappers successfully carried out the Open Data Day 2025 activity on March 6 – 8 with the theme Coastal Resilience through Mangrove Rehabilitation which was held in Cianjur, West Java. This activity was...

The post 🇮🇩 Open Data Day 2025 in Cianjur: Geospatial Data for Mangrove Rehabilitation first appeared on Open Knowledge Blog.

🇹🇿 Open Data Day 2025 in Dodoma: Driving Urban Resilience With Open Data / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. The Open Data Day 2025 event in Dodoma brought together open data advocates, government entities, researchers, NGOs, and YouthMappers under the theme “Open Data for a Resilient Dodoma.” Hosted by OpenGeoCity Tanzania with support...

The post 🇹🇿 Open Data Day 2025 in Dodoma: Driving Urban Resilience With Open Data first appeared on Open Knowledge Blog.

🇳🇵 Open Data Day 2025 in Ilam: Co-Creating Solutions to the Polycrisis with Indigenous and Marginalised Communities / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. To celebrate Open Data Day 2025, as part of the Harnessing Opportunities to address Polycrisis through community Engagement (HOPE) project, the Nepal Institute of Research and Communications, in collaboration with the Ilam Municipality, organized a...

The post 🇳🇵 Open Data Day 2025 in Ilam: Co-Creating Solutions to the Polycrisis with Indigenous and Marginalised Communities first appeared on Open Knowledge Blog.

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 How Etsy Uses LLMs to Improve Search Relevance

Search plays a central role in that mission. Historically, Etsy’s search models have relied heavily on engagement signals – such as clicks, add-to-carts, and purchases – as proxies for relevance. These signals are objective, but they can also be biased: popular listings get more clicks, even when they’re not the best match for a specific query.

To address this, we introduce semantic relevance as a complementary perspective to engagement, capturing how well a listing aligns with a buyer’s intent as expressed in their query. We developed a Semantic Relevance Evaluation and Enhancement Framework, powered by large language models (LLMs). It provides a comprehensive approach to measure and improve relevance through three key components:

High quality data: we first establish human-curated “golden” labels of relevance categories (we’ll come back to this) for precise evaluation of the relevance prediction models, complemented by data from a human-aligned LLM that scales training across millions of query-listing pairs Semantic relevance models: we use a family of ML models with different trade-offs in accuracy, latency, and cost; tuned for both offline evaluation and real-time search Model-driven applications: we integrate relevance signals directly into Etsy’s search systems enabling both large-scale offline evaluation and real-time enhancement in production

🔖 Understanding Etsy’s Vast Inventory with LLMs

While our powerful search and discovery algorithms can process unstructured data such as that in descriptions and listing photos, passing in long context and images directly to search poses latency concerns. For these algorithms, every millisecond counts as they work to deliver relevant results to buyers as quickly as possible. Spending time filtering through unstructured data for every query is just not feasible.

These constraints led us to a clear conclusion: to fully unlock the potential of all inventory listed on Etsy’s site, unstructured product information needs to be distilled into structured data to power both ML models and buyer experiences.

🔖 Unlocking the Codex harness: how we built the App Server

OpenAI’s coding agent Codex exists across many different surfaces: the web app⁠(opens in a new window), the CLI⁠(opens in a new window), the IDE extension⁠(opens in a new window), and the new Codex macOS app. Under the hood, they’re all powered by the same Codex harness—the agent loop and logic that underlies all Codex experiences. The critical link between them? The Codex App Server⁠(opens in a new window), a client-friendly, bidirectional JSON-RPC1 API.

In this post, we’ll introduce the Codex App Server; we’ll share our learnings so far on the best ways to bring Codex’s capabilities into your product to help your users supercharge their workflows. We’ll cover the App Server’s architecture and protocol and how it integrates with different Codex surfaces, as well as tips on leveraging Codex, whether you want to turn Codex into a code reviewer, an SRE agent, or a coding assistant.

🔖 OpenAI and Codex with Thibault Sottiaux and Ed Bayes

AI coding agents are rapidly reshaping how software is built, reviewed, and maintained. As large language model capabilities continue to increase, the bottleneck in software development is shifting away from code generation toward planning, review, deployment, and coordination. This shift is driving a new class of agentic systems that operate inside constrained environments, reason over long time horizons, and integrate across tools like IDEs, version control systems, and issue trackers.

OpenAI is at the forefront of AI research and product development. In 2025, the company released Codex, which is an agentic coding system designed to work safely inside sandboxed environments while collaborating across the modern software development stack.

🔖 Little Atoms - 2 February 2026 (George Saunders)

A talk show about ideas and culture, produced and presented by Neil Denny. Each show features guests from the worlds of science or the arts in conversation. This week: George Saunders on his latest novel, Vigil.

🔖 Bridging the Data Discovery Gap: User-Centric Recommendations for Research Data Repositories

Despite substantial investment in research data infrastructure, data discovery remains a fundamental challenge in the era of open science. The proliferation of repositories and the rapid growth of deposited data have not resulted in a corresponding improvement in data findability. Researchers continue to struggle to find data that are relevant to their work, revealing a persistent gap between data availability and data discoverability. Without rich, high-quality metadata, robust and user-centred data discovery systems, and a deeper understanding of how different researchers seek and evaluate data, much of the potential value of open data remains unrealised.

This paper presents a set of practical, evidence-based recommendations for data repositories and discovery service providers aimed at improving data discoverability for both human and machine users. These recommendations emphasise the importance of 1) understanding the search needs and contexts of data users, 2) addressing the roles that data repositories play in enhancing metadata quality to meet users’ data search needs, and 3) designing discovery interfaces that support effective and diverse search behaviours. By bridging the gap between data curation practices, discovery system design, and user-centred approaches, this paper argues for a more integrated and strategic approach to data discovery.

🔖 blevesearch

A modern text/numeric/geo-spatial/vector indexing library for go

🔖 hister: Web history on steroids

Hister is a web history management tool that provides blazing fast, content-based search for visited websites. Unlike traditional browser history that only searches URLs and titles, Hister indexes the full content of web pages you visit, enabling deep and meaningful search across your browsing history.

🔖 Alphabet sells rare 100-year bond to fund AI expansion as spending surges

Feb 10 (Reuters) - Alphabet (GOOGL.O), opens new tab on Tuesday sold a rare 100-year bond, a memo from the lead manager showed, part of a $31.51 billion global bond raise, as artificial intelligence-driven spending sparks a surge in borrowing at U.S. tech giants. Alphabet’s sale of the century bond is the tech industry’s first since Motorola’s (MSI.N), opens new tab issuance that dates back to 1997, according to LSEG data.

🔖 The Eternal Mainframe

In the computer industry, the Wheel of Reincarnation is a pattern whereby specialized hardware gets spun out from the “main” system, becomes more powerful, then gets folded back into the main system. As the linked Jargon File entry points out, several generations of this effect have been observed in graphics and floating-point coprocessors.

In this essay, I note an analogous pattern taking place, not in peripherals of a computing platform, but in the most basic kinds of “computing platform.” And this pattern is being driven as much by the desire for “freedom” as by any technical consideration.

🔖 VIAF Governance Concerns about the Refurbished VIAF Web and API Interfaces

In January 2025, OCLC made significant changes to the web and application programming interfaces for Virtual International Authority File (VIAF) clusters. This article will compare the old and new interfaces, highlighting the pros and cons introduced and calling attention, especially, to critical errors introduced that compromise the functionality of much of the VIAF product. Consequently, it will raise questions and concerns regarding the governance of VIAF, as well as OCLC’s development model, testing, and feedback before public rollout.

🔖 JupyterLite

JupyterLite is a JupyterLab distribution that runs entirely in the browser built from the ground-up using JupyterLab components and extensions.

🔖 NINeS 2026: Contributed Talks

  • Economic and Human Factors in Internet Design David D. Clark (MIT)
  • Changing Internet Architecture: A Practical Perspective Jana Iyengar (Netflix) and Barath Raghavan (USC/Fastly)
  • Perspectives on Congestion Control Nandita Dukkiapti (Google) in conversation with Brighten Godfrey (UIUC)
  • From Networks in Practice to Networks in Principle Jennifer Rexford (Princeton) in conversation with Nate Foster (EPFL)
  • Networking Fabrics for ML: Hardware Advances and Research Questions Arvind Krishnamurthy (University of Washington)
  • Why I was Wrong About Quality-of-Service (QoS) Scott Shenker (UC Berkeley/ ICSI, Professional Dilettante)

🔖 Creativity in Conflict: A Multi-Level Exploration of Software Developers’ Capacity to Innovate

The software industry, historically driven by creativity, faces a paradox. While developers are drawn to intellectual challenges, their creativity is increasingly constrained by efficiency-driven methods and so-called productivity metrics. Although positioned as innovation engines, Agile software development (hereinafter referred to as Agile) and open-source software (OSS) approaches may prioritize incrementalism over transformative breakthroughs. This tension between structure and creativity threatens individual potential and the industry’s capacity for meaningful innovation. Without addressing this gap, contemporary development approaches may fail to support the creativity necessary for crafting novel and impactful software. This dissertation examines this gap, investigating how modern development approaches shape individual creativity into project-level innovation. Drawing on multi-level interactionist theories of creativity, we explore the conditions under which individual, team, and organizational interactions foster or constrain creative outcomes. By addressing this critical gap, our research reconceptualizes development methodologies as enablers of radical innovation rather than constraints, ensuring the industry’s continued creative and transformative impact. Using a sequential exploratory mixed-methods design, this dissertation integrates qualitative and quantitative techniques to analyze creativity within software development. The qualitative strand examines individual developer experiences through 31 semi-structured interviews with Agile practitioners. The quantitative strand assesses cognitive conflict’s impact on team performance in OSS development, analyzing 40 projects and 82,949 code commits. The mixed convergent strand evaluates corporate and open governance interplay, leveraging data from 40 projects, 10,862 releases, and 15 interviews. By synthesizing insights across these strands, this dissertation delivers theoretical contributions and actionable guidance for fostering creativity in software development. We challenge the myth of developers as lone “rockstars” or “hackers” by demonstrating the critical role of social interactions in shaping creativity and innovation. Empirical findings reveal that review-stage interactions—such as pull requests and code reviews—mediate and transition from creativity to innovation, while project governance moderates this relationship further. This dissertation highlights how individual, team, and organizational dynamics influence creative outcomes by operationalizing cognitive conflict and release commit novelty. These insights advance theoretical understanding and offer practical strategies for unlocking the innovative potential of contemporary development practices.

🔖 US Military Helping Trump to Build Massive Network of ‘Concentration Camps,’ Navy Contract Reveals

In the wake of immigration agents’ killings of three US citizens within a matter of weeks, the Department of Homeland Security is quietly moving forward with a plan to expand its capacity for mass detention by using a military contract to create what Pablo Manríquez, the author of the immigration news site Migrant Insider calls “a nationwide ‘ghost network’ of concentration camps.”

🔖 unmerdify

Get the content, only the content: unenshittificator for the web.