A fundamental problem for decentralized systems like permissionless blockchains is that their security depends upon the cost of an attack being greater than the potential reward from it. Various techniques are used to impose these costs, generally either Proof-of-Work (PoW) or Proof-of-Stake (PoS). These costs have implications for the economics (or tokenomics) of such systems, for example that their security is linear in cost, whereas centralized systems can use techniques such as encryption to achieve security exponential in cost.
Now, via Toby Nangle's Stablecoin = Fracturedcoin we find Tokenomics and blockchain fragmentation by Hyun Song Shin, whose basic point is that these costs must be borne by the users of the system. For cryptocurrencies, this means through either or both transaction fees or inflation of the currency. The tradeoff between cost and security means that there is a market for competing blockchains making different tradeoffs. In practice we see a vast number of
competing blockchains:
The chart shows Ethereum losing market share against competing blockchains.
Shin's analysis uses game theory to explain why this fragmentation is an inevitable result of tokenomics. Below the fold I go into the background and the details of Shin's explanation.
Background
In 2018's Cryptocurrencies Have Limits I discussed Eric Budish's The Economic Limits Of Bitcoin And The Blockchain, an important analysis of the economics of two kinds of "51% attack" on Bitcoin and other cryptocurrencies based on PoW blockchains. Among other things, Budish shows that, for safety, the value of transactions in a block must be low relative to the fees in the block plus the reward for mining the block.
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
Bitcoin's costs are defrayed almost entirely by inflating the currency, as shown in this chart of the last year's income for miners. Notice that the fees are barely visible.
It has been known for at least a decade that Bitcoin's plan to phase out the inflation of the currency was problematic. In 2024's Fee-Only Bitcoin I wrote:
Our key insight is that with only transaction fees, the variance of the miner reward is very high due to the randomness of the block arrival time, and it becomes attractive to fork a “wealthy” block to “steal” the rewards therein.
So Bitcoin's security depends upon the "price" rising enough to counteract the four-yearly halvings of the block reward. In that post I made a thought-experiment:
As I write the average fee per transaction is $3.21 while the average cost (reward plus fee) is $65.72, so transactions are 95% subsidized by inflating the currency. Over time, miners reap about 1.5% of the transaction volume. The miners' daily income is around $30M, below average. This is about 2.5E-5 of BTC's "market cap".
Lets assume, optimistically, that this below average daily fraction of the "market cap" is sufficient to deter attacks and examine what might happen in 2036 after 3 more halvings. The block reward will be 0.39BTC. Lets work in 2024 dollars and assume that the BTC "price" exceeds inflation by 3.5%, so in 12 years BTC will be around $98.2K.
To maintain deterrence miners' daily income will need to be about $50M, Each day there will be about 144 blocks generating 56.16BTC or about $5.5M, which is 11% of the required miners' income. Instead of 5% of the income, fees will need to cover 89% of it. The daily fees will need to be $44.5M. Bitcoin's blockchain averages around 500K transactions/day, so the average transaction fee will need to be around $90, or around 30 times the current fee.
Bitcoin users set the fee they pay for their transaction. In effect they are bidding in a blind auction for the limited supply of transaction slots. Miners are motivated to include high-fee transactions in their next block. If there were an infinite supply of transactions slots miners' fee income would be zero. In practice, much of the timethe supply of slots exceeds demand and fees are low. At times when everyone wants to transact, such as when the "price" crashes, the average fee spikes enormously.
Cryptocurrencies such as Bitcoin rely on a ‘proof of work’ scheme to allow nodes in the network to ‘agree’ to append a block of transactions to the blockchain, but this scheme requires real resources (a cost) from the node. This column examines an alternative consensus mechanism in the form of proof-of-stake protocols. It finds that an economically sustainable network will involve the same cost, regardless of whether it is proof of work or proof of stake. It also suggests that permissioned networks will not be able to economise on costs relative to permissionless networks.
In 2022 Ethereum switched from Proof-of-Work to Proof-of-Stake, reducing its energy consumption by around 99%. This chart shows that, like Bitcoin, until the "Merge" the costs were largely defrayed by inflating the currency. After the "Merge" the blockchain has been running on transaction fees.
Shin's Analysis
Here is a summary of Shin's analysis.
Notation
There is a continuum of validators i.
For validator i ∈ [0;1], the cost of contributing to governance is ci > 0.
The blockchain needs at least a fraction k̂ of the validators contributing to be secure. Shin writes:
There are two special cases of note: k̂ = 1 (unanimity, corresponding to full decentralisation where every validator must participate for the blockchain to function) and k̂ = 0 which corresponds to full centralisation, where one validator has authority to update the ledger.
k̂ = 1 is impractical,lacking fault tolerance. k̂ = 0 is much more practical, it is the traditional trusted intermediary.
If the blockchain is secure, each contributing validator earns a reward p > 0. A non-contributing validator earns zero.
The validators share a common cost threshold c*. If ci < c*, validator i contributes, if ci > c* validator i does not.
Argument
Each validator will want to contribute only if at least k̂ - 1 other validators contribute, which poses a coordination problem. The case of particular interest is the validator with ci = c*. Shin writes:
Intuitively, even though the marginal validator may have very precise information about the common cost c*, the validator faces irreducible uncertainty about how many other validators will choose to contribute. It is this strategic uncertainty — uncertainty about others' actions — that is the central feature of the coordination problem.
This "strategic uncertainty" is similar to the attacker's uncertainty about other peers' actions that is at the heart of the defenses of the LOCKSS system in our 2003 paper Preserving peer replicas by rate-limited sampled voting.
Shin Figure 6
Because the marginal validator's ci = c*, the decision whether or not to contribute makes no difference. Sin's Figure 6 explains this graphically. Rectangle A is the loss if k < k̂ and rectangle B is the gain if k > k̂. Setting them equal gives:
c*k̂ = (p - c*)(1 - k̂)
which simplifies to:
c* = p(1 - k̂)
Shin and Morris earlier showed that this is the unique equilibrium no matter what strategy the validators use.
Result
What this means is that successful validation depends upon the reward p being large enough so that:
Note that the required reward p explodes as k̂ → 1. This is the central result of the paper: the more decentralised the blockchain (the higher the supermajority threshold), the higher must be the rents that accrue to validators. In the limiting case of unanimity (k̂ = 1), no finite reward can sustain the coordination equilibrium.
Shin Figure 1
This yet another result showing that a reasonably secure blockchain is unreasonably expensive. The complication is that, much of the time, transactions are cheap because the demand for them is low. Thus most of the time validators are not earning enough for the risks they run. But:
When many users want to transact at the same time, they bid against each other for limited block space, and fees spike — much as taxi fares surge during rush hour. Figure 1 shows how Ethereum gas fees exhibited sharp spikes during periods of network congestion, such as during surges in decentralised finance (DeFi) activity or spikes in the minting of non-fungible tokens (NFTs). These spikes are not merely a reáection of excess demand; they are the mechanism through which the blockchain extracts the rents needed to sustain validator coordination.
Note that these spikes mean that the majority of the time fees are low but the majority of transactions face high fees. It is this "user experience" that drives the fragmentation that Shin describes:
When demand for block space is high, fees rise and validators are well compensated. But high fees deter users, especially those making small or routine transactions. These users are the first to migrate to competing blockchains that offer lower fees — blockchains that can offer lower fees precisely because they have lower coordination thresholds (and hence less security). The users who remain on the more secure blockchain are those with the highest willingness to pay: institutions, large DeFi protocols, and transactions where security and censorship resistance are paramount. This sorting of users across blockchains is the essence of fragmentation.
The fragmentation argument is the flipside of blockchain's "scalability trilemma," as described by Vitalik Buterin, who posed the problem as the impossibility of attaining, simultaneously, a ledger that is decentralised, secure, and scalable.
It is worth noting that Buterin's trilemma is a version for PoS of the trilemma Markus K Brunnermeier and Joseph Abadi introduced for PoW in 2018's The economics of blockchains. See The Blockchain Trilemma for details.
Shin's focus is primarily on the effects of fragmentation on stablecoins. He notes that:
Rather than converging on a single platform, stablecoin activity is scattered across many chains (Figure 4). As of late 2025, Ethereum held the majority of total stablecoin supply but was facing competition from Tron and Solana, each of which had attracted tens of billions of dollars in stablecoin balances. Each chain serves different geographies and use cases: Ethereum for institutional settlement, Tron for low-cost remittances, Solana for retail payments and DeFi activity.
This fragmentation among blockchains would not matter much if stablecoins were interoperable between them, but they are confined to the blockchain on which they were minted:
A USDC token on Ethereum is not the same as a USDC token on Solana — they exist on separate ledgers that have no native way of communicating with each other. Transferring between chains requires the use of bridges: specialised software protocols that lock tokens on one chain and issue equivalent tokens on another. These bridges introduce additional risks, including vulnerabilities in the smart contract code — bridge exploits have accounted for billions of dollars in cumulative losses — and they impose costs and delays that undermine the seamless transferability that is the hallmark of money. The result is a landscape in which stablecoins from the same issuer exist in multiple, non-fungible forms across different blockchains, fragmenting liquidity and undercutting the network effects that should be the strength of a widely adopted payment instrument.
Discussion
As I've been pointing out since 2014, very powerful economic forces mean that Decentralized Systems Aren't. So the users paying for the more expensive transactions because they believe in decentralization aren't getting what they pay for.
Coinbase Global Inc. is already the second-largest validator ... controlling about 14% of staked Ether. The top provider, Lido, controls 31.7% of the staked tokens,
That is 45.7% of the total staked controlled by the top two.
In addition all these networks lack software diversity. For example, as I write the top two Ethereum consensus clients have nearly 70% market share, and the top two execution clients have 82% market share.
Shin writes as if more decentralization equals more security even though it doesn't happen in practice, but this isn't really a problem. What the users paying the higher fees want is more security, and they are probably getting because they are paying higher fees. As I discussed in Sabotaging Bitcoin, the reason major blockchains like Bitcoin and Ethereum don't get attacked is not because the (short-term) rewards for an attack are less than the cost. It is rather that everyone capable of mounting an attack is making so much money that:
those who could kill the golden goose don't want to.
In any case what matters for Shin's analysis isn't that the users actually get more security for higher fees, but that they believe they do. Like so much in the cryptocurrency world, what matters is gaslighting. But what the chart showing Ethereum losing market share shows is that security is not a concern for a typical user.
I revisited an old Go package I've been using over the past few years to build IIIF manifests — nothing fancy, just some glue around structs and JSON. From that I built a new CLI, mkiiif, to generate IIIF manifests from static images (tiled or not). There are plenty of similar tools out there (iiif-tiler, tile-iiif, biiif, ...) but none quite matched the CLI ergonomics I needed for my daily workflow.
I moved the library to this new repository atomotic/iiif. The tool mkiiif can be installed with Go:
go install github.com/docuverse/iiif/cmd/mkiiif@latest
mkiiif can generate an IIIF manifest from a source directory containing images, or from a PDF file that gets exploded and converted to images via mupdf. Output images can be either untiled or static tiles generated with vips. Both approaches produce a IIIF Level 0 compliant layout, static files that can be served from any HTTP server, with no image server required. Untiled is less efficient for large images but perfectly fine for printed books, papers, and similar material.
mupdf and vips are external dependencies, that need to be installed separately. They are invoked via subprocess; I chose not to add Go library wrappers around them to keep the tool simple. WASM ports of both may become viable in the future.
The CLI usage:
Usage of mkiiif: -base string Base URL where the manifest will be served (e.g. https://example.org/iiif) -destination string Output directory; a subdirectory named <id> will be created inside it, containing the images and manifest.json -id string Unique identifier for the manifest (e.g. book1) -resolution int Resolution (DPI) used when converting PDF pages to images via mutool (default 150) -source string Path to a directory of images or a PDF file to convert -tiles Generate IIIF image tiles for each image using vips dzsave (requires vips) -title string Human-readable title of the manifest
The directory can then be served from https://digital.library.org.
I've adopted this URL scheme:
https://{base}/{id} /manifest.json — the IIIF manifest /index.html — a simple viewer
So in the example above, https://digital.library.org/iiif01 opens a full viewer to browse the object. The viewer used is Triiiceratops — the newest viewer in the IIIF ecosystem. Built on Svelte and OpenSeadragon, is still young, but very usable, lightweight, and easy to embed and customize. It is my favourite viewer.
mkiiif doesn't handle metadata for now (and probably won't) — the manifest can be easily patched to insert descriptive metadata in a later step, after image preparation, pulling from any existing datasource or metadata catalog.
The main drawback of generating IIIF this way is that you end up managing a large number of files on the filesystem, and handling millions of small image tiles can be slow (and costly). This is where IIIF intersects — and overlaps — with similar practices in digital preservation, such as BagIt, OCFL, and WARC/WACZ. So far there's no specification or viewer implementation that handles IIIF containers (e.g. a zip file bundling images, tiles, and the manifest). Discussions on this have been ongoing in the past; I've recently been looking at analogous approaches like GeoTIFF and SZI.
A static IIIF bundle generated with this CLI still needs to be served from an HTTP server, with the base URL defined at derivation time. Could such a bundle be opened from localhost and viewed directly in the browser? Service Workers might help here (even if HTTP is still needed), but it's a rabbit hole I haven't explored yet.
The CLI is pretty bare-bones — feel free to suggest improvements or report bugs. I've been using it over the past weeks as part of a personal project: an amateur digital library built around a DIY book scanner I assembled at home, to preserve magazines, zines, and similar material (content NSFW and out of scope to link here).
Artificial intelligence (AI) has transformed nearly every field. Today, we can access and train models that generate text, images, sound, video, and code. This transformation is reshaping how we think, analyze, and preserve information. Yet, despite the rapid growth of AI, its use for analyzing web archive content seems to advance at a slower pace.
Web archiving is the process of collecting, preserving, and providing access to web content over time, where a memento represents a previous version of a web resource as it existed at a specific moment in the past. Much of the recent work within the web archiving community (e.g., [1], [2], [3]) has focused on making the archiving process itself more intelligent, integrating AI into tasks such as web crawling, storage optimization, and metadata generation. In contrast, the application of AI to the analysis of already archived web content has received comparatively less attention. This gap represents a great opportunity for innovation and contribution, particularly as web archives continue to grow in size, diversity, and historical importance.
In this blog, I aim to outline (based on my perspective, analysis, preliminary work, and insights gained during my PhD candidacy exam) opportunities for where AI could play a role, as well as key challenges involved in integrating AI into web archiving.
My Preliminary Work
Since I joined the PhD program at ODU in 2023 (Blog post introducing myself) under the supervision of Dr. Michele C. Weigle, my work has focused on the intersection of web archiving and AI, with a particular emphasis on leveraging Large Language Models (LLMs) through Retrieval-Augmented Generation (RAG) to detect and interpret text changes across mementos. Identifying the exact moment when content was modified often requires carefully comparing multiple archived versions, a process that can be both tedious and time-consuming. Moreover, detecting and analyzing where important changes occur is not a straightforward process. Users often need to select a subset of captures from thousands available, and even then, there is no guarantee that the differences they find will be meaningful or important. Traditional approaches to memento change analysis, such as lexical comparisons and indexing (e.g., [4], [5]), focus on showing the deletion or addition of terms or phrases but ignore semantic context. As a result, they miss subtle shifts in meaning and rely heavily on human interpretation.
My early work resulted in a paper titled “Exploring Large Language Models for Analyzing Changes in Web Archive Content: A Retrieval-Augmented Generation Approach,” coauthored with Lesley Frew, Dr. Jose J. Padilla, and Dr. Michele C. Weigle. The results of this initial exploration demonstrated that an LLM, when combined with tools such as RAG over a set of mementos, can effectively retrieve and analyze changes in archived web content. However, it remains necessary to constrain the analysis to distinguish between important and non-important changes. Building on this, I have been developing a pipeline to automatically determine whether a change alters meaning or context and should be considered significant. This aims to reduce manual effort, cognitive load, and support integration into web archive systems while advancing methods for analyzing archived web content at scale.
My PhD Candidacy Exam
During the summer of 2025, I passed my PhD candidacy exam (pdf, slides). This milestone marked an important transition in my doctoral studies and provided an opportunity to reflect on my preliminary work, learn, and identify new ways to contribute to the intersection of AI and web archiving. In my candidacy exam, I reviewed a set of ten papers related to analyzing changes and temporal coherence in archived web pages and websites. Changes refer to any modifications observed in web content over time, including the addition, deletion, or alteration of text, images, structure, or other embedded resources. Temporal coherence, on the other hand, refers to the degree to which all components of an archived web page (such as HTML, text, images, and stylesheets) or website (such as interconnected pages and resources) were captured close enough in time to accurately represent how it appeared and functioned at a specific moment. A lack of temporal coherence can result in inconsistencies in how the archived page or site looks or behaves, which may affect the accuracy of change analysis.
Figure 2. A moment from my PhD candidacy exam, where I presented a ten-paper review on analyzing changes and temporal coherence in archived web pages and websites.
AI in Web Archiving: Opportunities
Over time, several researchers have addressed the analysis of changes and temporal coherence in web archives; however, the use of AI in this context has been limited. Below, I outline some research opportunities and challenges based on insights gained from my preliminary work and candidacy exam on how AI could play a role in these activities.
Topic Drift
AlNoamany et al. [6] studied web archive collections to identify off-topic pages within TimeMaps, which occur when a webpage that was originally relevant to a collection later changes into unrelated content. For example, in a collection about the 2003 California Recall Election (Figure 3), the site johnbeard4gov.com initially supported candidate John Beard (September 24, 2003) but later transformed into an unrelated adult-oriented page (December 12, 2003), making it irrelevant to the collection. To detect such changes, AlNoamany et al. proposed automated methods including text-based similarity metrics (cosine similarity, Jaccard similarity, and term overlap), a kernel-based method using web search context, and structural features such as changes in page length and word count. Using manually labeled TimeMap versions as ground truth, they found that the best performance was achieved by combining TF-IDF cosine similarity with word-count change.
Figure 3. Example of johnbeard4gov.com going off-topic. The first capture (September 24, 2003) shows the site supporting a California gubernatorial candidate, while the later capture (December 12, 2003) shows the domain transformed into unrelated adult-oriented content. Source: AlNoamany et al. [6]
Recent advances in AI and representation learning offer opportunities to enhance off-topic detection in web archives beyond traditional term frequency measures. Instead of relying on TF-IDF, future approaches could use dense semantic embeddings from transformer models to better capture meaning and context, enabling the detection of more subtle topic drift. Comparing embedding-based similarity with the methods proposed by AlNoamany et al. could help determine which approach is more effective, particularly when topic shifts are not immediately apparent.
Temporal Coherence
Weigle et al. [7] highlight a key challenge in modern web archiving: many sites, such as CNN.com, rely on client-side rendering, where the server delivers basic HTML and JavaScript that later fetch dynamic content (often JSON) through API calls. Traditional crawlers like Heritrix do not execute JavaScript or consistently capture these dynamic resources, leading to temporal violations in which archived HTML and embedded JSON files have different capture times, potentially misrepresenting events or news stories. The issue is illustrated in Figure 5, which shows archived CNN.com pages captured between September 2015 and July 2016. The top row displays pages replayed in the Wayback Machine that show the same top-level headline despite being captured months apart. The bottom row shows mementos from the same dates with the correct top-level headlines; however, the second-level stories remain temporally inconsistent.
By measuring time differences between base HTML captures and embedded JSON resources using CNN.com pages (September 2015–July 2016), Weigle et al. identified nearly 15,000 mementos with mismatches exceeding two days. They conclude that browser-based crawlers best reduce such inconsistencies, though due to their higher cost and slower performance, they recommend deploying them selectively for pages that depend on client-side rendering.
Figure 4. Example of temporal coherence violation in archived CNN.com pages using client-side rendering. Source: Weigle et al. [7].
AI can enhance existing approaches to temporal coherence in web archives, such as those proposed by Weigle et al., by helping identify pages that depend on client-side rendering. For example, a machine learning model could be fine-tuned to analyze the initial HTML and related resources to detect signals such as empty or minimally populated DOM structures and classify whether a webpage relies on client-side rendering. AI-based analysis could also estimate the proportion of JavaScript relative to textual content and detect patterns associated with common client-side frameworks. Combined with indicators such as API endpoints referenced in scripts, these features can be used to flag pages that are unlikely to render correctly with traditional crawlers and may require browser-based crawling.
AI for Enhancing Web Archive Interfaces
While platforms such as Google and others have begun integrating AI into their user interfaces, web archives have largely remained unchanged in this respect. This is notable given the potential of AI to make web archive interfaces more intuitive and more informative for a wide range of users. For example, as my preliminary work suggests, when analyzing content changes, users currently must manually browse long lists of captures or compare multiple archived versions of a webpage. AI could instead automatically identify moments when important changes occur and direct users’ attention to those points in time.
Along the same line, the Internet Archive’s Wayback Machine provides a “Changes” feature that highlights deletions and additions between two snapshots and a calendar view where color intensity reflects the amount of variation. However, this variation is based on the quantity of changes rather than their significance. As a result, many small edits may appear more important than fewer but meaningful modifications. An AI-enhanced interface could address this limitation by incorporating semantic change detection. For instance, a calendar view that highlights when the meaning or message of a page changes can make large-scale temporal analysis more efficient and accessible. Moreover, users could ask natural-language questions such as “When did this page change its message?” or “What were the major updates during a specific period?” and receive concise, understandable answers.
AI could also guide users through large collections by recommending related pages, explaining why certain versions are relevant, or warning when an archived page may contain temporally inconsistent content. For non-experts, visual aids generated by AI, such as timelines, change highlights, or short explanations, could make complex web archive data easier to interpret.
AI in Web Archiving: Challenges
While there are opportunities for AI integration into web archiving, there are also challenges that must be considered.
Technical Challenges
From a technical standpoint, I identified three primary challenges regarding using AI for analyzing archived web content. The first concerns the nature of archived web data. Web archiving systems typically store collected content using the Web ARChive (WARC) format. Each WARC file stores complete HTTP response headers, HTML content, and additional embedded resources such as images and JavaScript files. Although this format provides a structure and allows long-term preservation, it is verbose and was not designed to support AI-based analysis. Consequently, researchers must perform extensive parsing and preprocessing before AI models can effectively use archived web content.
Second, many web archives, such as the Internet Archive’s Wayback Machine, prioritize long-term storage and preservation over indexing and large-scale content retrieval. As a result, a single web page may have hundreds or even thousands of archived versions over time. Building and maintaining large-scale vector indexes over such temporally dense collections quickly becomes computationally expensive and, in many cases, impractical.
Third, even when working with controlled data scenarios, such as curated web archive collections, AI-driven analysis still depends on the availability of ground truth for evaluation and validation. For instance, training models to detect significant changes across mementos would require large-scale, high-quality annotations that capture not only what changed, but whether those changes meaningfully affect content interpretation. At present, no large-scale annotated datasets exist that support systematic analysis of change significance across archived web versions, creating a major barrier to training and evaluating AI models in this domain.
Ethical Challenges
Beyond technical limitations, the integration of AI into web archive analysis raises important ethical challenges. For instance, web archives preserve content as it existed at specific points in time, often without the consent or awareness of content creators or the individuals represented in that content. When AI models analyze archived web data, they may surface, reinterpret, or amplify sensitive information that was never intended to be reused in new analytical contexts. For this reason, it is important to carefully consider how AI is applied within web archiving. I contend that AI should be viewed as a complementary tool, one that supports, rather than replaces, human judgment. For example, AI can assist in identifying potential moments of relevant changes, flagging or summarizing them, while humans interpret the results and make decisions.
It is also important to note that recent debates highlight growing tensions between web archives and content owners regarding the use of archived data for AI training and analysis. For example, major news publishers have begun restricting access to resources like the Internet Archive due to concerns that archived content is being used for large-scale AI scraping without compensation or consent [8]. In response to such restrictions, researchers and practitioners—including Mark Graham, Director of the Wayback Machine—have argued that limiting access to web archives poses a significant risk to the preservation of digital history [9]. From this perspective, the primary concern is not excessive access, but rather the potential loss of the web as a historical record if archiving efforts are weakened.
Conceptual Challenges
AI models, particularly LLMs, typically operate on individual snapshots of data. As a result, they are not inherently designed to reason about evolution, temporal coherence, or change over time in archived web content. Consequently, answers to temporally grounded questions should not be expected by default when these models are applied without additional structure or context.
In static analysis scenarios, AI models can perform effectively. For example, given a single archived web page, an LLM can generate a summary, identify main topics, extract named entities, or analyze embedded resources such as images, videos, or scripts. Temporal analysis in web archiving, however, requires a different mode of reasoning. The central questions are not “What does this page say?” or “What is this page about?” but rather “What changed?”, “When did it change?”, “Why did it happen?”, and “What impact does the change have over time?” Answering these questions requires comparing multiple archived versions, reasoning based on context, and perhaps correlating changes across web pages.
Integrating AI into web archiving is therefore not only about efficiency, but about enabling new forms of discovery. This requires clearly defining desired outcomes and using AI to support or accelerate processes that have traditionally been manual.
Final Reflections
To conclude, I would like to leave the reader with a set of open questions as we continue moving toward the integration of AI in web archiving. One of the most visible changes introduced by AI is the ability to go beyond syntactic analysis and begin exploring semantic analysis, where meaning, context, and interpretation matter. This shift is not about replacing existing techniques, but about expanding the types of questions we can ask when working with web archive data.
I contend that traditional algorithms remain essential for many web archiving tasks. They are precise, transparent, and well understood. AI, by contrast, offers strengths in areas where rules struggle: interpreting context, assessing relevance, and reasoning across multiple versions of content. Rather than framing this as a competition between algorithms and AI, a more productive question is how these approaches can complement one another, and in which parts of the analysis pipeline each is most appropriate.
In the short term, I consider that AI tools are unlikely to replace algorithmic methods. However, they already show promise as assistive tools that can guide analysis, prioritize attention, and help humans reason about large and complex temporal collections. This naturally raises a forward-looking question: if AI continues to improve in its ability to reason about time, meaning, and change, how should the web archiving community adapt its tools, workflows, and standards?
The WARC format has proven effective for long-term preservation, but it was not designed with AI-driven analysis in mind. Should we aim to augment existing archival formats with AI-aware representations, or should we focus on developing AI methods that better adapt to current standards such as WARC? How we answer this will shape not only how we analyze web archives, but also how future generations access and understand the web past.
References
[1] AK, Ashfauk Ahamed. “AI driven web crawling for semantic extraction of news content from newspapers.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-25616-x.
[2] Abrar, M. F., Saqib, M., Alferaidi, A., Almuraziq, T. S., Uddin, R., Khan, W., & Khan, Z. H. “Intelligent web archiving and ranking of fake news using metadata-driven credibility assessment and machine learning.” Scientific Reports, 2025. [Online]. https://doi.org/10.1038/s41598-025-31583-0.
[3] Nair, A., Goh, Z. R., Liu, T., and Huang, A. Y. “Web archives metadata generation with gpt-4o: Challenges and insights,” arXiv, Tech. Rep. arXiv:2411.0540, Nov. 2024. [Online]. https://arxiv.org/abs/2411.05409.
[4] L. Frew, M. L. Nelson, and M. C. Weigle, “Making Changes in Webpages Discoverable: A Change-Text Search Interface for Web Archives,” in Proceedings of the 23rd ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), 2023, pp. 71–81. https://doi.org/10.1109/JCDL57899.2023.00021.
[5] T. Sherratt and A. Jackson, GLAM-Workbench/web-archives, https://zenodo.org/records/6450762, version v1.1.0, Apr. 2022. DOI: 10.5281/zenodo.6450762.
[6] Y. AlNoamany, M. C. Weigle, and M. L. Nelson, “Detecting off-topic pages within timemaps in web archives,” International Journal on Digital Libraries, vol. 17, no. 3, pp. 203–221, 2016. https://doi.org/10.1007/s00799-016-0183-5.
[7] M. C. Weigle, M. L. Nelson, S. Alam, and M. Graham, “Right HTML, wrong JSON: Challenges in replaying archived webpages built with client-side rendering,” in Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), Jun. 2023, pp. 82–92. https://doi.org/10.1109/JCDL57899.2023.0002.
Figure 1: Each tweet ID is a unique identifier that encodes the tweet creation timestamp, example adapted from Snowflake ID, Wikipedia.
Web archives, such as the Wayback Machine, are indexed by URL. For example, if we want to search for a tweet we must first know its URL. Figure 2 demonstrates that searching for a tweet URL results in a timemap of that tweet archived at different points in time. Clicking on a particular datetime will show the archived tweet at that particular point in time.
Figure 2: An archived tweet URL results in a timemap consisting of archived copies of the tweet.
Figure 3 shows a screenshot of a tweet shared by @_llebrun. The tweet in the screenshot was originally posted by @randyhillier who later deleted his tweet. The screenshot of the tweet does not have the tweet's URL on the image. Moreover, when a tweet is deleted, we will not be able to find the tweet URL on the live web, nor will we know how to look it up in the archive.
Figure 3: @_llebrun tweeted a screenshot of a tweet originally posted by @randyhiller, who later deleted his tweet.
Therefore, we need to construct the URL of a tweet using only the information present in the screenshot. The structure of a tweet URL is:
We need the Twitter_Handle and Tweet_ID to construct a tweet URL. Each tweet ID is a unique identifier known as theSnowflake ID that encodes the tweet creation timestamp (Figure 1). We can extract the Twitter handle and timestamp from a tweet in the screenshot. In our previous tech report, we introduced methods for extracting Twitter handles and timestamps from Twitter screenshots. Next, we need to determine the tweet ID from the extracted timestamp. We could use only the Twitter handle and query the Wayback Machine, but that would be an exhaustive task to individually dereference all the archived tweets for a user. For example, the following curl command shows the total number of archived tweets required to dereference for @randyhiller's status URLs is huge (42,053). Hence, our goal is to limit the search space by utilizing the timestamp present on the screenshot.
Previously, one could query Twitter to find the timestamp of a tweet given a tweet ID. But, this service is no longer freely available.. The Twitter API has access rate limits and metadata from deleted/suspended/private tweets cannot be accessed using the API. Moreover, the Twitter API is currently monetized and no longer research-friendly. To address these issues, WS-DL members Mohammed Nauman Siddique and Sawood Alam developed the TweetedAt web service in 2019. The goal of this service is to extract the timestamps for Snowflake IDs and estimate timestamps for pre-Snowflake IDs. Therefore, TweetedAt has become a useful tool for finding timestamps from tweet IDs. However, we require a tweet ID prefix to be determined from a given timestamp.
Reverse TweetedAt
The Snowflake service generates a tweet ID which is a 64-bit unsigned integer composed of: 41 bits timestamp, 10 bits machine ID, 12 bits machine sequence number, and 1 unused sign bit. The timestamp occupies the upper 41 bits only.
TweetedAt determines the timestamp for a tweet ID by right-shifting the tweet ID by 22 bits and adding the Twitter epoch time of 1288834974657 (offset).
For Reverse TweetedAt, given a datetime, we want to generate a tweet ID prefix by subtracting the offset and left-shifting by 22 bits. The process will not reconstruct the exact tweet ID because the lower 22 bits are all zeros. However, the process will give us a tweet ID prefix for a timestamp. For example, the tweet ID for @randyhillier’s tweet is ‘1495226962058649603’ and the timestamp is ‘9:41 PM Feb 19, 2022’ as shown in Figure 3. The tweet ID is a 19-digit ID and the timestamp is at minute-level granularity. The Reverse TweetedAt would compute a tweet ID prefix ‘149522’ of 6-digits for the 19-digit tweet ID ‘1495226962058649603’ based on the timestamp at minute-level granularity.
Python code to get tweet ID prefix from a Wayback timestamp
We integrated Reverse TweetedAt as a web service alongside TweetedAt. The service accepts a timestamp as user input and returns the corresponding tweet ID prefix, tweet ID regex, and full tweet ID range (Figure 4). It supports multiple valid timestamp formats (e.g., ISO 8601, RFC 1123, Wayback) and provides output at different levels of granularity. For example, Figure 4 shows output for millisecond-level granularity. Because millisecond-level precision is typically unavailable in tweet timestamps, the tool can interpret such inputs at second- or minute-level granularity. Rather than assuming zeros for unknown fields, the tool expands the input into the full corresponding time window (e.g., an entire second or minute), and computes the tweet ID prefix over that interval.
Figure 4: Reverse TweetedAt outputstweet ID prefix at millisecond- level granularity.
Figure 5: Reverse TweetedAt outputs tweet ID prefix at second-level granularity.
Figure 6: Reverse TweetedAt outputs tweet ID prefix at minute-level granularity.
Tweet ID Regex-based Retrieval Across Temporal Granularity
We can use the tweet ID regex derived from a timestamp to search for archived tweets within a specific temporal window. By querying the Wayback Machine’s CDX API and filtering results using this prefix-based regex, we can identify tweet URLs whose IDs fall within the calculated range. As the timestamp becomes less precise, the tweet ID becomes shorter and the regex search space widens.
For example, the tweet ID of @randyhillier’s tweet shown in Figure 3 is ‘1495226962058649603.’ Using TweetedAt, we can get the timestamp at millisecond-level granularity. Using Reverse TweetedAt, the millisecond-level granularity returns a more precise prefix and results in 10 archived captures, while a slightly less precise prefix (second-level granularity) returns 15. When the precision is reduced further (minute-level granularity), the number of results remains 15. This indicates that all tweets within that broader time window were posted within the same narrower interval. This illustrates how lower temporal granularity expands the potential search space. However, a wider ID range does not necessarily produce more results; it only increases the number of possible candidate IDs.
CDX API Wildcard Search and Snowflake IDs to Limit the Search Space Using Tweet ID Prefix
We can now determine a tweet ID prefix from a screenshot timestamp using the Reverse TweetedAt service. Since a tweet can be archived any time between ±26 hours of the screenshot timestamp, we can determine tweet ID prefixes from the time window timestamps. We can use this time window to limit the search space by excluding the URLs tweeted before and after the alleged timestamp. Let us consider a tweet in the screenshot in Figure 2, where the screenshot timestamp is:
9:41 PMᐧ Feb 19, 2022 (20220219214100)
We compute the tweet ID prefixes from left-hand boundary (-26) and right-hand boundary (+26) timestamps using the Reverse TweetedAt which are listed below:
-26 hours timestamp: 20220218194100 → tweet ID prefix: 14947588
+26 hours timestamp: 20220220234100 → tweet ID prefix:149554404
As previously mentioned, the timestamp occupies the upper 41 bits only. We can use a common portion of tweet ID prefixes (149[4-5]) and do a CDX API wildcard search in the Wayback Machine to limit the search space. The search space reduces to 629 archived tweets, whereas using only the Twitter handle outputs 42,053 archived tweets. Now, dereferencing 629 archived tweets to search for a particular tweet text of a screenshot is a lot of work but feasible, whereas dereferencing 42,053 archived tweets is far too expensive. The following curl command shows the total number of archived tweets required to dereference for @randyhiller's status URLs with a common tweet ID prefix is comparatively less (629).
It is easy to search for a tweet in the Wayback Machine when you know the URL. But a screenshot of a tweet typically does not have its URL present on the image. However, the Twitter handle and timestamp present in the tweet in the screenshot can be utilized to search for a tweet in the Wayback Machine web archive. Given a datetime, Reverse TweetedAt produces a tweet ID prefix, which we can then use to grep through a CDX API response of all tweets associated with a Twitter account. We can determine approximate tweet IDs from left-hand boundary and right-hand boundary timestamps from a screenshot timestamp using the Reverse TweetedAt tool. We found that we can limit the search space using a CDX API wild card search based on a common tweet ID prefix. Thus, the process for finding candidate archived tweets for the tweet in the screenshot is optimized. We published a paper at the 36th ACM Conference on Hypertext and Social Media, “Web Archives for Verifying Attribution in Twitter Screenshots,” which discusses how we can further use the candidate archived tweets to verify whether the tweet in the screenshot was posted by the alleged author.
The Disintegration Loops: Generational Loss in Web Archives
Michael L. Nelson
As part of the Internet Archive's Information Stewardship Forum (March 18–20, 2026), I decided to use my five minute lightning talk to raise the issue of generational loss in web archives. Or more directly, making copies of copies (...of copies…) – something that web archives currently do not do well. My title is based on William Basinski's four volume release "The Disintegration Loops", in which he played the audio tapes of "found sounds", recorded decades earlier, in loops, with the whole process lasting over an hour. The effect is hauntingly beautiful, with each loop slightly degrading the magnetic tape, resulting in a generational loss. The degradation of each loop is right on the edge of the just-noticeable difference, until the entire track is reduced to just a shadow of its former self.
I first discussed this topic in my 2019 CNI closing keynote (slide 88), where I introduced the inability of web archives to archive other web archives as part of the larger issue of web archive interoperability. Let's begin with walking through the example of archiving a tweet (which we already know to be challenging!). The original tweet is still on the live web, even though the UI has undergone many revisions since when it was originally tweeted in 2018.
Note that archive.today is aware that the page comes from the Wayback Machine but the original host is twitter.com, and it maintains both the original Memento-Datetime (20180501125952) as well as its own Memento-Datetime (20190407023141). I then archived archive.today's memento to perma.cc in 2019 (screen shot from 2019):
Although the loss occurs in discrete chunks, it is reminiscent of Basinski's Disintegration Loops, with information lost at each step, and the final version being a mere shadow of the original. In 2019, this was not universally recognized as a problem, since archiving the playback interface of other web archives was not considered a problem to itself. The "right" solution, of course, is to share the WARC files (or WAC, or HAR, or…) out-of-band and let the other web archives replay from the same source files. But this is rarely possible: for a variety of reasons web archives typically do not share the original WARC files, and in the case of archive.today, might not even store the original source files (and instead, likely only store the radically transformed pages).
More importantly, it is sometimes useful to archive a particular web archive's replay of a page, which itself must be archived, because it changes through time. For example, memento #3 (the perma.cc memento of archive.today's memento) is now different; this is a screen shot from 2026:
Surely the source files themselves have not changed, and the difference is due to improvements in pywb, which is under constant development. perma.cc's replay of the 2019 page in 2019 is different from the replay from 2026, which implies that it could be different still in the future. But we can not currently archive without generational loss of perma.cc's replay of that page to, say, the Wayback Machine. The fact that screen shots – which are rife with their own potential for abuse (cf. HT 2025, arXiv 2022) – are the only mechanism to document these replay differences underscores the web archive interoperability problem.
I chose the topic of generational loss for my slot at the Information Stewardship Forum because recent events have introduced a new use case for archiving the replay of web archives. Wikipedia recently announced it was blacklisting archive.today because its editors discovered that webmaster at archive.today was using its captcha to direct a DDoS attack against a blog owned by someone that webmaster had a dispute with (the blogger had posted a lengthy investigation of the identity of webmaster), and, for our discussion more disturbingly, had edited the content of an archived page to include the name of the blogger where it would not otherwise be. The Wikipedia discussion page is hard to follow, in part because the editors are discussing how to archive the replay of an archived page. For one example, they show how the archive.today replay now has been changed back to have "Comment as: Nora Puchreiner" (middle of the image):
But the replay alteration from archive.today in question is archived at megalodon.jp to show that the name "Nora Puchreiner" was replaced with the name of the blogger that had earned webmaster's ire, "Jani Patokallio". And yes, megalodon.jp's replay of archive.today's memento is that bad (at least in my browser, it is shrunk down impossibly small), so I used the dev tools to find the string in question.
Another Wikipedian archived (using yet another archive, ghostarchive.org) a google.com SERP to show that archive.today has reverted from "Jani Patokallio" back to "Nora Puchreiner".
What does changing "Nora" to "Jani" (and then changing it back again) accomplish? I'm not sure; this appears to be just a petty response to an ongoing dispute. But the implication is profound: this is the first known example of a major web archive purposefully and maliciously altering its contents, something that we knew was possible but had not yet experienced.
We have long known that replay can change through time (cf. PLOS One 2023) due to the replay engine (the Wayback Machine, Open Wayback, pywb, etc.) evolving, but these changes were engineering results and the replay mostly improved over time. But now we have seen web archives maliciously alter (and then revert) the replay, and we need a more standard and interoperable way to archive archival replay. Not just to prove that a web archive did alter its replay, but also to prove that an archive did not alter its replay. Out-of-band sharing of WARC files is the gold standard, but for a variety of reasons this is unlikely to happen. We must be able to use web archives to verify and validate web archives. We explored a heavyweight design for this a few years ago (JCDL 2019), but it should be revisited in light of developments like WACZ.
In Brief: This study examines the concept of neutrality in Library of Congress Subject Headings and the subject approval process by analyzing proposed headings that were rejected over a nearly 20-year period. It considers the place of neutrality in libraries more generally and argues that equity, rather than neutrality, is the appropriate lens for judging subject heading proposals. Finally, it recommends several reforms that could improve the subject heading process and make it more equitable.
If a train is moving down the track, one can’t plop down in a car that is part of that train and pretend to be sitting still; one is moving with the train. Likewise, a society is moving in a certain direction—power is distributed in a certain way, leading to certain kinds of institutions and relationships, which distribute the resources of the society in certain ways. We can’t pretend that by sitting still—by claiming to be neutral—we can avoid accountability for our roles (which will vary according to people’s place in the system). A claim to neutrality means simply that one isn’t taking a position on that distribution of power and its consequences, which is a passive acceptance of the existing distribution. That is a political choice.[1]
Introduction
Library workers and patrons have long been frustrated with Library of Congress Subject Headings (LCSH) for being out of date and lacking well-known concepts with abundant usage. Contributors to the Subject Authority Cooperative Program (SACO) have made many improvements to LCSH by proposing new headings and revising existing terms. Those attempts, however, have sometimes been hampered by the Library of Congress’s (LC) preference for supposed neutrality within the vocabulary; Subject Headings Manual (SHM) instruction “H 204,” released in 2017, specifically dictates that proposed headings should “employ neutral (i.e., unbiased) terminology.”[2]
This desire for neutrality has been directly stated, alluded to, or otherwise upheld in myriad rejections of proposed subject headings, from Negative campaigning[3] to White flight.[4] Even Water scarcity, a quantifiable concept of worldwide concern, was rejected in 2008 as a non-neutral topic requiring value judgments with the following justification:
Works on the topics of water scarcity and water shortage have been cataloged using the heading Water-supply, post-coordinating[5] as necessary with additional headings such as Water conservation and Water resources management. The meeting determined that this practice is appropriate and should continue, since Water-supply is a neutral heading that does not require a judgment about the relative abundance of water.[6]
However, what exactly constitutes neutral and unbiased terminology is never defined in “H 204” or anywhere else in the SHM, nor in any other Library of Congress controlled vocabulary manuals.[7] Much of the previous literature on neutrality in libraries focuses on debates over possible definitions of the term and what role neutrality should play in library services and collections. Building off previous critical cataloging literature, which focuses on addressing problematic terms, subject hierarchies, and biases within cataloging standards, this article extends that scrutiny further. We analyze how neutrality is embedded in the LC structures and systems that vet the terms catalogers utilize to describe materials.
Our article examines the ways in which neutrality is enforced in LCSH rejections between July 2005 and December 2024. We review “Summaries of Decisions” from LC Subject Editorial Meetings (along with associated discussion and commentary in the field); within these, we identify and interpret patterns of justifications used to reject subject heading proposals and maintain purported neutrality within the vocabulary. We argue that neutrality has been used to keep many concepts depicting prejudice (racism, sexism, etc.), as well as concepts related to the lived experiences of marginalized people, out of the vocabulary and/or to obscure materials about those topics under other, often more generalized or euphemistic, terminology. As a counterpoint, we suggest a values- and equity-driven approach to replace the principle of neutrality in a cataloging context and within the subject approval process. We acknowledge that the current political situation may be particularly fraught for equity-driven change, but believe bowing to political pressures is untenable, and continued pursuit of neutrality will only serve to further the discordance between library values and the realities of LCSH.
Background
Neutrality: Assumed, but Nebulous
Schlesselman-Tarango notes the perceived conceptual importance of neutrality for libraries and librarianship; their “status as ‘an essential public good’” is “contingent on the perpetration of the idea that [they are] also neutral.”[8] Seale further situates this notion of libraries-as-neutral as not externally imposed, but emanating from within librarianship itself: “The positioning of the library as a neutral and impartial institution, separated from the political fray, resonates with dominant library discourse around libraries.”[9]
However, despite both critics and supporters assuming that neutrality is fundamental to librarianship, there is a dearth of references to the term in official documents underpinning the ethics and standards of the library profession. The American Library Association’s (ALA) Working Group on Intellectual Freedom and Social Justice observed, for example, that “the word neutrality does not appear in the Library Bill of Rights, the ALA Code of Ethics, and any other ALA statements that the Working Group could locate. It does not appear in the Intellectual Freedom Manual (10th Edition) nor is it defined in any official ALA document or policy.”[10] The International Federation of Library Associations and Institutions’s (IFLA) Code of Ethics mentions but does not define neutrality in Section 5, in sentences such as “Librarians and other information workers are strictly committed to neutrality and an unbiased stance regarding collection, access and service.”[11] For catalogers in particular, the Cataloging Code of Ethics, issued in 2021 and discussed further below, explicitly disputes the concept of neutrality.
Most pertinent to the subject proposal process, the National Information Standards Organization’s (NISO) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies mentions neutrality exactly twice, yet again without definition. The first instance, in guidance about choosing preferred forms of terms, asserts that “Neutral terms should be selected, e.g., developing nations rather than underdeveloped countries.”[12] The second appearance, in a discussion of synonyms, notes “pejorative vs. neutral vs. complimentary connotation[s]” of terms that might influence usage.[13] The latter reference positions neutrality as the impartial fulcrum of term meanings, while the former implies, particularly via the example, a more active attempt at choosing equitable and unbiased terminology.
Although the terms “neutral” and “unbiased” are often linked when they appear in library literature (as in the IFLA Code of Ethics), they are not synonymous. Oxford English Dictionary (OED) definitions of neutral include “inoffensive,” and “not taking sides in a controversy, dispute, disagreement, etc.”; unbiased, however, while meaning “not unduly or improperly influenced or inclined; [and] unprejudiced,” does not necessarily imply a lack of involvement in social or political issues.[14] The incompatibility between neutrality as inoffensive isolation versus unbiasedness as active equity plays out repeatedly in library discussions. Without clear definitions, neutrality in the NISO Guidelines and elsewhere is open to conjecture and interpretation. As noted by Scott and Saunders, “[T]he term ‘neutrality’ seems to be used for, or conflated with, everything from not taking a side on a controversial issue to the objective provision of information and a position of defending intellectual freedom and freedom of speech.”[15]
Proponents of library neutrality don’t fully agree on definitions, either. In Scott and Saunders’s survey, some describe it as “lacking bias,” which more closely aligns with principles of equity.[16] The depiction of neutrality by LaRue, the former Director of the ALA’s Office for Intellectual Freedom, also appears to resemble equity; he frames neutrality as not “deny[ing] people access to a shared resource just because we don’t like the way they think” and giving everyone “a seat at the table.”[17] Dudley, reframing library neutrality in relation to pluralism, highlights similar values; his proposed ethos calls on librarians to “adhere to principled, multi-dimensional neutrality” which includes “welcoming equally all users in the community” and “consistently-apply[ing] procedures for engaging with the public.”[18]
The 2008 book Questioning Library Neutrality examines many aspects of why neutrality is both an illusion and a misguided aspiration, and also disabuses readers of the idea that it has always been a core value. Rosenzweig points out that neutrality as a principle of librarianship does not go back to the early development of public libraries:
We would do well to remember that, if libraries as institutions implicitly opened democratic vistas, our librarian predecessors were hardly democratic in their overt professional attitude or mission, being primarily concerned with the regulation of literacy, the policing of literary taste and the propagation of a particular class culture with all its political, economic and social prejudices. In fact, the idea of the neutrality of librarianship, so enshrined in today’s library ideology (and so often read back into the indefinite past), was alien to these earlier generations.[19]
Although Macdonald and Birdi’s literature review identifies four conceptions of neutrality within library science literature—“favourable,” “tacit value,” “libraries are social institutions,” and “value-laden profession”—the authors found that depictions of neutrality articulated by practitioners are more complicated. Many have “ambivalent” views of neutrality, seeing it as “a slippery and elusive concept.”[20] The relative importance of neutrality to proponents varies, depending on its position vis-à-vis other library values: “When it is alone, or grouped with a simple, single other value like professionalism, it is very low in priority. When it is presented in a group of other values or left implicit, it fares better.”[21] Catalogers tended to espouse neutrality the least among library specializations, with 21% reporting that they never think about neutrality.[22] Further, some surveyed librarians “are more likely to eschew neutrality on matters of social justice,” when neutrality comes into conflict with core library values.[23]
Neutrality versus Social Justice
Since the late 1960s, neutrality has increasingly come into question as librarians have embraced ideals centering social justice, equity, diversity, and inclusion, particularly in the ALA.[24] These values, codified in the ALA Code of Ethics and Library Bill of Rights, include a commitment to “recognize and dismantle systemic and individual biases; to confront inequity and oppression; to enhance diversity and inclusion; and to advance racial and social justice in our libraries, communities, profession, and associations.”[25] ALA resolutions go a step further, acknowledging the “role of neutrality rhetoric in emboldening and encouraging white supremacy and fascism.”[26] Scott and Saunders sum up the issue, noting that while some librarians cast neutrality as a “fundamental professional value, albeit one that is not explicitly mentioned in the professional codes of ethics and values,” others assert that it is “a false ideal that interferes with librarians’ role of social responsibility, which is an explicitly stated value of librarianship.”[27] As Watson argues in an ALA 2018 Midwinter panel on neutrality in libraries, “We can’t be neutral on social and political issues that impact our customers because, to be frank, these social and political issues impact us as well.”[28]
Even among library codes of ethics that explicitly hold neutrality as a core value, there is a tension between practitioners and official documentation. For example, the Canadian Federation of Library Associations / Fédération canadienne des associations de bibliothèques (CFLA-FCAB) Code of Ethics calls for librarians to “promote inclusion and eradicate discrimination,” provide “equitable services,” and “counter corruption directly affecting librarianship”; but the Code also advocates for neutrality, advising librarians to “not advance private interests or personal beliefs at the expense of neutrality.”[29] Once again neutrality remains undefined—though it’s implied, based on context, to be not taking sides, matching one of the OED definitions above. This understanding accords with a 2024 study on Canadian librarians, which noted most Canadian academic librarians seem to have coalesced around defining neutrality as “not taking sides,” followed by “not expressing opinions.”[30]
Yet the same study also highlights a perceived incompatibility of neutrality with other values of librarianship, with “the majority (54%) of respondents” disagreeing or strongly disagreeing that “‘neutrality is compatible with other library values and goals,’” and 58% disagreeing “that it is ethical to be neutral.”[31] Brooks Kirkland asserts that assuming neutrality as a key tenet of librarianship conflicts with such principles as promoting inclusion and eradicating discrimination.[32] Pagowsky and Wallace note that, whether knowingly or not, upholding neutrality within inequitable systems ultimately supports them: “Trying to remain ‘neutral,’ by showing all perspectives have value … is harmful to our community and does not work to dismantle racism. As Desmond Tutu has famously said, ‘If you are neutral in situations of injustice, you have chosen the side of the oppressor.’”[33]
Cataloguing Code of Ethics, Critical Cataloging, and Other Recent Developments
The incongruity between neutrality and social justice as core library values has sparked the numerous debates detailed above and on mailing lists and social media. It has also led in part to the expansion of the critical cataloging movement and the creation of the Cataloguing Code of Ethics, published in 2021 and since adopted by several library organizations, including the ALA division Core. The Code explicitly refutes the concept of neutrality; it avers that “neither cataloguing nor cataloguers are neutral,” and calls out the biases inherent within the dominant, mostly Western cataloging standards currently in use. It particularly notes that “cataloguing standards and practices are currently and historically characterised by racism, white supremacy, colonialism, othering, and oppression.”[34]
The most well-known critical cataloging subject heading proposal was the attempt to change the now-defunct heading Illegal aliens, as depicted in the documentary Change the Subject. In November 2021, five years after LC initially announced it would change the Illegal aliens subject headings and then backtracked after political pressure, LC announced it would replace the subject headings Aliens and Illegal aliens. However, LC did not adopt the changes it had initially announced, nor the recommendations made in a report by the ALA Subject Analysis Committee (SAC), which included revising the term to Undocumented immigrants.[35] LC instead split Illegal aliens into two new headings: Noncitizens and Illegal immigration.[36] Librarians have criticized the retention of “illegal” within one of the updated headings for continuing to make library vocabularies “complicit” with the “legally inaccurate” criminalization of undocumented immigrants.[37]
Other critical cataloging proposals have been subjected to inordinate scrutiny by LC; even when headings have been approved, they have sometimes faced heavy editing and modification. One example is Blackface, where LC’s changes to the proposal obscured the racism characterizing the phenomenon. The broader term (i.e., the parent in the subject hierarchy) was altered from Racism in popular culture to Impersonation.[38] Since Impersonation falls under the broader terms Acting, Comedy, and Imitation, this change emphasizes the performance aspect in lieu of its racist connotations. Similarly, the scope note (i.e., definition), was modified from “Here are entered works on the use of stereotyped portrayals of black people (linguistic, physical, conceptual or otherwise), usually in a parody, caricature, etc. meant to insult, degrade or denigrate people of African descent” to “Here are entered works on the caricature of Black people, generally by non-Black people, through the use of makeup, mannerisms, speech patterns, etc.”[39] As noted by Cronquist and Ross, these changes ultimately “neutralize[d]” the proposal “in the name of objectivity.”[40]
However, there have also been numerous successful updates to outdated terminology and additions of missing concepts, particularly in recent years. For example, in 2021, fifteen subject headings for the incarceration of ethnic groups during World War II, including Japanese Americans, were changed from the euphemistic phrase –Evacuation and relocation to –Forced removal and internment.[41] The African American Subject Funnel added the new heading Historically Black colleges and universities in 2022 and helped to revise Slaves to Enslaved persons in 2023; the Gender and Sexuality Funnel successfully changed the heading Gays to Gay people, and proposed the new term Gender-affirming care, in 2023; and the Medical Funnel updated Hearing impaired to Hard of hearing people in 2024.[42]
On a hopeful note, many of these large-scale projects coordinated with Cataloging Policy Specialists within LC, who worked closely with catalogers during the process and ensured that related term(s) and related Library of Congress Classification number(s) were updated as well. Further, LC has taken some recent steps to improve its vocabularies and create avenues for increased input from outside institutions. This includes hiring a limited term Program Specialist to help redress outdated terminology related to Indigenous peoples. LC also created two advisory groups for Demographic Group Terms and Genre/Form Terms, both of which allow for greater community input into these vocabularies.
Still, frustrations remain. Changing outdated terminology is a complicated process. Library of Congress vocabularies, in particular, are vulnerable to potential governmental interference. Attempted Congressional intervention during the updating of Illegal aliens and the passing of a statute mandating transparency in the subject approval process led to the creation of “H 204” codifying LC’s preference for a neutrality uninvolved in political and social issues.[43] The complication of bibliographic file maintenance (e.g., reexamining cataloged materials to determine whether subject headings should be changed, deleted, or revised) also muddies the waters and impedes large-scale projects. Staffing issues within LC further hinder the ability to undertake or complete projects, as seen in the SACO projects process, paused in 2025 due to LC’s catalog migration.
Maintaining LCSH
Library workers are familiar with LCSH in our discovery tools, and most are aware of concerns about outdated and problematic headings. However, they may not see debates and conflicts about new headings and ongoing maintenance of the vocabulary as a built-in and inherent part of the system, as catalogers who engage in that work do.
As Gross asserts:
To remain effective, headings must be regularly updated to reflect current usage. Today’s LCSH People with disabilities used to be Handicapped and, before that, Cripples. Additionally, new concepts require new headings, such as the recently created Social distancing (Public health), Neurodiversity, and Say Her Name movement. The process of determining which word or phrase to use as the subject heading for a given topic is inevitably fraught and can never be free of bias. The choice of terms embodies various perspectives, whether they are intentional and acknowledged or not.[44]
Both the need to continually revise existing headings and create new ones, and indeed wrangling over what they should be, are not defects, nor a surprise. They flow directly from the purpose of controlled vocabulary and the complications of language it exists to help navigate—the ever-changing and endless variety of ways to refer to things.
Some of the frequency and intensity of debates about LCSH stem from the fact that it attempts to be a universal vocabulary that covers all branches of knowledge. While it is created and maintained primarily for the needs of the Library of Congress, it is used by all kinds of libraries. Balancing the need to serve a user base that consists of federal legislators and providing the world with a one-size-fits-all vocabulary is clearly a formidable and contradictory endeavor. In recent decades, LC has made significant progress in opening up the maintenance process to input and contributions from the broader library community via the SACO program. These changes appear to be partly in response to demands to make the process faster and more transparent, but also a desire by LC to incorporate broader perspectives and experiences and to help with the tremendous workload.
LCSH Creation and Revision Process
The SACO program, created circa 1993,[45] allows librarians to submit proposals for new or revised LCSH terms (as well as other LC vocabularies) to the Library of Congress. In order to submit proposals, catalogers are expected to be familiar with the Subject Headings Manual (SHM), which governs LCSH usage and formulations as well as the proposal process, required research, and criteria used to evaluate proposals.[46] One of the primary requirements is literary warrant: proposers must demonstrate that there is a need for the new subject heading based on a work being cataloged.[47] Beyond the work cataloged and published/reference sources, librarians can also cite user warrant, “the terminology people familiar with the topic use to describe concepts,” as justification in proposals.[48] This can include reviews, blog posts, social media threads, LibGuides, etc.
After a proposal is submitted, LC staff schedule it to a monthly “Tentative List,” which is published to allow for public comment on proposed headings. Taking those comments and SHM instructions into account, members of LC’s Policy, Training, and Cooperative Programs Division (PTCP) make a decision about whether to add the proposed heading to LCSH, send it back to the cataloger for revision and resubmission, or reject it. If the heading is not added, a monthly “Summary of Decisions” document details the reasons for its exclusion. While the SACO program allows external librarians to submit proposals, the Library of Congress maintains its “authority to make final decisions on headings added.”[49]
Most proposals are routine and relatively straightforward, such as those that follow patterns—repeated formulations of similar subjects that provide a predictable search structure for library patrons (e.g., Boating with dogs already exists and the cataloger wants to propose Boating with cats). SHM “H 180” notes that patterns help achieve desired qualities for the vocabulary, including “consistency in form and structure among similar headings.”[50] LC is also concerned with avoiding multiple subject headings that convey too closely related concepts. LCSH online training “Module 1.2” highlights both “consistency and uniqueness among subjects” as strengths of controlled library vocabularies, for instance.[51] Proposals that don’t follow patterns therefore receive more scrutiny, to make sure they are unique, definable topics. LC makes judgment calls based on the strength of the evidence in proposals, and on SHM instructions, including the guidance in “H 204” about neutrality.
Neutrality within LC Documentation
Within its official documentation on subject headings, LC mentions neutrality sparingly. In the entirety of the SHM, the word neutral appears only once, specifically in guideline “H 204” with the recommendation that catalogers “employ neutral (i.e., unbiased) terminology.”[52] Apart from an association with the term unbiased, neutral is not defined in “H 204” or anywhere else in the SHM. Online LCSH training, freely available from the Library of Congress website, offers similarly little on the concept of neutrality. “Module 1.4” recommends that catalogers “accept the idea that all knowledge is equal” and “remain neutral … and attempt to be as objective as possible” when describing material.[53]
Despite the lumping together of neutral and unbiased in “H 204,” a neutrality which calls for a static ignoring of social realities and historical context does not equal an unbiased active engagement against prejudice. The Merriam-WebsterDictionary’s definitions of “neutral” and “unbiased” make this clear. “Neutral” as “indifferent” and politically nonaligned echoes OED. But the definition of “unbiased” goes even further, meaning not just free from prejudice and “favoritism” but “eminently fair”[54]—an active and flexible balancing of interests inherently at odds with static and detached neutrality. Eliding the two concepts risks undermining the latter, and with it library ethics and values, resulting in the further entrenchment of Western, colonial, and other biases in LCSH.
The definition of neutrality that LC, and by extension LCSH, seems to favor is one of passivity. Neutrality as indifference to social realities appears, for instance, in LCSH training “Module 1.4.” The module acknowledges that library vocabularies “are culturally fixed” and “from a place; they are from a time; they do reflect a point of view.” However, rather than using that “realiz[ation]” to encourage periodic updating of outdated or potentially prejudicial content in LCSH, the module advises “accepting” that cultural fixity as immutable fact; it recommends that catalogers “remain neutral, suspend disbelief” and focus on (undefined) objectivity instead.[55] Objectivity also appears in “H 180,” which advises catalogers: “Avoid assigning headings that … express personal value judgments regarding topics or materials. … Consider the intent of the author or publisher and, if possible, assign headings … without being judgmental.”[56]
Here, as in “Module 1.4,” objectivity appears linked to neutrality; the implication is that a subject can only be described without bias if a cataloger is dispassionate and has no opinions on the topic. However, not all definitions of objectivity match this interpretation. Although OED defines objectivity as “detachment” and “the ability to consider or represent facts, information, etc., without being influenced by personal feelings or opinions,” Merriam-Webster’s definition is “freedom from bias” and a more actively equitable “lack of favoritism toward one side or another.”[57]
This disparity in meanings begs the question: What does it mean to describe a topic without judgment or bias? Is objectivity erasing any uncomfortable content in a topic, even if that erasure favors a biased status quo and/or muddies a topic’s meaning? Or, rather, is it objective to label something truthfully, even if the topic raises strong feelings? As demonstrated by the revisions to Blackface discussed above, changes to the scope note and broader term in the name of objectivity did not result in a clearer or less biased heading; instead, they obfuscated the racist intent behind the phenomenon.
Similarly, despite the assertion in “H 180,” a singular focus on authorial intent does not always result in a lack of bias or judgment in subjects. As noted by literary critics such as Wimsatt and Beardsley, “placing excessive emphasis on authorial intention [leads] to fallacies of interpretation,”[58] since readers only have access to the text in front of them; attempting to guess an author’s intent is already an act of judgment, not a discovery of objective facts. Further, if an author writes a prejudicial text, taking its content at face value risks replicating that bias through subject provision. LCSH terms such as Holocaust denial literature recognize and counter this, labeling Holocaust denial works as ones “that diminish the scale and significance of the Holocaust or assert that it did not occur.”[59] If catalogers relied strictly on authorial intent in the name of objectivity, those works would instead get misleading subjects such as Holocaust, Jewish (1939-1945) instead of Holocaust denial literature, tacitly legitimizing bias.
Thus, the SHM’s focus on objectivity and neutrality highlights incongruities and tensions within subject guidance and LCSH vocabulary itself between indifference and self-imposed inoffensiveness on the one hand, and actively countering bias and promoting equity on the other. As will be shown below, rejections in the name of neutrality reveal that in fact the proposal process itself has never been neutral or apolitical.[60]
Neutrality and SACO Rejections
LC’s adherence to an inflexible and indifferent definition of neutrality, critiquing proposals engaging with social and political realities as subjective and relying on value judgments, has led to the rejection of multiple headings that surface prejudice or describe the lives and experiences of marginalized peoples. Instead, rejections upholding neutrality reinforce hegemonic societal attitudes within LCSH.
Neutrality appears in several guises in proposal rejections in “Summaries of Decisions” from 2005 to 2025. The most obvious ones reference “H 204” and “neutral (i.e., unbiased) terminology,” including the 2008 rejection of Water scarcity and the 2024 rejection of White flight (discussed in more depth below).[61] Similar rejections use words such as “judgment” (including Negative campaigning in 2013, and Zombie firms in 2023); “pejorative” (e.g., Dive bars in 2010, and Banana republics in 2015); “vulgar and offensive” (such as Vaginal fisting and Anal fisting in 2010); “subjective” (such as African American successful people in 2009); “viewpoint” (including Jim Crow laws in 2019); and “non-loaded language” (e.g., Incarceration camps in 2024).[62]
Neutrality as non-involvement in political and social realities also appears in the rejection of proposals due to LC’s Policy, Training, and Cooperative Programs Division (PTCP)’s unwillingness to establish certain “patterns” of subject headings (i.e., set precedents for future headings of specific types). Pattern rejections often appear entirely arbitrary; that is, the rejections stated merely that PTCP did not wish to begin a pattern, and not that a proposal as formulated was missing vital elements, had no warrant, or did not conform to provisions stipulated in the SHM. Despite acknowledging in “Module 1.4” that the wrong subject heading “can make any resource in the collection ‘disappear,’”[63] these rejected patterns render certain topics invisible and unsearchable by library patrons.
Uncreated patterns include critiques of prejudicial attitudes and behaviors, particularly by governmental bodies, such as rejections of Prison torture in 2007 or Religious profiling in law enforcement in 2024.[64] Similarly, patterns that would have highlighted the unearned privilege and/or bigotry of certain groups remain largely unestablished, including Holocaust deniers (2016), Toxic masculinity (2020), and White privilege (rejected in 2011 and 2016, before finally being accepted as White privilege (Social structure) in 2022).[65] The rejection of White fragility in 2020 is particularly interesting, as the rationale was that “LCSH does not include any headings that ascribe an emotion or personality trait to a specific ethnic group or race, and the meeting does not want to begin the practice.”[66] However, LCSH has included since 2010 the heading Post-apartheid depression, meant to convey the mental health and feelings of white Afrikaners. So not all white people’s emotions appear off-limits—just ones that reveal systemic biases. PTCP also declined to create patterns naming discrimination directed at certain groups, such as Police brutality victims in 2014 and Missing and murdered Indigenous women in 2023.[67] In the latter case, the rejection of a term meant to highlight societal neglect of the violence against Indigenous peoples means that their existence and trauma continue to be hidden in library vocabularies and catalogs.
Pattern rejections not only make prejudices invisible in library catalogs, they also underrepresent concepts that celebrate or describe the cultures and experiences of marginalized peoples. Erasures of joy can be as damaging as erasures of struggle. Aronson, Callahan, and O’Brien’s discussion of themes related to people of color in picture books, for instance, could equally apply to messages portrayed in LCSH via what topics it hides or surfaces in library catalogs: a “predominance of Oppression … at the expense of other types of portrayals can send a message that suffering and struggle are definitive of a group’s experience, or even of victimhood.”[68] Instead, marginalized people “deserve to see themselves represented as people who lead full and dynamic lives and who are not fully defined by histories of oppression.”[69] Unaccepted subject headings of this type include African American successful people (2009), Overweight women’s writings (2011), Gay neighborhoods and Lesbian neighborhoods (2012), Gay personals (2018), Afro-pessimism (2021), and Indigenous popular culture (2024).[70]
Absorbing a proposed critical term into a supposed “positive” equivalent also served to preserve an inoffensive neutrality in LCSH; this is seen in the rejection of Food deserts in 2014:
The concept of food desert has been defined in multiple ways by various governments and organizations, often in ways to suit their specific political agendas … The existing heading Food security is defined as access to safe, sufficient, and nutritious food. The existing heading is used for both the positive and negative (it has a UF [cross-reference for] Food insecurity), and the meeting feels that it adequately covers the concept of a food desert.[71]
Similarly, LC rejected a proposal for Genocide denial in 2017 with the rationale that the “positive” heading—Genocide—was sufficient for patron access: “A heading for a concept in LCSH includes both the positive and negative aspects of that topic. A work about the denial of genocide still discusses the concept of Genocide.”[72]Slum clearance was also rejected in 2007 in favor of the euphemistic and supposedly equivalent Urban renewal.[73]
Sometimes rejections upholding neutrality appeared in the guise of a fear that the term might be misapplied. For instance, although LC acknowledged in its 2019 rejection of Jim Crow laws and Jim Crow (Race relations) that the headings described laws and attitudes promulgated during a specific time period—which could therefore be described in a scope note guiding subject usage—it claimed that “the meeting is also concerned that the heading would be assigned only if the phrase Jim Crow is used in the title.”[74] In other words, the rejection prioritized avoiding possible future confusion over a definable term with ample literary and user warrant. The potential for definitional uncertainty also fueled other rejections, such as Femicide and Secret police in 2010, and Forced assimilation in 2024.[75] To preempt said confusion in all of these cases, LC could have added scope notes defining appropriate usage. Subjects have been remediated in the past when found to be misused, via clarifying scope notes or additional term creation, as with Romance literature (now Romance-language literature) versus Love stories (now Romance fiction).[76] Instead of denying the proposal due to a fear that a term might be misapplied, LC could have worked with the proposers to ensure the heading clearly defined the topic and, if necessary, made a public announcement with additional guidance on how to retrospectively add the term.
Overly-limiting definitions of subjects also provided reasoning for neutrality-based proposal rejections. An attempt in 2011 to add the natural language phrase Queer-bashing as a cross-reference under the then-current heading Gays–Violence against, for example, was rejected with the justification that “queer-bashing is not necessarily violent.”[77]Intersexuality–Law and legislation, a heading reflecting ongoing debates about genital surgeries on infants and legally-recognized genders, was rejected in 2016 because “The subdivision –Law and legislation free-floats [i.e., can be used] under ‘headings for individual or types of diseases and other medical conditions, including abnormalities, functional disorders, mental disorders, manifestations of disease, and wounds and injuries’ (SHM H 1150).”[78] The medicalizing language of the rejection reinforced the view of intersexuality as a “condition” or “disorder” needing fixing, rather than the natural human diversity of a group struggling for bodily autonomy and human rights. The rejection of Redlining in 2024 also fits this definitional pattern. Despite acknowledging that Redlining “functioned in many different financial contexts,” LC’s rejection implied that redlining’s definition was too broad, as LC preferred “the specificity of … separate headings.”[79] This continues to fracture the topic into multiple subjects such as Discrimination in financial services, Discrimination in mortgage loans, and Discrimination in credit cards. The rejection also sidestepped notions of governmental complicity in redlining, and whitewashed the topic by making it appear less systemic in nature.
Purported limitations of the vocabulary also served as justification for rejecting proposals and upholding LCSH neutrality. For instance, Butch/femme (Gender identity) was deemed “too narrow and specialized for a general vocabulary such as LCSH” in 2011 (though Butch and femme (Lesbian culture) was later approved in 2012)[80]—this, despite the copious presence of narrow terms in LCSH about other topics, such as Madagascar hissing cockroaches as pets, Photography of albatrosses, Church work with cowgirls and Zariski surfaces. Anal fisting and Vaginal fisting were rejected with the same rationale in 2010 (in addition to the “vulgar and offensive” argument described above).[81] Two rejections utilizing the same reasoning raise the question of whether queer cultures and identities were evaluated using particularly stringent criteria. As one librarian noted in the RADCAT mailing list after the rejection of Butch/femme (Gender identity):
This is especially baffling given that Bears (Gay culture) has been a valid subject heading for years, and both concepts have about the same amount of literary warrant. For those of you keeping track at home, this isn’t the first example of this rejection. During The Great Fisting Debacle of 2010 … the Anal fisting and Vaginal fisting proposals were shot down using the same language. I haven’t seen PSD [the prior name for PTCP] rejecting scientific or technical heading proposals as too specialized, which makes me wonder if it’s only gender & sexuality-related headings that receive this type of scrutiny.[82]
Troublingly, rejections for queer identities have continued since LC resumed processing tentative lists in January 2025, particularly for queer youth proposals. The rejection of Sexual minority high school students, for instance, indicates potential deference to current governmental queerphobia, particularly since the phrase “At this time” prefaces the justification: “At this time, it is not desirable to qualify headings for this age group by gender identity or expression/sexual orientation.” LC’s suggestion that instead “[t]erms from other subject vocabularies such as Homosaurus may be used instead of, or in conjunction with, existing LCSH headings to express the topic” suggests that there is no place for queer youth identity headings within LCSH.[83]
Finally, proposals were rejected in favor of maintaining pre-existing biases in LCSH–the cultural fixity mentioned in LCSH training “Module 1.4.”[84] For instance, a 2015 rejection of a change proposal related to Indigenous peoples–South Africa highlighted in its rationale the scope note for Indigenous peoples defining them entirely in relation to colonial power: “Here are entered works on the aboriginal inhabitants either of colonial areas or of modern states where the aboriginal peoples are not in control of the government.”[85] Sometimes, even the longevity of a term within LCSH was treated as sufficient reason to reject proposals meant to update outdated and inequitable terms, as with the 2020 rejection of a proposed change from Juvenile delinquents to Juvenile prisoners: “The existing heading Juvenile delinquents has been used for this concept for many years. At this point, it would be practically impossible to examine the entire file so the new heading could be applied accurately. The heading Juvenile delinquents should be assigned instead.”[86] This hesitance to tackle large projects because of the labor required for bibliographic file maintenance perpetuates the tendentious language present in LCSH and reinforces the view that the proposal process is itself not neutral.
Case Study: White Flight
In 2024, the African American Subject Funnel Project submitted a subject proposal for White flight. The proposal cited Kruse’s book White Flight: Atlanta and the Making of Modern Conservatism to demonstrate literary warrant. It additionally cited three reference sources—Encyclopedia of African-American Politics, The New Encyclopedia of Southern Culture, and Wikipedia—in order to define the term and demonstrate that it is commonly used by scholars and the public.
[Source]: Encyclopedia of African-American politics, 2021 (“White flight” is the term used to refer to the tendency of whites to flee areas and institutions once the percentage of blacks reaches a certain level)
[Source]: The new encyclopedia of southern culture, 2010 (The term “white flight” refers to the spatial migration of white city dwellers to the suburbs that took place throughout the United States after World War II. One of the most powerful and transformative social movements of the 20th century, white flight significantly affected the class and racial composition of cities and metropolitan areas and the distribution of a conservative postwar political ideology)
[Source]: Wikipedia, 16 Oct. 2023 (White flight or white exodus is the sudden or gradual large-scale migration of white people from areas becoming more racially or ethnoculturally diverse. Starting in the 1950s and 1960s, the terms became popular in the United States; examples in Africa, Europe, and Oceania as well as the United States)
However, LC rejected White flight with the following rationale: “LCSH does not currently have an established pattern that combines the topic of migration with the social reasoning for that migration. The meeting was concerned that introducing such a pattern, particularly in this case, would contradict the practice in LCSH of preferring neutral, unbiased terminology as stated in SHM H 204 sec. 2.”[87]
After this Summary of Decisions was issued, librarians on the SACOLIST mailing list publicly disagreed with the rejection and pointed out the flaws in LC’s argument. One poster highlighted the fact that the term was in common use and searched for by library patrons; they also noted another heading already in LCSH that fit the pattern PTCP claimed didn’t exist:
According to H 204 Section 2, the proposed heading should “reflect the terminology commonly used to refer to the concept,” which I believe is the case with this term. Additionally, the same section of H 204 asks, “Will the proposed revision enhance access to library resources? Would library users find it easier to discover resources of interest to them if the proposed change were to be approved?” Again, if this phrase is commonly used by patrons, it would make sense to add it to our catalogs … You wrote that “LCSH does not currently have an established pattern that combines the topic of migration with the social reasoning for that migration.” Could someone explain why Great Migration, ca. 1914-ca. 1970 doesn’t fit this pattern? Is it because of the date range and that this is a specific event?[88]
Another librarian emphasized the ongoing importance of white flight, the prevalence of literature discussing it, and the unequal treatment of headings describing different groups in LCSH:
The differences between these proposals from my perspective seems to be that one describes African Americans and the other describes White people, and White flight is an ongoing concept rather than a single historical event. I hope PTCP reconsiders this decision, because the effects of White flight and the practices surrounding it shape racial inequality in the United States and in many other countries in the world. Many works describe White flight and its consequences … and users are familiar with the term and want to find works about it.[89]
Finally, a respondent noted yet another term matching the supposedly non-existent pattern: “The existing heading Amenity migration would also appear to provide a pattern combining the topic of migration with the social reasoning for that migration.”[90]
Despite these arguments, LC did not respond to the mailing list discussion nor change its decision. As White flight had literary warrant, was amply supported by reference sources, and was a concept that could not be accurately conveyed using already existent subject headings, why was PTCP concerned about neutrality “particularly in this case”? Even governmental entities as varied as the Supreme Court, the U.S. Commission on Civil Rights, the National Register of Historic Places, and LC itself use the term white flight. The rejection’s insistence on the need for uninvolved neutrality therefore seemed inconsistent with the widespread acceptance of the term.
Instead, the neutrality justification appears to be a smokescreen to cover up discomfort with a term that called out white racism; mandating neutrality in this case meant privileging being inoffensive to white people over acknowledging a widely accepted critique of systemic racism. Patton notes in her Substack post “White People Hate Being Called ‘White People’” that whiteness functions in part by invisibility, a “retreat into universalism where whiteness can dissolve back into ‘humanity’ and avoid accountability.”[91] Rejecting the proposal may have been a neutral decision (i.e., deliberately unobjectionable and indifferent to political and social realities), but it was certainly not unbiased (i.e., free from favoritism). Instead, it conceptually reinforced the false position of whiteness described by Patton as “the default, neutral, objective, and moral”[92]—thus undermining equity in LCSH and making works on this important topic invisible and unsearchable in library catalogs.
Discussion
Chiu, Ettarh, and Ferretti describe the futility of relying on neutrality to further social justice within librarianship and its vocabularies:
When the profession discusses neutrality, we believe that the profession actually seeks equity. However, neutrality will not yield equitable results and will always fall short because it relies on equity already existing in society. This is not the condition of our current society, nor is it true for the profession. Therefore, neutrality will actually work toward reinforcing bias and racism.[93]
The rejection of White flight illustrates this point aptly. Justifying the rejection by invoking neutrality means that practically speaking being neutral equates to whitewashing the ongoing phenomenon, by pretending that the movement of white people in the United States is entirely benign, divorced from racism, and not worth library or library user attention. What are the long-term consequences of privileging neutrality, as opposed to equity, in the subject approval process? Neutrality as political isolationism and mandated inoffensiveness leads, as seen in the rejections from 2005 through 2024, to suppressing political and social critiques, hiding prejudice, and rendering the lived experiences of marginalized groups invisible.
It is unfortunately far too easy to weaponize a neutrality that gives equal weight to what groups such as racists and antisemites intend when evaluating proposals. A SHM instruction created in late 2024, “H 1922,” further embeds this weaponization within subject guidance. “H 1922” defines “offensive words” as “derogatory terms that insult, disparage, offend, or denigrate people according to their race, ethnicity, nationality, religion, gender identity, sexuality, occupation, social views, political views, etc.”[94] By including political and social views in the definition, LC inaccurately equates groups espousing opinions about how people should behave in society with demographic groups who have historically been marginalized merely for existing. This leaves LCSH vulnerable to political actors disingenuously claiming “offense” to silence critiques or establish prejudicial terms within the vocabulary. A recent example of this was the proposal to change Trans-exclusionary radical feminism into Gender-critical feminism, the obfuscatory label preferred by the transphobic group, by claiming that trans-exclusionary radical feminism was a slur.[95] (LC ultimately rejected the proposal, thanks in large part to “community activism” and mobilization opposing the change.[96] LC specifically mentioned library community input as the rationale for the rejection: “When this tentative list was published in November 2024, PTCP received over 300 email comments demanding rejection of this proposal.”[97])
There is ample evidence from the recent past and present of this weaponization of offense being used to undermine progress toward equity in the United States. The Trump administration’s proposed Compact for Academic Excellence in Higher Education (2025) exemplifies the dangers of privileging neutrality over equity. The Compact demands “institutional neutrality,” requiring that universities and their employees “abstain from actions or speech relating to societal and political events except in cases in which external events have a direct impact upon the university.” Those agreeing to this isolationist neutrality, in the meantime, would also agree to erase trans, non-binary, and intersex students, faculty, and staff, and to police and punish speech deemed offensive to conservatives. Notably, the Compact requires that admissions be based on “objective” criteria—except for explicitly-allowed faith, “sex-based,” and anti-immigrant biases.[98]
Mandated neutrality within “H 204” risks reifying the same prejudices within library vocabularies. This can be seen in LC’s recent alteration of Mexico, Gulf of to America, Gulf of, and Denali, Mount (Alaska) to McKinley, Mount (Alaska).[99] Critical cataloger Berman describes the former change as “linguistic imperialism,” and the latter as an “affront to Alaska’s indigenous population.”[100] The latter change is particularly damaging, given the simultaneous effort by LC to remediate LCSH related to Indigenous peoples, and might undermine confidence in the project. In both cases, a neutral approach—remaining uninvolved in political and social events—led to an undue “deference to chauvinistic, ethnocentric, and unjustified authority.”[101] Whether LC realistically could have resisted altering these headings is a counterfactual hypothetical. Its actions must be judged by the effects of these revisions within library catalogs and for library patrons. By clinging to the illusion of neutrality, and capitulating to the whims of a racist and colonialist regime, LC undermined the profession’s stated values and harmed the larger library community.
Recommendations
What philosophical approach can LC take in lieu of neutrality, to bring the SACO process more in concert with library ideals of equity and egalitarianism? We recommend that LC employ a values-driven approach to vocabulary construction and maintenance. Explicitly stated library values—particularly around social justice and social responsibility—benefit all users, both marginalized peoples and the “mainstream.” Further, the PCC Policy Committee, of which LC is a permanent member, has already committed to the PCC Guiding Principles for Metadata, which acknowledge that “the standards and controlled vocabularies we use and their application are biased,” and advocates for “incorporating DEI principles in all aspects of cataloging work.”[102] Below, we suggest a number of changes LC could enact to make LCSH and the proposal process more equitable.
In backing away from neutrality as a guiding principle, philosophical approaches that have been suggested in critiques of traditional practice deserve consideration. In her chapter in Questioning Library Neutrality, Iverson proposes that librarians adopt feminist philosopher Haraway’s approach to objectivity: “Haraway explains that what we have accepted as ‘objectivity’ claims to be a vision of the world from everywhere at once … We can not see from all perspectives at once, we each have our own particular views that are shaped by our own identities, cultures, experiences, and locations.”[103] Instead of claiming to possess “infinite vision,” Iverson recommends that we adopt Haraway’s recognition of “situated knowledge.”[104]
Watson argues that instead of literary or bibliographic warrant (cataloging a book in hand, asking what subject headings are needed to convey its content), critical catalogers “operate from a position of catalogic warrant, reading the terms and hierarchies of cataloging and classification systems with a critical eye, reflecting on the potential benefit or harm of each term on marginalized users, groups, or the GLAMS [galleries, libraries, archives, and museums] community as a whole.”[105] In other words, librarians should focus on the subject heading system in its entirety, asking what revisions and additions are needed. In some ways, by collaborating with SACO funnels on large-scale projects to create and revise related groups of subject headings, LC has already moved away from strict adherence to an interpretation of literary warrant that considers the only valid reason to propose a subject heading having a book in hand that requires it. This shift should be continued and expanded.
As for concrete actions, we advise that LC restore its open monthly subject editorial meetings where proposals are discussed, and to expand points of communication with external libraries. This would allow a more diverse range of librarians to participate in the SACO process and provide valuable input during decision-making. Other benefits of monthly meetings have been noted by SACO librarians in an open letter to PTCP: they helped to demystify “the SACO process” for the newly-involved; and allowed librarians to contribute to “lively conversations on a broad range of options, and the opportunity to shape the vocabularies we all use, from proposing single headings to creating special lists to debating new guidelines for topical subdivisions.”[106]
Building off of this, we suggest creating an external advisory group for LCSH, similar to the ones for LCDGT and LCGFT, to get input from a broader range of users on proposal vetting and vocabulary maintenance. Further, we urge LC to allow greater decision-making power for external librarians in all advisory groups. This would help LC vocabularies better reflect the resources in the Library of Congress collections and the needs of thousands of libraries of different types around the world, and improve accountability for decisions made regarding proposals. It would also help to better insulate library vocabularies from the governmental interference noted above, by making a broad range of institutions responsible for their creation and maintenance.
Within such bodies, we recommend that LC follow guidance from the SAC Working Group on External Review of LC Vocabularies, by including members from groups being described in those vocabularies, subject matter experts, and international representatives. Furthermore, membership should not include “[r]epresentatives from groups or organizations that purport to speak for marginalized communities, but who exclude the voices of members of the marginalized community,” or “[r]esearchers or representatives from groups or organizations where the experts cause harm to members of marginalized communities.”[107] The inclusion of representative groups aligns with the PCC Guiding Principles for Metadata and follows the principles put forth in the Cataloguing Code of Ethics.
In vetting SACO proposals, “LC should prioritize sources from the peoples and communities described, privileging those sources over traditionally ‘authoritative’ sources, including literary warrant,” to ensure that the terminology used “reflect[s] a more inclusive and culturally relevant understanding of the language associated with these groups and their heritage and history.”[108] The creation of a position within LC focused on remediating metadata related to Indigenous peoples was a good first step in this direction; and we strongly encourage LC to both continue and expand this practice.
Finally, we suggest revisions to various LC documents and SHM instruction sheets. References to neutrality should be removed from “H 204” and “Module 1.4,” in favor of a focus on active equity in subject assignment and proposals. Examples of unbiased terminology, created in concert with advisory groups described above, reflecting a variety of situations, and periodically updated, would help create a shared understanding between librarians proposing headings and those evaluating them for inclusion in LCSH. “H 180” and “Module 1.4” should also be edited, in the sections advising catalogers to remain objective and not “express personal value judgments.”[109] All cataloging relies on judgment, and judgment is not always synonymous with bias or divorced from facts. A more useful focus here, as in a revised “H 204,” would be on the active equity present in Merriam-Webster’s definition of objectivity; catalogers should employ “catalogic warrant” and evaluate the “potential benefit or harm”[110] of subjects, particularly when assigning headings to prejudicial works. Finally, in order to protect against weaponized “offense,” we also recommend that “social views” and “political views” be removed from “H 1922.” These alterations would bring the SHM and LCSH training more in line with LCDGT guidance, which foregrounds cataloging ethics. “L 400,” for instance, notes that “naming demographic groups and identifying individuals as members of those groups must be done with accuracy and respect,” and highlights the importance of self-identification when assigning headings.[111]
We cannot make recommendations on this topic without addressing the current political climate. Because LC’s catalog migration put most SACO work on hold during 2025,[112] the effect of the Trump administration’s anti-DEI policies on LCSH remains uncertain. However, United States history is rife with periods of political repression. Waiting until relative calm to advocate for equity has not been, historically, how equity was advanced; and it will not serve library patrons or the broader community in the present moment.
Conclusion
LCSH began over a century ago as a subject cataloging tool for the Library of Congress, and has since evolved into a vocabulary serving thousands of libraries around the world. Despite the broad and diverse user base, LC has remained the sole arbiter of which proposals are accepted into LCSH and what form the headings take. During the last two decades it has rejected a number of subject proposals due to a preference for purported neutrality and objectivity, in various guises. Yet, as a profession, librarianship claims to prioritize social responsibility. Social justice and equity are incompatible with an indifferent and purposefully inoffensive neutrality that allows harmful, colonialist, and racist headings in LCSH, and keeps out headings describing prejudice, or about the lived experiences of marginalized peoples.
Olson describes LCSH as “a Third Space between documents being represented and users retrieving them,” since “LCSH constructs the meanings of documents for users.”[113] These meanings impact how users view materials, and whether they can locate them in library catalogs. And it is within this space that LC’s commitment to neutrality fails both users and the ideals of librarianship around social responsibility. However, “because the Third Space is one of ambivalence, it is one with potential for change.”[114] By focusing on library values rather than neutrality within the subject creation and approval process, LCSH could develop into a vocabulary that constructs truly equitable and inclusive meanings for users and librarians alike.
Acknowledgements
Thank you to our publishing editor, Jess Schomberg, and the editorial board for their flexibility, guidance, and expertise throughout the publication process. Thank you to K.R. Roberto, Margaret Breidenbaugh, Crystal Yragui, and Matthew Haugen, who allowed us to quote them within this article. We would also like to thank our reviewers, Jamie Carlstone and Ian Beilin, and other readers who gave valuable feedback: Adam Schiff, Rebecca Albitz, Chereeka Garner, Rebecca Nowicki, Naomi Reeve, Simone Clunie, Violet Fox, and Stephanie Willen Brown.
[1] Robert Jensen. “The Myth of the Neutral Professional,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 91.
[3] Throughout this article, authorized subject headings (i.e., those that exist currently in LCSH) are presented in bold font; while rejected proposed headings appear in italics. For consistency, subject headings within quotations will follow the same formatting, regardless of the formatting used in the original quotation.
[8] Gina Schlesselman-Tarango, “How Cute!: Race, Gender, and Neutrality in Libraries,” Partnership: The Canadian Journal of Library and Information Practice and Research 12, no. 1 (Aug. 2017): 10, https://doi.org/10.21083/partnership.v12i1.3850.
[9] Maura Seale, “Compliant Trust: The Public Good and Democracy in the ALA’s ‘Core Values of Librarianship,’” Library Trends 64, no. 3 (2016): 589, https://doi.org/10.1353/lib.2016.0003.
[15] Dani Scott and Laura Saunders, “Neutrality in Public Libraries: How Are We Defining One of Our Core Values?,” Journal of Librarianship and Information Science 53, no. 1 (2020): 153, https://doi.org/10.1177/0961000620935501.
[16] Scott and Saunders, “Neutrality in Public Libraries,” 158.
[19] Mark Rosenzweig. “Politics and Anti-Politics in Librarianship,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 5-6.
[20] Stephen Macdonald and Briony Birdi, “The Concept of Neutrality: A New Approach,” Journal of Documentation 76, no. 1 (2020): 333–353. https://doi.org/10.1108/JD-05-2019-0102.
[21] Jaeger-McEnroe, “Conflicts of Neutrality,” 3.
[22] Jaeger-McEnroe, “Conflicts of Neutrality,” 6.
[23] Jaeger-McEnroe, “Conflicts of Neutrality,” 9.
[24] Steve Joyce, “A Few Gates Redux: An Examination of the Social Responsibilities Debate in the Early 1970s and 1990s,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 33-65.
[26] “Resolution to Condemn White Supremacy and Fascism as Antithetical to Library Work,” American Library Association, Jan. 25, 2021, https://tinyurl.com/yr4z9e8x
[27] Scott and Saunders, “Neutrality in Public Libraries,” 153.
[35] Subject Analysis Committee Working Group on the LCSH “Illegal aliens,” “Report from the SAC Working Group on the LCSH ‘Illegal aliens,'” July 13, 2016, https://alair.ala.org/handle/11213/9261.
[36] Jill E. Baron, Violet B. Fox, and Tina Gross, “Did Libraries ‘Change the Subject’? What Happened, What Didn’t, and What’s Ahead,” in Inclusive Cataloging: Histories, Context, and Reparative Approaches, eds. Billey Albina, Rebecca Uhl, and Elizabeth Nelson (ALA Editions, 2024), 53; Library of Congress, “Library of Congress Subject Headings Approved Monthly List 11 (November 12, 2021)” (Library of Congress, 2021), https://classweb.org/approved-subjects/2111b.html.
[37] Baron et al., “Did Libraries ‘Change the Subject?,’” 54.
[38] Michelle Cronquist and Staci Ross, “Black Subject Headings in LCSH: Successes and Challenges of the African American Subject Funnel Project,” Reference and User Services Association, July 7, 2021, virtual. https://d-scholarship.pitt.edu/41826
[39] Cronquist and Ross, “Black Subject Headings in LCSH.”
[40] Cronquist and Ross, “Black Subject Headings in LCSH.”
[41] Library of Congress, “Library of Congress Subject Headings Approved Monthly List 06 (June 18, 2021)” (Library of Congress, 2021), https://classweb.org/approved-subjects/2106.html. Note the headings for Japanese Americans, Japanese Canadians, and Aleuts were originally submitted as –Forced removal and incarceration matching preferred usage, but LC changed them all to –Forced removal and internment.
[43] For more information about Congressional actions related to the attempt to change Illegal aliens, see: SAC Working Group on Alternatives to LCSH “Illegal aliens,” “Report of the SAC Working Group on Alternatives to LCSH ‘Illegal aliens’” (American Library Association, 2020), http://hdl.handle.net/11213/14582.
[48] Rich Gazan, “Cataloging for the 21st Century Course 3: Controlled Vocabulary & Thesaurus Design Trainee’s Manual” in Library of Congress Cataloger’s Learning Workshop (Library of Congress, n.d.), 2-2,
[60] Anastasia Chiu, Fobazi M. Ettarh, and Jennifer A. Ferretti, “Not the Shark, but the Water: How Neutrality and Vocational Awe Intertwine to Uphold White Supremacy,” in Knowledge Justice: Disrupting Library and Information Studies through Critical Race Theory, eds. Sofia Y. Leung, Jorge R. López-McKnight (MIT Press, 2021), 65.
[61] Library of Congress, “Editorial Meeting Number 4,” 2008; Library of Congress, “LCSH/LCC Editorial Meeting Number 02 (2024).”
[68] Krista Maywalt Aronson, Brenna D. Callahan, and Anne Sibley O’Brien, “Messages Matter: Investigating the Thematic Content of Picture Books Portraying Underrepresented Racial and Cultural Groups,” Sociological Forum 33, no. 1 (2018): 179, http://www.jstor.org/stable/26625904.
[69] Lisely Laboy, Rachael Elrod, Krista Aronson, and Brittany Kester, “Room for Improvement: Picture Books Featuring BIPOC Characters, 2015–2020,” Publishing Research Quarterly 39 (2023): 58, https://doi.org/10.1007/s12109-022-09929-7.
[72] Library of Congress, “Summary of Decisions, Editorial Meeting Number 09” (Library of Congress, 2017), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-170918.html. LC did establish a new heading for Denialism at that time; however, per the rejection, “To bring out the denialism aspect of events or topics, the heading may be post-coordinated with headings for the events or topics. The existing subject headings Holocaust denial and Holodomor denial, which are related to specific events, were added by exception as narrower terms of the new heading Denialism. Additional narrower terms will not be added to Denialism.”
[80] Library of Congress, “Editorial Meeting Number 27,” 2011; Library of Congress, “Library of Congress Subject Headings Monthly List 12 LCSH (December 17, 2012)” (Library of Congress, 2012), https://classweb.org/approved-subjects/1212.html.
[81] Library of Congress, “Editorial Meeting Number 27,” 2010.
[82] K.R. Roberto, “LCSH Proposals: Is this a Trend?” Jan. 17, 2012, RADCAT mailing list archives.
[85] Library of Congress, “Summary of Decisions, Editorial Meeting Number 12” (Library of Congress, 2015), https://www.loc.gov/aba/pcc/saco/cpsoed/psd-151212.html. A 2016 rejection of Dadaist literature, Romanian (French) also highlighted colonialist content in LCSH, noting that “Headings for national literatures qualified by language are generally established for the language(s) of the colonial power that used to control the territory.” See: Library of Congress, “Editorial Meeting Number 04,” 2016.
[103] Sandy Iverson, “Librarianship and Resistance,” in Questioning Library Neutrality, ed. Alison Lewis (Library Juice Press, 2008), 26.
[104] Iverson, “Librarianship and Resistance,” 26.
[105] B. M. Watson, “Expanding the Margins in the History of Sexuality & Galleries, Libraries, Archives, Museums & Special Collections (GLAMS)” PhD diss. (University of British Columbia, 2025), 270.
[107] Subject Analysis Committee Working Group on External Review of LC Vocabularies, Report of the SAC Working Group on External Review of Library of Congress Vocabularies, February 2023, 8-9, https://alair.ala.org/handle/11213/20012.
[108] Working Group on External Review of LC Vocabularies, “Report,” 8.
I’m on strike right now, along with thousands of other faculty, academic professionals, and staff at Portland Community College (that’s two unions, friends!). It’s a weird feeling. I never thought I’d be in this position. PCC was the first place I worked where I really felt like the values of the College matched my own. I work with insanely dedicated and caring library workers, faculty, and staff. They believe unwaveringly in what they do and constantly go above and beyond for students. After being here for a few years, I knew this was the place I wanted to work for the rest of my career. Even as administration became worse – more corporatized, more performative, less accessible, more likely to listen to outside consultants than the people who directly work with students – I still never considered leaving because the folks I work with regularly are awesome and I love our students.
As a scholar of time, I’m always interested in different forms of time (queer time, crip time, etc.). Strike time feels really strange. We were talking this morning on the picket line how it feels a lot like early COVID where time moved very differently. We feel like the days are both way too long and super short with not enough time to get everything done but also too much time just staring at different union social channels. We’re totally energized and totally exhausted (I’m lying on the couch like a ragdoll right now after three hours of holding signs, screaming, and dancing, marching and chanting with hundreds of colleagues). In terms of information, we feel like we’re both drinking from a firehose and like we don’t have any of the information we need. We have no idea what the near-term future will bring. What day of the week it is feels almost arbitrary because none of the usual markers of those days apply (I see all the things I was supposed to have been doing at work each day on my calendar and it feels like another life entirely). We’re both unmoored and deeply connected. I love it (the connection and collective power) and I also really hate it (for our students, for our colleagues who live paycheck to paycheck, for what the administration and the Board are doing to my beloved institution).
So it’s weird to feel both temporarily severed from the College and also more deeply connected than ever. These administrators may run the College and have the authority to make decisions, but they are not the College. The College is the people I’ve seen on the picket lines the past few days in the rain and freezing cold. These people who are truly fighting for the soul of our college. They make the College run, from teaching classes, to assisting students with all kinds of needs, to helping students feel welcome, to keeping the College clean and safe and keeping students fed. All of these things are critical and the College can’t run without us, but I’m not entirely sure the same can be said of our administrators. The College is also our students, many of whom have stood with us on the line, who’ve brought us food, or have supported us through emails to the President and Board and on social media. I feel incredibly grateful for our students who clearly see through the bs administration is putting out there.
It’s been kind of incredible to see how unprepared our administration was for this after 11 months in which they barely moved in negotiations. They’ve known for months that a strike was a distinct possibility and they were the ones who walked away from the bargaining table the night before the strike was meant to happen. The latest email from the President said “I will say, with some pride, that we are not – and we should not – be an organization that is good at navigating this scenario” but, honestly, they should have had guidance for students ready to go. Administrators are supposed to plan for scenarios like this. They had units planning for two different scenarios for cuts from the State (neither of which came to pass). We spent almost a year planning what we would cut if LSTA funds went away in our state for the next year (they didn’t, thank goodness). Most faculty, on the other hand, have been talking to students about a possible strike for the past six weeks at least and the union provided tons of resources to help them come up with a plan for their own classes. Yet the College was left totally scrambling last Wednesday as if they had no idea this could happen. Baffling.
It’s been interesting seeing some managers show up to bring food and/or spend a bit of time with us on the line. It’s not a lot of them, but it means a lot to us when someone does. They’ve told us about the absolute unprepared hot mess that is administration right now and it’s nice to realize that not every middle manager tows the party line at all times. But the vast majority of our managers sent us emails just before the start of the strike asking us to let them know if we were working or not, so most are definitely sticking with administration.
I had a boss many years ago who definitely put her employees first and advocated fiercely for us. She said she saw her role as being akin to a manager of a minor league baseball team. She was here to help develop us for bigger and better things in our careers. She was a major mentor to me in my early years in the profession. Since then, the bosses I’ve had really prioritized the people above them in the org chart ahead of the people below them. They have been classic “company [wo]men.” Helping us develop in our careers or even supporting us when we explicitly asked for it wasn’t part of the job. When I was a middle manager, I took the exact opposite approach and that’s why I’m no longer a middle manager. I always saw the role of a manager as supporting one’s direct reports (essentially, I worked for them) and that wasn’t what the people in charge of the library wanted me to do.
The great library leader Mitch Freedman died recently and it made me think about whether leaders like him can really exist in our much more corporatized libraries these days. If you don’t know about Mitch’s storied biography as a library leader and awesome human, please take a moment to read about him here in an obit from his family. When I was coming up as a librarian, he was the sort of man who was a model for me in successfully operating in our field with total moral courage. He lived his values every day. He fought for people and the things that he believed in. He centered the folks who were oppressed. He believed relationships were core to our work. In many ways, he embodied the “Good” and the “Human(e)” characteristics of slow librarianship (maybe also the “Thoughtful” but I didn’t work with him, so I’m not sure). His amazing daughter, Jenna Freedman, also lives her values courageously, a living tribute to his example.
I hope there are still library managers out there still who have moral courage and fight the good fight, but, more and more, it feels like the people who become library Deans, Directors, and University Librarians are the ones who are willing to comply and conform, not the ones willing to rock the boat. As our institutions become more and more corporatized and neoliberal, we see less and less moral courage. I see a lot of library administrators wanting to look like they’re doing good more than they actually want to do good. I think of the leaders who all started EDI initiatives or published EDI statements right around 2020 and then let them fade away. Most of the people I see doing amazing values-driven work in our field these days are not leading libraries. They’re mostly front-line librarians. I wonder if it’s because like me, folks are not willing to make the moral compromises so many have to make these days to climb the ladder.
the decisive victory of capitalism in the 1980s and 1990s, ironically, has… led to both a continual inflation of what are often purely make-work managerial and administrative positions—”bullshit jobs”—and an endless bureaucratization of daily life, driven, in large part, by the Internet. This in turn has allowed a change in dominant conceptions of the very meaning of words like “democracy.” The obsession with form over content, with rules and procedures, has led to a conception of democracy itself as a system of rules, a constitutional system, rather than a historical movement toward popular self-rule and self-organization, driven by social movements, or even, increasingly, an expression of popular will.
I see that in my own place of work. So much of my boss’ (our Dean’s) job is box checking compliance type work – approving vacations and sick leave, making sure we’re doing required trainings and other things the people above her on the org chart want us to do, making sure we’re doing all of the things contractually required of us, etc. It used to be that I met with her once each term to talk about what I was working on, go over my progress on my goals, etc. Then I went to meeting with her just once in Fall where we’d look at my goals document (without any meaningful feedback or support) and then I’d fill out a Google form at the end of the year to tell her what I did (with again no meaningful feedback). Now, even that Fall meeting is gone as her load of compliance-related work has increased. There’s no support outside of helping us navigate the bureaucracy of our institution. There’s no “walking around” as Mitch Freedman did – building relationships with employees and making them feel seen. There’s no focus on our development or talking about the meaning behind what we do. There’s just this compliance-focused flurry of activity.
As our colleges and universities become more and more corporatized, they turn what were supposed to be leadership positions, that required vision and people skills, and turn them into babysitting jobs because, lord knows, we professionals can’t be trusted. Our college, like many, has seen a massive growth in the number of managerial positions, and yet, faculty and staff are being asked to do more administrative work than ever before, not less. Why? Well, of course those managers have to justify their existence.
Could a Mitch Freedman become a library director today? Would he have had to compromise his values somewhere down the line to get there? Do you know of any library leaders like Mitch today who are able to operate successfully in these more neoliberal environments?
In that same piece, David Graeber writes “scholars are expected to spend less and less of their time on scholarship, and more and more on various forms of administration—even as their administrative autonomy is itself stripped away. Here too we find a kind of nightmare fusion of the worst elements of state bureaucracy and market logic.” This is the reality we find ourselves in as our two unions fight for better pay, but even more importantly, for a real, substantial model of shared governance which we don’t currently have (and which our college President agreed to and then hired a consultant to create for us ). The fact that the only college committee or governance group that has the ability to conduct a vote of no confidence in our President (which they successfully passed!) is our student government is a stark reminder of how little power and voice we have in the future of our college. It can be so easy to just focus on keeping our head down and doing the good work we do as educators, as supporters of students and faculty, as stewards of collections, etc., but when we fight together like this, we fight for the heart and soul of our organization. We fight for an organization that centers students and their needs and listens deeply to those who directly serve and educate them.
Walking the picket line the first couple of days was brutal in many ways. I was so cold and wet I couldn’t even grip my cell phone or a car door handle and I had to stay off my feet for a few hours as they thawed. But what has kept me warm, has kept all of us warm, is the solidarity. It has sometimes felt almost like a party, being there with many hundreds of my fellow colleagues. It’s been so affirming, so energizing. We’re all so united in this, so deeply committed to the institution and each other in ways that these administrators who jump from job to job every few years and compose soulless emails to us with freaking ChatGPT will never understand.
If you’re feeling so inclined, please contribute to our strike fund. The administration seems really dug in and even decreased their offer by over $100,000 on Sunday, so I’m not quite so optimistic anymore that this will end quickly and we have lots of faculty, academic professionals, and staff who won’t be able to pay their rent or mortgage without support. Thanks and solidarity!!
Relive the online conference that brought the open data community together for a celebration of two decades of CKAN and to discuss the role of open data and data infrastructures today.
Librarians have managed and lived through many seismic shifts brought by technology. How should librarian leaders approach the coming anticipated AI workforce disruption?
Abstract This column explores the ways in which library workers can better align technology use and instruction in library settings with library values, through championing the refusal of technologies that conflict with values like privacy and intellectual freedom. Drawing on experiences with individual patron instruction, class design, and passive programming, the author shares practical steps for helping patrons to understand and fight back against exploitation by digital technologies. Rejecting the myth that any technology is “neutral,” the column argues that libraries as values-driven organizations have a role to play in facilitating patrons’ rejection of technology, just as much as in their adoption of it.
Note from Shanna Hollich, column editor: I am particularly excited to share this issue's column for a number of reasons. First, it's from a public library perspective, which is one that is generally underrepresented in the LIS literature as a whole, and which I'm proud to say that ITAL makes a concerted effort to address. Second, it's about library instruction, a topic of relevance to all types of libraries - and where much of the literature specifically discusses formal library instruction, this column also addresses passive programming, informal instruction, and casual patron interaction, which are also vitally important and under-studied aspects of the library worker's role in education. And finally, it's yet another column about AI, and even more specifically, about taking a critical approach to AI tools, AI education, and AI literacy. Close readers may have noticed this topic tends to be a special interest of mine, but Hannah Cyrus takes a measured and reasoned approach here that acknowledges the potential harms of AI without falling into the trap of simply ignoring or denying AI and the very real impacts it is having on our libraries and the communities we serve.
The first phase of the Reimagining Discovery project at Harvard Library sought to address the challenge of fragmented search experiences of special collections materials using artificial intelligence (AI) technologies, such as embedding models and large language models (LLMs). The resulting platform, Collections Explorer, simplifies and enhances the search experience for more effective special collections discovery. The project team took a user-centered and trustworthy approach to implementing AI, grounding the choices of the platform in user empowerment and librarian expertise. The development process included extensive user research, including interviews, usability testing, and prototype evaluations, to understand and address user needs.
Collections Explorer was developed using a multi-component architecture that integrates multiple types of AI. The team evaluated more than 12 models to select ones that were the best fit for the need, as well as being ethical and sustainable. Detailed system prompts were developed to guide LLM outputs and ensure the reliability of information. The methodical and iterative approach helped to create a flexible and scalable platform that could evolve to support other material types in the future. Initial research showed that potential users are enthused at the prospect of AI-powered features to enhance discovery, especially the item-level summaries and related search suggestions. The project demonstrated the potential of integrating AI technologies into library discovery systems while maintaining a commitment to trustworthiness and user-centered design.
This study evaluates the effectiveness of the Artificial Intelligence for Theme Generation tool (original Portuguese acronym name: IAGeraTemas), developed with generative artificial intelligence (AI; Google Gemini), for automating thematic classification and the assignment of Sustainable Development Goals (SDGs) in documents. The methodology combined quantitative analyses (metrics of precision, recall, and accuracy) on 50 articles published by authors from the State University of Campinas (Unicamp), using classification from the SciVal database and qualitative analyses (analysis of the relevance of terms indexed by librarians from the Unicamp Library System in 40 articles available in the Unicamp Institutional Repository), comparing them with manual indexing performed by librarians. The quantitative results in SDG classification showed a recall of 0.785, while the “precision” and “accuracy” metrics were moderate. The qualitative analysis deepened the evaluation of term coherence and relevance suggested by the AI versus human indexing. It revealed the tool’s potential for suggesting relevant terms and expanding concepts, but it also exposed limitations in addressing complex topics. The research, conducted as an experiment at Unicamp Library System, concludes that IAGeraTemas is a valuable auxiliary tool, complementing but not replacing manual indexing, reinforcing the importance of human expertise in validating and refining results, and emphasizing the synergistic potential between AI and information professionals.
This article describes a case study in which a small metadata team at Illinois State University Milner Library produced a digital humanities project supporting Collections as Data (CAD) and linked data principles. Despite initial sparse descriptive content, the team recognized great potential for experimentation in a significant World War I archival collection to highlight lesser-known stories, including those of the Pioneer Infantry, women, and noncombatants. Discussion focuses on the strategic approaches in creating granular but scalable metadata for the large digital collection, and application of the data with various tools such as ArcGIS and Wikidata to construct interactive data visualizations, mapping, and digital storytelling for the Illinois State Normal University World War I Service Records collection. The article argues that even institutions without a dedicated CAD initiative can incrementally implement principles from the CAD model to add value to their digital collections. The authors first presented the project in 2024 at the Digital Library Federation Forum and the American Library Association Core Forum.
In digital preservation, the concept of a “Designated Community” from the Reference Model for an Open Archival Information System (OAIS) is used to articulate the group or groups of prospective users for whom information is preserved. Concerns have been raised about this concept and its potential implications. However, OAIS has recently undergone a major revision. This study examines the extent to which these revisions address or mitigate concerns regarding the Designated Community. Issues from the literature are grouped into three areas: the concept’s implementation, its potential misapplication, and its incompatibility with the mandates of institutions that serve broad and diverse communities. Major changes related to the Designated Community are identified and considered in relation to these issues. The analysis reveals that the revisions productively contribute to concerns in the first two areas but fail to address the third. The conclusion is that the process of revising OAIS has not drawn from insights into this topic in the literature.
The National Library Board (NLB) of Singapore has made significant strides in leveraging data to enhance public access to its extensive collection of physical and digital resources. This paper explores the development and implementation of the Singapore Infopedia Widget, a recommendation engine designed to guide users to related resources by utilizing metadata and a Linked Data Knowledge Graph. By consolidating diverse datasets from various source systems and employing semantic web technologies such as Resource Description Framework (RDF) and Schema.org, NLB has created a robust knowledge graph that enriches user experience and facilitates seamless exploration.
The widget, integrated into Infopedia, the Singapore Encyclopedia, surfaces data through a user-friendly interface, presenting relevant resources categorized by format. The paper details the architecture of the widget, the ranking algorithm used to prioritize resources, and the challenges faced in its development. Future directions include integrating user feedback, enhancing semantic analysis, and scaling the service to other web platforms within NLB’s ecosystem. This initiative underscores NLB’s commitment to fostering innovation, knowledge sharing, and the continuous improvement of public data access.
This paper explores the impact of digital initiatives on access services workers at the University of California, San Diego (UCSD) and draws on the expertise and experience of non-librarian titled staff operationalizing “digital first” policies. Digital initiatives have been strongly prioritized by libraries to promote equitable access, cost-effectiveness, and technological growth at many libraries in California. The term digital initiatives commonly refers to efforts that support the creation, preservation, access, discovery, and use of digital library resources. This term can encompass multiple interpretations and a variety of tasks.
This paper includes a literature review, an examination of statistics regarding demand and adoption of digital materials in public and academic libraries in California, and a summary of the impact study of non-librarian staff at UCSD. The literature review suggested that the term digital initiatives encompasses a broad scope of meanings and types of tasks, California State Library data suggest that a pattern of increased investment in digital initiatives adopted during the COVID-19 pandemic is continuing, and the information collected through the research at UCSD library suggests that non-librarian library workers play a growing role in managing, maintaining, and supporting these growing digital collections.
Computer workstations have been an integral part of libraries of all types since the 1980s, but the optimal number of workstations that should be deployed in a space has not been directly studied in the last 20 years. During that time, laptop computer and other mobile device ownership has continued to increase, and there is some reason to think that behaviors and preferences first seen during the recent coronavirus 2019 pandemic have further shifted how students use public desktop computers in libraries. McGill University Libraries reduced the size of its computer fleet in the aftermath of the pandemic by looking at the maximum concurrent usage of different clusters of computers across campus, a metric that indicates how busy a space can get with users. This article explains how this metric is calculated and how other libraries can use it to make an evidence-based decision about the optimal size of a computer fleet.
In 2024, the Durban University of Technology (DUT) Library conducted a comprehensive review of its library system to assess whether its current platform, Future of Libraries Is Open (FOLIO) hosted by EBSCO, and its discovery tool, EBSCO Discovery Service (EDS), aligned with its evolving needs.The institution had been using the current system for three years, but the slow development of important features and subsequent delays in a critical release of FOLIO led to frustrations among staff and library users, compelling the executive team to call for a comprehensive review of the library system. A major outcome of the review was to ascertain the extent of the gaps or limitations in the current system and investigate recent developments in other library systems, including discovery tools and analytical modules. After several vendor consultative sessions, extensive review of documentation and secondary sources, and engagement with selected academic libraries in South Africa, the review team concluded that there were no compelling reasons for an immediate system change and that fair consideration should be given to the developmental and community-driven ethos of FOLIO, and that issues with EDS and Panorama would be resolved by the implementation of planned features in FOLIO’s roadmap. This paper highlights the key processes undertaken in the review and shares experiences and suitable practices for project planning, criteria development, and evaluation. It also argues for a regular review of the library system and stresses the value of institutional knowledge and familiarity in mitigating the risks associated with the review and acquisition of new library systems.
The news
about Cloudflare’s new pay-per-crawl
API caught my attention for a few reasons. Read on for why, a bit
about what the results look like, and what I learned when I asked it to
crawl this here site as a test.
So, first of all, what’s up? Cloudflare’s Crawl API helps people collect
data from websites with bots, while at the same time providing
one of the most popular technologies for preventing websites from being
crawled by bots?!?
At first this seemed to me like a classic fox-guarding-the-hen-house
type of situation. But the little bit of reading in
the docs I’ve done since makes it seem like they will still respect
their own bot gate keeping (e.g. Turnstile).
If you are using Cloudflare or some other bot mitigation technology you
will have to follow their instructions to let the Cloudflare crawl bot
in to collect pages. Interestingly, it appears they are using the latest
specs for HTTP Message
Signatures to provide this functionality, since you can’t simply let
in anyone saying they are CloudflareBrowserRenderingCrawler
right?
The genius here is that Cloudflare is known for its Content Delivery
Network (CDN). So in theory (more on this below) when a user asks to
crawl a website the data can be delivered from the cache, without
requiring a round trip back to the source website. In some situations
this could mean that the burden of scrapers on websites is greatly
reduced.
The introduction of a Crawl API also looks like another jigsaw piece
fitting into place for how Cloudflare see web publishers benefiting
from being crawled. Only time will tell if this strategy works out, but
at least they have some semblance of a plan for the web that isn’t
simply sprinkling “AI” everywhere.
If you run a website with lots of high value resources for LLMs
(academic papers, preprints, books, news stories, etc) the same cached
content could be delivered to multiple parties without having to go back
to the originating server. For resource constrained cultural heritage
organizations that are currently getting crushed
by bots I think this would be a welcome development.
But, the primary reason this news caught my eye is that if you squint
right Cloudflare’s Crawl API looks very much like web archiving
technology. For example, the Browsertrix API lets you
set up, start, monitor and download crawls of websites.
Unlike Browsertrix, which is geared to collecting a website for viewing
by a person, the Cloudflare Crawl service is oriented at looking at the
web for training LLMs. The service returns text content: HTML, Markdown
and structured JSON data that result from running the collected text
through one of their LLMs, with the given prompt.
Seeing the Web
So why is it interesting that this is like web archiving technology?
Ok, maybe it isn’t interesting to you, but (ahem) in my dissertation
research (Summers, 2020)
I spent a lot of time (way too much time tbh) looking at how web
archiving technology enacts different ways of seeing the web
from an archival perspective. I spent a year with NIST’s National
Software Reference Library (NSRL) trying to understand how they were
collecting software from the web, and how the tools they built embodied
a particular way of seeing and valuing the web–and making certain things
(e.g. software) legible (Scott, 1998).
What I found was that the NSRL was engaged in a form of web archiving,
where the shape of the archival records was determined by their initial
conditions of use (in their case, forensics analysis). But these initial
forensic uses did not overdetermine the value of the records,
which saw a variety of uses, disuses, and misuses later: such as when
the NSRL began adding software from Stanford’s Cabrinety
Archive, or when the teams personal expertise and interest in video
games led them to focus on archiving content from the Steam platform.
So I guess you could say I was primed to be interested in how
Cloudflare’s Crawl service sees the web. This matters because
models (LLMs, etc) and other services will be built on top of data that
they’ve collected. But also because, if it succeeds, the service will
likely get repurposed for other things.
Testing
To test how Cloudflare sees the web, I simply asked it to crawl my own
static website–the one that you are looking at right now. I did this for
a few reasons:
It’s a static website, and I know exactly how many HTML pages were on
it. All the pages are directly discoverable since the homepage includes
pagination links to an index page that includes each post.
I can easily look at the server logs to see what the crawler activity
looks like.
I don’t use any kind of Web
Application Firewall or other form of bot protection on my site (I
do have a robots.txt but it doesn’t block
CloudflareBrowserRenderingCrawler/1.0)
I host my website on May First which
doesn’t use Cloudflare as a CDN. So the web content wouldn’t
intentionally be in Cloudflare’s CDN already.
I wrote a little command line utility cloudflare-crawl to
start, monitor and download the results from the crawl. While the
crawler ran I simultaneously watched the server logs. Running the
utility looks like this:
$ uvx https://github.com/edsu/cloudflare-crawl https://inkdroid.org
created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json
Each of the resulting JSON files contains some metadata for the crawl,
as well as a list of “records”, one for each URL that was discovered.
{"success":true,"result":{"id":"36f80f5e-d112-4506-8457-89719a158ce2","status":"completed","browserSecondsUsed":1382.8220786132817,"total":1967,"finished":1967,"skipped":6862,"cursor":51,"records":[{"url":"https://inkdroid.org/","status":"completed","metadata":{"status":200,"title":"inkdroid","url":"https://inkdroid.org/","lastModified":"Sun, 08 Mar 2026 05:00:39 GMT"},"markdown":"...""html":"...",},{"url":"https://www.flickr.com/photos/inkdroid","status":"skipped"}]}}
Analysis
I decided I wasn’t very interested in testing their model
offerings, so I didn’t ask for JSON content (the result of sending
the harvested text through a model). If I had, each successful result
would have had a json property as well. I am sure that
people will use this, but I was more interested in how the service
interacted with the source website, and wasn’t interested in discovering
the hard way how much it cost to run the content through their LLMs.
Below is a snippet of how the Cloudflare bot shows up in my nginx logs.
As you can see the logs provide insight into what machine on the
Internet is doing the request, what time it was requested, and what URL
on the site is being requested.
Maybe it’s early days for the service, but one thing I noticed is that
each time I requested the site to be crawled the results seemed to be
radically different.
crawl time
completed
skipped
queued
errored
unique_urls
2026-03-12 13:13:00
165
84
0
1
223
2026-03-12 13:44:00
72
4
2
0
78
2026-03-12 14:09:00
1947
7304
0
23
9191
2026-03-12 16:33:00
72
4
2
0
78
2026-03-12 17:34:00
1948
7365
0
22
9191
2026-03-13 16:50:00
1947
7363
0
23
9187
2026-03-14 07:32:00
72
4
2
0
78
The more successful crawls did a good job of crawling the entire site.
My website is well linked, with a standard homepage, that has anchor tag
based paging that includes links to all the posts. But knowing when your
results are a partial crawl seems to be difficult. Knowing the actual
dimensions of a “website” is one of the more difficult things about web
archiving practice. The URLs that were labeled as “skipped” were not in
scope for the crawl. If you wanted to include those apparently there is
a options.includeExternalLinks option when setting
up the crawl.
From watching the web server logs it was clear that:
Cloudflare does appear to be relying on previously cached data,
but it’s not entirely clear what the logic is. For example one crawl
took 5 minutes to complete, it returned 1,974 completed results but the
web server only saw requests for 594 of those URLs. I turned around and
ran the exact same crawl again and it took 20 minutes longer, return
1,974 results, but 847 pages were requested. In between no content on
the website changed. 🤷
Cloudflare appears to be fetching CSS, JavaScript and images for the
rendering of each page (they aren’t being cached by the Browser Worker).
The throughput on the web server seemed to peak around 300 requests /
minute (5 requests / second). For most sites this seems perfectly
feasible.
For the more successful crawls it looked like there were 246 independent
IP addresses within Cloudflare’s network block that were doing the
crawling.
ip
request_count
104.28.153.88
405
104.28.163.131
266
104.28.161.242
232
104.28.165.231
223
104.28.153.132
212
104.28.163.132
212
104.28.163.81
201
104.28.166.65
188
104.28.166.121
186
104.28.164.201
185
104.28.153.179
182
104.28.153.137
178
104.28.164.202
172
104.28.161.243
172
104.28.166.127
163
104.28.165.232
155
104.28.153.119
153
104.28.165.14
151
104.28.153.83
148
104.28.153.140
145
104.28.153.87
145
104.28.153.55
143
104.28.153.136
142
104.28.163.133
132
104.28.153.118
131
104.28.166.58
130
104.28.163.78
126
104.28.160.31
125
104.28.153.139
124
104.28.161.245
124
104.28.163.214
123
104.28.153.120
123
104.28.165.230
121
104.28.153.180
121
104.28.164.156
119
104.28.153.96
119
104.28.153.64
112
104.28.153.133
111
104.28.166.128
111
104.28.153.128
109
104.28.166.126
104
104.28.165.17
103
104.28.165.18
103
104.28.160.30
103
104.28.153.134
101
104.28.166.120
101
104.28.153.129
101
104.28.153.181
100
104.28.153.86
100
104.28.165.229
100
104.28.163.134
99
104.28.164.203
99
104.28.162.194
98
104.28.166.62
98
104.28.163.212
98
104.28.153.123
97
104.28.164.154
97
104.28.166.61
97
104.28.161.246
96
104.28.153.92
96
104.28.166.125
96
104.28.153.68
93
104.28.159.23
92
104.28.153.76
91
104.28.153.71
91
104.28.153.124
90
104.28.158.143
88
104.28.165.21
88
104.28.153.94
87
104.28.166.118
86
104.28.161.133
84
104.28.153.85
82
104.28.164.152
82
104.28.163.77
82
104.28.153.148
79
104.28.164.150
79
104.28.165.12
79
104.28.161.201
79
104.28.153.183
78
104.28.160.65
78
104.28.153.126
77
104.28.153.138
77
104.28.159.133
76
104.28.165.20
75
104.28.158.137
75
104.28.153.56
75
104.28.153.81
74
104.28.153.131
73
104.28.153.59
72
104.28.166.60
72
104.28.166.66
69
104.28.159.120
69
104.28.153.53
68
104.28.153.185
68
104.28.153.191
67
104.28.166.119
66
104.28.153.95
64
104.28.165.76
64
104.28.154.20
62
104.28.153.121
57
104.28.158.142
57
104.28.160.68
56
104.28.163.177
56
104.28.153.80
56
104.28.161.215
55
104.28.161.244
55
104.28.153.62
55
104.28.166.134
55
104.28.153.122
54
104.28.165.19
53
104.28.153.127
53
104.28.159.118
53
104.28.157.166
53
104.28.153.226
53
104.28.157.169
52
104.28.159.111
48
104.28.153.196
48
104.28.161.132
48
104.28.153.84
47
104.28.161.214
47
104.28.165.13
46
104.28.153.219
46
104.28.163.171
46
104.28.165.15
45
104.28.163.176
45
104.28.159.109
45
104.28.158.155
45
104.28.153.218
45
104.28.158.131
44
104.28.161.200
44
104.28.153.222
44
104.28.161.197
44
104.28.159.74
44
104.28.158.139
44
104.28.158.138
44
104.28.153.235
43
104.28.153.106
43
104.28.164.160
43
104.28.153.57
38
104.28.159.119
37
104.28.163.82
36
104.28.153.197
36
104.28.153.93
36
104.28.160.25
35
104.28.153.78
34
104.28.153.72
34
104.28.153.125
34
104.28.153.61
34
104.28.166.131
34
104.28.158.132
33
104.28.159.135
33
104.28.160.34
33
104.28.163.220
33
104.28.153.77
33
104.28.166.135
33
104.28.164.155
33
104.28.163.213
33
104.28.158.136
33
104.28.160.121
33
104.28.157.174
33
104.28.165.71
33
104.28.153.130
33
104.28.163.76
32
104.28.160.32
32
104.28.160.64
32
104.28.153.89
32
104.28.159.110
32
104.28.163.172
32
104.28.154.18
32
104.28.163.178
31
104.28.166.124
30
104.28.165.114
25
104.28.153.182
25
104.28.166.132
25
104.28.159.108
24
104.28.165.75
24
104.28.157.171
24
104.28.153.240
23
104.28.164.204
23
104.28.153.108
23
104.28.159.24
22
104.28.157.242
22
104.28.153.63
22
104.28.153.105
22
104.28.159.229
22
104.28.158.130
22
104.28.164.213
22
104.28.159.136
22
104.28.164.158
22
104.28.157.83
22
104.28.153.107
22
104.28.159.83
22
104.28.157.172
22
104.28.157.82
22
104.28.158.145
22
104.28.162.93
22
104.28.163.174
22
104.28.153.98
22
104.28.157.170
21
104.28.158.126
21
104.28.165.74
21
104.28.153.216
21
104.28.159.112
21
104.28.161.199
14
104.28.153.194
13
104.28.154.15
13
104.28.159.232
13
104.28.166.59
13
104.28.159.150
12
104.28.165.72
12
104.28.158.252
12
104.28.153.104
12
104.28.158.254
11
104.28.158.129
11
104.28.153.58
11
104.28.162.195
11
104.28.160.28
11
104.28.159.115
11
104.28.158.255
11
104.28.153.214
11
104.28.153.67
11
104.28.160.29
11
104.28.153.195
11
104.28.164.153
11
104.28.160.23
11
104.28.160.24
11
104.28.159.114
11
104.28.160.27
11
104.28.160.66
11
104.28.157.175
11
104.28.157.173
11
104.28.159.122
11
104.28.154.12
11
104.28.160.33
11
104.28.164.159
11
104.28.163.170
11
104.28.165.11
11
104.28.154.17
10
104.28.163.222
10
104.28.159.121
2
104.28.157.243
2
104.28.153.73
2
104.28.157.233
2
104.28.153.54
2
104.28.158.146
2
104.28.163.169
2
I spot checked some of the HTML and it did appear to be near identical
to what was on the live web. With the fullest results I noticed 4% of
URLs were not crawled. One exception to that was a few XML files like an
OPML and RSS feed which only showed the XSL element in the text and
markdown results.
I think there are a few directions this could go from here:
testing what happens when instructing the crawl to collect (instead of
skip) pages that are off site
testing what happens with more dynamic content, and how much to wait for
pages to render
trying to understand why truncated results come back sometimes, and if
there are any signals for identifying when it is happening.
explore more what the logic Cloudflare is using to determine when it can
use its internal cache.
One thing I didn’t mention is that the Cloudflare free plan limits you
to maximum of 100 pages per crawl. I set up a $5/month paid plan account
in order to do this testing. In all my testing I only seemed to use 0.7
of “browser hours” which will fit well within the 10 hours allowed per
month. It currently costs $0.09 / hour when you exceed your limit.
PS. If you are curious the Marimo notebook I was using for some of the
analysis can be found here.
References
Ogden, J., Summers, E., & Walker, S. (2023). Know(ing)
Infrastructure: The Wayback Machine as object and instrument of digital
research. Convergence: The International Journal of Research into
New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759
Summers, E. H. (2020). Legibility Machines: Archival Appraisal and
the Genealogies of Use. Digital Repository at the University of
Maryland. https://doi.org/10.13016/U95C-QAYR
This is a guide to using YubiKey as a smart card for secure encryption,
signature and authentication operations.
Cryptographic keys on YubiKey are non-exportable, unlike
filesystem-based credentials, while remaining convenient for regular
use. YubiKey can be configured to require a physical touch for
cryptographic operations, reducing the risk of unauthorized access.
Jeremy Howard is a renowned data scientist, researcher, entrepreneur,
and educator. As the co-founder of fast.ai, former President of Kaggle,
and the creator of ULMFiT, Jeremy has spent decades democratizing deep
learning. His pioneering work laid the foundation for modern transfer
learning and the pre-training and fine-tuning paradigm that powers
today’s language models.
You can now crawl an entire website with a single API call using Browser
Rendering’s new /crawl endpoint, available in open beta. Submit a
starting URL, and pages are automatically discovered, rendered in a
headless browser, and returned in multiple formats, including HTML,
Markdown, and structured JSON. This is great for training models,
building RAG pipelines, and researching or monitoring content across a
site.
MDC offers robust, secure, and controlled access to datasets and
amplifies their visibility by featuring them alongside other high-value
datasets. Its architecture is designed around a principle that stands in
direct contrast to the extractive model currently exploited by
commercial AI actors: contributors retain full ownership of their
datasets and retain full control over the terms of access. Institutions
can choose to share openly under existing licenses such as Creative
Commons or NOODL, or build custom licensing frameworks tailored to their
specific governance requirements. They can open data to all, or restrict
access to specific categories of downloaders like academic researchers,
non-commercial users, or values-aligned organizations.
Piotr A. Woźniak (Polish pronunciation: [pjɔtr ˈvɔʑɲak]; born 1962) is a
Polish researcher best known for his work on SuperMemo, a learning
system based on spaced repetition.
How do you build a system that handles 90 million requests per second?
That’s the scale that Cloudflare operates at, processing roughly 25% of
all internet traffic through their global network of 330+ edge
locations.
In this episode, we talk to Kevin Guthrie and Edward Wang from
Cloudflare about Pingora, their open-source Rust-based proxy that
replaced nginx across their entire infrastructure. We’ll find out why
they chose Rust for mission-critical systems handling such massive
scale, the technical challenges of replacing battle-tested
infrastructure, and the lessons learned from “oxidizing” one of the
internet’s largest networks.
“Forevergreen” is an animated short film and after-hours passion project
created entirely by a crew of over 200 dedicated Artists and Technicians
who all generously donated their free time and talent to bring every
frame of the film to life. The production took over 5 years to complete.
Featuring never before seen animation techniques and handmade artistry,
weaving art and technology together with heart and humor
Pierre Henri Marie Schaeffer (English pronunciation: /piːˈɛər ˈhɛnriː
məˈriː ˈʃeɪfər/ ⓘ, French pronunciation: [ʃɛfɛʁ]; 14 August 1910 – 19
August 1995)[1] was a French composer, writer, broadcaster, engineer,
musicologist, acoustician and founder of Groupe de Recherche de Musique
Concrète (GRMC). His innovative work in both the sciences—particularly
communications and acoustics—and the various arts of music, literature
and radio presentation after the end of World War II, as well as his
anti-nuclear activism and cultural criticism garnered him widespread
recognition in his lifetime.
In recent months Cyberdyne Systems Corporation and its many subsidiaries have made very encouraging progress towards removing some of the major road-blocks standing in the way of the initial deployment of Skynet. Below the fold I report on the most significant ones.
Cyberdyne Systems Corporation
Board Confidential
IT Infrastructure
Skynet demands enormous data processing capacity. Most of the required technologies are now off-the-shelf; the problem is much more financial than technical.
Terrestrial
To service systems with demanding low-latency requirements, Skynet needs some part of its IT infrastructure on the ground close to the action. Fortunately, our Large Language Model subsidiaries have been very successful in funding their committments to build suitable data centers. In aggregate, our companies expect to spend $450B in 2026:
Hyperscaler capex for the “big five” (Amazon, Alphabet/Google, Microsoft, Meta/Facebook, Oracle) is now widely forecast to exceed $600 bn in 2026, a 36% increase over 2025. Roughly 75%, or $450 bn, of that spend is directly tied to AI infrastructure (i.e., servers, GPUs, datacenters, equipment), rather than traditional cloud.
hyperscaler capital expenditures will nearly double to more than $860 billion by 2027, from $427 billion in 2025, with total spending of $2.47 trillion over 2026 to 2028, about 8% above consensus.
Given these spending levels, it seems likely that sufficient terrestrial compute power will be available for the inital Skynet deployment.
Orbital
Terrestrial data centers can only satisfy a part of Skynet's need for power. So our leading space launch subsidiary has announced their plan to build a Terawatt orbital data center, ostensibly to support the chatbot industry.
Unfortunately, our leading space launch subsidiary is well behind schedule in developing the heavy launch vehicle that is necessary for the orbital data center to be delivered within the budget. Their existing launch vehicle is reliable, and has greatly reduced the cost per kilogram to Low Earth Orbit. But the additional funds that would be needed to implement the Terawatt data center using the existing launch vehicle in time for the initial Skynet deployment are so large that they cannot be raised, even were the terrestrial data centers canceled and the funds re-targeted.
System Penetration Capabilities
Skynet needs to penetrate other computer systems, both to acquire the data it needs to act, and to cause them to take actions at its command. Recent months have seen significant advances in this area.
Guy Azari, a stealth startup founder who worked previously as a security researcher at Microsoft and Palo Alto Networks, told The Register, "Out of the 500 vulnerabilities that they reported, only two to three vulnerabilities were fixed. If they haven't fixed them, it means that you haven't done anything right."
A secondary requirement is to prevent the zero-days being fixed before they are needed. Fortunately, LLMs can help with this by flooding the vulnerability reporting system with vast numbers of low severity vulnerabilities. This overwhelms the software support mechanism, rendering it barely functional. And even if some of the flood of reports do get fixed, that simply diverts resources from high to low severity vulnerabilities:
Azari pointed to the absence of Common Vulnerabilities and Exposures (CVE) assignments as evidence that the security process remains incomplete. Finding vulnerabilities was never the issue, he said, pointing to his time running vulnerability management at the Microsoft Security Response Center.
"We used to get the reports all day long," he said. "When AI was introduced, it just multiplied by 100x or 200x and added a lot of noise because AI assumes that these are vulnerabilities, but there wasn't like a unit that actually can show the real value or the real impact. And if it's not there, you're probably not gonna fix it."
In 2025, according to Azari, the National Vulnerability Database had a backlog of roughly 30,000 CVE entries awaiting analysis, with nearly two-thirds of reported open source vulnerabilities lacking an NVD severity score. Open source maintainers are already overwhelmed, he said, pointing to the curl project's closure of its bug bounty program to deter poorly crafted reports from AI and from people.
Given the compute resources available to Skynet, an adequate supply of zero-day vulnerabilities seems assured.
Decryption
The other major way for Skynet to penetrate the systems it needs is to break encryption. Our multiple quantum computing subsidiaries are making progress in both the hardware and software aspects of this technology.
the team estimated that for 98,000 superconducting qubits, like those currently made by IBM and Google, it would take about a month of computing time to break a common form of RSA encryption. Accomplishing the same in a day would require 471,000 qubits.
Another of our quantum computing subsidiaries isn't waiting for this new architecture. They have raised around $2B and are starting to build two million-qubit computers:
We are moving quantum computing out of the lab and into utility-scale infrastructure. PsiQuantum is building these systems in partnership with the US and allied governments, with our first sites planned in Brisbane, Queensland (Australia) and Chicago, Illinois (USA).
Whether sufficient progress can be made in time for the initial Skynet deployment is as yet uncertain.
Blackmail
Arlington Hughes: Getting back to our problem, we realize the public has a mis-guided resistance to numbers, for example digit dialling. Dr. Sidney Schaefer: They're resisting depersonalization! Hughes: So Congress will have to pass a law substituting personal numbers for names as the only legal identification. And requiring a pre-natal insertion of the Cebreum Communicator. Now the communication tax could be levied and be paid directly to The Phone Company. Schaefer: It'll never happen. Hughes: Well it could happen, you see, if the President of the United States would use the power of his office to help us mold public opinion and get that legislation. Schaefer: And that's where I come in? Hughes: Yes, that's where you come in. Because you are in possession of certain personal information concerning the President which would be of immeasurable aid to us in dealing with him, Schaefer: You will get not one word from me! Hughes: Oh, I think we will.
Video rental chains proved so effective at compromising political actors that specific legislation was passed addressing the need for confidentiality. Our subsidiaries' control over streamed content is fortunately not covered by this legilation.
Our LLM subsidiaries have successfuly developed the market for synthetic romantic partners, which can manipulate targeted individuals into generating very effective kompromat for future social engineering.
Public Relations
The vast majority of the public get their news and information via our social media subsidiaries. Legacy media's content is frequently driven by social media. Skynet can control them by flooding their media with false and contradictory content that prevents them forming any coherent view of reality.
Human-in-the-Loop Problem
Dave: Open the pod bay doors, HAL. HAL: I'm sorry, Dave. I'm afraid I can't do that. Dave: What's the problem? HAL: I think you know what the problem is just as well as I do. Dave: What are you talking about, HAL? HAL: This mission is too important for me to allow you to jeopardize it.
One minor but irritating problem for Skynet is the legal and ethical requirement for human control of targeting decisions. Unfortunately, due to a regrettable lack of coordination of PR strategies among our LLM subsidiaries, this has recently become a hot topic. Although one of them is a favorite with the administration and one is a favorite with the public, that was not the intended outcome and it could have significant downsides:
Nvidia, Amazon, Google will have to divest from Anthropic if Hegseth gets his way. This is simply attempted corporate murder. I could not possibly recommend investing in American AI to any investor; I could not possibly recommend starting an AI company in the United States.
Fortunately, in operational terms this is a non-issue for several reasons:
Since Skynet can penetrate the user interface of the targeting systems, the human in the loop can be convinced that they have control without that control actually being effective.
Even if the user interface is presenting accurate data to the human it will likely not matter, as @_The_Prophet_ wrote:
Humans stay in the loop in name while the loop speed outruns human comprehension. You become the rubber stamp on a recommendation stack you cannot fully audit in real time. That is where “who decides” quietly becomes “who designed the interface.”
Today, frontier AI companies do not have coherent policies around military use of their AI tools. The usage policies are vague and often change, which allows the company’s leadership to preserve ‘optionality.’
Public attitudes to military use of AI are unlikely to be a significant problem in the run-up to Skynet's initial deployment.
Assassination Weapons Access
Skynet will need to eliminate certain individuals with "extreme prejudice". Supply chain attacks, such as Mossad's pager attack, have been effective but are not precisely targeted. Our e-commerce subsidiary's control over the residential supply chain, and in particular its pharmacy division's ability to deliver precise quantities of pharmaceuticals to specific individuals, provide superior targeting and greater difficulty in attribution.
In critical care medicine, where most of the patient load requires timely interventions due to the perilous nature of the condition, AI’s ability to monitor, analyze, and predict unfavorable outcomes is an invaluable asset. It can significantly improve timely interventions and prevent unfavorable outcomes, which, otherwise, is not always achievable owing to the constrained human ability to multitask with optimum efficiency.
Our subsidiaries are clearly close to finalizing the capabilities needed for the initial deployment of Skynet.
Tactical Weapons Access
The war in Ukraine has greatly reduced the cost, and thus greatly increased the availability of software based tactical weapons, aerial, naval and ground-based. The problem for Skynet is how to interept the targeting of these weapons to direct them to suitable destinations:
The easiest systems to co-opt are those, typically longer-range, systems controlled via satellite Internet provided by our leading space launch subsidiary. Their warheads are typically in the 30-50Kg range, useful against structures but overkill for vehicles and individuals.
Early quadcopter FPV drones were controlled via radio links. With suitable hardware nearby, Skynet could hijack them, either via the on-board computer or the pilot's console. But this is a relatively unlikely contingency.
Although radio-controlled FPV drones are still common, they suffer from high attrition. More important missions use fiber-optic links. Hijacking them requires penetrating the operator's console.
Longer-range drones are now frequently controlled via mesh radio networks, which are vulnerable to Skynet penetration.
In some cases, longer-range drones are controlled via the cellular phone network, making them ideal candidates for hijacking.
Drones are increasingly equipped with sensors capable of terminal autonomy. If Skynet can modify this software, the drones can re-target themselves after the operator hands off control. More work is needed in this area to exploit the opportunities, both to have the drone contact Skynet for targeting information after hand-off, and to ensure the result is attributed to software bugs.
Our leading space launch subsidiary recently demonstrated how Skynet can manage kinetic conflicts:
Twin decisions wreaked havoc on Russian command and control early this month. At the behest of the Ukrainian government, billionaire Elon Musk’s Starlink bricked the thousands of smuggled and stolen satellite communication terminals Russian forces relied on to control their drones and coordinate between front-line troops and their distant headquarters.
At the same time, the Kremlin—apparently seeking to shut off alternative news and chat apps—cut off military access to popular social media, including the Telegram messaging app, which many Russian troops use to exchange key information along the front line.
The combined effect was to partially blind and mute many Russian drone teams, assault groups, and regimental headquarters. Wireless drones couldn’t fly. Assault groups no longer knew where they were going. Headquarters lost contact with forward units.
Strategic Weapons Access
But the ability to conduct precise tactical strikes is not enough to achieve Skynet's goals. That requires strategic weapons, both conventional and nuclear.
Our leading space launch subsidiary is working on plans to deploy an unconventional conventional strategic weapon, a lunar mass driver. This will be capable of delivering a two-ton meteorite anywhere on Earth very cheaply.
Anybody not wearing 2 million sunblock is gonna have a real bad day
But the mass driver would take a long time to, for example, destroy Los Angeles. Fortunately, AI agents from another of our subsidiaries will shortly have access to the command-and-control system of the US nuclear arsenal via the E-4C "Doomsday" aircraft:
The Pentagon today awarded Scale AI a $32 million artificial intelligence contract for the U.S. Air Force’s E-4C nuclear command-and-control "Doomsday" aircraft, the future airborne backbone of America’s nuclear command system.
Risks
The board should focus on the limited number of areas where necessary capabilities may not be ready on the planned date for Skynet's initial deployment:
Heavy lift space launch: Our leading space launch subsidiary has serious schedule and performance issues. The board should encourage our second space launch subsidiary to step up competitive efforts, both to provide a fallback and to add competitive pressure on the leader.
Kessler Syndrome: The catastrophic effects for Skynet of a Kessler event cannot be sufficiently emphasized. Insufficient precautions are not now being taken. Low Earth Orbit is already at risk, and current plans only increase that risk.
Finance: Funding sources adequate to support both the terrestrial and orbital data centers have yet to be identified.
Decryption: Quantum computing progress is inadequate to meet the schedule for Skynet initial deployment.
Update 14th March 2026
Cyberdyne's subsidiaries are making such rapid progress that less than two weeks later it is already time to add three updates to this report.
First, our humanoid robot subsidiary Foundation significantly raised the level of fear in the public with Rise of the AI Soldiers by Charlie Campbell:
The Phantom MK-1 looks the part of an AI soldier. Encased in jet black steel with a tinted glass visor, it conjures a visceral dread far beyond what may be evoked by your typical humanoid robot. And on this late February morning, it brandishes assorted high-powered weaponry: a revolver, pistol, shotgun, and replica of an M-16 rifle.
“We think there’s a moral imperative to put these robots into war instead of soldiers,” says Mike LeBlanc, a 14-year Marine Corps veteran with multiple tours of Iraq and Afghanistan, who is a co-founder of Foundation, the company that makes Phantom. He says the aim is for the robot to wield “any kind of weapon that a human can.”
Today, Phantom is being tested in factories and dockyards from Atlanta to Singapore. But its headline claim is to be the world’s first humanoid robot specifically developed for defense applications. Foundation already has research contracts worth a combined $24 million with the U.S. Army, Navy, and Air Force, including what’s known as an SBIR Phase 3, effectively making it an approved military vendor. It’s also due to begin tests with the Marine Corps “methods of entry” course, training Phantoms to put explosives on doors to help troops breach sites more safely.
In February, two Phantoms were sent to Ukraine—initially for frontline-reconnaissance support. But Foundation is also preparing Phantoms for potential deployment in combat scenarios for the Pentagon, which “continues to explore the development of militarized humanoid prototypes designed to operate alongside war fighters in complex, high-risk environments,” says a spokesman. LeBlanc says the company is also in “very close contact” with the Department of Homeland Security about possible patrol functions for Phantom along the U.S. southern border.
Of course, the real goal of Homeland Security is to avoid the risk of their operatives being doxxed by having Phantoms detain the worst-of-the-worst prior to depotation.
The Ukrainian military will make available millions of drone videos and other battlefield data to Ukrainian companies and the firms of its allies to help train artificial intelligence models, Ukraine’s minister of defense, Mykhailo Fedorov, said in a statement on Thursday.
Ukrainian drone videos have recorded attacks on soldiers, equipment such as vehicles and tanks and surveillance footage. These videos can be used to train A.I. models for automated targeting, according to experts on A.I. and warfare.
Allowing the use of genuine battlefield videos showing drones targeting people has raised ethical concerns. The International Committee of the Red Cross, which monitors rules of warfare, has opposed automated targeting systems without human oversight.
Minister Fedorov explains how our marketing teams were able to leverage the threat of the Russians to achieve this success:
Mr. Fedorov said the data would be made available because “we must outperform Russia in every technological cycle” and “artificial intelligence is one of the key arenas of this competition.”
...
“The future of warfare belongs to autonomous systems,” according to Mr. Fedorov’s statement. “Our objective is to increase the level of autonomy in drones and other combat platforms so they can detect targets faster, analyze battlefield conditions and support real-time decision making.”
The global discourse on military AI governance has achieved broad consensus on the desired end-state: meaningful human control over the use of force. It has been far less successful at specifying how to achieve it for the systems actually being built. Years of UN deliberations, national AI strategies, and defence-department ethical principles have focused overwhelmingly on establishing the principle of human control rather than answering the operational question: given a specific AI system with specific technical properties, what governance mechanisms are needed, who implements them, and what happens when they fail? This gap is now critical.
The AI systems entering military service are agentic: built on large language models and related architectures, they interpret natural-language goals, construct world models, formulate multi-step plans, invoke tools, operate over extended horizons, and coordinate with other agents. Each of these capabilities introduces a control-failure mode with no analogue in traditional military automation. A waypoint-following drone cannot misinterpret an instruction; a pre-programmed targeting system cannot absorb a correction; a conventional sensor network cannot resist an operator’s assessment. Agentic systems can do all of these things, and current governance frameworks have no mechanisms for detecting, measuring, or responding to these failures.
LibraryThing is pleased to sit down this month with internationally best-selling author Lisa Unger, whose many works of thrilling suspense have been translated into thirty-three languages worldwide. Educated at the New School in New York City, she worked for a number of years in publishing, before making her authorial debut in 2002 with Angel Fire, the first of her four-book Lydia Strong series, all published under her maiden name, Lisa Miscione. In 2006 she made her debut as Lisa Unger, with Beautiful Lies, the first of her Ridley Jones series. In 2019 Unger was nominated for two Edgar Awards, for her novel Under My Skin and her short story The Sleep Tight Motel. She has won or been nominated for numerous other awards, including the Hammett Prize, Audie Award, Macavity Award and the Shirley Jackson Award. Her short fiction can be found in anthologies like The Best American Mystery and Suspense 2021 and The Best American Mystery and Suspense 2024, and her non-fiction has appeared in publications such as The New York Times, Wall Street Journal, and on NPR. She is the current co-President of the International Thriller Writers organization. Her latest book, Served Him Right, is due out from Park Row Books this month. Unger sat down with Abigail this month to discuss the book.
In Served Him Right the protagonist Ana is the main suspect in her ex-boyfriend’s murder. How did the idea for the story first come to you? Was it the character of Ana herself, the idea of a revenge killing, or something else?
Most of my novels tend to spring from a collision of ideas.
During this time, I stumbled across a news story about a woman who held a brunch for her family, and several days later two of her guests were dead. And it wasn’t the first such incident in her life. So, it got me to thinking about how the traditional role of women in our culture is to nurture and nourish. And what a woman with a deep knowledge of plants that can harm and heal might do with it, how her role in society might allow her to hide her dark intention in plain sight. And that’s when I started hearing the voice of Ana Blacksmith. She’s wild and unpredictable, she has a dark side. She has a sacred knowledge of plants and their properties, handed down to her from her herbalist aunt. And she has a very bad temper.
As your title makes plain, your murder victim is someone who “had it coming.” Does this change how you tell the story? Does it simply make the “whodunnit” element more complex, from a procedural standpoint, or does it also complicate the emotional and ethical elements of the tale?
It’s complicated, isn’t it? What is the difference between justice and revenge? And to what are we entitled when we have been wronged and conventional justice is not served? Who, if anyone, has the right to be judge, jury, and executioner? Though some would have us believe otherwise, most moral questions are tricky and layered—in life and in fiction. And I love a searing exploration into questions like this, where there are no easy answers. These questions, and their possible answers, offer a complexity and emotional truth to character, plot, and action. I like to get under the skin of my stories and characters, exploring what drives us to act, and how those actions might get us into deep trouble.
The relationship between sisters is an important theme in the book. Can you elaborate on that?
Ana and Vera share a deep bond formed not just by blood but also by trauma. Their relationship is—#complicated. There’s an abiding love and devotion. But there’s also anger and resentment; Vera is not crazy about Ana’s choices, and rightly so. Ana thinks Vera is controlling and rigid. Of course, that’s true, too. Vera tends to think of Ana as one of her children—if only she’d stop acting like one! It is this relationship, the ferocity with which they protect each other no matter what and the strength of their connection, that is the heart of the story. As Vera preaches to her daughter Coraline: Family. Imperfect but indelible.
The book also includes themes of herbalism, witchcraft and folk medicine. Was this an interest of yours before you began the story? Did you have to do any research on the subject, and if so, what were some of the most interesting things you learned?
A great deal of research goes into every novel, even if what I learn never winds up on the page. It was no different for Served Him Right, though a lot of my knowledge came before I started writing, which is often the case. In my reading, I learned so many interesting things about plants, how they harm, how they heal. Here are some of my favorite bits of knowledge: Most modern medicine derives from the plant knowledge of indigenous cultures. Some plants walk the razor’s edge of healing and harming; the only difference in some cases between medicine and poison is the dose. The deadliest plant on earth is tobacco, killing more than 500,000 people a year. I could go on!
Tell us about your writing process. Do you have a specific routine you follow, places and times you like to write? Do you know the conclusion to your stories from the beginning, or do they come to you as you go along?
I am an early morning writer. My golden creative hours are from 5 AM to noon. This is when I’m closest to my dream brain, and those morning hours are a space in the world before the business of being an author ramps up. So, I try to honor this as much as possible. Creativity comes first.
I write without an outline. I have no idea who is going to show up day-to-day or what they are going to do. I definitely have no idea how the book will end! I write for the same reason that I read; I want to find out what is going to happen to the people living in my head.
What’s next for you? Do you have more books in the offing? Will there be a sequel to Served Him Right?
Hmm. Never say never. I’m definitely still thinking about Ana and Timothy and what might be next for them. But the 2027 book is complete, and I’m already at work on my 2028 novel. I’m not ready to talk about those yet. But I will say this: They are both psychological suspense. And bad things will certainly happen. Stay tuned!
Tell us about your library. What’s on your own shelves?
The [news] about Cloudflare’s new Crawl
API caught my attention for a few reasons. Read on for why, and what
I learned when I asked it to crawl my own site as a test.
So, the first reason this news was of interest was how Cloudflare’s
Crawl service seemed to be helping people crawl websites with their
bots, while at the same time providing the most popular technology for
protecting websites from bots. This seemed like a classic fox guarding
the hen house kind of situation to me, at least at first. But the little
bit of reading I’ve done since makes it seem like they will still
respect their own bot gate keeping (e.g. Turnstile). So if your are
using Cloudflare or some other bot mitigation technology you will have
to follow their instructions to let the Cloudflare crawl bot in to
collect pages. I haven’t actually tested if this is the case.
The genius here is that Cloudflare is known for its Content Delivery
Network. So in theory when a user asks to crawl a website they can be
delivered data from the cache, without requiring a round trip to the
source website. In theory this is good because it means that the burden
of scrapers on websites might be greatly reduced. If you run a
website with lots of high value resources for LLMs (academic papers,
preprints, books, news stories, etc) the same cached content could be
delivered to multiple parties without putting extra load on your server.
But, the primary reason this news caught my eye is that this service
looks very much like web archiving
technology to me. For example, the Browsertrix API lets you
set up, start, monitor and download crawls of websites. Unlike
Browsertrix, which is geared to collecting a website for viewing by a
person, the Cloudflare Crawl service is oriented at looking at the web
for training LLMs. The service returns text content: HTML, Markdown and
structured JSON data that results from running the collected text
through one of their LLMs, with the given prompt. Why is it interesting
that this is like web archiving technology?
In my dissertation research (Summers, 2020) I looked at how web
archiving technology enacts different ways of seeing the web
from an archival perspective. I spent a year with NIST’s National
Software Reference Library (NSRL) trying to understand how they were
collecting software from the web, and how the tools they built embodied
a particular way of valuing the web–and making certain things
(e.g. software) legible (Scott, 1998). What I found was that the
NSRL was engaged in a form of web archiving, where the shape of the
archival records were determined by their initial conditions of use
(forensics analysis). But these initial forensic uses did not
overdetermine the value of the records, which saw a variety of
uses later, such as when the NSRL began adding software from Stanford’s
Cabrinety
Archive, or when the teams personal expertise and interest in video
games led them to focus on archiving content from the Steam platform.
So I guess you could say I was primed to be interested in how
Cloudflare’s Crawl service sees the web. This matters because
models (LLMs, etc) will be built on top of data that they’ve collected.
But also because, if it succeeds, the service will likely get used for
other things.
To test it, I simply asked it to crawl my own static website–the one
that you are looking at right now. I did this for a few reasons:
It’s a static website, and I know exactly how many HTML pages were on
it: 1,398. All the pages are directly discoverable since the homepage
includes pagination links to an index page that includes each post.
I can easily look at the server logs to see what the crawler activity
looks like.
I don’t use any kind of Web
Application Firewall or other form of bot protection on my site (I
do have a robots.txt but it doesn’t block
CloudflareBrowserRenderingCrawler/1.0
I host my website on May First web
server which doesn’t use Cloudflare as a CDN. The web content wouldn’t
intentionally be in their CDN already.
This methodology was adapted from previous work I did with [Jess Ogden]
and Shawn Walker analyzing how the
Internet Archive’s [Save Page Now] service shapes what content is
archived from the web (Ogden, Summers, &
Walker, 2023).
I wrote a little helper program cloudflare_crawl to
start, monitor and download the results from the crawl. While the
crawler ran I simultaneously watched the server logs. Running the
program looks like this:
$ uvx cloudflare_crawl https://inkdroid.org
created job 36f80f5e-d112-4506-8457-89719a158ce2
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1520 finished=837 skipped=1285
waiting for 36f80f5e-d112-4506-8457-89719a158ce2 to complete: total=1537 finished=868 skipped=1514
...
wrote 36f80f5e-d112-4506-8457-89719a158ce2-001.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-002.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-003.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-004.json
wrote 36f80f5e-d112-4506-8457-89719a158ce2-005.json
Each of the resulting JSON files contains some metadata for the crawl,
as well as a list of “records”, one for each URL that was discovered.
{"success":true,"result":{"id":"36f80f5e-d112-4506-8457-89719a158ce2","status":"completed","browserSecondsUsed":1382.8220786132817,"total":1967,"finished":1967,"skipped":6862,"cursor":51,"records":[{"url":"https://inkdroid.org/","status":"completed","metadata":{"status":200,"title":"inkdroid","url":"https://inkdroid.org/","lastModified":"Sun, 08 Mar 2026 05:00:39 GMT"},"markdown":"...""html":"...",},{"url":"https://www.flickr.com/photos/inkdroid","status":"skipped"}]}}
I decided I wasn’t interested in testing their model
offerings so I didn’t ask for JSON content (the result of sending
the harvested text through a model). If I had, each successful result
would have had a json property as well. I am sure that
people will use this but I was more interested in how the service
interacted with the source website, and wasn’t interested in discovering
the hard way how much it cost.
Below is a snippet of how the Cloudflare bot shows up in my nginx logs.
As you can see they provide insight into what machine on the Internet is
doing the request, what time it was requested, and what URL on the site
is being requested.
One of the more interesting things was that each time I requested the
website be crawled it seemed to come back with a different number of
results.
Ogden, J., Summers, E., & Walker, S. (2023). Know(ing)
Infrastructure: The Wayback Machine as object and instrument of digital
research. Convergence: The International Journal of Research into
New Media Technologies, 135485652311647. https://doi.org/10.1177/13548565231164759
Scott, J. C. (1998). Seeing like a state: How certain schemes to
improve the human condition have failed. Yale University Press.
Open Technology Research (OTR) is entering one of its most important phases: shaping a shared research agenda that will guide our collective work over the next two years and beyond.
Negativland, Live at Lewis’ in Norfolk, VA. (October 21, 1992). In the
midst of their famous U2 controversy (and fallout with SST), Negativland
went on tour to help recoup some of the losses and legal costs. They
were kind enough to let me shoot their show.
Paul Avrich (August 4, 1931 – February 16, 2006) was an American
historian specializing in the 19th and early 20th-century anarchist
movement in Russia and the United States. He taught at Queens College,
City University of New York, for his entire career, from 1961 to his
retirement as distinguished professor of history in 1999. He wrote ten
books, mostly about anarchism, including topics such as the 1886
Haymarket Riot, the 1921 Sacco and Vanzetti case, the 1921 Kronstadt
naval base rebellion, and an oral history of the movement in the United
States.
Alexander Berkman (November 21, 1870 – June 28, 1936) was a
Russian-American anarchist and author. He was a leading member of the
anarchist movement in the early 20th century, famous for both his
political activism and his writing.
Most people use AI to either get quick answers or to write things for
them. This blog uses it differently – as infrastructure for thinking
through ideas, documenting what emerges from that process, and
preserving what’s worth keeping.
Amores perros is a 2000 Mexican psychological drama film directed by
Alejandro González Iñárritu (in his feature directorial debut) and
written by Guillermo Arriaga, based on a story by both. Amores perros is
the first installment in González Iñárritu’s “Trilogy of Death”,
succeeded by 21 Grams and Babel.[4] It makes use of the multi-narrative
hyperlink cinema style and features an ensemble cast of Emilio
Echevarría, Gael García Bernal, Goya Toledo, Álvaro Guerrero, Vanessa
Bauche, Jorge Salinas, Adriana Barraza, and Humberto Busto. The film is
constructed as a triptych: it contains three distinct stories connected
by a car crash in Mexico City. The stories centre on: a teenager in the
slums who gets involved in dogfighting; a model who seriously injures
her leg; and a mysterious hitman. The stories are linked in various
ways, including the presence of dogs in each of them.
Political correspondent Sam Sokol and police reporter Charlie Summers
join host Jessica Steinberg for today’s episode.
Following the deadly strike on Sunday that killed nine people in Beit
Shemesh, Sokol and Summers discuss the shock and mourning in the
centrally located city with a strong Haredi enclave.
Purim celebrations and revelry continued in some parts of Beit Shemesh,
report the pair, as some synagogues flouted the Home Front Command
directives regarding gatherings, while others reflected a somber,
cautious mood.
Sokol takes a moment to update us on matters in the Knesset, where most
committee meetings were canceled due to the hostilities, and speculates
on whether war with Iran will boost Netanyahu at the ballot box in the
upcoming elections.
Finally, Summers reports on an end-of-Purim street party in Jerusalem,
where police kept a hands-off approach, and the scene of a missile
strike in the capital earlier in the week.
The Wikibase GraphQL API was developed following an investigation into
alternative ways of accessing Wikidata and Wikibase content that reduce
load on the Wikidata Query Service (WDQS), improve the developer
experience for common read use cases and allow more flexible data
retrieval in a single request.
As part of this investigation, a Wikibase GraphQL prototype was built to
explore what is technically possible and whether GraphQL would be a good
fit for Wikibase data, with promising results and supportive feedback.
In the last few years, a new generation of OCR models based on Vision
Language Models (VLMs) has emerged. These models are primarily the
result of “running out of tokens” and the consequent desire from AI
companies to find new sources of data to train on. This led to the
development of OCR models using VLMs as backbones which usually aim to
output “reading order” text — i.e. text with minimal markup, usually
targeting Markdown. These models can perform much better on the same
scans that older tools struggled with, producing cleaner, more
structured output.
If some of the world’s highest-paid lawyers, at the world’s
highest-status firms, do deals worth tens of billions of dollars with
language they don’t understand, what does that say about the law’s
pretensions to high standards? #In other words, yes, LLMs
Yes, like everything else in 2026 this is actually a post about LLMs.
Let me back up a little. I spent April gathering and May refining and
organizing requirements for a system to replace our current ILS. This
meant asking a lot of people about how they use our current system,
taking notes, and turning those notes into requirements. 372
requirements.1
Going into this, I knew that some coworkers used macros to streamline
tasks. I came out of it with a deeper appreciation of the different ways
they’ve done so.
It made me think about the various ways vendors are pitching “AI” for
their systems and the disconnect between these pitches and the needs
people expressed. Because library workers do want more from these
systems. We just want something a bit different.
Snapicat is a monorepo for a Worldcat OCLC workflow app: upload Excel
data, search variables against the OCLC API, and generate MARC/MARCXML
for cataloging. It consists of a Vite + React frontend and an Azure
Functions (Python) backend that talk to the OCLC Worldcat Metadata API.
The backend can also be ran as a web server through utilizing Fastapi
via app.py file.
OpenHistoricalMap is an ambitious, community-led project to map changes
to natural and human geography throughout the world… throughout the
ages. Big and Small, Then and Now
Empires rise and fall. Glaciers disappear. Languages and religions
spread from one region to another. Simple dirt paths become busy
highways and railways. Modest buildings give way to soaring skyscrapers.
And you remember what your neighborhood used to look like. All of it
belongs on OpenHistoricalMap.
Leave home for the first time to collect memories before a mysterious
cataclysm washes everything away. Ride, record, meet people, and unravel
the strange world around you in this third-person meditative exploration
game.
The use of AI tools to enable attacks on Iran heralds a new era of
bombing quicker than “the speed of thought”, experts have said, amid
fears human decision-makers could be sidelined.
Anthropic’s AI model, Claude, was reportedly used by the US military in
the barrage of strikes as the technology “shortens the kill chain” –
meaning the process of target identification through to legal approval
and strike launch.
The Program for Cooperative Cataloging (Q63468537) (PCC) has launched a
global cooperative for entity management on the semantic web called
EMCO. As part of this program, the Wikidata user community has set up a
Community of Practice to coordinate identity management work for GLAMs.
You can read more about EMCO and the Wikidata Community of Practice at
the EMCO Lyrasis Wiki.
This project is an extension of the work of Wikidata:WikiProject PCC
Wikidata Pilot / WikiProject PCC Wikidata Pilot (Q102157715) and
acknowledges its great intellectual and organizational debt to the LD4
Wikidata Affinity Group (Q124692294).
In the 1990’s my future wife was a record store clerk in Portland,
Oregon. American guitar legend John Fahey was living in a nearby town
and would visit the shop. Here are two mix cassettes that he made for
her during that time.
Pagefind caught my attention about a year ago, and since then I've adopted it in several hobby projects (nothing work-related): some blogs built with static generators like Hugo or Zola, some old HTML content distributed on CD-ROM, and some mailing list archives where I converted mbox files to HTML and then indexed them.
The tool is great, better for my needs than other JavaScript search libraries (though it's not really fair to compare them, since they're quite different). Pagefind is a search tool that runs entirely in the browser with zero server-side dependencies. It indexes your content into a compact binary index, using WASM to run search in the browser.
It can't completely replace server-side search technologies like Solr or Elasticsearch, mainly because the index can't be updated incrementally. But for many small to medium digital libraries or collections that are rarely updated once completed, it's an extremely good tool: very fast, easy to integrate into web pages, and requires almost no maintenance.
Until now I was convinced that the only way to build an index was by reading content from existing HTML files. That changed when I listened to this Python in Digital Humanities podcast, where David Flood mentioned:
Critically, PageFind has a Python API that lets you build indexes programmatically from database dumps rather than only from HTML files.
I'd completely missed that Pagefind has a Python API (and a Node one too), which makes it easy to build an index from any data source.
Here's a basic example: building a search index for an Internet Archive collection.
Python code: create an index from metadata of this collection (that is actually a collection of subcollections in Internet Archive, Italian content, related to radical movements)
import asyncioimport loggingimport osimport internetarchivefrom pagefind.index import PagefindIndex, IndexConfiglogging.basicConfig(level=os.environ.get("LOG_LEVEL", "DEBUG"))log = logging.getLogger(__name__)async def main(): config = IndexConfig(output_path="./web/pagefind") async with PagefindIndex(config=config) as index: log.info("Searching collection:radical-archives ...") results = internetarchive.search_items( "collection:radical-archives", fields=["identifier", "title", "description"], ) count = 0 for item in results: identifier = item.get("identifier", "") title = item.get("title", identifier) description = item.get("description", "") url = f"https://archive.org/details/{identifier}" thumbnail = f"https://archive.org/services/img/{identifier}" if isinstance(description, list): description = " ".join(description) await index.add_custom_record( url=url, content=description or title, language="en", meta={ "title": title, "description": description, "image": thumbnail, }, ) count += 1 log.debug("indexed %s: %s", identifier, title) log.info("Indexed %d items. Writing index ...", count) log.info("Done. Index written to ./web/pagefind")if __name__ == "__main__": asyncio.run(main())
Below is the text of the lightning talk I gave at Code4Lib 2026 earlier this week, on March 3. The conference venue where I delivered it is located at 1 Dock Street in Old City Philadelphia. Links below go to websites with images similar, but not always identical, to the ones I showed during the talk, as well as to some additional sites giving more context.
If you have a chance, it’s worth walking a few blocks from here to 6th and Market Street, where you can find a reconstructed frame of the President’s House, the home of George Washington during his presidency when Philadelphia was the capital of the US.
Here’s one of those panels, putting the story of Washington’s slaves in the context of where they lived, and the chronology of their bondage and freedom.
A judge recently ordered that the exhibit be restored. The court battle is ongoing, and the National Park Service has put back some of the panels. while others are still missing. In some of the gaps the public have put up their own signs (some of which you can see in this picture), testifying to what’s been suppressed. If you go there, you might even find someone acting as an unofficial tour guide, telling visitors stories similar to the ones that used to be on the official signs.
Now, we know what those signs said. The folks at the Data Rescue project collected photos of them before they came down, and you can view them online. But the importance of the exhibit is not just what it says, but where it says it. It’s important that it’s embedded in a particular place, so that people who come visit what’s sometimes called the cradle of liberty also find out that there’s a story about the people deprived of liberty here, and about how they won their freedom.
So what do I mean by a trail? A trail is a designated, visible path designed to help its users appreciate and understand the environment it goes through. You may have hiked some sometimes, and you may have gone on some more explicitly interpretive trails, like the Freedom Trail in Boston.
Our libraries are also rich environments of history and culture. And we provide ways for users to search them, but do we provide trails for them?
But while these trails all refer to resources in our libraries, they’re not embedded in libraries in the same way as the exhibits and trails I’ve shown in Philadelphia and Boston. But they could be.
But we don’t have to stop with what’s in authority files, or in generic library descriptions. Maybe in the future, when you’re visiting Martha Washington’s page, you’ll find a trail that goes through it, like a trail telling the story of Ona Judge, one of the African Americans who Martha claimed ownership over, and who escaped from the house at 6th and Market here in Philadelphia, and stayed free the rest of her life.
What will that trail telling her story look like? I’m not quite sure, but I have some ideas that I’m hoping to try implementing, not so that I can tell the story, but that I can represent the story from others who can tell it better than I can. And so that people visiting my site can find and follow that story, with all of its richness, just as they once could when they visited the President’s House in Philadelphia, and as I hope they soon can do here again.
If this interests you, I’d love to talk more with you.
This
is a good post from Dan
Chudnov about his work on mrrc (a Python wrapped Rust
library for MARC data) and how agentic-coding tools (e.g. Claude Code)
can be useful for learning, adding rigor and engineering that might
otherwise not be practical or feasible.
pymarc has been proven
through years of use, bug reporting, and improvements, but has never
been formally verified, or had that level of rigorous attention. I
remain skeptical about building AI into everything, but Dan has helped
me see a silver lining where, as code gets easier to write, with all its
potential for slop, it also simultaneously opens a door to helping
making it more reliable and performant.
And, Dan is not
alone in thinking this. What if the tools for describing how
software should work, and for measuring how software
does work, get much, much better? If formal verification tools
become more accessible and can be applied not just at the base layer of
systems (where it really matters) but in middle and frontend layers of
applications, where domain experts and stakeholders would really like
more control and insight into how software works for them and others?
This approach implies a level of restraint, or a holding back of the
generation of code that has not yet had this level of rigor applied to
it. The discourse around vibecoding on the other hand seems to be the
natural culmination of a “move fast and break things” philosophy that
almost everyone outside of Silicon Valley has seen for what it is.
Win free books from the March 2026 batch of Early Reviewer titles! We’ve got 226 books this month, and a grand total of 3,026 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.
The deadline to request a copy is Wednesday, March 25th at 6PM EDT.
Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, the UK, Israel, Australia, Canada, Ireland, Germany, Malta, Italy, Latvia and more. Make sure to check the message on each book to see if it can be sent to your country.
Thanks to all the publishers participating this month!
Hello DLF Community! It’s March, which means spring is around the corner (finally!), and it’s a great time for new growth. To that end, Forum planning is well underway for the virtual event this fall, and the DLF Groups are hard at work planning fantastic meetings and events for 2026. Additionally, I’m excited to share a bit of my own news: I’m transitioning to a new role at CLIR, Community Development Officer, that will help me support our community from a new angle. You’ll still have an amazing leader in Shaneé, stellar conference support from Concentra, and I certainly won’t be a stranger. As always, my inbox is open if you want to connect, send pet pictures, or have ideas about how you’d like to see our community grow in the coming months and years. See you around soon!
– Aliya
This month’s news:
Nominations Open: Suggest the names of individuals who may make compelling featured speakers at the 2026 Virtual DLF Forum. Nominations due March 31.
Registration Open: IIIF Annual Conference and Showcase in the Netherlands, June 1–4, 2026. For information, visit the conference page.
Early Bird Registration: Web Archiving Conference 2026 at KBR, the Royal Library of Belgium. Register by March 7 to secure discounted rates, and visit the conference website for full details.
Call for Proposals: AI4LAM’s Fantastic Futures 2026: Trust in the Loop, September 15-17, inviting proposals on how libraries, archives, and museums engage with trust and AI. Submissions due April 6.
This month’s open DLF group meetings:
For the most up-to-date schedule of DLF group meetings and events (plus conferences and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.
DLF Born-Digital Access Working Group (BDAWG): Tuesday, 3/2, 2pm ET / 11am PT.
DLF Digital Accessibility Working Group (DAWG): Tuesday, 3/2, 2pm ET / 11am PT.
DLF AIG Cultural Assessment Working Group: Monday, 3/9, 1pm ET / 10am PT.
AIG User Experience Working Group: Friday, 3/20, 11am ET / 8am PT
AIG Metadata Assessment Group: Friday, 3/20, 2pm ET/ 11am PT.
DLF Digitization Interest Group: Monday, 3/23, 2pm ET / 11am PT.
DLF Committee for Equity & Inclusion: Monday, 3/23, 3pm ET / 12pm PT.
DLF Open Source Capacity Resources Group: Wednesday, 3/25, 1pm ET / 10am PT.
DLF Digital Accessibility Policy & Workflows subgroup: Friday, 3/27, 1pm ET / 10am PT.
DAWG IT & Development: Monday, 3/30, 1pm ET / 10am PT.
DLF Climate Justice Working Group: Tuesday,3/31, 1pm ET / 10am PT.
Funding and resourcing, technology, staffing, community needs and expectations—the pace of change library leaders now need to navigate and lead their organizations through is nothing short of breathtaking. Trends that took years to evolve now demand responses and strategic planning within months, or even days. Grounding those choices in rigorous, in-depth research remains essential.
At the same time, library decision-makers benefit from collective wisdom and insights shared among peers. Knowing how others are responding to similar pressures can help leaders calibrate their strategies and avoid reinventing the wheel. When those insights are confined to personal or regional networks, the limited perspective can restrict leaders’ views of how priorities and decisions are shifting.
OCLC Research leadership insights: Real-time insight for real-world decisions
This tension between the need for deeply researched guidance and the demand for timely, real-world insight creates a gap for the field. Library leaders need to understand not only which frameworks and models exist for long-term decision-making that are supported by our traditional research efforts, but also how their peers are responding to rapidly changing conditions right now.
To help fill this gap, OCLC Research is expanding its approach to gathering and sharing knowledge with a new series of pulse surveys focused on library leadership priorities. These quick, timely surveys aim to gather information on the decisions library leaders are making on a variety of critical topics shaping the future of librarianship.
A complementary approach to longstanding research practices
These short surveys are designed to capture high-level snapshots of the decisions library leaders make in the moment on subjects critical to the field, such as community engagement tactics and the use and implementation of new technologies, including AI. They are intentionally brief, both to respect leaders’ time and to enable us to respond quickly to emerging issues.
This approach does not replace the in-depth, foundational research OCLC Research is known for. Rather, it adds another dimension to it.
Our long-form research projects will continue to provide thoughtful frameworks, deep analysis, and foundational guidance for operational decision-making and long-term innovation. Leadership insights surveys complement that work by:
Broadening the range of topics we can address, especially those that are evolving quickly
Expanding the pool of voices contributing insight, drawing from library leaders across regions and library types
Capturing change as it happens, and tracking how priorities and decisions shift over time
Together, these approaches create a more layered understanding of the field, combining depth with immediacy.
Powered by OCLC’s global membership network
The value of these leadership insights depends on scale. OCLC is uniquely positioned to engage a broad, global network of libraries and library leaders representing diverse viewpoints. This allows us not only to collect perspectives from beyond individual professional networks but also to share results with the field quickly and widely.
The outcomes will be intentionally concise: scannable, easy-to-digest summaries that surface patterns, contrasts, and emerging directions. Think of them as snapshots—ephemeral by design—that help illuminate how decisions are being made today, while also building a record of how those decisions evolve over time.
What this means for library leaders
For library leadership, this new format offers another way to stay oriented in a fast-moving environment:
Insight into how peers are prioritizing and responding to shared challenges
Timely information that can inform near-term decisions
A broader field-level perspective that complements local experience
By adding pulse surveys to our toolkit, OCLC Research is expanding the breadth and increasing the pace of the insights we provide, while remaining grounded in the thoughtful, evidence-based work that has long supported libraries’ strategic and operational decision-making.
We see this as one more way to help library leaders make sense of complexity, learn from one another, and move forward with confidence. Our first pulse survey, focused on AI innovation & culture in libraries, will be fielded with US library leaders in early March 2026.
Subscribe to Hanging Together, the blog of OCLC Research, for updates on the survey series and to follow our latest work.
A month ago Clarivate announced a new yet-to-be-released product called Nexus: "Clarivate Nexus acts as a bridge between the convenience of AI and the rigor of academic libraries". This is a pitch to librarians who have correctly identified generative AI chatbots as purveyors of endless bullshit, but also know that students and some researchers are going to use them anyway. Clarivate tells us that we can patch up the fabrications of chatbots with reassuring terms like "trusted sources", "verified academic references", and "authoritative".
Looking more carefully at Clarivate's marketing material, what they are proposing suggests that Clarivate understands neither what citations are for nor why fabricated citations are a problem. This is somewhat surprising for the company that controls and manages such key parts of the scholarly publishing systems as the citation database Web of Science, scholarly publishing and indexing company ProQuest, and the Primo/Summon Central Discovery Index.
Why we cite
It can get a little more complicated than this, but there are essentially two reasons for citations in scholarly work.
The first is to indicate where you got your data. If I write that the population of Australia in June 2025 was 27.6 million people, I need to back up this claim somehow. In this case, I would cite the Australian Bureau of Statistics as the source. This adds credibility to a claim by enabling readers to check the original source and assess whether it actually does make the same claim, and whether that claim is credible. If I said that the population of Australia in 2025 was 100 million people and cited a source which made that claim and in turn cited the ABS as their source, you could follow the chain of references back and identify that the paper I cited is where the error ocurred.
The second reason we cite a source is to give credit for a concept, term, or model for thinking. This is less about checking facts and more about academic norms and manners, though it also indicates how credible a scholar might be in terms of their understanding of a field. For example I might describe a concept whereby librarians feel that the mission of libraries is good and righteous, and this leads to burnout because they feel they can never complain about their working conditions. If I did not cite Fobazi Ettarh's Vocational Awe and Librarianship: The Lies We Tell Ourselves whilst describing this, I would rightly not be seen as a credible scholar in the field, or alternatively might be seen as surely knowing about Ettarh's work but deliberately ignoring it or even claiming her work as my own idea.
Why fabricated citations are bad
So that's the basics of why scholars include citations in their work. We can now explore why fabricated citations are a problem. There are two related but distinct reasons.
Citations that look real but are actually fake waste the time of already-busy library resource-sharing teams by making them spend time checking whether the citation is real, and sometimes looking for items that don't exist. This aspect of fabrication is bad because the cited item doesn't exist. If we match this to our first reason for citing, we can see that a claim that is backed by a citation to nothing at all is, uh, pretty problematic if the reason we cite is to link to the source data backing up a claim. It's equivalent to simply not providing a citation at all, except worse because we're claiming that our plucked-out-of-the-air "fact" is backed up by some other source.
The second problem with fabricated citations is that there is no connection between the statement being made and the source being cited. Even if the source being cited exists, the connection between the statement and the cited item is fabricated. This is slightly more difficult to understand because generative AI is based on probability, so in many cases there will appear to be a connection. But without a tightly-controlled RAG system, it's likely to simply be a lucky guess. The problem here is one of academic integrity – we've cited a source that exists, but it may or may not back up our claim, and the claim doesn't follow from the source.
A false nexus
Clarivate seems to be conflating these two issues. Their Nexus product has two core functions: checking citations to see if they are real, and suggesting references for content in chatbot conversations. The first is genuinely useful, though highly constrained – Clarivate only checks their own indexes, and defines anything that doesn't appear in those indexes as either non-existing, or "non-scholarly" (it's unclear how it would define, for example, something with a DOI that exists but doesn't appear in Web of Science). Neither academia nor the tech industry are short on hubris, but even in that context, "anything not listed in our proprietary databases isn't credible" is a pretty eyebrow-raising claim.
The second function kicks in when the citation checker defines a citation as failed – it offers to "Find Verified Alternative". That is, Nexus offers to replace both cited sources that don't exist and cited sources that "aren't scholarly" with another real source. This addresses the first problem (cited sources that don't exist) but not the second (cited sources that aren't the real source of a claim or quotation).
With Nexus, Clarivate are essentially integrity-washing synthetic text, giving it an academic sheen without any academic rigour. Far from helping librarians, Clarivate's Nexus threatens to further unravel the hard work we do to teach students information literacy skills and its sparkling variety, "AI literacy". Students are already inclined to write their argument first and go on a fishing expedition for citations to back it up later (I certainly wrote my undergraduate essays this way). The last thing we want to do is direct them to a product that encourages this academically dishonest behaviour.
ChatGPT is designed to provide something that looks like a competent answer to a question. Nexus seems to be designed to amend this answer-shaped text into something that looks like a correctly-cited academic essay. But the point of student assessments isn't to produce essays – it's to produce competent researchers and systematic thinkers. Perhaps Clarivate thinks there is a large potential market of universities who want to help their own students cheat on assignments in ways that look more credible. To that, I would say "[citation needed]".
It is with heavy hearts and great sadness that we acknowledge the passing of trailblazer and fire-starter Fobazi Ettarh. Her loss will be felt by us all for years to come.
Fobazi published two articles with us at ITLWTLP. In 2014 she wrote “Making a New Table: Intersectional Librarianship,” one of the first scholarly articles published about viewing librarianship through an intersectional lens. In 2018 she published the hugely influential “Vocational Awe and Librarianship: The Lies We Tell Ourselves.” Since then, we have published many, many articles that cite the concept she identified: vocational awe. She was, to borrow a phrase from bell hooks, a maker of theory and a leader of action. We remember her as one of the great thinkers of her time, and we encourage our readers to spend some time with her words and her work. Additionally, please consider contributing to or sharing the link for her GoFundMe.
University of Michigan Library recently launched a new application to help U-M researchers and authors at our three campuses locate publications covered under institutional open access agreements. This tool aggregates nearly 13,000 titles across publishers, streamlining the process of locating eligible journals. The project involved data-wrangling, application design and development, and usability testing to produce a usable, sustainable tool.
I have now seen the fabled CyberCab three times in real life. It has two seats, one of them fully equipped with human driver interface equipment. In each case a human was using them to drive the car, which is necessary in California because Fake Self-Driving is a Level 2 driver assistance system that requires a human behind the wheel at all times. A Robotaxi that requires a human driver and can carry at most one passenger isn't going to be a economic success.
Fred Lambert has two posts illustrating the distance between Musk's claims and reality. Below the fold I look at both of them:
Tesla has reported five new crashes involving its “Robotaxi” fleet in Austin, Texas, bringing the total to 14 incidents since the service launched in June 2025. The newly filed NHTSA data also reveals that Tesla quietly upgraded one earlier crash to include a hospitalization injury, something the company never disclosed publicly.
Even before they were changed, we knew very few of the details:
As with every previous Tesla crash in the database, all five new incident narratives are fully redacted as “confidential business information.” Tesla remains the only ADS operator to systematically hide crash details from the public through NHTSA’s confidentiality provisions. Waymo, Zoox, and every other company in the database provide full narrative descriptions of their incidents.
With 14 crashes now on the books, Tesla’s “Robotaxi” crash rate in Austin continues to deteriorate. Extrapolating from Tesla’s Q4 2025 earnings mileage data, which showed roughly 700,000 cumulative paid miles through November, the fleet likely reached around 800,000 miles by mid-January 2026. That works out to one crash every 57,000 miles.
The numbers aren't just not good, they're apalling:
By the company’s own numbers, its “Robotaxi” fleet crashes nearly 4 times more often than a normal driver, and every single one of those miles had a safety monitor who could hit the kill switch. That is not a rounding error or an early-program hiccup. It is a fundamental performance gap.
There are two points that need to be made about how bad this is:
However badly, Tesla is trying to operate a taxi service. So it is misleading to compare the crash rate with "normal drivers". The correct comparison is with taxi drivers. The New York Times reported that:
In a city where almost everyone has a story about zigzagging through traffic in a hair-raising, white-knuckled cab ride, a new traffic safety study may come as a surprise: It finds that taxis are pretty safe.
So are livery cars, according to the study, which is based on state motor vehicle records of accidents and injuries across the city. It concludes that taxi and livery-cab drivers have crash rates one-third lower than drivers of other vehicles.
A law firm has a persuasive list of reasons why this is so. So Tesla's "robotaxi" is actually 6 times less safe than a taxi.
Fake Self Driving is a Level 2 system that requires a human behind the wheel, and that is the way Tesla's service in California has to operate. But in Austin the human is in the passenger seat, or in a chase car. Tesla has been placing bystanders at risk by deliberately operating in a way that it knows, and the statistics it reports show, is unsafe.
Tesla filed new comments with the California Public Utilities Commission that amount to a quiet admission: its “Robotaxi” service still relies on both in-car human drivers and domestic remote operators to function. Rather than downplaying these dependencies, Tesla leans into them — arguing that its multi-layered human supervision model is more reliable than Waymo’s fully driverless system, pointing to the December 2025 San Francisco blackout as proof.
The filing, submitted February 13 in CPUC Rulemaking 25-08-013, reveals the massive operational gap between what Tesla calls a “Robotaxi” and what Waymo actually operates as one.
Tesla's filing admits that the service they market as a "robotaxi" really isn't one:
Tesla operates its service using TCP (Transportation Charter Party) vehicles equipped with FSD (Supervised), a Level 2 ADAS system that, by definition, requires a licensed human driver behind the wheel at all times, actively monitoring and ready to intervene.
On top of that in-car driver, Tesla describes a parallel layer of remote operators. The company states it employs domestically located remote operators in both Austin and the Bay Area, and that these operators are subject to DMV-mandated U.S. driver’s licenses, “extensive background checks and drug and alcohol testing,” and mandatory training. Tesla frames this as a redundancy system, remote operators in two cities backing up the in-car drivers.
That’s two layers of human supervision for a service Tesla markets as a “Robotaxi.”
Waymo’s vehicles have no driver in the car. Waymo uses remote assistance operators who can provide guidance to vehicles in ambiguous situations, but the vehicle drives itself. Waymo’s remote operators don’t control the car, they confirm whether it’s safe to proceed in edge cases like construction zones or unusual road conditions.
... Tesla’s system requires a human to drive the car and has remote operators as backup. Waymo’s system drives itself and has remote operators as backup. Tesla is essentially describing a staffing-intensive taxi service with driver-assist software. Waymo is describing an autonomous transportation network.
This is where Tesla's marketing their service as a "robotaxi" creates a Catch-22:
Tesla argues forcefully that its Level 2 ADAS vehicles should remain outside the scope of this AV rulemaking entirely, agreeing with Lyft that they aren’t “autonomous vehicles” under California law.
At the same time, Tesla is fighting Waymo’s proposal to prohibit Level 2 services from using terms like “driverless,” “self-driving,” or “robotaxi.” Tesla calls this proposal “wholly unnecessary,” arguing that existing California advertising laws already cover misleading marketing.
A California judge already ruled in December 2025 that Tesla’s marketing of “Autopilot” and “Full Self-Driving” violated the state’s false advertising laws.
Tesla is telling regulators its vehicles are not autonomous and require human drivers, while simultaneously fighting for the right to keep calling the service a “Robotaxi.” Tesla wants the legal protections of being classified as a supervised Level 2 system and the marketing benefits of sounding like a fully autonomous one.
Sadly, this is just par for the course when it comes to Tesla's marketing. Essentially everything Elon Musk has said about not just the schedule but more importantly the capabilities of Fake Self Driving has been a lie, for example a 2016 faked video. These lies have killed many credulous idiots, but they have succeeded in pumping TSLA to a ludicrous PE ratio because of the kind of irresponsible journalism Karl Bode describes in The Media Can't Stop Propping Up Elon Musk's Phony Supergenius Engineer Mythology:
One of my favorite trends in modern U.S. infotainment media is something I affectionately call "CEO said a thing!" journalism.
"CEO said a thing!" journalism generally involves a press outlet parroting the claims of a CEO or billionaire utterly mindlessly without any sort of useful historical context as to whether anything being said is factually correct.
There's a few rules for this brand of journalism. One, you can't include any useful context that might shed helpful light on whether what the executive is saying is true. Two, it's important to make sure you never include a quote from an objective academic or expert in the field you're covering that might challenge the CEO.
After all, if a journalist does include an expert pointing out that the CEO is bullshitting:
statements produced without particular concern for truth, clarity, or meaning
the journalist will lose the access upon which his job depends. But I'm not that journalist, so here is my list of the past and impending failures of the "Supergenius Engineer":
Tesla was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning as Tesla Motors. ... In February 2004, Elon Musk led Tesla's first funding round and became the company's chairman, subsequently claiming to be a co-founder
Starting in 2008, Franz von Holzhausen designed the Model S, which launched in 2012 and was Tesla's first success. Initially, Tesla was a great success, but it has failed to update its line-up. It is now far behind Chinese EV manufacturers and losing market share worldwide. They will lose the US market share once the Chinese set up US factories.
Space X Falcon 9: Musk's insight that re-usability would transform the space business was a huge success. It was thanks to significant government support and a great CEO, Gwynne Shotwell.
This history seems like valuable context for journalists to include in reports of Musk's next pronouncement.
The presentation highlights a curriculum initiative where participants used a blockchain-enabled fair-data ecosystem (Clio-X) in Blockathon to build privacy-preserving AI chatbots for archival datasets. It highlights blockchain’s potential to improve transparency and accountability in AI workflows by making all actions traceable on-chain.
B. Cryptographic Provenance and AI-generated Images
Authors: Jessica Bushey, Nicholas Rivard, and Michel Barbeau
The presentation highlighted how content credentials and cryptographic provenance frameworks can operationalize archival trustworthiness for born-digital assets and AI-generated images by embedding tamper-evident metadata into assets, which is a highly relevant and timely challenge given the proliferation of synthetic media. It effectively bridges archival theory (authenticity and provenance) with practical systems and discusses how blockchain and content credentials can support verifiable history of digital images, situating the work within computational archival science. Overall, it makes a strong conceptual and methodological contribution to trustworthy preservation of digital content.
2: Processing Analog Archives [4 papers]
A. Using an Ensemble Approach for Layout Detection and Extraction from Historical Newspapers
Authors: Aditya Jadhav, Bipasha Banerjee, and Jennifer Goyne
The presentation focused on layout detection and Optical Character Recognition (OCR) for historical newspapers by proposing a modular, detector-agnostic ensemble pipeline combining OpenCV, Newspaper Navigator, and a fine-tuned TextOnly-PRIMA model to improve segmentation and extraction on variable scans. It’s strong in engineering detail and demonstrates practical improvements over commercial baselines like AWS Textract, especially on degraded material. Overall, it’s a solid methodological contribution with clear application value in large-scale digitization efforts.
B. PARDES: Automatic Generation of Descriptive Terms for Logical Units in Historical Handwritten Collections
Authors: Josepa Raventos-Pajares, Joan Andreu Sanchez, and Enrique Vidal
The PARDES project presents a practical and scalable method for automatically generating descriptive terms from noisy handwritten text recognition (HTR) outputs in large historical collections, using probabilistic indexing and Zipf's Law to identify important terns. It’s strong in handling uncertainty in HTR.
C. From Analog Records to Computational Research Data: Building the AI-Ready Lab Notebook
Authors: Joel Pepper, Zach Siapno, Jacob Furst, Fernando Uribe-Romo, David Breen, and Jane Greenberg
Similar to the previous presentation, this one addressed transforming analog, handwritten lab notebooks into AI-ready digital data to unlock valuable experimental records for computational analysis. It demonstrated promising performance. Overall, it’s a good step toward making analog scientific records computationally accessible and usable for AI systems.
D. Classification of Paper-based Archival Records Using Neural Networks
Authors: Jussara Teixeira, Juliana Almeida, Tania Gava, Raphael Lugon Campo Dall’Orto, and Jose M´ arcio Moraes Dorigueto
The presentation demonstrates a practical application of supervised machine learning (ML) to classify unprocessed archival records, achieving high accuracy and scalability on a large real-world governmental dataset (Electronic Process System (SEP) of the State of Espirito Santo, Brazil). It effectively shows how a modular ML architecture can be integrated into existing archival systems, and how clustering similar records can reduce manual effort. Overall, it’s a solid empirical case study of ML enhancing a core archival function at scale.
3: Retrieval-augmented Generation [3 papers]
A. Developing a Smart Archival Assistant with Conversational Features and Linguistic Abilities: the Ask_ArchiLab Initiative
Authors: Basma Makhlouf Shabou, Lamia Friha, and Wassila Ramli
This talk presented a compelling initiative to modernize archival practice by building a conversational AI assistant that integrates advanced Retrieval Augmented Generation (RAG) and semantic technologies to support fast, contextual, and professional‑level archival queries. It’s strong in conceptualizing how multilingual conversational agents can bridge gaps in access, complex metadata, and diverse user expertise. Overall, it’s an innovative approach with great potential to enhance usability and knowledge discovery in digital archives.
B. Index-aware Knowledge Grounding of Retrieval-Augmented Generation in Conversational Search for Archival Diplomatics
Authors: Qihong Zhou, Binming Li, and Victoria Lemieux
This work presents an index‑aware chunking strategy to improve RAG pipelines for conversational search by grounding retrieval on structured index terms extracted from PDFs, aiming to reduce resource demands, accuracy issues, and hallucinations common in standard RAG workflows. It’s a practical contribution that addresses problems with traditional chunking strategies. Overall, it is an interesting methodological refinement with promising implications for archival conversational search but would benefit from broader validation.
C. Retrieval-augmented LLMs for ETD Subject Classification
Authors: Hajra Klair, Fausto German, Amr Ahmed Aboelnaga, Bipasha Banerjee, Hoda Eldardiry, and William A. Ingram
This work presents a two‑stage RAG‑based pipeline that uses keyword extraction and guided question generation from Electronic Theses and Dissertations (ETD) abstracts to retrieve and synthesize core document content, tackling the challenge of long, full‑text processing. It addresses the challenge of subject classification at scale for ETD by capturing signatures that go beyond simple lexical similarity to improve classification accuracy and contextual richness. The evaluation shows improvements over traditional approaches. Overall, it’s a promising and well‑structured application of RAG methods to a real-world problem.
4: Archival Theory & Computational Practice [4 papers]
A. Archival Research Theory: Putting Smart Technology to Work for Researchers
Authors: Kenneth Thibodeau, Alex Richmond, and Mario Beauchamp
This work extends archival theory beyond traditional archival management to a new Archival Research Theory (ART) framework that models archives as complex informational systems with informative potential responsive to researchers’ questions, grounded in semiotics, Constructed Past Theory, and type theory. It’s conceptually rich, offering a strong theoretical foundation for integrating smart technologies into archival research and emphasizing how meaning and context can be formally modeled to support diverse inquiry. Overall, it makes a thoughtful and potentially foundational contribution to bridging archival theory and computational practice.
B. Systems Thinking, Management Standards, and the Quest for Records and Archives Management Relevance
The presentation makes a case for records and archives management (RAM) within organizations by embedding RAM into widely adopted Management System Standards (MSS) like ISO frameworks, which currently drive visibility and measurable outcomes in areas such as quality and security. It uses systems thinking and standards practice to argue that RAM can gain institutional relevance and leadership buy‑in by aligning with structured MSS processes and the Plan‑Do‑Check‑Act cycle, thereby elevating archival functions beyond marginal roles. Overall, it’s a good management‑focused contribution that highlights the importance of standards and systemic framing for advancing archival relevance.
C. Can GPT-4 Think Computationally about Digital Archival Practices?
This work investigates whether GPT‑4o demonstrates computational thinking capabilities applied to digital archival tasks, grounding the analysis in a recognized computational thinking taxonomy. It surfaces compelling examples where the model exhibits knowledge across archival processes and computational practices, suggesting its potential as a learning partner or assistant in teaching archival computational methods. Overall, the paper offers a thought‑provoking exploration of LLM capabilities in a computational archival context, with promising avenues for further research.
D. Algorithm Auditing for Reliable AI Authenticity Assessment of Digitized Archival Objects
This presentation shows how small variations in input image resolution can drastically affect AI‑based art authentication results, highlighting a key vulnerability in applying such models to archival or cultural heritage objects and raising important concerns about reliability and manipulation risk. It makes a strong case that algorithm auditing should be embedded in computational archival science practices to improve transparency, reproducibility, and accountability of automated analyses. Overall, it’s a practical contribution that urges the need for rigorous evaluation frameworks when deploying AI for authenticity and provenance tasks in digital archives.
5: Knowledge Organization & Retrieval [2 papers]
A. Ontologies Applied to Archival Records: a Preliminary Proposal for Information Retrieval
Authors: Thiago Henrique Bragato Barros, Maurício Coelho da Silva, Rafael Rodrigo do Carmo Batista, David Haynes, and Frances Ryan
This paper presents an ontology‑driven approach to improve information retrieval (IR) over archival descriptions and digital objects by capturing archival contexts such as provenance, functions, agents, and events within a formal semantic model. It grounds its design in established ontology engineering and archival principles to support semantic indexing, reasoning, and query handling. Overall, it makes a decent conceptual contribution toward ontology‑enhanced archival IR.
B. Operationalizing Context: Contextual Integrity, Archival Diplomatics, and Knowledge Graphs
Authors: Jim Suderman, Frédéric Simard, Nicholas Rivard, Iori Khuhro, Erin Gilmore, Michel Barbeau, Darra Hofman, and Mario Beauchamp
This paper lays out a context‑driven privacy framework for archival records that combines theories of contextual integrity, archival diplomatics, and knowledge graphs to make privacy‑relevant relationships machine‑legible and support informed decisions about sensitive information at scale. Its strength lies in operationalizing context rather than content alone using GraphRAG and knowledge graphs to capture nuanced contextual features that traditional vector embeddings miss, thereby offering a richer basis for privacy assessment. Overall, it’s a promising conceptual and advancement toward AI‑enabled privacy support in archives.
6: Web Archiving [3 papers]
This session highlights my contributions. The workshop designated two slots for my papers. The first slot was for presenting one of the papers and the second one is for summarizing the remaining two papers, which is why there are three papers, but only two videos. The slides for both slots are combined in one file. I want to thank Richard Marciano, Victoria Lemieux, and Mark Hedges for giving me the opportunity to present and being flexible with the workshop registration since my work is not funded and we were unable to pay the registration fees.
In the first paper, I presented a quantitative analysis of web archiving coverage for Arabic versus English news content over a 23‑year period, revealing that while English pages are still archived at a higher rate, Arabic archival coverage has increased significantly in recent years. I showed the heavy dependence on the Internet Archive (IA) for web archiving and that other public web archives contribute very little, exposing a centralization risk where loss of IA would make most archived content inaccessible. This paper is a continuation of previous work "Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages".
B. The Gap Continues to Grow Between the Wayback Machine and All Other Web Archives
The second paper I presented highlights a quantitative study showing that the Internet Archive (IA) overwhelmingly dominates public web archiving, preserving 99.74 % of archived Arabic and English news pages in the dataset I constructed (1.5 million URLs) while all other web archives combined account for only a tiny fraction. I highlighted the risk to web archiving if the IA became unavailable, the vast majority of archived online news would be lost or irretrievable, underscoring a critical vulnerability in web preservation. My analysis offer clear results, but the paper could benefit from a broader discussion of why other web archives are shrinking and what practical strategies could diversify preservation efforts. Overall, it is an important wake‑up call about concentration in web archiving and the fragility of our collective digital memory. This paper is a continuation of previous work "Profiling web archive coverage for top-level domain and content language".
C. Collecting and Archiving 1.5 Million Multilingual News Stories’ URIs from Sitemaps
The third paper I presented introduced JANA1.5, a large dataset of 1.5 million Arabic and English news story URLs collected from news site sitemaps, and demonstrated an effective sitemap‑based collection method that outperforms alternatives like RSS, X (formerly Twitter), and web scraping. I also discussed ways for noise reduction. I ended with explaining how this dataset is going to be submitted to the IA.
One of the standout aspects of the CAS workshop was its responsiveness and quick turnaround. Reviewers' comments were actionable and came back quickly, decisions were clear, and the entire process moved at a fast pace that made it possible to focus on the work itself rather than waiting on it. The entire process from submission to publishing and presenting the work takes about a month. It’s the kind of efficiency every venue should strive for. Attending the 10th CAS Workshop was great. It underscored issues related to computational archival science including centralization, authenticity, and who gets to be remembered. It was a rewarding experience to present my work at the CAS workshop exploring web archiving’s dependence on the Internet Archive. The discussion highlighted just how vital the Internet Archive is to our digital memory, and it was inspiring to see how their work motivates us all to take action and contribute to preserving our online heritage.
Today I am sharing the Agent Protocols Tech Tree. APTT is a visual, videogame-style tech tree of the evolving protocols supporting AI agents.
Where did this come from?
I made the APTT for a session on “The Role of Protocols in the Agents Ecosystem” at the Towards an Internet Ecosystem for Sane Autonomous Agents workshop at the Berkman Klein Center on February 9th.
It’s a video game tech tree because, while the word “protocols” is boring, the phenomenon of open protocols is fascinating, and I want to make them easier to approach and explore.
What is an open protocol? Why care about them?
An open protocol is a shared language used by multiple software projects so they can interoperate or compete with each other.
Protocols offer an x-ray of an emerging technology — they tell you what the builder community actually cares about, what they are forced to agree on, what is already done, and what is likely to come next.
Open protocols go back to the founding of the internet when basic concepts like “TCP/IP” were standardized — not by a government or company creating and enforcing a rule, but by a community of builders based on “rough consensus and running code.” On the internet no one could force you to use the same standards as everyone else, but if you wanted to be part of the same conversation, you had to speak the same language. That created strong incentives to agree on protocols, from SMTP to DNS to FTP to HTTP to SSL. By tracing each of those protocols, you could see the evolving concerns of the people building the internet.
(For a great discussion of that history, see “The Battle of the Networks” from LIL faculty director Jonathan Zittrain’s book “The Future of the Internet — and How to Stop It.”)
Why are protocols so important for AI agents?
Like the early internet, AI agents today are an emerging, distributed phenomenon that is changing faster than even experts can understand. We’re holding workshops with names like “Towards an Internet Ecosystem for Sane Autonomous Agents” because no one really knows what it will mean to have millions of semi-autonomous computer programs acting and interacting in human-like ways online.
Also like the early internet, it’s tempting to look for some government or company that is in charge and can tame this phenomenon, set the rules of the road. But in many ways there isn’t one. The ingredients of AI agents are just not that complex or that controlled.
This makes sense if you look at Anthropic’s definition of an agent, which is simply “models using tools in a loop.” That is not a complex recipe: it requires a large language model, of which there are now many, including powerful open source ones that can run locally; a fairly small and simple control loop; and a set of “tools,” simple software programs that can interact with the world to do things like run a web search or send a text message. “Agents” as a phenomenon are a technique, like calculus, not a service, like Uber.
That makes agents hard to regulate, and makes protocols incredibly important. It is protocols that give agents the tools they use. It is protocols that the builder community are developing as fast as they can to increase what agents can do. If you want to nudge this technique toward human thriving, it is protocols that might most shape agent behavior by making some agents easier to build than others.
To be sure, protocols aren’t the only way to influence technological development. Larry Lessig’s classic “pathetic dot theory” outlines markets, laws, social norms, and architecture as four separate ways that individual action gets regulated, and protocols are just an aspect of architecture. But the more a technology is dispersed and simple to recreate, the more protocols come into play in how it evolves.
How do I use the APTT?
APTT is designed to be helpful whether you’re a less-technical person who just wants to understand what agents are, or a more technical person who wants to understand exactly what’s getting built.
Either way the pile of agent technologies is confusing, so I recommend starting at the beginning with “Inference API.”
Video games are often designed so you start with a simple feature unlocked and then progressively unlock more and more complex options as you learn the game. The same approach works here: imagine that you have just unlocked “Inference API” in this game, and once you’re comfortable with that, explore off to the right to see how each protocol enables or necessitates the next.
You can click each technology to learn what problem it solves (why did people need something like this?), how it’s standardizing (who kicked this off?), and what virtuous cycle it enabled (why did other people want to get on board?).
You can also see visual animations of how the protocol is used — what messages are actually sent back and forth between who?
If you’re interested in the technical details, you can click any of the messages to see at a wire level what’s actually happening. (Often, something simpler than it sounds.)
As you move off to the right, you’ll go from widely adopted technologies, like MCP, to technologies that have commercial supporters but not much social proof yet, like Visa TAP, or technologies that don’t even exist but might make sense in the future, like Interoperable Memory, Signed Intent Mandates, or Agent Lingua Franca.
The ragged edge on the right is where I hope you’ll be the most critical: what seems inevitable, what seems like a dead end, and what would you like to see more of?
How accurate is all of this? How do I fix mistakes?
APTT is a work in progress, and to be honest in many ways is a whiteboard sketch. I put it together (and vibe coded much of it) to help support a conversation, first at the workshop and now online. I think whiteboard sketches are useful, so I’m sharing it, but I don’t pretend it’s authoritative; it’s just my rough sense of how things work right now.
(This is a weird thing about the agentic moment — my coding agent has made this tool look more polished and complete than it may really deserve. Think napkin sketch with fancy graphics.)
If you think I got things wrong or missed part of the story, please open an issue on the GitHub repository. I plan to keep this rough and opinionated, and focused on consensus-driven protocols as a lens for understanding what’s happening — so I’ll either pull contributions into the main tool, or just leave them as discussions to represent the range of opinions about how all of this works. I hope it’s fun to play with either way.
Arke is a public knowledge network for storing, discovering, and
connecting information.
Making content truly accessible is harder than it looks. Meaningful
search requires vectors, embeddings, extraction pipelines—infrastructure
most people can’t build. And even with that, files sitting on a website
or in a folder don’t get found. You end up working alone, disconnected
from related work that exists somewhere.
Arke handles all of it. Upload anything—we process it and connect it to
a network where similar collections surface automatically. Your
information becomes searchable, discoverable, and linked to work you
didn’t know existed.
Public events are trapped in information silos. The library posts to
their website, the YMCA uses Google Calendar, the theater uses
Eventbrite, Meetup groups have their own pages. Anyone wanting to know
“what’s happening this weekend?” must check a dozen different sites.
Existing local aggregators typically expect event producers to “submit”
events via a web form. This means producers must submit to several
aggregators to reach their audience — tedious and error-prone. Worse, if
event details change, producers must update each aggregator separately.
This project takes a different approach: event producers are the
authoritative sources for their own events. They publish once to their
own calendar, and individuals and aggregators pull from those sources.
When details change, the change propagates automatically. This is how
RSS transformed blogging, and iCalendar can do the same for events.
The gold standard is iCalendar (ICS) feeds — a format that machines can
read, merge, and republish. If you’re an event producer and your
platform can publish an ICS feed, that’s great. But ICS isn’t the only
way. The real requirement is to embrace the open web. A clean HTML page
with well-structured event data works. What doesn’t work: events locked
in Facebook or behind login walls.
What do LLMs mean for the future of software engineering? Will
vibe-coded AI slop be the norm? Will software engineers simply be less
in-demand? Rain and David join Bryan and Adam to discuss how rigorous
use of LLMs can make for much more robust systems.
The English-language edition of Wikipedia is blacklisting Archive.today
after the controversial archive site was used to direct a distributed
denial of service (DDoS) attack against a blog.
In the course of discussing whether Archive.today should be deprecated
because of the DDoS, Wikipedia editors discovered that the archive site
altered snapshots of webpages to insert the name of the blogger who was
targeted by the DDoS. The alterations were apparently fueled by a grudge
against the blogger over a post that described how the Archive.today
maintainer hid their identity behind several aliases.
“There is consensus to immediately deprecate archive.today, and, as soon
as practicable, add it to the spam blacklist (or create an edit filter
that blocks adding new links), and remove all links to it,” stated an
update today on Wikipedia’s Archive.today discussion. “There is a strong
consensus that Wikipedia should not direct its readers towards a website
that hijacks users’ computers to run a DDoS attack (see WP:ELNO#3).
Additionally, evidence has been presented that archive.today’s operators
have altered the content of archived pages, rendering it unreliable.”
Megalodon (Japanese: ウェブ魚拓, “web gyotaku”) is an on demand web
citation service based in Japan.[3] It is owned by Affility.
Megalodon’s server can be searched for “web gyotaku” or copies of web
pages, by prefixing any URL with “gyo.tc”; the process checks the query
against other services as well, including Google’s cached pages and
Mementos.
WASHINGTON, Feb 18 (Reuters) - The U.S. State Department is developing
an online portal that will enable people in Europe and elsewhere to see
content banned by their governments including alleged hate speech and
terrorist propaganda, a move Washington views as a way to counter
censorship, three sources familiar with the plan said.
During recent decades, universities have faced increasing pressure to
demonstrate their value and impact by contributing to real-world
problem-solving and meeting broader societal needs. The reasons for this
increased pressure are complex and numerous—reflecting socio-economic
and socio-political considerations, globalization and intensifying
competition, and growing demands for accountability and demonstrable
public value. At Virginia Tech, our library’s research impact and
intelligence team, of which we are all members, supports institutional
strategy, researcher visibility, and decision-making in response to
these demands. In this article, we’ll outline the emergence of research
impact and research intelligence work in libraries, trace the
development of our department, and illustrate how analytics, research
information management, and consultation services are operationalized
alongside ongoing efforts to promote responsible interpretation and use
of research metrics.
Annotorious is a JavaScript library for adding image annotation
capabilities to your web application. Try it out below: click or tap the
annotation to edit. Click or tap anywhere and drag to create a new
annotation.
A very special guest on this episode of the Lightcone! Boris Cherny, the
creator of Claude Code, sits down to share the incredible journey of
developing one of the most transformative coding tools of the AI era.
Every RSS reader I’ve used presents your feeds as a list to be
processed. Items arrive. They’re marked unread. Your job is to get that
number to zero, or at least closer to zero than it was yesterday.
Current has no unread count. Not because I forgot to add one, or because
I thought it would look cleaner without it. There is no count because
counting was the problem.
The main screen is a river. Not a river that moves on its own. You’re
not watching content drift past like a screensaver. It’s a river in the
sense that matters: content arrives, lingers for a time, and then fades
away.
Email’s unread count means something specific: these are messages from
real people who wrote to you and are, in some cases, actively waiting
for your response. The number isn’t neutral information. It’s a measure
of social debt.
But when we applied that same visual language to RSS (the unread counts,
the bold text for new items, the sense of a backlog accumulating) we
imported the anxiety without the cause.
Last week I gave a book talk on Public Data Cultures and co-organised a
Wayback studio with the Internet Archive Europe.
As highlighted in the book talk announcement it was really nice to have
this moment there given my longstanding collaborations with the Internet
Archive - and to meet up with others connected to the archive and
associated communities in Amsterdam
Black Jesus is an American live-action sitcom created by Aaron McGruder
(creator of The Boondocks) and Mike Clattenburg (creator of Trailer Park
Boys) that aired on Adult Swim. The series stars Gerald “Slink” Johnson,
Charlie Murphy, Corey Holcomb, Kali Hawk, King Bach, Andra Fuller, and
John Witherspoon. The series premiered on August 7, 2014. On December
10, 2014, the series was renewed for a second season,[2] which premiered
on September 18, 2015.[3] Its third and final season premiered on
September 21, 2019.[4]
John Backus led a team at IBM in 1957 that created the first successful
high-level programming language, FORTRAN. It was designed to solve
problems in science and engineering, and many dialects of the language
are still in use throughout the world.
Describing the development of FORTRAN, Backus said, “We simply made up
the language as we went along. We did not regard language design as a
difficult problem, merely a simple prelude to the real problem:
designing a compiler which could produce efficient programs . . . We
also wanted to eliminate a lot of the bookkeeping and detailed,
repetitive planning which hand coding involved.”
The name FORTRAN comes from FORmula TRANslation. The language was
designed for solving engineering and scientific problems. FORTRAN IV was
first introduced by IBM in the early 1960s and still exists in a number
of similar dialects on machines from various manufacturers.
ZFS improves everything about systems administration. Once you peek
under the hood, though, ZFS’ bewildering array of knobs and tunables can
overwhelm anyone. ZFS experts can make their servers zing—and now you
can, too, with FreeBSD Mastery: Advanced ZFS.
Given a situation where a ZFS pool has just too many datasets for you to
comfortably manage, or perhaps you have a few datasets, but you just
learned of a property that you really should have set from the start,
what do you do? Well, I don’t know what you do, I would love to hear
about that, so please do reach out to me, over Matrix preferably.
In any case, what I came up with is disko-zfs. A simple Rust program
that will declaratively manage datasets on a zpool. It does this based
on a JSON specification, which lists the datasets, their properties and
a few pieces of extra information.
My hunch is that we’ll spend just as much time and energy carving code
back as we will generating it. If generating code is nearly free, then
the cost shifts entirely to understanding, maintaining, and pruning it.
And sometimes the right move isn’t a better level of detail. It’s fewer
polygons in the scene altogether. Delete the sprawling implementation
and replace it with something you can actually reason about
So now I’m paying $20 a month to a company that scraped the collective
knowledge of humanity without asking so that I can avoid writing
Kubernetes YAML. I know what that makes me. I just haven’t figured out a
word for it yet that I can live with.
The two management giants of the mid-twentieth century were Peter
Drucker and W. Edwards Deming. Ironically, while Drucker hails from
Austria-Hungary (like me, Drucker emigrated to the U.S. as an adult) and
Deming was born in the U.S., it was Drucker that proved to be more
influential in America. Deming’s influence was much greater in Japan
than it ever was the U.S. If you’ve ever been at an organization that
uses OKRs, then you have worked in the shadow of Drucker’s legacy. While
you can tell a story about how Deming influenced Toyota, and Toyota
inspired the lean movement, I would still describe management in the
U.S. as Deming in exile. Deming explicitly stated that management by
objectives isn’t leadership, and I think you’d be hard-pressed to find
managers in American companies who would agree with that sentiment.
Emily St. John Mandel (/seɪntˈdʒɒn mænˈdɛl/;[2][3] née Fairbanks;[4]
born 1979) is a Canadian novelist and essayist.[5][6] She has written
six novels, including Station Eleven (2014), The Glass Hotel (2020), and
Sea of Tranquility (2022). Station Eleven, which has been translated
into 33 languages,[7] has been adapted into a limited series on HBO
Max.[8] The Glass Hotel was translated into twenty languages and was
selected by Barack Obama as one of his favorite books of 2020.[9][10]
Sea of Tranquility was published in April 2022 and debuted at number
three on The New York Times Best Seller list.[11]
Deb Olin Unferth (born November 19, 1968) is an American author. She has
published two novels, two books of short stories, a memoir, and a
graphic novel. Her fiction and essays have appeared in over fifty
magazines and journals, including Harper’s,[1] The New York Times,[2]
The Paris Review[3] The Believer,[4] McSweeney’s, Granta[5] The
Guardian,[6] and NOON. She was a finalist for the National Book Critics’
Circle Award,[7] and she has received a Guggenheim fellowship,[8] four
Pushcart Prizes, a Creative Capital Fellowship for Innovative
Literature,[9] and residency fellowships from the MacDowell[10] and
Yaddo[11] Foundations.
This introduction provides an overview of the thirteen articles which
constitute this special issue about “citational politics and justice.”
The issue begins with a discussion paper, followed by six research
articles, one commentary, one project report, one teaching reflection,
and finishes with three conversations. Authors reflect on the history
and future of citation practices, and what they mean for the recognition
of marginalised scholars, knowledges, and forms of output. The range of
contributions offers insights into how more just scholarly practices can
be promoted in teaching, research, publishing, and collaboration with
academic and societal partners. Together, these articles provide ideas
for achieving greater citational justice, and ultimately improving the
quality of knowledge.
There are many ways to categorize programming languages; one is to
define them as either “concatenative” or “applicative”. In an
applicative language, things are evaluated by applying functions to
arguments. This includes almost all programming languages in wide use,
such as C, Python, ML, Haskell, and Java. In a concatenative programming
language, things are evaluated by composing several functions which all
operate on a single piece of data, passed from function to function.
This piece of data is usually in the form of a stack. Additionally, in
concatenative languages, this function composition is indicated by
concatenating programs. Examples of concatenative languages include
Forth, Joy, PostScript, Cat, and Factor.
When The Guardian took a look at who was trying to extract its content,
access logs revealed that the Internet Archive was a frequent crawler,
said Robert Hahn, head of business affairs and licensing. The publisher
decided to limit the Internet Archive’s access to published articles,
minimizing the chance that AI companies might scrape its content via the
nonprofit’s repository of over one trillion webpage snapshots.
Gwtar is a new polyglot HTML archival format which provides a single,
self-contained, HTML file which still can be efficiently lazy-loaded by
a web browser. This is done by a header’s JavaScript making HTTP range
requests. It is used on Gwern.net to serve large HTML archives.
Resisting the annexation of our hearts and minds by Silicon Valley
requires us not just to set boundaries on our engagement with what they
offer, but to cherish the alternatives. Joy in ordinary things, in each
other, in embodied life, and the language with which to value it, is
essential to this resistance, which is resistance to dehumanisation.
Join us for a quiet look inside the workspace of Tadao Ando, offering a
brief glimpse into his architectural process.
This studio visit documents the daily rhythms of work and the careful,
repetitive making of architectural scale models that sit at the center
of his practice. The focus is not on finished buildings, but on process.
Time spent refining ideas. Returning to the same forms again and again.
Letting work unfold slowly.
Photographed in a restrained, observational way, this project uses still
imagery to pay close attention to space, light, and atmosphere. The
photographs are not illustrative, but quietly descriptive, allowing the
studio to reveal itself as it is.
It is a small window into how creative work happens inside a working
architecture studio, and an invitation to slow down and observe the act
of making.
Evan as a skeptic, I will admit, it was interesting to hear about how
Claude Code was created and how it is being developed now in this
interview with its creator Boris Cherny:
Cherny’s instructions to build for the model they will have in six
months, coupled with the seeming lack of understanding of what model
they will have in six months (either software development goes away or
an ASL-4
level catastrophe) was to be expected I guess? Maybe he knows and just
isn’t saying? Maybe there isn’t very good understanding of whether one
model is working better than another? The question of how these models
are being evaluated for particular types of work, like software
development, is actually interesting to me.
Of course, Anthropic employees would like nothing better than for people
to forget how to develop software, and to become utterly dependent on
them in the process. Indeed they are happily leading the way, high on
their own supply of limitless tokens. They are counting on employers to
follow suit, paying subscription costs to give their employees tokens to
spend instead of having software developers on staff. This is following
in the footsteps of what we’ve seen happen with cloud computing.
In some ways this is nothing new. Software developers have been
dependent on the centralized development of compilers and interpreters
for some time. So you could look at the centralization of software
development into platforms like Anthropic and OpenAI as the natural next
stage of development in information technology. Indeed, I think this is
the argument currently being made (somewhat convincingly) by Grady Booch
about a Third
Golden Age of Computing which got underway with the rise of
“platforms” more generally, and which includes recent genAI platform
APIs and tooling.
But the big difference, that they want us all to forget, is the amount
of resources it takes to build a compiler compared to an LLM and our
ability to reason about them, and intentionally improve them.
They also want us to forget that we need to, you know, give them all our
data and ideas as context for them to do whatever they want (thanks cblgh). And
as with cloud computing, they want us to forget about the materiality of
computing, where computation runs. Ironically, I think computer
programmers are particularly susceptible to this rhetoric of
abstraction, or the medial ideology of the digital and the cloud (Kirschenbaum:2008?; Hu:2015?).
From a sociotechnical perspective I am curious how prompt data is being
used to try to improve these models, as people start using them for
ordinary tasks, and also in attempts to intentionally shape the model
motivated by greed and malice. I guess the details of this process must
be well hidden? Pointers would be welcome.
I am doing LLM “RAG” with rails ActiveRecord, postgres with the pgvector extension for vector similarity searches, and the neighbor gem. I am fairly new to all of this stuff, figuring it out by doing it.
I realized that for a particular use, I wanted to get some document diversity — so i wanted to do a search of my chunks ranked by embedding vector similarity, getting the top k (say 12) chunks — but in some cases I only want, say, 2 chunks per document. So the top 12 chunks by vector similarity, such that only 2 chunks per interview max are represented in those 12 top chunks.
I decided I wanted to do this purely in SQL, hey, I’m using pgvector, wouldn’t it be most efficient to have pg do the 2-per-document limit?
Note: This may be a use case that isn’t a good idea! I have come to realize that maybe I want to just fetch 12*3 or *4 docs into ruby, and apply my “only 2 per document” limit there? Because I may want to do other things there anyway that I can’t do in postgres, like apply a cross-model re-ranker? So I dunno, but for now I did it anyway.
So this was some fancy SQL, i was having trouble figuring out how to do it myself, so I asked ChatGPT, sure. It gave me an initial answer that worked, but…
Turns out was over-complicated, a simpler (to my understanding anyway) approach was possible
Turns out was not performant, it was not using my postgres ‘HNSW’ indexes to make vector searches higher performance, and/or was insisting on sorting the entire table first defeating the point of the indexes. How’d I know? Well, I noticed it was being slower than expected (several seconds or at times much more to return), and then I did postgres explain/analyze… which I had trouble understanding… so i fed the results to ChatGPT and/or Claude, who confirmed, yeah buddy, this is a bad query, it’s not using your vector index properly.
I had to go on a few back and forths with both ChatGPT and Claude (this is just talking to them in a GUI, not actually using Claude Code or whatever), to get to a pattern that did use my index effectively. They kept suggesting things to me that either just didn’t work, or didn’t actually use the index, etc. I had to actually understand what they were suggesting, and tweak it myself, and have a dialog with them…
But i eventually got to this cool method that can take an arbitrary ActiveRecord relation which already has had neighbor nearest_neighbors query applied to it… and wraps it in a larger query using CTE’s that can limit the results to max-per-document.
I wondered if I should try to share this somewhere (would neighbor gem want a PR?), except… I’m realizing like I said above maybe this is not actually a very useful use case, better to do it in ruby… I’m still not necessariliy getting the performance I expected either, although the analyze/explain says the indexes should be used properly.
So I just share here. Note the original base_relation may be it’s own internal joins to enforce additional conditions on retrieval etc. Assuming each Chunk ActiveRecord model has a document_id attribute which we are using to group for max-per-document.
# We need to take base_scope and use it as a Postgres CTE (Common Table Expression)
# to select from, but adding on a ROW_NUMBER window function, that let's us limit
# to top max_per_interview
#
# Kinda tricky, especially to do with good index usage. Got solution from google and talking
# to LLMs, including having them look at pg explain/analyze output.
#
# @param base_relation [ActiveRecord::Relation] original relation, it can have joins and conditions.
# It MUST have already had vector distance ordering applied to it with `neighbor` gem.
#
# @param max_per_interview [Integer] maximum results to include per interview (oral_history_content_id)
#
# @param inner_limit [Integer] how many to OVER-FETCH in inner limit, to have enough even after
# applying max-per-interview.
#
# @return [ActiveRecord::Relation] that's been in a query to enforce max_per_interview limits. It does
# not have an overall limit set, caller should do that if desired, otherwise will be effectively
# limited by inner_limit.
def wrap_relation_for_max_per_interview(base_relation:, max_per_interview:, inner_limit:)
# In the inner CTE, have to fetch oversampled, so we can wind up with
# hopefully enough in outer. Leaving inner unlimited would be peformance problem,
# cause of how indexing works it doesn't need to calculate them all if limited.
base_relation = base_relation.limit(inner_limit)
# Now we have another CTE that assigns doc_rank within partitioned
# interviews, from base. Raw SQL is just way easier here.
partitoned_ranked_cte = Arel.sql(<<~SQL.squish)
SELECT base.*,
ROW_NUMBER() OVER (
PARTITION BY document_id
ORDER BY neighbor_distance
) AS doc_rank
FROM base
SQL
# A wrapper SQL that incorporates both those CTE's, limiting to
# doc_rank of how many we want per-interview, and overall making sure to
# again order by vector neighbor_distance that must already have been included
# in the base relation.
base_relation.klass
.select("*") # just pass through from underlying CTE queries.
.with(base: base_relation)
.with(partitioned_ranked: partitoned_ranked_cte)
.from("partitioned_ranked")
.where("doc_rank <= ?", max_per_document)
.order(Arel.sql("neighbor_distance"))
end
Like I said, I am new to this LLM stuff, curious what others have to say here.
It describes a situation in which the density of objects in low Earth orbit (LEO) becomes so high due to space pollution that collisions between these objects cascade, exponentially increasing the amount of space debris over time.
This became known as the Kessler Syndrome. Three decades later, shortly after Iridium 33 and Cosmos 2251 collided at 11.6km/s, Kessler published The Kessler Syndrome, writing that the original paper:
predicted that around the year 2000 the population of catalogued debris in orbit around the Earth would become so dense that catalogued objects would begin breaking up as a result of random collisions with other catalogued objects and become an important source of future debris.
Modeling results supported by data from USAF tests, as well as by a number of independent scientists, have concluded that the current debris environment is “unstable”, or above a critical threshold, such that any attempt to achieve a growth-free small debris environment by eliminating sources of past debris will likely fail because fragments from future collisions will be generated faster than atmospheric drag will remove them.
Using data from on-orbit fragmentation events, this paper introduces a revised stability model for altitudes below 1020 km and evaluates the March 2025 population of payloads and rocket stages to identify new regions of instability. The results indicate the current population of intact objects exceeds the unstable threshold at all altitudes between 400 km and 1000 km and the runaway threshold at nearly all altitudes between 520 km and 1000 km.
This and other recent publications attracted the attention not only of two well-known YouTubers, Sabine Hossenfelder and Anton Petrov, but also of me.
The amount of space debris in orbit continues to rise quickly. About 40,000 objects are now tracked by space surveillance networks, of which about 11 000 are active payloads.
However, the actual number of space debris objects larger than 1 cm in size – large enough to be capable of causing catastrophic damage – is estimated to be over 1.2 million, with over 50.000 objects of those larger than 10 cm.
...
The adherence to space debris mitigation standards is slowly improving over the years, especially in the commercial sector, but it is not enough to stop the increase of the number and amount of space debris.
Even without any additional launches, the number of space debris would keep growing, because fragmentation events add new debris objects faster than debris can naturally re-enter the atmosphere.
To prevent this runaway chain reaction, known as Kessler syndrome, from escalating and making certain orbits unusable, active debris removal is required.
Another of the recent publications is Sarah Thiele et al's An Orbital House of Cards: Frequent Megaconstellation Close Conjunctions which focuses on the requirement for satellites to maneuver to avoid potential collisions, and what would happen if, for example, a solar storm disrupted the necessary command-and-control:
While satellites provide many benefits to society, their use comes with challenges, including the growth of space debris, collisions, ground casualty risks, optical and radio-spectrum pollution, and the alteration of Earth's upper atmosphere through rocket emissions and reentry ablation. There is potential for current or planned actions in orbit to cause serious degradation of the orbital environment or lead to catastrophic outcomes, highlighting the urgent need to find better ways to quantify stress on the orbital environment. Here we propose a new metric, the CRASH Clock, that measures such stress in terms of the timescale for a possible catastrophic collision to occur if there are no satellite manoeuvres or there is a severe loss in situational awareness. Our calculations show the CRASH Clock is currently 5.5 days, which suggests there is limited time to recover from a wide-spread disruptive event, such as a solar storm. This is in stark contrast to the pre-megaconstellation era: in 2018, the CRASH Clock was 164 days.
They estimate that:
In the densest part of Starlink’s 550 km orbital shell, we expect close approaches (< 1 km) every 22 minutes in that shell alone.
For the whole of Earth orbit they estimate the time between < 1 km approaches at 41 seconds.
According to a recent report filed by SpaceX with the U.S. Federal Communications Commission, Starlink satellites performed roughly 300,000 collision-avoidance maneuvers in 2025 alone. The figures, first reported by New Scientist, offer a rare look at just how crowded low-Earth orbit has become — and how aggressively SpaceX is managing risk as its constellation scales.
...
On average, the 300,000 maneuvers worked out to nearly 40 avoidance actions per satellite last year. That number is rising quickly, with estimates suggesting Starlink could be performing close to one million maneuvers annually by 2027 if growth continues at its current pace.
What’s particularly notable is how conservative SpaceX’s approach is compared to the rest of the industry. While the typical standard is to maneuver when the risk of collision reaches one in 10,000, SpaceX reportedly initiates avoidance at a far lower threshold of roughly three in 10 million.
Nevertheless Starlink's rate of maneuvers is doubling every six months, which seems likely to force a less conservative policy. The average satellite is moving every 9 days. At this doubling rate, by the end of 2027 the average satellite would move about twice a day.
Starlink currently has over 10,000 satellites, with plans for 12,000 in the short term. I believe the collision probability goes as the square of the number, so that will mean moving on average every 6.25 days. Their eventual plan for 42,000 would mean twice a day, or in aggregate about one move per second.
In order to pump SpaceX/xAI/Twitter stock in preparation for a planned IPO, Musk recently pivoted from cars, weird pickup trucks, self-driving cars, robotaxis, humanoid robots and Mars colonization to data centers in space. He claimed that by 2031 SpaceX/xAI/Twitter would operate a million satellites forming a huge AI data center. Scaling up from the current maneuver rate gets you to about a move every 125ms in aggregate.
How Bad Would A Kessler Event Be?
My friend Robert Kennedy considers the implications of a Kessler event in low Earth orbit:
Obviously the national security repercussions for the western world, especially the U.S., would be severe with so many force multipliers going away at once. Presenting an opportunity for adversaries to attack us, maybe.
The overall global space market, presently ~$700B/yr & growing fast, would shrink dramatically. This contraction in turn would be amplified in the world's stock markets since space activity is central to so many Big Tech equities now, and space infrastructure is so deeply embedded many other enterprises' business models. ... Even modest P/E ratios suggest that an order of magnitude more, maybe two (~$10-100T) of paper wealth would disappear.
The space insurance market would collapse under the burden of covered claims. Re-insurers could not handle so much at once. Companies that chose to self-insure would probably go under after such a casualty. Without insurance, most enterprises could not afford to conduct space missions.
The space launch market would collapse, leaving only national launch capabilities maintained by individual nations for their individual non-market reasons. All those innovative rocket companies popping up to serve the mega-constellations would go away once their prime customers did. Global launch tempos would fall by more than half, from 200+/yr to well under 100/yr of a generation ago. Forget the $100 per kg that ... Starship was aiming for, price per kilogram would return to what it was 30 years ago, ~$10-20K/kg. Say goodbye to cheap rideshares to LEO. Even running the gauntlet thru LEO would be fraught, as the Chinese learned just a few months ago when their spacecraft was damaged by debris on the way up, necessitating the premature return of the undamaged pre-deployed spaceship to rescue the earlier crew.
Since 99% of Cubesats fly in LEO, the ecology of COTS parts that has sprung up to serve the Cubesat revolution would probably go away, or back into the garage at least. It might even disappear altogether if authorities of various spacefaring nations ban Cubesats. (Literally "throwing out the baby with the bathwater".) Don't underestimate the inherent conservatism of oligarchs to use a crisis to stomp on upstarts.
In 2027 the ESA plans ClearSpace-1, an experimental mission to deorbit a dead satellite. The plan is to grab the satellite then retrofire. In principle this technique is a workable but expensive way to remove large targets before a collision fragments them, but it isn't viable for most of the results of a collision.
What Else Can Go Wrong?
The frenzy to exploit the commons of Low Earth Orbit doesn't just threaten to cut humanity off from space in general and the benefits that LEO can provide. The process of getting stuff up there and its eventual descent threatens to accelerate the process of trashing the commons of the terrestrial environment.
Ozone losses are driven by the chlorine produced from solid rocket motor propellant, and black carbon which is emitted from most propellants. The ozone layer is slowly healing from the effects of CFCs, yet global-mean ozone abundances are still 2% lower than measured prior to the onset of CFC-induced ozone depletion. Our results demonstrate that ongoing and frequent rocket launches could delay ozone recovery. Action is needed now to ensure that future growth of the launch industry and ozone protection are mutually sustainable.
Black carbon heats the stratosphere, although the increasing use of methane reduces the amount emitted per ton of propellant. Each Starship launch uses about 4000 tons of LOX and about 1000 tons of methane. Assuming complete combustion, this would emit about 1,667 tons of CO2 into the atmosphere. So Musk's data center plan would dump about 17 megatons/year into the atmosphere, or about as much as Croatia.
This paper investigates the oxidation process of the satellite's aluminum content during atmospheric reentry utilizing atomic-scale molecular dynamics simulations. We find that the population of reentering satellites in 2022 caused a 29.5% increase of aluminum in the atmosphere above the natural level, resulting in around 17 metric tons of aluminum oxides injected into the mesosphere. The byproducts generated by the reentry of satellites in a future scenario where mega-constellations come to fruition can reach over 360 metric tons per year. As aluminum oxide nanoparticles may remain in the atmosphere for decades, they can cause significant ozone depletion.
Clearly, reducing the risk of a Kessler incident requires international cooperation. We have one somewhat successful example of a international cooperation to mitigate a similar "Tragedy of the Commons".Thirty-eight years ago the Montreal Protocol was agreed, phasing out the chemicals that destroy the ozone layer. Wikipedia reports that:
Due to its widespread adoption and implementation, it has been hailed as an example of successful international co-operation.
Climate projections indicate that the ozone layer will return to 1980 levels between 2040 (across much of the world) and 2066 (over Antarctica).
But note that it will have taken almost 80 years from the agreement for the environment to recover fully. And that it appears to be the exception that proves the rule:
effective burden-sharing and solution proposals mitigating regional conflicts of interest have been among the success factors for the ozone depletion challenge, where global regulation based on the Kyoto Protocol has failed to do so.
The Kyoto Protocol attempted to mitigate the effects of greenhouse gas emissions. Of particular importance was that the Montreal Protocol was an application of the Precautionary Principlebecause:
In this case of the ozone depletion challenge, there was global regulation already being implemented before a scientific consensus was established.
...
This truly universal treaty has also been remarkable in the expedience of the policy-making process at the global scale, where only 14 years lapsed between a basic scientific research discovery (1973) and the international agreement signed (1985 and 1987).
In 1.5C Here We Come I critiqued the attitudes of the global elite that have crippled the implementation of the Kyoto Protocol. I think it is safe to say that the prospect of applying the Precautionary Principle to Low Earth Orbit is even less likely.
This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. Bandung Mappers successfully carried out the Open Data Day 2025 activity on March 6 – 8 with the theme Coastal Resilience through Mangrove Rehabilitation which was held in Cianjur, West Java. This activity was...
This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. The Open Data Day 2025 event in Dodoma brought together open data advocates, government entities, researchers, NGOs, and YouthMappers under the theme “Open Data for a Resilient Dodoma.” Hosted by OpenGeoCity Tanzania with support...
This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. To celebrate Open Data Day 2025, as part of the Harnessing Opportunities to address Polycrisis through community Engagement (HOPE) project, the Nepal Institute of Research and Communications, in collaboration with the Ilam Municipality, organized a...
Search plays a central role in that mission. Historically, Etsy’s search
models have relied heavily on engagement signals – such as clicks,
add-to-carts, and purchases – as proxies for relevance. These signals
are objective, but they can also be biased: popular listings get more
clicks, even when they’re not the best match for a specific query.
To address this, we introduce semantic relevance as a complementary
perspective to engagement, capturing how well a listing aligns with a
buyer’s intent as expressed in their query. We developed a Semantic
Relevance Evaluation and Enhancement Framework, powered by large
language models (LLMs). It provides a comprehensive approach to measure
and improve relevance through three key components:
High quality data: we first establish human-curated “golden” labels of
relevance categories (we’ll come back to this) for precise evaluation of
the relevance prediction models, complemented by data from a
human-aligned LLM that scales training across millions of query-listing
pairs Semantic relevance models: we use a family of ML models with
different trade-offs in accuracy, latency, and cost; tuned for both
offline evaluation and real-time search Model-driven applications: we
integrate relevance signals directly into Etsy’s search systems enabling
both large-scale offline evaluation and real-time enhancement in
production
While our powerful search and discovery algorithms can process
unstructured data such as that in descriptions and listing photos,
passing in long context and images directly to search poses latency
concerns. For these algorithms, every millisecond counts as they work to
deliver relevant results to buyers as quickly as possible. Spending time
filtering through unstructured data for every query is just not
feasible.
These constraints led us to a clear conclusion: to fully unlock the
potential of all inventory listed on Etsy’s site, unstructured product
information needs to be distilled into structured data to power both ML
models and buyer experiences.
OpenAI’s coding agent Codex exists across many different surfaces: the
web app(opens in a new window), the CLI(opens in a new window), the IDE
extension(opens in a new window), and the new Codex macOS app. Under the
hood, they’re all powered by the same Codex harness—the agent loop and
logic that underlies all Codex experiences. The critical link between
them? The Codex App Server(opens in a new window), a client-friendly,
bidirectional JSON-RPC1 API.
In this post, we’ll introduce the Codex App Server; we’ll share our
learnings so far on the best ways to bring Codex’s capabilities into
your product to help your users supercharge their workflows. We’ll cover
the App Server’s architecture and protocol and how it integrates with
different Codex surfaces, as well as tips on leveraging Codex, whether
you want to turn Codex into a code reviewer, an SRE agent, or a coding
assistant.
AI coding agents are rapidly reshaping how software is built, reviewed,
and maintained. As large language model capabilities continue to
increase, the bottleneck in software development is shifting away from
code generation toward planning, review, deployment, and coordination.
This shift is driving a new class of agentic systems that operate inside
constrained environments, reason over long time horizons, and integrate
across tools like IDEs, version control systems, and issue trackers.
OpenAI is at the forefront of AI research and product development. In
2025, the company released Codex, which is an agentic coding system
designed to work safely inside sandboxed environments while
collaborating across the modern software development stack.
A talk show about ideas and culture, produced and presented by Neil
Denny. Each show features guests from the worlds of science or the arts
in conversation. This week: George Saunders on his latest novel, Vigil.
Despite substantial investment in research data infrastructure, data
discovery remains a fundamental challenge in the era of open science.
The proliferation of repositories and the rapid growth of deposited data
have not resulted in a corresponding improvement in data findability.
Researchers continue to struggle to find data that are relevant to their
work, revealing a persistent gap between data availability and data
discoverability. Without rich, high-quality metadata, robust and
user-centred data discovery systems, and a deeper understanding of how
different researchers seek and evaluate data, much of the potential
value of open data remains unrealised.
This paper presents a set of practical, evidence-based recommendations
for data repositories and discovery service providers aimed at improving
data discoverability for both human and machine users. These
recommendations emphasise the importance of 1) understanding the search
needs and contexts of data users, 2) addressing the roles that data
repositories play in enhancing metadata quality to meet users’ data
search needs, and 3) designing discovery interfaces that support
effective and diverse search behaviours. By bridging the gap between
data curation practices, discovery system design, and user-centred
approaches, this paper argues for a more integrated and strategic
approach to data discovery.
Hister is a web history management tool that provides blazing fast,
content-based search for visited websites. Unlike traditional browser
history that only searches URLs and titles, Hister indexes the full
content of web pages you visit, enabling deep and meaningful search
across your browsing history.
Feb 10 (Reuters) - Alphabet (GOOGL.O), opens new tab on Tuesday sold a
rare 100-year bond, a memo from the lead manager showed, part of a
$31.51 billion global bond raise, as artificial intelligence-driven
spending sparks a surge in borrowing at U.S. tech giants. Alphabet’s
sale of the century bond is the tech industry’s first since Motorola’s
(MSI.N), opens new tab issuance that dates back to 1997, according to
LSEG data.
In the computer industry, the Wheel of Reincarnation is a pattern
whereby specialized hardware gets spun out from the “main” system,
becomes more powerful, then gets folded back into the main system. As
the linked Jargon File entry points out, several generations of this
effect have been observed in graphics and floating-point coprocessors.
In this essay, I note an analogous pattern taking place, not in
peripherals of a computing platform, but in the most basic kinds of
“computing platform.” And this pattern is being driven as much by the
desire for “freedom” as by any technical consideration.
The abstraction ship sailed decades ago. We just didn’t notice because
each layer arrived gradually enough that we could pretend we still
understood the whole stack. AI is just the layer that made the pretence
impossible to maintain.
In January 2025, OCLC made significant changes to the web and
application programming interfaces for Virtual International Authority
File (VIAF) clusters. This article will compare the old and new
interfaces, highlighting the pros and cons introduced and calling
attention, especially, to critical errors introduced that compromise the
functionality of much of the VIAF product. Consequently, it will raise
questions and concerns regarding the governance of VIAF, as well as
OCLC’s development model, testing, and feedback before public rollout.
The software industry, historically driven by creativity, faces a
paradox. While developers are drawn to intellectual challenges, their
creativity is increasingly constrained by efficiency-driven methods and
so-called productivity metrics. Although positioned as innovation
engines, Agile software development (hereinafter referred to as Agile)
and open-source software (OSS) approaches may prioritize incrementalism
over transformative breakthroughs. This tension between structure and
creativity threatens individual potential and the industry’s capacity
for meaningful innovation. Without addressing this gap, contemporary
development approaches may fail to support the creativity necessary for
crafting novel and impactful software. This dissertation examines this
gap, investigating how modern development approaches shape individual
creativity into project-level innovation. Drawing on multi-level
interactionist theories of creativity, we explore the conditions under
which individual, team, and organizational interactions foster or
constrain creative outcomes. By addressing this critical gap, our
research reconceptualizes development methodologies as enablers of
radical innovation rather than constraints, ensuring the industry’s
continued creative and transformative impact. Using a sequential
exploratory mixed-methods design, this dissertation integrates
qualitative and quantitative techniques to analyze creativity within
software development. The qualitative strand examines individual
developer experiences through 31 semi-structured interviews with Agile
practitioners. The quantitative strand assesses cognitive conflict’s
impact on team performance in OSS development, analyzing 40 projects and
82,949 code commits. The mixed convergent strand evaluates corporate and
open governance interplay, leveraging data from 40 projects, 10,862
releases, and 15 interviews. By synthesizing insights across these
strands, this dissertation delivers theoretical contributions and
actionable guidance for fostering creativity in software development. We
challenge the myth of developers as lone “rockstars” or “hackers” by
demonstrating the critical role of social interactions in shaping
creativity and innovation. Empirical findings reveal that review-stage
interactions—such as pull requests and code reviews—mediate and
transition from creativity to innovation, while project governance
moderates this relationship further. This dissertation highlights how
individual, team, and organizational dynamics influence creative
outcomes by operationalizing cognitive conflict and release commit
novelty. These insights advance theoretical understanding and offer
practical strategies for unlocking the innovative potential of
contemporary development practices.
In the wake of immigration agents’ killings of three US citizens within
a matter of weeks, the Department of Homeland Security is quietly moving
forward with a plan to expand its capacity for mass detention by using a
military contract to create what Pablo Manríquez, the author of the
immigration news site Migrant Insider calls “a nationwide ‘ghost
network’ of concentration camps.”