First, let's look at the live web goo.gl URLs. The ones that were not lucky enough to receive traffic in "late 2024" no longer return a sunset message, and now return a garden variety HTTP 404 (image above, HTTP response below):
Recall that goo.gl/0R8XX6 was one of the 26 shortened URLs from a 2017 survey of data sets for self-driving cars that was not lucky enough to have received traffic during late 2024, and thus was no longer going to continue to redirect (the other 25 shortened URLs are still redirecting). One reason that I had put off posting about this finding is that other than saying that they did the thing they said they were going to do, there wasn't a surprise or interesting outcome. But, it turns out I was wrong: it appears that you can look at the HTML entity to determine if there was ever a redirection at the now 404 shortened URL,
I wanted to test if goo.gl would return a 410 Gone response for URLs that no longer redirect. The semantics of a 410 are slightly stronger than a 404, in that a 410 allows you to infer that there used to be a resource identified by this URL, but there isn't now. A regular 404 doesn't allow you to distinguish from something that used to be 200 (or 302*, in the case of goo.gl) vs. something that was never 200 (or 301, 302, etc.). Unfortunately, 410s are rare in the live web, but goo.gl deprecating some of its URLs seemed like a perfect opportunity to use them. But in my testing of shortened URLs, I discovered that you get a different HTML entity depending on if the goo.gl URL ever existed or not.
Let's take a look at the HTML entity that comes back via curl (I've created a gist with the full responses, but here I'll just show byte count):
% curl -s https://goo.gl/0R8XX6 | wc -c
1652
Doing the same thing for a shortened URL that presumably never existed, we get a response that's about 5X bigger (9,237 bytes vs. 1,652 bytes), even though it's still an HTTP 404:
Note that the 404 page shown in the top image is the same Google-branded 404 page that one gets from google.com; for example google.com/asdkfljlsdjfljasdljfl.
It's possible there's a regular expression that checks for goo.gl style hashes in the URLs and "asdkfljlsdjfljasdljfl" was handled differently. So next I tested a pair of six character hashes: goo.gl/111111 vs. goo.gl/111112 and got the same behavior: both 404, but 111111's HTML was 5X bigger than 111112's HTML:
Turns out that I was lucky with my first pair of random strings: goo.gl/111111 has an archived redirection and goo.gl/111112 does not, with 111111 also not being popular in "late 2024". While the archived redirection proves that there was a redirection for 111111, the lack of an archived redirection for 111112 technically does not prove that there was never a redirection (there could have been one and it wasn't archived). While I could spend more time trying to reverse engineer goo.gl and Firebase, I will be satisfied with my initial guess and trust my intuition, which says that the different returned HTML entities allow you to determine what goo.gl URLs used to redirect (i.e., de facto HTTP 410s) vs. the goo.gl URLs that never redirected (i.e., correct HTTP 404s).
So the current status of goo.gl is even crazier than it first seems: rather than simply have all the goo.gl URLs redirect, they are keeping a separate list of goo.gl URLs that do not redirect. We now have:
goo.gl URLs that still redirect correctly
goo.gl URLs that no longer redirect, but goo.gl knows they used to redirect, because they return a Google-branded 404 page
goo.gl URLs that never redirected (i.e., were never really goo.gl shortened URLs), for which goo.gl returns a Firebase-branded 404 page
I suppose we should be happy that they did not deprecate all of the goo.gl URLs, but surely keeping all of them would have been easier.
Fortunately, web archives, specifically IA's Wayback Machine in this case, have archived these redirections. The Wayback Machine is especially important in the case of goo.gl/0R8XX6, since its redirection target, 3dvis.ri.cmu.edu/data-sets/localization/, no longer resolves, and the page is not unambiguously discoverable via a Google search. In this case, we need the Wayback Machine to get both the goo.gl URL and the cmu.edu URL.
So there is a possible, but admittedly unlikely, use case for this bit of knowledge. If you're resolving goo.gl URLs and get a 404 instead of a 302, then check the Wayback Machine, it probably has the redirect archived. If Wayback Machine doesn't have the redirect archived, you can check the HTML entity returned in the goo.gl 404 response: Google-branded 404s (deprecated goo.gl URLs) are much smaller than Firebase-branded 404s (never valid goo.gl URLs). A small, Goole-branded 404 page is a good indicator that there used to be a redirection, and if the Wayback Machine doesn't have it archived, maybe another web archive does.
Archives and special collections contain a wide range of resource types requiring different metadata workflows. Resources may be described in library catalogs, digital repositories, or finding aids, and the metadata can vary greatly because of platforms, collections priorities, and institutional policies. Providing online access and discovery for these unique resources presents an ongoing challenge because of inconsistent or incomplete metadata and new digital accessibility standards. AI presents new possibilities for providing access to unique resources in archives and special collections, where it may be used for data—like captions and transcriptions—relying on the strengths of large language models (LLMs).
This blog post—the second in our series on the work of the OCLC Research Library Partnership (RLP) Managing AI in Metadata Workflows Working Group—focuses on the “Metadata for Special and Distinctive Collections” workstream. It shares current uses of AI by members, insights on assessing whether AI is suitable for a task, and open questions about accuracy and data provenance.
Participants
This workstream brought together metadata professionals from diverse institutions, including academic libraries, national archives, and museums. Their collective expertise and the use cases they shared provided valuable insights into how AI tools can address the unique challenges of special and distinctive collections. Members of this group included:
Helen Baer, Colorado State University
Jill Reilly, National Archives and Records Administration
Amanda Harlan, Nelson-Atkins Museum of Art
Mia Ridge, British Library
Miloche Kottman, University of Kansas
Tim Thompson, Yale University
Integration in existing tools
Participants primarily described using tools already available to them through existing licensing agreements with their parent institution. While this works for proof-of-concept experimentation, these ad hoc approaches do not scale up to production levels or provide the desired increases in efficiency. Participants expressed that they want integrated tools within the library workflow products they are already using.
Using multiple tools is a long-standing feature of metadata work. In the days of catalog cards, a cataloger might have a bookcase full of LCSH volumes (i.e., the big red books), LCC volumes, AACR2, LCRIs, a few language dictionaries, a few binders of local policy documents, and, of course, a typewriter manual. Today, a cataloger may have four or five applications open on their computer, including a browser with several tabs. Working with digital collections compounds this complexity, requiring additional tools for content management, file editing, and project tracking. Since AI has already been integrated into several popular applications, including search engines, metadata managers hope to see similar functionality embedded within their existing workflows, potentially reducing the burden of managing so many passwords, windows, and tabs.
Entity management
Many metadata managers, including our subgroup members, dream of automated reconciliation against existing entity databases. This becomes even more important for archives, which often contain collections of family papers with multiple members with the same names. A participant observed that URIs are preferable for disambiguation due to the requirement to create unique authorized access points for persons using a limited set of data elements. The natural question then becomes, “How can AI help us do this?”
Yale University’s case study explored this question, noting that it used AI in combination with many other tools, as using an LLM for this work would have been prohibitively expensive. The technology stack is shared in the entity resolution pipeline and includes a purpose-built vector database for text embeddings. The results included a 99% precision rate in determining whether two bibliographic records with different headings (e.g., “Schubert, Franz” and “Schubert, Franz, 1797-1828”) referred to the same person and did not make traditional string match errors that occur when identical name strings refer to different persons. This case study demonstrated how AI could be effectively used in combination with multiple tools, but it may also require technical expertise beyond that of many librarians and archivists.
Readiness and need
All participants indicated some level of organizational interest in experimenting with AI to address current metadata needs. Due to distinct workflows and operations common in special collections and archives, there were fewer concerns about AI replacing human expertise than in the general cataloging subgroup.
We identified three factors influencing their willingness to experiment with AI:
Traditional divisions of labor
Quantity of resources to be described
Meeting accessibility requirements
Traditional divisions of work
In archival work, item-level description elements, such as image captions and transcripts, have often been done selectively by volunteers and student workers rather than metadata professionals due to the volume of items and the lack of specialized skills needed.* For example, the United States’ National Archives and Records Administration (NARA) relies on its Citizen Archivist volunteer program to provide tagging and transcription of digitized resources. Even with these dedicated volunteers, NARA uses AI-generated descriptions because of the extensive number of resources. However, NARA’s volunteers provide quality control on the AI-generated metadata, and the amount of metadata generated by AI ensures that these volunteers continue to be needed and appreciated.
Quantity of resources
Archival collections may range from a single item to several thousand items, resulting in significant variation in the type and level of description provided. Collection contents are often summarized with statements such as “45 linear feet,” “mostly typescripts,” and “several pamphlets in French.” However, when collections are digitized, more granular description is required to support discovery and access. The workflow at NARA is a good demonstration of how an archive uses AI to provide description at a scale that is not feasible for humans. Many archivists have been open to the idea of using AI for these tasks because the quantity of resources meant that detailed metadata was not possible.
Meeting accessibility requirements
Accessibility is a growing priority for libraries and archives, driven by legal requirements such as the ADA Title II compliance deadline in the US. For digital collections, this may mean providing alt text for images, embedded captions and audio descriptions for video recordings, and full transcripts for audio recordings.
A participant observed that, in their experience with AI-generated transcripts, AI does well transcribing single-language, spoken word recordings. However, the additional nuances with singing and multiple-language recordings are too complex for AI. This provides a natural triage for audio transcript workflows in their institution.
Creating transcripts of audio recordings is time-consuming, and archives have largely relied on student workers and volunteers for this work. Many institutions have a backlog of recordings with no transcriptions available. Thus, using AI for transcripts enables them to meet accessibility requirements and increase discovery of these resources.
Challenges and open questions around the use of AI
While AI offers opportunities, the group also identified several challenges and open questions that must be addressed for successful implementation. Metadata quality and data provenance were the top issues emerging for special and distinctive collections.
Assessing metadata quality
What is an acceptable error rate for AI-generated metadata? Participants noted that while perfection is unattainable, even for human catalogers, institutions need clear benchmarks for evaluating AI outputs. Research providing comparative studies of error rates between AI and professional catalogers would prove valuable for informing AI adoption decisions, but few such findings currently exist. High precision remains critical for maintaining quality in library catalogs, as misidentification of an entity will provide users with incorrect information about a resource.
The subgroup also discussed the concept of “accuracy” in transcription. For instance, AI-generated transcripts may be more literal, while human transcribers often adjust formatting to improve context and readability. An example from NARA showing a volunteer-created transcription and the AI data (labeled as “Extracted Text”) illustrates these differences. The human transcription moves the name “Lily Doyle Dunlap” to the same line as “Mrs.”, but the AI transcribes line by line. While the human transcriber noted untranscribed text as “[illegible],” the AI transcribed it as “A.” Neither reflects what was written, so both could be described as not completely accurate. Unlike cataloging metadata, there has never been an expectation that transcriptions of documents or audiovisual records would be perfect in all cases for various reasons, including handwriting legibility and audio quality. One participant characterized their expectations for AI-generated transcripts as “needed to be good, but not perfect.”
One case study used confidence scores as a metric to determine whether the AI-generated metadata should be provided to users without review. Confidence scores provide a numerical value indicating the probability that the AI output is correct. For example, a value of over 70% might be set as a threshold for providing data without review. Because confidence scores are provided by the models themselves, they are as much a reflection of the model’s training as its output.
Providing data provenance
Data provenance—the story of how metadata is created—is a critical concern for AI-generated outputs. Given the risk of AI “hallucinations” (generating incorrect or fabricated data), it is important to provide information to users about AI-created metadata. Working group members whose institutions are currently providing such data provenance shared their practices. NARA indicates that a document transcript is AI-generated using the standard text “Contributed by FamilySearch NARA Partner AI / Machine-Generated” (see this example for extracted text of a printed and handwritten document).
OCLC recognizes the importance of this issue to the community and is providing support in these ways:
Updated WorldCat documentation: Section 3.5 of the Bibliographic Formats and Standards (BFAS) now includes guidance on recording AI-generated metadata.
AskQC Office Hours webinar: The August 2025 session focused on providing data provenance in bibliographic records, including AI use cases.
Metadata professionals have a long-standing interest in the use of automation to provide and improve metadata, and AI joins macros, controlling headings, and batch updates as the latest technology tool in this effort. Our subgroup’s case studies demonstrated that AI tools can be used in special collections workflows in cases where AI is well-suited to the metadata needed. The most compelling applications involved transcribing documents and recordings, where AI capabilities, such as automatic speech recognition (ASR) and natural language processing (NLP), make it a good fit for such tasks.
NB: As you might expect, AI technologies were used extensively throughout this project. We used a variety of tools—Copilot, ChatGPT, and Claude—to summarize notes, recordings, and transcripts. These were useful for synthesizing insights for each of the three subgroups and for quickly identifying the types of overarching themes described in this blog post.
*It is worth noting that the labor available to national and university archives includes volunteers and student workers, whereas a smaller stand-alone archive like a historical society would not have access to so many human resources.
§1 The Why Before the How §2 The (RSS) Feed is Dead. Long Live the (RSS) Feed §3 "Social networks consist of people who are connected by a shared object" §4 ActivityPub needs local champions §5 An ActivityPub Membership Drive
While here, basking inside the grandeur of the Sainte-Geneviève Library (Paris), I have finished curating four Distant Reader study carrels:
Emma by Jane Austen
The Iliad and the Odyssey by Homer
Moby Dick by Herman Melville
Walden by Henry David Thoreau
Introduction
Distant Reader study carrels are data sets intended to be read by people as well as computers. They are created through the use of a tool of my own design -- the
Distant Reader Toolbox. Given an arbitrary number of files in a myriad of formats the Toolbox caches the files, transforms them into plain text files, performs feature extractions against the plain text, and finally saves the results as sets of tab-delimited files as well as an SQLite database. The files and the database can then computed against -- modeled -- in a myriad of ways: extents (sizes in words and readability scores), frequencies (unigrams, bigrams, keywords, parts-of-speech, named entities), topic modeling, network analysis, and a growing number of indexes (concordances, full-text searching, semantic indexing, and more recently, large language model embeddings).
I call these data sets "study carrels", and they are designed to be platform- and network-independent. Study carrel functionality requires zero network connectivity, and study carrel files can be read by any spreadsheet, database, analysis program (like OpenRefine), or programming language. Heck, I could even compute against study carrels on my old Macintosh SE 30 (circa 1990) if I really desired. For more detail regarding study carrels, see the readme file included with each carrel. All that said, once a carrel is created is lends itself to all sorts analysis, automated or not. The "not automated" analysis I call "curation" which is akin to a librarian curating any of their print collections.
With this in mind, I have curated four study carrels. I have divided each of the four books (above) into their individual chapters. I then created study carrels from the results, and I have done distant reading against each. I applied distant reading, observed the results, summarized my observations, and documented what I learned. Since each curation details what I learned, I won't go into all of it here, but I will highlight some of the results of my topic modeling.
Topic modeling
In a sentence, topic modeling is an unsupervised machine learning process used to enumerate the latent themes in any corpus. Given an integer (T), topic modeling divides a corpus into T topics and outputs the words associated with each topic. Like most machine learning techniques, the topic modeling process is nuanced and therefore the results are not deterministic. Still, topic modeling can be quite informative. For example, once a model has been created, the underlying documents can be associated with ordinal values (such as dates or sequences of chapters). The model can then be pivoted so the ordinal values and the topic model weights are compared. Finally, the pivoted table can be visualized in the form of a line chart. Thus, a person can address the age-old question, "How did such and such and so and so topic ebb and flow over time?" This is exactly what I did with Emma, the Iliad and the Odyssey, Moby Dick, and Walden. In each and every case, my topic modeling described the ebb and flow of the given book, which, in the end, was quite informative and helped me characterize each.
Emma
I topic modeled Emma with only four topics, and I assert the novel is about "emma", "engagement", "charade", and "jane". Moreover, these topics can be visualized as a pie chart as well as a line chart. Notice how "emma" dominates. From my point of view, it is all about Emma and her interactions/relationships with the people around her. For more elaboration, see the
curated carrel.
label
weight
features
emma
3.0353
emma harriet weston knightley elton time great woodhouse quite nothing dear always
I did the same thing with the Iliad and the Odyssey, but this time I modeled with a value of eight. From this process, I assert the epic poems are about "man", "trojans", "achaeans", "achilles", "sea", ulysses", "horses", and "alcinous". This time "man" dominates, but "trojans" and "acheans" are a close second. More importantly, plotting the topics over the sequence of the books (time), I can literally see how the two poems are distinct stories; notice how the first part of the line chart is all about "trojans", and the second is all about "man". See the
curated carrel for an elaboration.
I topic modeled Moby Dick with a value of ten, and the resulting topics included: "ahab", "whales", "soul", "pip", "boats", "queequeg", "cook", "whaling", "jonah", and "bildad". The topics of "ahab" and "whales" dominate, and if you know the story, then this makes perfect sense. Topic modeling over time illustrates how the book's themes alternate, and thus I assert the book is not only about Ahab's obsession with the white whale, but it is also about the process of whaling, kinda like an instruction manual. Again, see the
curated carrel for an elaboration.
labels
weights
features
ahab
0.99807
ahab man ship sea time stubb head men
whales
0.2336
whales sperm leviathan time might fish world many
soul
0.08189
soul whiteness dick moby brow mild times wild
pip
0.08068
pip carpenter coffin sun fire blacksmith doubloon try-works
boats
0.07229
boats line air spout water oars tashtego leeward
queequeg
0.06133
queequeg bed room landlord harpooneer door tomahawk bedford
cook
0.0569
cook sharks dat blubber mass tun bucket bunger
whaling
0.05234
whaling ships gabriel voyage whale-ship whalers fishery english
jonah
0.04218
jonah god loose-fish fast-fish law shipmates guernsey-man woe
bildad
0.02847
bildad peleg steelkilt sailor gentlemen lakeman radney don
Unlike the other books, Walden is not a novel but instead a set of essays. Set against the backdrop of a pond (but we would call it a lake), Thoreau elaborates on his observations of nature and what it means to be human. In this case I modeled with seven topics, and the results included: "man", "water", "woods", "beans", "books", "purity", and "sheltor". Yet again, the topic of "man" dominates, but notice how each of the chapters' titles very closely correspond to each of the computed topics. As I alluded to previously, pivoting a topic model on some other categorical value, often brings very interesting details to light. See the
curated carrel for more detail.
labels
weights
features
man
1.64
man life men house time day part world get morning work thought
water
0.56165
water pond ice shore surface walden spring deep bottom snow winter summer
woods
0.2046
woods round fox pine door bird snow evening winter night suddenly near
beans
0.19471
beans hoe fields seed cultivated soil john corn field planted labor dwelt
books
0.19032
books forever words language really things men learned concord intellectual news wit
I have used both traditional as well as distant reading against four well-known books. I have documented what I learned, and this documentation has been manifested as a set of four curated Distant Reader study carrels. I assert traditional reading's value will never go away. After all, novels and sets of essays are purposely designed to be consumed through traditional ("close") reading. On the other hand, the application of distant reading can quickly and easily highlight all sorts of characteristics which are not, at first glance, very evident. The traditional and distant reading processes compliment each other.
In the context of electronic books, I've always been frustrated by how reading applications relegate navigation of table of contents to a minor feature in their UI/UX.
(Note: Throughout history, indexes — those alphabetical listings at the back of books — have been crucial for knowledge access, as Dennis Duncan explores in "Index, A History of the". But this post focuses on tables of contents, which show the hierarchical structure of chapters and sections.)
I find it very useful to view the table of contents before opening a book. I often do this in the terminal. You can easily create your own script in any programming language using an existing library for EPUB files (or PDF, or whatever the format you need to read). For EPUB files, the simplest approach I have found is to use Readium CLI and jq to print a tree-like structure of the book. This is the script I use:
#!/bin/bash
# Usage: ./epub-toc.sh <epub-file>
if [ $# -eq 0 ]; then
echo "Usage: $0 <epub-file>" >&2
exit 1
fi
EPUB_FILE="$1"
if [ ! -f "$EPUB_FILE" ]; then
echo "Error: File '$EPUB_FILE' not found" >&2
exit 1
fi
if ! command -v readium &> /dev/null; then
echo "Error: 'readium' command not found. Please install readium-cli." >&2
exit 1
fi
if ! command -v jq &> /dev/null; then
echo "Error: 'jq' command not found. Please install jq." >&2
exit 1
fi
readium manifest "$EPUB_FILE" | jq -r '
def tree($items; $prefix):
$items | to_entries[] |
(if .key == (($items | length) - 1) then
$prefix + "└── "
else
$prefix + "├── "
end) + .value.title,
(if .value.children then
tree(.value.children; $prefix + (if .key == (($items | length) - 1) then " " else "│ " end))
else
empty
end);
if .toc then
tree(.toc; "")
else
"Error: No .toc field found in manifest" | halt_error(1)
end
'
Example of a book with a long and nested table of contents
~ readium-toc La_comunicazione_imperfetta_-_Peppino_Ortoleva_Gabriele_Balbi.epub
├── Copertina
├── Frontespizio
├── LA COMUNICAZIONE IMPERFETTA
├── Introduzione
│ ├── 1. I percorsi movimentati, e accidentati, del comunicare.
│ ├── 2. Teorie lineari della comunicazione: una breve archeologia.
│ ├── 3. Oltre la linearità, verso l’imperfezione.
│ └── 4. La struttura del libro.
├── Parte prima. Una mappa
│ ├── I. Malintesi
│ │ ├── 1. Capirsi male. Un’introduzione al tema.
│ │ ├── 2. Una prima definizione, anzi due.
│ │ ├── 3. A chi si deve il malinteso.
│ │ ├── 4. Il gioco dei ruoli.
│ │ ├── 5. L’andamento del malinteso.
│ │ ├── 6. Le cause del malinteso.
│ │ │ ├── 6.1. Errori e deformazioni materiali.
│ │ │ ├── 6.2. Parlare lingue diverse.
│ │ │ ├── 6.3. La comunicazione non verbale: toni, espressioni, gesti.
│ │ │ ├── 6.4. La comunicazione verbale: l’inevitabile ambiguità del parlare.
│ │ │ ├── 6.5. Detto e non detto.
│ │ │ └── 6.6. Sovra-interpretare.
│ │ ├── 7. Le conseguenze: il disagio e l’ostilità.
│ │ ├── 8. La spirale del non capirsi.
│ │ ├── 9. Uscire dal malinteso.
│ │ └── 10. Il ruolo del malinteso nella comunicazione umana.
│ ├── II. Malfunzionamenti
│ │ ├── 1. Malfunzionamenti involontari.
│ │ ├── 2. Malfunzionamenti intenzionali.
│ │ ├── 3. (In)tollerabilità del malfunzionamento.
│ │ ├── 4. Contrastare il malfunzionamento: manutenzione e riparazione.
│ │ ├── 5. Produttività del malfunzionamento.
│ │ └── 6. Relativizzare il malfunzionamento: per una conclusione.
│ ├── III. Scarsità e sovrabbondanza
│ │ ├── 1. Il peso della quantità.
│ │ │ ├── 1.1. La scarsità informativa: effetti negativi e produttivi.
│ │ │ ├── 1.2. La sovrabbondanza informativa: effetti negativi e produttivi.
│ │ │ └── 1.3. Qualche principio generale.
│ │ ├── 2. Politiche della scarsità e politiche dell’abbondanza.
│ │ │ ├── 2.1. Accesso all’informazione, accesso al potere.
│ │ │ └── 2.2. Controllare la circolazione dell’informazione: limitare o sommergere.
│ │ ├── 3. Scarsità e abbondanza nell’economia della comunicazione.
│ │ │ ├── 3.1. Il valore dell’informazione tra domanda e offerta.
│ │ │ ├── 3.2. L’economia dell’attenzione.
│ │ │ └── 3.3. I padroni della quantità.
│ │ ├── 4. Le basi tecnologiche della scarsità e della sovrabbondanza.
│ │ │ ├── 4.1. Effluvio comunicativo e scarsità materiale.
│ │ │ └── 4.2. Scarsità e abbondanze oggettive o create ad arte.
│ │ └── 5. Gestire il troppo e il troppo poco.
│ │ ├── 5.1. Il troppo stroppia o melius est abundare quam deficere?
│ │ ├── 5.2. Colmare un ambiente povero di informazioni.
│ │ └── 5.3. Due concetti relativi.
│ └── IV. Silenzi
│ ├── 1. La comunicazione zero.
│ │ ├── 1.1. La presenza dell’assenza.
│ │ ├── 1.2. I silenzi comunicano.
│ │ └── 1.3. Silenzi codificati e silenzi enigmatici.
│ ├── 2. Il silenzio del mittente.
│ ├── 3. Il silenzio del ricevente.
│ ├── 4. Il silenzio dei pubblici.
│ ├── 5. Silenzi parziali: le omissioni.
│ ├── 6. Il valore del silenzio: il segreto.
│ │ ├── 6.1. Una breve tipologia dei segreti.
│ │ ├── 6.2. Preservare e carpire i segreti.
│ │ └── 6.3. Ancora sulla fragilità del segreto.
│ └── 7. I paradossi del silenzio.
├── Parte seconda. Verso una teoria
│ └── V. La comunicazione è imperfetta
│ ├── 1. L’imperfezione inevitabile.
│ ├── 2. Correggere, rimediare.
│ │ ├── 2.1. Prima dell’invio: le correzioni umane, e non.
│ │ ├── 2.2. Durante l’invio.
│ │ └── 2.3. E quando la comunicazione ha già raggiunto il destinatario o l’arena pubblica?
│ ├── 3. Le vie dell’adattamento.
│ │ ├── 3.1. Avere tempo.
│ │ ├── 3.2. Adattarsi e adattare a sé.
│ │ └── 3.3. Tra le persone, con gli strumenti.
│ └── 4. Dal lineare al non lineare e all’imperfetto.
├── Bibliografia
├── Il libro
├── Gli autori
└── Copyright
The serious discussion I'd like to engage in — though I'm not sure where or which community would be best for this — is whether online book catalogs, from both stores and public libraries, publish their books' table of contents and whether those are searchable.
Is this a technical limitation, a licensing restriction from publishers, or simply an overlooked feature? Being able to search within tables of contents would significantly improve book discovery and research workflows.
Here are some examples I know so far:
Digitocs by University of Bologna
DigiTocs is a service launched by the University of Bologna in 2009 that provides online access to indexes, tables of contents, and supplementary pages from books cataloged in their library system.
The service works through a distributed network of participating university libraries, each responsible for digitizing and uploading pages along with OCR-generated text and metadata. The platform is integrated with the library's OPAC (online catalog), allowing users to view and search digitized indexes and tables of contents directly from catalog records (example book and its TOC).
Neural Archive
Neural Archive is the online catalog of the library maintained by Neural Magazine. For each book they review, they publish high-quality cover images, minimal metadata, and the book's TOC.
In the summer of 2025, I was selected for Google Summer of Code (GSoC), a program that introduces new contributors to open source software development. I had the opportunity to contribute to the Internet Archive, an organization I have long admired for its efforts to preserve digital knowledge for all.
Numerous open source organizations annually participate in the program as mentoring organizations (2025 mentoring organizations), and that includes the Internet Archive. As a GSoC contributor, I was mentored by Dr. Sawood Alam, Research Lead of the Wayback Machine and WS-DL alum. Over the coding period, our project focused on detecting social media content in TV news, specifically through logo and screenshot detection. My work as a contributor is documented in my previous blog post, while this post highlights the GSoC program and my experience in it.
Becoming a GSoC contributor
Becoming a GSoC contributor is open to any beginners in open source (student or non-student) who meet a few basic requirements: you must be at least 18 years old at the time of registration, be a student or a newcomer to open source, be eligible to work in your country of residence during the program, and not reside in a country currently under U.S. embargo. The application process begins by exploring project ideas listed by mentoring organizations, drafting a proposal, and submitting it to Google for review. The project ideas are published on each organization’s page, and contributors can choose one (or more) of these ideas to develop into a proposal. Alternatively, you could also propose your own project idea (this is the option I did) that may be of interest to the organization you are applying for. Contributors are encouraged to share their drafts with mentors from the organization to get feedback before submitting to Google. Once accepted, contributors spend the summer coding under the guidance of a mentor.
Working on My Project
Information diffusion on social media has been widely studied, but little is known about how social media is referenced in traditional TV news. Our project addresses this gap by analyzing broadcasts for such references by detecting social media logos and screenshots of user posts.
My original proposal to GSoC involved training object detection and image classification models. However, we then pivoted to using large language models (LLMs), specifically ChatGPT-4o for logo and screenshot detection. This change was worthwhile as we realized that LLMs could perform logo and screenshot detection tasks with significantly less manual data labeling and setup than traditional machine learning approaches. It also taught me to stay flexible and adapt your methods when needed.
This was my first time working with LLMs. I have learned a lot, and am still learning about creating effective prompts and integrating this model into a functional pipeline.
Beyond coding, GSoC taught me several valuable lessons. It is really important to stay flexible and to communicate regularly with your mentors. It is also crucial to prioritize your work by putting off critical tasks for future work to maintain steady progress. And of course, effective time management is key, since juggling work and life requires careful planning.
The Best Part
For me, the most exciting part of GSoC was working with the Internet Archive team. I had weekly meetings with my mentors - Dr. Sawood Alam, my assigned GSoC mentor and Will Howes, a Software Engineer at the Internet Archive. Will was mentoring two other GSoC students who joined the same sessions. Both the mentors were very helpful, very responsive through Slack, and always offering advice whenever needed. The Internet Archive leadership, such as Mark Graham, the Director of the Wayback Machine and Roger Macdonald, the Founder of the TV News Archive created a welcoming environment for contributors and always made sure we had the resources we needed.
Being added to the TV News Archive guest Slack channel and invited to join the weekly TV News Archive team meetings during the Summer were great opportunities for me as a student researcher interested in this field. It was nice to observe how the team curates and preserves broadcast news content, and to learn about their ongoing projects.
Final Thoughts
GSoC was more than just a coding program - it was a huge opportunity for me to learn from great mentors and contribute to the open source community. I hope to stay involved with the Internet Archive and its team. The technical and collaborative skills I gained, especially from working with LLMs boosted my skills and confidence as a student researcher. Finally, being selected as a GSoC contributor was a great experience and not to mention a notable addition to my resume, I would definitely consider applying again.
In 2019, I packed my bags and flew from Sri Lanka to Virginia to begin my Ph.D. in Computer Science at Old Dominion University. I did not have a clear roadmap or any prior research experience; all I had was the hope that I would be able to figure things out along the way. After six years, I found myself diving deep into eye-tracking, human-computer interaction, and machine learning; eventually completing my dissertation in multi-user eye-tracking using commodity cameras, with the support of my advisor, Dr. Sampath Jayarathna, NIRDS Lab, and ODU Web Science and Digital Libraries Research group.
When I started my Ph.D. at ODU, I had limited knowledge and experience in eye tracking and computer vision research. After learning about ongoing research at the lab on cognitive load using eye tracking, I was fascinated by how we could use technology to better understand humans in terms of their intentions, focus, attention, and interactions with the world. That curiosity, combined with my liking for working with hardware, eventually led me to eye-tracking research.
Early on, I realized that most eye-tracking studies focused on single users, highly controlled environments, and expensive hardware. That works for lab studies, but the real world is messy, as we experienced during our first event participation, STEAM on Spectrum at VMASC. Our demo application for eye tracking was successful for a single user in the laboratory environment, but it did not perform well in the real world. Also, since we had only one eye tracker for the demo, only one person could experience eye tracking, while the others had to wait in line away from the tracker. These problems led us to question how we could enable two or more people to interact with an eye tracker while also measuring their joint attention, which a traditional eye tracker could not do. That was when the idea for Multi-Eyes started to take shape.
First, we started with the trivial approach of having a dedicated eye tracker for each user. It worked well until all the users started moving, which sometimes prevented the eye trackers from capturing valid eye tracking data, giving us incorrect values. Movement constraints and the high cost of eye trackers made the setup very expensive and difficult to use in real-life applications. It may be disadvantageous for eye tracking when participants are physically together. Still, it worked best when they were online, which we later published in CHIIR 2023, "DisETrac: Distributed Eye-tracking for Online Collaboration."
Due to the limitations of this approach, mainly the need for a dedicated device for each participant, we attempted to create Multi-Eyes using low-cost, commodity cameras, such as webcams, thereby eliminating the need for specialized eye-tracking hardware. Although modern eye trackers made the process appear simple, there were numerous challenges to overcome when building Multi-Eyes.
The first challenge was developing a gaze estimation model that can identify where a person is looking in various environments, such as poorly lit rooms, different camera hardware, extreme head angles, and different facial features. To address this, we developed a gaze model that utilizes unsupervised domain adaptation techniques, providing robust gaze estimates across a wide range of environmental conditions. Additionally, we focused on achieving parameter efficiency through existing model architectures. We validated this through a series of experiments on publicly available gaze estimation datasets, with our approach and findings published in IEEE IRI 2024 (Multi-Eyes: A Framework for Multi-User Eye-Tracking using Webcameras), and IEEE IRI 2025 (Unsupervised Domain Adaptation for Appearance-based Gaze Estimation).
Beyond gaze estimates, we had to solve the problem of mapping each user’s gaze direction onto a shared display, a commonly discussed scenario in multi-user interaction within human-computer interaction. The mapping process required transforming gaze information from the user coordinate frame into the display coordinate frame. We designed a simple yet effective learnable mapping function, eliminating the need for complex setup procedures. Our approach achieved on-screen gaze locations with horizontal and vertical gaze errors of 319 mm and 219 mm, respectively, using 9-point 9-sample calibration. Considering large shared displays, the error is sufficient and stable for gaze classification or coarse-grained gaze estimation tasks.
By combining these approaches, we developed a prototype application that can run at ~17 gaze samples per second on commodity hardware, without utilizing GPU acceleration or a specialized installation. We replicated an existing study in the literature using a setup that traditionally requires expensive hardware, demonstrating that Multi-Eyes could serve as a viable low-cost alternative.
Throughout the Multi-Eyes project, we contributed to advancements in the field of eye tracking through conference presentations and publications. Notably, our review paper on eye tracking and pupillary measures helped us set the requirements for Multi-Eyes, which later received the Computer Science Editor’s Pick award. We first proposed the Multi-Eyes architecture at ETRA 2022 and then refined the approach, showcasing its feasibility at IEEE IRI 2024. Along with the papers, we also published our research on gaze estimation approaches, capsule-based gaze estimation at Augmented Human 2020, parameter-efficient gaze estimation at IEEE IRI 2024, and parameter-efficient gaze estimation with domain adaptation in IEEE IRI 2025.
Beyond the main framework, Multi-Eyes sparked several spin-off projects. Our work, utilizing a dedicated eye tracker-based approach, resulted in published research in ACM CHIIR 2023, IEEE IRI 2023, and IJMDEM 2024. In addition, through my work with eye trackers, I contributed to several publications on visual search patterns, published in JCDL 2021, ETRA 2022, and ETRA 2025, as well as drone navigation, published in Augmented Humans 2023.
Looking back, I’m grateful that my work has had a positive impact on the broader community by advancing research in eye tracking and making the technology more accessible. After a journey of over five years, I’m starting a new chapter as a lecturer at the Department of Computer Science at ODU. While teaching is my primary role, I plan to continue my research, exploring new directions in eye tracking and human-computer interaction.
While I have documented most of my research findings, I am adding a few tips for myself, in case I ever happen to do it again or travel through time, which someone else might find helpful.
Collaboration is key: Collaborators can bring together the missing pieces of the puzzle, offering fresh perspectives that may lead to new ideas. Additionally, they can serve as your free reviewer before rejection ;).
Embrace rejections: Every 'no' is a part of the process, as all of my work comes from ideas that were initially rejected but later accepted after refinement.
Prototype early, fail fast: Building something tangible, even if it’s not perfect, helps you identify problems sooner and will aid in your next step.
Document everything: A half-forgotten experiment is as good as lost. Notes and version control have saved me many times, especially when refining after a rejection. You will thank yourself for explaining why you have used that weird design or that random number.
I am immensely grateful to my dissertation committee members and mentors: Dr. Sampath Jayarathna, Dr. Michael Nelson, Dr. Michele Weigle, Dr. Vikas Ashok, and Dr. Yusuke Yamani for their invaluable feedback, which greatly contributed to my success. I also owe my heartfelt thanks to my family, friends, and research collaborators, whose encouragement kept me going through the highs and lows of this journey.
Card Division of the Library of Congress, ca. 1900–1920. Source: Wikimedia Commons.
In February, the Library Innovation Lab announced its archive of the federal data clearinghouse Data.gov. Today, we’re pleased to share Data.gov Archive Search, an interface for exploring this important collection of government datasets. Our work builds on recent advancements in lightweight, browser-based querying to enable discovery of more than 311,000 datasets comprising some 17.9 terabytes of data on topics ranging from automotive recalls to chronic disease indicators.
Traditionally, supporting search across massive collections has required investment in dedicated computing infrastructure, such as a server running a database or search index. In recent years, innovative tools and methods for client-side querying have opened a new path. With these technologies, users can execute fast queries over large volumes of static data using only a web browser.
This interface joins a host of recent efforts not only to preserve government data, but also to make it accessible in independent interfaces. The recently released Data Rescue Project Portal offers metadata-level search of the more than 1,000 datasets it has preserved. Most of these datasets live in DataLumos, the archive for valuable government data resources maintained by the University of Michigan’s Institute for Social Research.
LIL has chosen Source Cooperative as the ideal repository for its Data.gov archive for a number of reasons. Built on cloud object storage, the repository supports direct publication of massive datasets, making it easy to share the data in its entirety or as discrete objects. Additionally, LIL has used the Library of Congress standard for the transfer of digital files. The “BagIt” principles of archiving ensure that each object is digitally signed and retains detailed metadata for authenticity and provenance. Our hope is that these additional steps will make it easier for researchers and the public to cite and access the information they need over time.
In the coming month, we will continue our work, fine-tuning the interface and incorporating feedback. We also continue to explore various modes of access to large government datasets, and so we are exploring, for example, how we might create greater access to the 710 TB of Smithsonian collections data we recently copied. Please be in touch with questions or feedback.
Libraries face persistent challenges in managing metadata, including backlogs of uncataloged resources, inconsistent legacy metadata, and difficulties in processing resources in languages and scripts for which there is not staff expertise. These issues limit discovery and strain staff capacity. At the same time, advances in artificial intelligence (AI) provide opportunities for streamlining workflows and amplifying human expertise—but how can AI assist cataloging staff in working more effectively?
To address these questions, the OCLC Research Library Partnership (RLP) formed the Managing AI in Metadata Workflows Working Group earlier this year. This group brought together metadata managers from around the globe to examine the opportunities and risks of integrating AI into their workflows. Their goal: to engage collective curiosity, identify key challenges, and empower libraries to make informed choices about how and when it is appropriate to adopt AI tools to enhance discovery, improve efficiency, and maintain the integrity of metadata practices.
This blog post—the first in a four-part series—focuses on one of the group’s critical workstreams: primary cataloging workflows. We share insights, recommendations, and open questions from the working group on how AI may address primary cataloging challenges, such as backlogs and metadata quality, all while keeping human expertise at the core of cataloging.
The “Primary Cataloging Workflows” group was the largest of our three workstreams, comprising seven participants from Australia, Canada, the United States, and the United Kingdom. Participants represented institutions in primarily English-speaking countries in which libraries may lack needed capacity to provide metadata for resources written in non-Latin scripts like Chinese and Arabic.
Jenn Colt, Cornell University
Chingmy Lam, University of Sydney
Elly Cope, University of Leeds
Yasha Razizadeh, New York University
Susan Dahl, University of Calgary
Cathy Weng, Princeton University
Michela Goodwin, National Library of Australia
Motivations: shared (and persistent) needs
Working group members are turning to AI to help solve a set of familiar cataloging challenges that result from a combination of resource constraints and limited access to specific skills. These challenges include:
Increasing cataloging efficiency
Improving legacy metadata
Obtaining assistance with resources in certain scripts where expertise is limited
Members of the working group assessed both the capabilities and limitations of AI tools in addressing these challenges by examining specific tools and workflows that could support this work.
Increasing cataloging efficiency
Backlogs of uncataloged resources prevent users from discovering valuable resources. Even experienced, dedicated staff are unable to keep up with the amount of resources awaiting description. AI offers the potential to address this problem by streamlining and accelerating the cataloging workflow for these materials. The working group identified key use cases of backlogs, including legal deposits, gifts, self-published resources, and those lacking ISBNs.
Copy cataloging is critical to addressing backlog issues, but the key challenge here is to identify the “best record.” Working group participants discussed how AI could streamline these workflows by automating record selection based on criteria such as the number of holdings or metadata completeness.
When original cataloging is required, AI-generated brief records for these materials can enable them to appear in discovery systems earlier, accelerating the process of making hidden collections discoverable and supporting local inventory control. This approach addresses the immediate need for discovery while allowing records to be completed, enriched, or refined over time.
Improving legacy metadata
Legacy metadata may contain errors, inconsistencies, or outdated terminology, which hinders discovery and fails to connect users with relevant resources. AI could assist with metadata cleanup and enrichment, reducing manual effort while maintaining high standards. This was an area where working group members had not experimented directly with AI tools, but could imagine a number of use cases, including:
Identifying and replacing outdated terms in existing metadata
Using AI tools to flag duplicates, diacritic errors, or anomalies to streamline cleanup efforts and improve data quality
Suggesting additional metadata fields or descriptions to enhance discovery
Supplying matching headings from local authority files to existing authorized headings or validated entities
Improving metadata quality, including reducing the number of duplicate records, has also been an area where OCLC has devoted considerable effort, including the development and use of human-informed machine learning processes, as illustrated in this recent blog post on “Scaling de-duplication in WorldCat: Balancing AI innovation with cataloging care.”
Providing support for scripts
Language and script expertise is a long-standing cataloging issue. In English-speaking countries, this manifests as difficulty describing resources written in languages using non-Latin scripts and those that are not often taught in local schools. AI tools could assist with transliteration, transcription, and language identification, enabling the more efficient processing of these materials. Some tools lack the basic functionality or support for specific, required languages. Even when AI tools confidently provide transliteration, human expertise is still very much required to evaluate AI-generated work. A library looking to AI to fill an expertise gap for these languages faces a double challenge of not fully trusting AI tools and also lacking access to internal language skills to effectively evaluate and correct its work.
Working group members brainstormed ways to address the needs in this situation. Research Libraries collect resources in dozens or even hundreds of languages to support established academic programs. Although the library may lack direct access to language proficiency, this expertise may be abundant across campus, with students, faculty, and researchers who are experts in the languages for whom hard-to-catalog resources are selected. These campus community members could help address a specific skill gap and safeguard the accuracy of AI-assisted workflows, while fostering community involvement and ensuring that humans are in the loop. In implementing such a program, libraries would need to create an engagement framework that includes rewards and incentives—such as compensation, course credit, or public acknowledgment—to encourage participation.
Open questions around the use of AI
Unsurprisingly, as with any new technology, opportunities come paired with questions and concerns. Metadata managers shared that some of their staff expressed uncertainty about adopting AI workflows, feeling they need more training and confidence-building support. Others wondered whether shifting from creating metadata to reviewing AI-generated records might make their work less engaging or meaningful.
Metadata managers themselves raised a particularly important question: If AI handles foundational tasks like creating brief records—work that traditionally serves as essential training for new catalogers—how do we ensure new professionals still develop the core skills they’ll need to effectively evaluate AI outputs?
These are important considerations as we explore the implementation of AI tools as amplifiers of human expertise, rather than replacements for it. The goal is to create primary cataloging workflows where AI manages routine tasks at scale, freeing qualified staff for higher-level work while preserving the meaningful aspects of metadata creation that make this field rewarding.
Conclusion
While not a panacea, AI offers significant potential to address primary cataloging challenges, including backlogs, support for scripts, and metadata cleanup. By adopting a pragmatic approach and emphasizing the continued relevance of human expertise, libraries can leverage AI with care to address current capacity issues that will make materials available more easily and improve discovery for users.
NB: As you might expect, AI technologies were used extensively throughout this project. We used a variety of tools—including Copilot, ChatGPT, and Claude—to summarize notes, recordings, and transcripts. These were useful for synthesizing insights for each of the three subgroups for quickly identifying the types of overarching themes described in this blog post.
LibraryThing is pleased to sit down this month with British mystery novelist S.J. Bennett, whose Her Majesty the Queen Investigates series, casting Queen Elizabeth II as a secret detective, has sold more than half a million copies worldwide, across more than twenty countries. Educated at London University and Cambridge University, where she earned a PhD in Italian Literature, she has worked as a lobbyist and management consultant, as well as a creative writing instructor. As Sophia Bennet she made her authorial debut with the young adult novel Threads, which won the Times Chicken House Children’s Fiction Competition in 2009, going on to publish a number of other young adult and romance novels under that name. In 2017 her Love Song was named Romantic Novel of the Year by the RNA (Romantic Novelists’ Association). She made her debut as S.J. Bennett in 2020 with The Windsor Knot, the first of five books in the Her Majesty the Queen Investigates series. The fifth and final title thus far, The Queen Who Came In From the Cold is due out next month from Crooked Lane Books. Bennett sat down with Abigail this month to discuss the book.
The Queen Who Came In From the Cold is the latest entry in your series depicting Queen Elizabeth II’s secret life as a detective. How did the idea for the series first come to you? What is it about the Queen that made you think of her as a likely sleuth?
The Queen was alive and well when I first had the idea to incorporate her into fiction. She was someone who fascinated people around the world, and she was getting a lot of attention because of The Crown.
I was looking for inspiration for a new series, and I suddenly thought that she would fit well into the mold of a classic Golden Age detective, because she lived in a very specific, self-contained world and she had a strong sense of public service, which I wanted to explore. Her family didn’t always live up to it, but she tried! What’s great for a novelist is that everyone thinks they know her, but she didn’t give interviews, so it leaves a lot of room to imagine what she was really thinking and doing behind the scenes.
I often get asked if I was worried about including her as a real figure, and I was a bit, to start with. But then I realized that she has inspired a long line of novelists and playwrights – from Alan Bennett’sThe Uncommon Reader, and A Question of Attribution, to Peter Morgan’sThe Queen, The Crown and The Audience, Sue Townsend’sThe Queen and I. I think they were also attracted by that combination of familiarity and mystery, along with the extraordinary life she led, in which she encountered most of the great figures of the twentieth century.
My own books are about how a very human public figure, with heavy expectations on her, juggles her job, her beliefs, her interests and her natural quest for justice. The twist is, she can’t be seen to do it, so she has to get someone else to take the credit for her Miss Marple-like genius.
Unlike many other detectives, yours is based on a real-life person. Does this influence how you tell your stories? Do you feel a responsibility to get things right, given the importance of your real-world inspiration, and what does that mean, in this context?
I do feel that responsibility. I chose Elizabeth partly because I admired her steady, reliable leadership, in a world where our political leaders often take us by surprise, and not always in a good way. So, I wanted to do justice to that.
The Queen’s circumstances are so interesting, combining the constraints of a constitutional monarch who can’t ever step out of line with the glamour of living in a series of castles and palaces. Weaving those contrasts into the book keeps me pretty busy, in a fun way. Plus, of course, there’s a murder, and only her experience and intelligence can solve it.
I made the decision at the start that I wouldn’t make any of the royals say or do anything we couldn’t imagine them saying or doing in real life. Anyone who has to behave oddly or outrageously to fit my plots is an invented character. But it helps that the royal family contained some big characters who leap off the page anyway. Prince Philip, Princess Margaret and the Queen Mother have lots of scenes that make me giggle, but that I hope are still true to how they really were. I would honestly find it much harder to write about the current generations, because their lives are more normal in many ways, and also, because we already know about their inner lives, because they tell us. The Queen and Prince Philip were the last of the ‘mythical’ royals, I think.
I love referencing other writers, and someone on the train in this novel is reading Thunderball, by Ian Fleming, which came out in 1961 and deals with one of the themes that’s present in my book too, namely the threat of nuclear war. At that point, The Queen Who Came In From the Cold is very much still in the Agatha Christie mold, where a murder is supposedly seen from the train, but Fleming’s book hints at the more modern spy story that the book will become in the second half.
As well as Fleming and John le Carré, whose debut novel came out that year, I read a lot of Len Deighton when I was growing up, so I hope some of his sense of adventure is in there too. But another big influence was film. I love the comedy and graphic design of The Pink Panther, and the London-centered louche photography of Blow-Up. I asked if the jacket designer (a brilliant Spanish illustrator called Iker Ayesteran) could bring some of that Sixties magic to the cover, and I like to think he has done … even if the lady in the tiara isn’t an exact replica of the Queen.
Unlike the earlier books in your series, which were contemporaneous, your latest is set during the Cold War. Did you have to do a great deal of additional research to write the story? What are some of the most interesting things you learned?
I hadn’t realized there were quite so many Russian spy rings on the go in and around London at the time! One of my characters is based on a real-life Russian agent called Kolon Molody, who embedded himself in British culture as an entrepreneur (set up by the KGB) selling jukeboxes and vending machines. According to his own account, he became a millionaire out of it before he was caught. His world was a classic one of microdots and dead-letter drops.
As a teenager, I lived in Berlin in the 1980s, when the Berlin Wall literally ran around the edge of our back garden. We were at the heart of the Cold War, but by then it was obvious the West was winning, so I didn’t personally feel under threat – although people were still dying trying to escape from East Germany to the West. I hadn’t fully realized
how much more unsafe people must have felt a generation earlier. I don’t think the western world has felt so unstable since those days … until now, perhaps.
It fascinates me that Peter Sellers, who was so entertaining as Inspector Clouseau in the Pink Panther films, was also the star in Dr Stangelove, which was based on an early thriller about the threat of nuclear annihilation called Red Alert, by Peter George. That dichotomy between fear and fun seemed to characterize the early 19§0s, and is exactly what I’m trying to capture in the book.
On a different note, it was a surprise to see how well Russia was doing in the Space Race. At that time, the Soviet Union was always a step ahead. Yuri Gagarin was the first person to go into orbit, and the Queen and Prince Philip were as awestruck as anyone else. When Gagarin visited the UK in the summer of 1961, they invited him to lunch at the palace and afterwards, it was Elizabeth who asked for a picture with him, not the other way around.
The Soviet success was largely down to the brilliance of the man they called the Chief Designer. His real name was Sergei Korolev, but the West didn’t find this out for years, because the Soviets kept his identity a closely-guarded secret. He was an extraordinary figure – imprisoned in the gulags by Stalin, and then brought out to run their most important space program. I’d call that pretty forgiving! Their space program never recovered after he died. I’m a big fan of his ingenuity, and he has a place in the book.
Tell us a little bit about your writing process. Do you have a particular writing spot and routine? Do you know the solution to your mysteries from the beginning? Do you outline your story, or does it come to you as you go along?
I went to an event recently, where Richard Osman and Mick Herron – both British writers whose work I enjoy – talked about how they are ‘pantsers’, who are driven purely by the relationships between the characters they create. I tried that early in my writing life and found I usually ran out of steam after about five thousand words, so now I plot in a reasonable amount of detail before I start.
I always know who did it and how, and I’ve given myself the challenge of fitting the murder mystery alongside everything the Queen was really doing at the time, so I need a spreadsheet to keep track of it all. Nevertheless, red herrings will occur to me during the writing process, and I adapt the plot to fit. I find if I know too much detail, then the act of writing each chapter loses its fun. I need to leave room for discoveries along the way.
If in doubt, I get Prince Philip on the scene to be furious or reassuring about something. He’s always a joy to write. So is the Queen Mother, as I mentioned. It’s the naughty characters who always give the books their bounce.
Her Majesty the Queen Investigates was published as part of a five-book deal. Will there be more books? Do you have any other projects in the offing?
I was very lucky to get that first deal from Bonnier in the UK. My editor had never done a five-book deal before, and I’m not sure he’s done one since! I always knew I wanted the series to be longer, though. I’ve just persuaded him to let me write two more, so book six, set in the Caribbean in 1966, will be out next year, and another one, set in Balmoral back in 2017, will hopefully be out the year after. I miss Captain Rozie Oshodi, the Queen’s sidekick in the first three books, and so do lots of readers, so it’ll be great to be in her company again for one last outing.
Tell us about your library. What’s on your own shelves?
My bookshelves are scattered around the house and my writing shed, wherever they’ll fit. I studied French and Italian at university, so there are a lot of twentieth century books from both countries. I love the fact that French spines read bottom up, whereas English ones read top down. I bought really cool blue and white editions of my favourite authors from Editions de Minuit in the 1990s and it’s lovely to have them on my shelves.
I’ve always loved classical literature, so there are plenty of Everyman editions of Jane Austen, George Eliot and Henry James, but equally, the books that got me through stressful times like exams were Jilly Cooper and Jackie Collins, so they have their place. These are the books that inspired the kind of literature I wanted to write: escapist, absorbing and fun. They’re near the travel guides, for all the real-life escaping I love to do.
I have two bookcases dedicated to crime fiction, packed with Christie, Dorothy L. Sayers, Ngaio Marsh, P.D. James, Rex Stout (Nero Wolfe was a big inspiration for the way I write the Queen and her sidekicks), Donna Leon and Chris Brookmyre. I inherited my love of the mystery genre from my mother, who has a library full of books I’ve also loved, by other authors such as Robert B. Parker and Sue Grafton, as well as her own shelf of Le Carrés. She decided to start clearing them out recently, but I begged her not to: I still love seeing them there.
What have you been reading lately, and what would you recommend to other readers?
Thanks to my book club, I’ve been re-reading Jane Austen, and am reminded of what a fabulous stylist she was. But in terms of new writers, I’ve recently enjoyed The Art of a Lie by Laura Shepherd-Robinson, set in Georgian London, and A Case of Mice and Murder by Sally Smith, set in the heart of legal London at the turn of the twentieth century. Both Laura and Sally write vivid characters with aplomb, and create satisfying, twisty plots that are a joy to follow. I definitely recommend them both.
This essay examines the extractive practices employed in biomedical research to reconsider how librarians, archivists, and knowledge professionals engage with the unethical materials found in their collections. We anchor this work in refusal—a practice upheld by Indigenous researchers that denies or limits scholarly access to personal, communal or sacred knowledges. We refuse to see human remains in the biomedical archive as research objects. Presenting refusal as an ethical and methodological intervention that responds to the often stolen biomatter and biometrics in medical collections, this essay creates frameworks for scholars working with archival or historical materials that were obtained through violent, deceitful, or otherwise unethical means.
There are many photographs of doctors at the turn of the twentieth century posing with dead human subjects. Medicine’s visual culture in this period is marked by a nonchalance toward the deceased subjects who constituted their research materials. Medical students posed with their anatomical cadavers (Warner 2014) (fig. 1), and doctors were framed in candid shots in ways that displayed their wet specimens (fig. 2). John Harley Warner, writing of anatomical students’ group photographs, noted how, for American doctors who often acquired their cadavers from Black graveyards, these photographs mimicked the composition of the lynching photograph:
The practices represented in photographs of this other “strange fruit” involved not just dismemberment of dead bodies but also constant threat to certain black communities of postmortem violation, actual trauma inflicted on those still living. (16)
Because these human remains were obtained prior to the codification of informed consent (Lederer 1995), and because medical science historically depended on theft as a means to forward epistemic, cultural, and monetary value (Richardson 1987; Sappol 2002; Redman 2016; Alberti 2011), there remains an open wound caused by the use of stolen human material in the creation of biomedical argument.
Figure 1.In the early twentieth century, medical students often posed with anatomical cadavers. These images often included written elements related to the gaining of knowledge through sacrifice (Warner, 2014). The cadaver in the foreground has been made opaque, as consent was likely not obtained from the individual pre-mortem. Photograph courtesy of the Medical Historical Library, Harvey Cushing/John Hay Whitney Medical Library, Yale University.
Figure 2.A candid portrait of a doctor working in the neurology lab of the Henry Phipps Institute. In the foreground are dozens of jars filled with human brains extracted at autopsy as part of the Phipp’s Institute’s research into tuberculosis. The jars in the foreground have been made opaque, as they contain stolen human tissues. Report of the Henry Phipps Institute for the Study, Treatment and Prevention of Tuberculosis. Philadelphia: Henry Phipps Institute, 1905.
This essay describes ethical and methodological interventions developed in response to the extractivist program employed by medical scientists at the turn of the twentieth century. Our intervention, the Opaque Publisher (OP), introduces a theoretical framework that lets professionals whose work engages with stolen material choose which sections of material in their collections need to be redacted. The framework also provides readers a way to engage with these ethical decisions through a toggling interface (fig. 3). This essay is the first of two essays written for Lead Pipe on the ways digital methods afford different approaches to ethical problems. In our second essay we will go into more detail on the design-based methodology that led the development of the OP, as well as DigitalArc, the community archiving platform from which the OP was originally built.
Figure 3.A gif showing how users interact with images and text made opaque for the Tuberculosis Specimen. Link to the example page.
We ground our argument in a case study: a dissertation that examines biomedical extractivism in tuberculosis research at the turn of the twentieth century (Purcell 2025). Tuberculosis has been the center of exclusionary and anti-immigrant policies employed by nations, states, and cities. These policies tend to target brown and black populations, creating an apparatus to more easily deny immigration from those communities (Abel 2007). The disease has also been flagged to leverage eugenicist discourses in the United States (Feldberg 1995), and has been used to manufacture middle and upper class aesthetics of health and wellness (Bryder 1988). The dissertation makes a strong case study for our digital-methods intervention because it examines how biomedical and public health professionals studied the disease, and how these research programs fit into America’s expanding biomedical and public health infrastructures.
The process of medical research, especially the research employed by medical scientists at the turn of the twentieth century, sees research subjects as valuable epistemic resources. We use the term ‘epistemic’ to refer to the philosophical tradition of epistemology—or the study of how knowledge is created—with a particular stress on implicit historical, cultural, and ideological assumptions that Michel Foucault frames in the discursive épistémè (Foucault 1994). Building on histories of anatomy that describe the commodification and exploitation of postmortem subjects, we argue that biomedical science depends on the theft of human material. These extractive methods were built out of historical practices that disregarded the autonomy of non-white communities, seeing their lives, cultures, and histories as a resource to be mined (Redman 2016; Washington 2006; Sappol 2002).
Megan Rosenbloom, in her excellent book on anthropodermic bibliopegy—or books bound in human skin—describes the problem we address in an anecdote about a book challenge brought against Édouard Pernkoft’s Topographische Anatomie des Menschen. The book was written by a Nazi scientist with illustrations that may have been drawn using the bodies of subjects killed by the Nazi regime. Describing USC’s Norris Medical Library’s decision to keep the book, while adding additional information about the history of the text, Rosenbloom writes, “if books have to be removed from a medical library because the bodies depicted in them were obtained through unethical and nonconsensual means, there might not be an anatomical text left on the shelf” (170). Central to Rosenbloom’s logic is a presumption that knowledge–medical, historical, cultural knowledge–supercedes the needs of abused historical subjects, their communities, and their descendants.
Rosenbloom’s careful attention to historical violences in the history of medicine describes a broader problem practiced by knowledge workers in medical libraries, archives, and museums. Knowledge workers are obligated to maintain and preserve these materials because of their epistemic and cultural value, in spite of their awful, nonconsensual origins. We wanted to create an ethical and methodological framework that enabled the divestment of stolen human biomatter and biometrics from institutions, whose collecting histories harmed Black, brown, and Indigenous communities (Monteiro 2023). Knowing that the majority of these research materials—subjects depicted in medical atlases, described in research reports, and whose remains have been collected and maintained in medical museums—were extracted from people who never consented to that research, we present a model that calls attention to that theft. We ask, is it possible to do research in the history of medicine that respects our interlocutors’ autonomy?
Our answer to this question is a methodological one: we argue for librarians, archivists, and knowledge workers to refuse the object. While biomedical researchers saw the materials that populated their journals, textbooks, and archives as objects, we advocate for an approach that reestablishes the human base upon which these disciplines are built. Refusing the object is a countermethod to the reductive, dehistoricizing, and decontextualizing processes that harm humans caught in biomedicine’s dragnet.
This approach builds on frameworks around refusal. Refusal is a practice described and employed by Indigenous researchers and academics working with Indigenous communities that denies academic access to personal, communal, and sacred knowledges (Simpson 2007; Tuck & Yang 2014; Liborion 2021). In its most broad definition, refusal is a generative, socially embedded practice of saying ‘no’, akin to, but distinct from, resistance. It is a critique that is levied in different ways by different actors, circumscribed by their social and political context (McGranahan 2016). In its original contexts, refusal refers to the gestures made by research subjects to disrupt and disallow research (Liboiron 2021, 143). We argue that knowledge workers have ethical obligations to their interlocutors that require unique, case-by-case interventions (Caswell & Cifor 2016), and that sometimes these obligations force us, as Audra Simpson argues, to work through a calculus of “what you need to know and what I refuse to write in” (2007, 72). We argue for frameworks that enable knowledge workers to refuse materials that depend on the objectification of, and through that objectification the commodification of, human subjects.
Building from arts-based approaches to opacity (Blas 2014; Purcell 2022), we developed protocols—structured methods applied uniformly across our primary materials—for refusing the objectifying practices employed in the creation of our primary sources. These protocols highlighted the ways opacity would be scaffolded in a final published work, imagining how norms of anonymity and consent might be applied post hoc. For text, we redacted words where our primary sources revealed too much about their subjects (fig. 3). For images, we erased parts of people’s bodies depending on who was in the frame (fig. 4). What drove our design was a desire to scaffold the effects of refusal in ways that were obvious and intentional. We wanted to show the effects of refusal, rather than hypothesize about what might be lost in the process.
Figure 4.For images, a labor-intensive step-by-step omission was practiced in order to protect patients. Reading from left to right, the image becomes blacked out based on the level of opacity applied when accessing the site. For the dissertation project, partial opacity was defined as matching contemporary needs for anonymity in research; full opacity scrubbed patient’s bodies from the images, but tried to maintain any material produced by researchers. A more detailed description of the opacity process is described in the dissertation’s website. Crofton, W. M.. Pulmonary Tuberculosis: Its Diagnosis, Prevention and Treatment. Philadelphia: P. Blakiston’s Son & Co., 1917.
For this essay, we will begin with a discussion of objectifying practices in biomedical epistemics, before talking through refusal-as-method. We will finish with a discussion of ethics audits, which can be applied late in a research project using the concepts we have outlined in this article.
Pathology’s Objects
One of the messier epistemic contradictions which enables the collection of biomatter, biometrics, images, and histories from patients is that the process of collection transforms the patient or subject into an object. Object, as we use the term, refers to a representation of phenomena used in scientific research that has been divorced from its historical, cultural origin (Daston & Galison 2007, 17). Biomedical research depends on multiple objectifying practices, the most famous of which is known as the clinical gaze. As described by Michel Foucault in The Birth of the Clinic, this visual practice refers to the ways doctors are trained to see the difference between a patient’s body and an assumed ‘normal’ human anatomy as disease. The first issue with this visual practice is that it imagines a single supposedly perfect human anatomy (a body of a cis, heterosexual, white, nondisabled man), and that this model assumes that anyone whose body is different from this constructed normal (in sexuality, gender, race, or ability) as being diseased.
The second issue with the practice comes from the clinical method. This method ties case histories with postmortem examination: patients would visit a clinic, doctors would track their symptoms, collect relevant information—their family histories, the progression of the disease—and then, if the patient died under their care, doctors would try to link the patient’s symptoms to phenomena found at autopsy. A good example of this practice can be seen in the work of René Laennec, a French doctor who practiced clinical research in the post-revolutionary period (Foucault 1994, 135-36). Laennec observed tubercles—hard, millet sized growths—in the lungs of the consumptive patients he autopsied, and he connected the symptoms experienced by these patients to these pathologies (fig. 5).
Figure 5.This illustration comes from René Laennec’s research into diseases of the chest. Underneath the opaqued redaction applied by the research team are images that show the formations of tubercules in autopsied lungs from non-consented patients. Treatise on the Diseases of the Chest in Which they are Described According to their Anatomical characters and their Diagnosis Established on a New Principle by Means of Acoustick Instruments, with plates. Translated by: Forbes, John. Philadelphia: James Webster, 1823. Image courtesy of the New York Academy of Medicine.
The clinical gaze has long been described as one that alienates patients because doctors are only trained to see them as nests of symptoms. What is important to remember is that the clinical method, as it is described by Foucault, is similarly alienating, insomuch as it sees patient symptoms as data to be gathered, analyzed, and extrapolated for medical progress. Even case histories, filled as they are with intimate details of an individual’s life, are described in such a way as to flatten that life into possible causes that may be examined in the abstract for future biomedical argument.
What Foucault neglects to mention, but which anatomical historians have made clear, is that developing in the same period was a commodification of human remains in medical contexts. Ruth Richardson has linked the popularization of the Parisian anatomical method, which required medical students to anatomize a cadaver in their training, to the rise of graverobbing in England and Scotland in the late eighteenth and early nineteenth centuries. In the same period, medical schools were deemed more or less prestigious based on the scale and quality of their medical museums and specimen collections (Alberti 2011). The production of a valuable, commodifiable object went hand-in-hand with the epistemic framework that dehumanized patients in diagnosis. The creation of a pathological specimen—a representational object that purports to show some aspect of a disease’s progression—splits the disease from the human subject whose life, death, and afterlife was necessary in the collection of that phenomenon. In denying this connection, biomedical argument enables a specimen to stand-in as an objective representation for observation and study.
This objectification extends beyond medical contexts. The problem that arises is that to engage with these historical materials as academics, even as practitioners of subjective, qualitative research, we have to approach them as research objects—as representational materials that describe the phenomena we critique. To refuse the object in the history of medicine is to refuse to decouple the biomedical object from the subject from whose body this specimen was taken. It is a refusal and denial of the material’s ultimate epistemic value, both for the sciences but also for humanistic, historical, or qualitative research.
Opaque Protocols
Our methods to refuse the objectifying practices in medicine began with a speculative approach to the history of medicine. We use “speculative” to refer to the methodological interventions into archival research argued for by Saidiya Hartman. These methods ask for historians to read against the grain of the archive, and to see the archival omissions as being part and parcel of broader carceral, colonial histories (2008). Krista Thompson has built on this scholarship to advocate for “speculative art history” which practices historical fabulation—the manipulation of archival materials—to imagine histories that otherwise would never be seen (Thompson 2017; Lafont et al. 2017).
The speculative historical method enables us to intervene on primary materials in critical, reparative ways. It allows us to shift our understanding of the primary document as a concrete, essential thing, to something that comes from structural practices that denied the humanity of certain subjects. By applying opacity—these methods of conspicuous, obdurate erasure—to primary sources, we reassert the centrality of the patient in our argument (fig. 4). This term, opacity, derives from Édouard Glissant’s critique of western academic essentialism. To be opaque is to refuse access to a phenomena’s root and the totalitarian possibility afforded by control of that essential character (1997, 11-22; 189-94).
We extend this practice beyond the platform—using this method in the images we have supplied for this article—as a way to continue the same critique: Were these images necessary for our argument? Are our claims lessened if they are intentionally marked or changed?
Refusing the Object
Our approach to opacity came about from a nagging discomfort we experienced when engaging with materials in the history of medicine. So much of medicine’s violence has been practiced in the open (Washington 2006, 12), and its harms are felt as a “bruise” by the communities whose bodies were subjected to research and ignored by the institutions that benefited from those practices (Richardson 1987, xvi). Taking primary evidence at face value, accepting the harms, and deeming them necessary for revelatory research felt hypocritical, especially because academic research so often only benefits those doing the research and not their subjects (Hale 2006).
The opaque protocols we used to redact images and text (figs. 3, 4, 6 show multiple layers of opacity) were also moments of refusal—of denying the reader access to stolen, coerced, and unethically extracted materials produced in biomedical research. Where refusal is a mode adopted by research interlocutors (Liborion 2021, 143), it is also a tool for knowledge workers working in obligation to the people and communities who inform their research (Simpson 2007). For Max Liboiron, in the context of community peer review, refusal “refers to ethical and methodological considerations about how and whether findings should be shared with and within academia at all” (2021, 142). Premised on this idea is the realization that not all knowledge needs to be known within academic systems. As Liboiron writes, “[g]iving up the entitlement and perceived right to data is a central—the central!—ethic of anticolonial sciences” (Ibid., 142, footnote 96).
Refusal, for us, is predicated on an understanding that our current knowledge infrastructure depends on extraction enacted through theft and hidden in plain sight. Roopika Risam, in her keynote for DH2025, notes the digital humanities’ long quest to make collections accessible has its own ideological basis. She writes,
Because access without accountability risks becoming a kind of digital settler colonialism: where archives are opened but not contextualized, where stories are extracted from communities but not returned to them, where knowledge circulates but the people who shaped it are left behind. It is access that takes, not access that gives back. (2025)
There is a broader need to acknowledge that the materials we maintain, use, and reproduce are so defined by their extraction—thefts of people’s biomatter, their history, and their secrets. Refusal is to say ‘no’ to this extraction, and to critique why we reveal materials in such ways.
Knowing that biomedical materials are linked to human subjects with cultures and histories, we need to acknowledge that in order to respect a community or patient’s consent we may have to lose those materials. We refuse the processes that turn people into objects. We refuse to place the value these materials offer our institutions and disciplines above the people whose bodies were made into valuable epistemic resources.
Ethics Audits
The application of opacities—obvious redactions of text and images—to the dissertation, The Tuberculosis Specimen, occurred at the end of the research and writing process. It was only after each chapter had been approved by the dissertation chair that images and text would be made opaque. Every image had to be reviewed for content, and if an image included sensitive material—human subjects undergoing treatment, children who could not have consented to having their image taken, or human remains—it would need to be edited multiple times for the final published website (figs. 4, 6, 7 illustrate this editing process). Primary quotations were also reviewed for sensitive materials. For the final publication, the text that was deemed unethical, and which needed opacity applied, was changed in the final markdown (.md) file uploaded to the site. Span classes, or hypertext markup language (HTML) wrappers that flag certain stylistic or functional changes on the final site, were added to the text to enable the redaction of that text.
The original goal was never to actually erase the materials. Many of the images that were used in that project, and which we shared in this essay, were obtained through HathiTrust—a digital library made up of many university collections which are, and will remain, available for academic research. The speculative turn was a means of thinking through the argumentative need for such materials. It helped us reconsider our roles as scholars and stewards, digital humanists and critical practitioners.
In preparing the dissertation to be published using the OP, there emerged two parallel reflective practices: both the text and image had to be made opaque. This process required a great deal of labor, editing multiple versions of each image (fig. 7). This labor was a boon insomuch as it afforded time and effort to think through each image: What is being shown? Who is central in the image? Who has agency and who does not? What needs to be erased? And what needs to be maintained (fig. 7)?
Figure 6.As we edited this image from Belleview Hospital’s tuberculosis clinic, we became aware of small details that changed our understanding of the image. What is the nurse at the center of the photograph doing? Was it a choice to frame patients with their backs turned? For each level of opacity, we had to consider who was in the image, and make decisions of who we would remove. Image courtesy of the New York Academy of Medicine.
Figure 7.The framing in this photograph similarly gave us pause. Is the doctor’s expression that way because he did not want a camera in the examining room? Why does the doctor return the camera’s gaze while the patient does not? Crowell, F. Elisabeth. The Work of New York’s Tuberculosis Clinics. 1910. Image courtesy of the New York Academy of Medicine.
The process of digitally editing these photos, of cutting away bodies with the help of image editing tools, was also a process of touching upon them, enabling a haptic relationship between ourselves and the primary source. As Tina Campt argues, holding archival materials, grasping them in our hands, forces us “to attend to the quiet frequencies of austere images that reverberate between images, statistical data, and state practices of social regulation” (90). The process of making images and text opaque was a laborious one, not a rote one like cropping an image or changing a .tiff into a .jpg. It was a moment of reflection and communion. It let us touch upon the lives of our interlocutors, to say:I see you. You did not want to be here.
This work enabled us to better understand the practices that captured subjects within academic research. Our conclusions regarding the continuity between subject and object, between patient and stolen remains, are the result of this hands-on process.
From this experience, we advocate for the use of a research audit. This is a moment of retrospective reflection that occurs at the end of a research program but before publication. During a research audit, a researcher or team of researchers review their work. They conduct a close reading of their evidentiary materials with the goal to tease apart the epistemic assumptions in their research that reduce, dehumanize, or alienate their human interlocutors. In this phase of the work, researchers ask, whose lives, deaths, and afterlives are integral to my research? Why am I using them? And who benefits from this work? Am I working in obligation with those who make my own research possible?
Obligation, as we use it, is indebted to Indigenous axiology, epistemology, ontology, and methodology (Wilson 2008), particularly to the kind of relationships between researchers and subjects. Kim TallBear, describing these relationships, reminds us that we are obligated to care for everyone who our research touches upon, both those from subaltern groups and as well as those from powerful positionalities (2014). Our research is not our own, but a collaboration that ties us to the people, institutions, histories, and moments that inform our arguments.
This end-of-project reflection helps us attend to everyone who is entangled in our research. Importantly, it occurs after completing research, and it is intentionally separate from the requirements of institutional review board (IRB) approval. In an ethics audit, researchers review their work, and mark the finished product in ways that show their fraught relationship between their jobs as knowledge workers and their obligations to their interlocutors. The opacity protocols developed for the OP and The Tuberculosis Specimen were a means of showing that the evidentiary requirements for a completed dissertation conflicted with an ethics of care. They were designed to make clear to the reader that 1) as scholars we have to show our work, and 2) this practice of showing is often at the expense of those whose lives and deaths are entangled in our research programs.
The ethics audit is a moment to embrace hypocrisy as a critical method, acknowledging that knowledge work is always partial, contested, and conflicted. This idea is built out of the feminist, anticolonial approach to ethics described by Max Liboiron and their colleagues. Seeing a need to navigate the complex, impossible, overlapping, and contradicting ethical demands in a research project, they write,
our obligations and relations are often compromised, meaning we are beholden to some over others, and reproduce problematic parts of dominant frameworks while reproducing good relations at other scales. Compromise is not a mistake or a failure—it is the condition for action in a diverse field of relations. (137-38)
The opaque methodology we described above was premised on an assumption that we cannot do ethically perfect research. We instead chose to do the best research we could, while considering the historical, cultural, and epistemic violence that we addressed and within which we are enmeshed (Caswell & Cifor 2016). We inhabit a hypocritical position by design, revealing and concealing in the same breath.
Conclusion
This intervention was developed to advocate for the return of stolen materials in the history of medicine. In working on this project, we found ourselves in a double bind: we have built an environment for more and new scholarship, and find ourselves arguing against that kind of output. Caring for the lives, deaths, and afterlives of research subjects means not using their bodies, histories, and images to further our own agenda, and yet, in this essay, we have. We wanted to show our work, and to convey the importance of these problems, but realize that our evidence is, itself, the problem.
Refusal, in academic contexts, is multi-leveled in practice. Our interlocutors and subjects can refuse us. We knowledge workers can also refuse (Simpson 2007; Liboiron 2021). Academic refusal does not have a one size fits all model. It is an imperfect patchwork assembled through care, attention, and practice. The approach to opacity outlined in this essay is flawed by design. We write from institutions with their own histories of collection and exploitation.
Modes and methods of refusal do not have to be clean, or perfect, nor do they foreclose the creation of knowledge. To that end, we developed the OP as a platform that refuses and reveals, in an attempt to make these contradictions more visible. The process forced us, as knowledge workers, to more carefully assess our materials, and to work in obligation to the people whose lives, deaths, cultures, and histories are necessary to making our arguments.
Acknowledgements
We would like to thank Emily Clark, Vanessa Elias, and all of our colleagues at the Institute for Digital Arts and Humanities at Indiana University Bloomington for their help at different times during the research process. We are also thankful to Marisa Hicks-Alcaraz who assisted in the early phases of our research. This article was vastly improved through a generous and generative open-peer review process. Thanks to Roopika Risam, Jessica Schomberg and Pamella Lach for their constructive feedback. This research was made possible thanks to funding from the New York Academy of Medicine, the Center for Research on Race Ethnicity and Society, and with support from the American Council of Learned Societies’ (ACLS) Digital Justice grant program.
This essay is the first in a two-part series of articles developed for Lead Pipe. Where this essay focuses on the theoretical grounding of our critique, our follow-up essay will describe and detail the specific technological and methodological approaches we developed while creating the Opaque Publisher (OP) and DigitalArc. Together these essays will show how ethical frameworks and methodologies can be produced through a design-based approach to research.
References
Abel, Emily K. 2007. Tuberculosis & the Politics of Exclusion: A History of Public Health & Migration to Los Angeles. Rutgers University Press.
Alberti, Samuel J. M. M. 2011. Morbid Curiosities: Medical Museums in Nineteenth-Century Britain. Oxford University Press.
Bryder, Linda. 1988. Below the Magic Mountain: A Social History of Tuberculosis in Twentieth-Century Britain. Clarendon Press.
Campt, Tina M. 2017. Listening to Images. Duke University Press.
Caswell, Michelle, and Marika Cifor. 2016. “From Human Rights to Feminist Ethics: Radical Empathy in the Archives.” Archivaria 81: 23–43.
Daston, Lorraine, and Peter Galison. 2007. Objectivity. Zone Books.
Feldberg, Georgina D. 1995. Disease and Class: Tuberculosis and the Shaping of Modern North American Society. Rutgers University Press.
Foucault, Michel. 1994a. The Birth of the Clinic: An Archeology of Medical Perception. Translated by A. M. Sheridan Smith. Vintage Books.
Foucault, Michel. 1994b. The Order of Things: An Archaeology of the Human Sciences. Vintage Books.
Glissant, Édouard. 1997. Poetics of Relation. Translated by Betsy Wing. University of Michigan Press.
Hartman, Saidiya. 2008. “Venus in Two Acts.” Small Axe 12 (2): 1–14.
Lederer, Susan. 1995. Subjected to Science: Human Experimentation in America before the Second World War. The Johns Hopkins University Press.
Liboiron, Max. 2021. Pollution Is Colonialism. Duke University Press.
Liboiron, Max, Emily Simmonds, Edward Allen, et al. 2021. “Doing Ethics with Cod.” In Making & Doing: Activating STS through Knowledge Expression and Travel, edited by Gary Lee Downey and Teun Zuiderent-Jerak. The MIT Press.
McGranahan, Carole. 2016. “Theorizing Refusal: An Introduction.” Cultural Anthropology 31 (3): 319–25.
Monteiro, Lyra. 2023. “Open Access Violence: Legacies of White Supremacist Data Making at the Penn Museum, from the Morton Cranial Collection to the MOVE Remains.” International Journal of Cultural Property 30: 105–37.
Purcell, Sean. 2025. “The Tuberculosis Specimen: The Dying Body and Its Use in the War Against the ‘Great White Plague.’” Indiana University.tuberculosisspecimen.github.io/diss.
Redman, Samuel J. 2016. Bone Rooms: From Scientific Racism to Human Prehistory in Museums. Harvard University Press.
Richardson, Ruth. 1987. Death, Dissection and the Destitute. The University of Chicago Press.
Rosenbloom, Megan. 2020. Dark Archives: A Librarian’s Investigation into the Science and History of Books Bound in Human Skin. Farrar, Straus and Giroux.
Sappol, Michael. 2002. A Traffic of Dead Bodies: Anatomy and Embodied Social Identity in Nineteenth-Century America. Princeton University Press.
Sutton, Jazma, and Kalani Craig. 2022. “Reaping the Harvest: Descendant Archival Practice to Foster Sustainable Digital Archives for Rural Black Women.” Digital Humanities Quarterly 16 (3).
TallBear, Kim. 2014. “Standing With and Speaking as Faith: A Feminist-Indigenous Approach to Inquiry.” Journal of Research Practice 10 (2).
Tuck, Eve, and K. Wayne Yang. 2014. “Unbecoming Claims: Pedagogies of Refusal in Qualitative Research.” Qualitative Inquiry 20 (6): 811–18.
Warner, John Harley. 2014. “The Aesthetic Grounding Of Modern Medicine.” Bulletin of the History of Medicine 88 (1): 1–47.
Washington, Harriet A. 2006. Medical Apartheid: The Dark History of Medical Experimentation on Black Americans from Colonial Times to the Present. Harlem Moon & Broadway Books.
Wilson, Shawn. 2008. Research Is Ceremony: Indigenous Research Methods. Fernwood Publishing.
A monk asked Chao Chou, “The Ultimate Path has no difficulties–just
avoid picking and choosing. As soon as there are words and speech, this
is picking and choosing.’ So how do you help people, Teacher?”
Chou said, “Why don’t you quote this saying in full?” The monk said, “I
only remember up to here.”
Chou said, “It’s just this: ‘This Ultimate Path has no difficulties–just
avoid picking and choosing.’” (Cleary, 2005, p. 337)
Cleary, T. (2005). The Blue Cliff Record. Boston: Shambhala.
As has now been widely reported, the White House has sent a number of universities, including the one I work at, a set of terms it wants them to agree to, which indicate that not doing so may mean they “forego federal benefits”. It’s not entirely clear what criteria were used to select the universities, though I suspect in my university’s case it may have had something to do with its recent willingness to give in to earlier demands from the Trump regime when it looked like the only community members they’d have to sell out were their transgender student athletes.
Now, as Martin Niemöller’s readers could have predicted, they’re coming back for more. As I write this, I’ve heard no word from our university administration, either in response or acknowledgement, but we also didn’t hear a lot from them before they made their previous deal with the White House. (Another university’s board chair, though, suggested eagerness to comply.)
But it isn’t just university faculty and research centers that would be muzzled by the agreement. The libraries would be too. That’s the implication of section 4 of the proposal, which mandates that “all of the university’s academic units, including all colleges, faculties, schools, departments, programs, centers, and institutes” comply with what the White House calls “institutional neutrality”. University libraries are among those centers, and the proposal says they would have to “abstain from actions or speech relating to societal and political events except in cases in which external events have a direct impact upon the university.”
Academic libraries are full of speech relating to societal and political events that don’t have.a “direct impact on the university”. It’s obviously in many of the books in our collection, which deal with societal and political events of all kinds. But it’s also in what we do to build our collections, put them in context, and invite our community to engage with them. It’s in the exhibits we create, the web pages we publish, the events we host, and the speakers we invite. Much of it is usually not particularly controversial; I’ve heard no protests about our Revolution at Penn? exhibit, for instance. But exhibits honestly dealing with revolution cannot avoid talking about political events, and while that might be welcome when they discuss how Revolutionary leaders fought for America’s freedom, we’ve seen how the White House reacts when they also discuss how they denied some Americans’ freedom. (I’ll note that a similar subject is also addressed in another exhibit our library hosts.)
The proposal also calls for “transforming or abolishing institutional units that purposefully… belittle… conservative ideas”. Many of the recent calls to ban books in US libraries and schools are the ideas of self-proclaimed conservatives, and libraries of all kinds speak out against these “societal and political” events. To date, most American research libraries have not yet been directly impacted by these bans, which have largely been imposed on public and K-12 school libraries. But they still have every right to object to them, and this proposal could easily be used to chill such objections. Indeed, much to my chagrin, even without this agreement my university’s library has already taken down online statements championing other important library values, out of concern over government reaction. I hope the statements will return online before long, but agreeing to the White House’s new terms would increase, rather than reduce, an already unacceptable expressive chill.
Research libraries also cannot assume their own collections will be safe from censorship, should their universities sign on to the White House’s proposal. Recently a controversial Fifth Circuit court decision upheld a book ban in part by accepting an argument that “a library’s collection decisions are government speech”— which is to say, official speech. The White House could use this argument to interfere with collection decisions they also consider to be official institutional speech on societal and political events, should a library’s sponsoring institution sign on to this agreement.
A university might argue that some of these restraints on library activities and collections aren’t a reasonable interpretation of the terms of White House proposal. But the agreement takes the decision on what’s reasonable out of the university’s hands. Instead, “adherence to this agreement shall be subject to review by the Department of Justice”, which has the power to compel the return of “all monies advanced by the U.S. government during the year of any violation”, large or small, whether in the library or elsewhere. The Department of Justice is not an agency with particular expertise in education, librarianship, or research. And it’s also no longer an agency independent of the White House, and a number of commentators (including some former GOP-appointed officials) have noted that it is now carrying out “vindictive retribution” against Donald Trump’s enemies.
Academic libraries are often called “the heart of the university” because of how their collections, spaces, and people sustain the university’s intellectual life. As I’ve shown above, both the terms of the White House’s proposed agreement and its context threaten to cut off the free inquiry, dialogues, and innovation that our libraries sustain. Universities that accept its extreme demands even as a basis for negotiation, rather than completely rejecting them, risk being distracted about the shape of the noose they are asked to get into. They should refuse the noose outright.
Today I’m introducing new pages for people and other authors on The Online Books Page. The new pages combine and augment information that’s been on author listings and subject pages. They let readers see in one place books both about and by particular people. They also let readers quickly see who the authors are and learn more about them. And they encourage readers to explore to find related authors and books online and in their local libraries. They draw on information resources created by librarians, Wikipedians, and other people online who care about spreading knowledge freely. I plan to improve on them over time, but I think they’re developed enough now to be useful to readers. Below I’ll briefly explain my intentions for these pages, and I hope to hear from you if you find them useful, or have suggestions for improvement.
Who is this person?
Readers often want to know more about the people who created the books they’re interested in. If they like an author, they might want to learn more about them and their works– for instance, finding out what Mark Twain did besides creating Tom Sawyer and Huckleberry Finn. For less familiar authors, it helps to know what background, expertise, and perspectives the author has to write about a particular subject. For instance, Irving Fisher, a famous economist in the early 20th century, wrote about various subjects, not just ones dealing with economics, but also with health and public policy. One might treat his writings on these various topics in different ways if one knows what areas he was trained in and in what areas he was an interested amateur. (And one might also reassess his predictive abilities even in economics after learning from his biography that he’d famously failed to anticipate the 1929 stock market crash just before it happened.)
The Wikipedia and the Wikimedia Commons communities have created many articles, and uploaded many images, of the authors mentioned in the Online Books collection, and they make them freely reusable. We’re happy to include their content on our pages, with attribution, when it helps readers better understand the people whose works they’re reading. Wikipedia is of course not the last word on any person, but it’s often a useful starting point, and many of its articles include links to more authoritative and in-depth sources. We also link to other useful free references in many cases. For example, our page on W. E. B. Du Bois includes links to articles on Du Bois from the Encyclopedia of Science Fiction, the Internet Encyclopedia of Philosophy, BlackPast, and the Archives and Records center at the University of Pennsylvania, each of which describes him from a different perspective. Our goal in including these links on the page is not to exhaustively present all the information we can about an author, but to give readers enough context and links to understand who they are reading, and to encourage them to find out more.
Find more books and authors
Part of encouraging readers to find out more is to give them ways of exploring books and authors beyond the ones they initially find. Our page on Rachel Carson, for example, includes a number of works she co-wrote as an employee of the US Fish and Wildlife Service, as well as a public domain booklet on her prepared by the US Department of State. But it doesn’t include her most famous works like Silent Spring and the Sea Around Us, which are still under copyright without authorized free online editions, as are many recent biographies and studies of Carson. But you can find many of these books in libraries near you. Links we have on the left of her page will search library catalogs for works about her, and links on the bottom right will search them for work by her, via our Forward to Libraries service.
Readers might also be interested in Carson’s colleagues. The “Associated authors” links on the left side of Carson’s page go to other pages about people that Carson collaborated with who are also represented in our collection, like Bob Hines and Shirley Briggs. Under the “Example of” heading, you can also follow links to other biologists and naturalists, doing similar work to Carson.
Metadata created with care by people, processed with care by code
I didn’t create, and couldn’t have created (let alone maintained) all of the links you see on these pages. They’re the work of many other people. Besides the people who wrote the linked books, collaborated on the linked reference articles, and created the catalog and authority metadata records for the books, there are lots of folks who created the linked data technology and data that I use to automatically pull together these resources on The Online Books Page. I owe a lot to the community that has created and populated Wikidata, which much of what you see on these pages depends on, and to the LD4 library linked data community, which has researched, developed, and discussed much of the technology used. (Some community members have themselves produced services and demonstrations similar to the ones I’ve put on Online Books.) Other crucial parts of my services’ data infrastructure come from the Library of Congress Linked Data Service and the people that create the records that go into that. The international VIAF collaboration has also been both a foundation and inspiration for some of this work.
These days, you might expect a new service like this to use or tout artificial intelligence somehow. I’m happy to say that the service does not use any generative AI to produce what readers see, either directly, or (as far as I’m aware) indirectly. There’s quite a bit of automation and coding behind the scenes, to be sure, but it’s all built by humans, using data produced in the main by humans, who I try to credit and cite appropriately. We don’t include statistically plausible generated text that hasn’t actually been checked for truth, or that appropriates other people’s work without permission or credit. We don’t have to worry about unknown and possibly unprecedented levels of power and water consumption to power our pages, or depend on crawlers for AI training so aggressive that they’re knocking library and other cultural sites offline. (I haven’t yet had to resort to the sorts of measures that some other libraries have taken to defend themselves against aggressive crawling, but I’ve noticed the new breed of crawlers seriously degrading my site’s performance, to the point of making it temporarily unusable, on more than one occasion.) With this and my other services, I aim to develop and use code that serves people (rather than selfishly or unthinkingly exploiting them), and that centers human readers and authors.
Work in progress
I hope readers find the new “people” pages on The Online Books Page useful in discovering and finding out more about books and authors of interest to them. I’ve thought of a number of ways we can potentially extend and build on what we’re providing with these new pages, and you’ll likely see some of them in future revisions of the service. I’ll be rolling the new pages out gradually, and plan to take some time to consider what features improve readers’ experience, and don’t excessively get in their way. The older-style “books by” and “books about” people pages will also continue to be available on the site for a while, though these new integrated views of people may eventually replace them.
If you enjoy the new pages, or have thoughts on how they could be improved, I’d enjoy hearing from you! And as always, I’m also interested in your suggestions for more books and serials — and people! — we can add to the Online Books collection.
About 0.5% of websites publish their content in Arabic, occupying the 20th place among other languages; however, Arabic is the 6th most spoken language in the world at 3.4%. A considerable portion of Arabs live in English speaking countries. For example, Arabs make up roughly 1.2% of the U.S. population. Some of them, mainly first generation, are able to consume news in Arabic in addition to English. Second, third, and fourth generation Arabs might be interested in the Arabic narrative of news stories, but they prefer the English language since it is their first language. In this post, we present a quantitative study for the archival rate of news webpages published in Arabic as compared to news pages published in English by Arabic media from 1999 to 2022. We reveal that, contrary to the general conjecture which is that web archives favor English webpages, the archival rate of Arabic webpages in increasing more rapidly than the archival rate for English webpages.
The Dataset
Our dataset consists of 1.5 million multilingual news stories' URLs, collected in September of 2022, from sitemaps of four prominent news websites: Aljazeera Arabic, Aljazeera English, Alarabiya, and Arab News. Using sitemaps yielded the maximum amount of stories' URLs. I examined multiple methods to fetch URLs including RSS, Twitter, GDELT, web crawling, and sitemaps. We selected a sample of our dataset based on the median number of stories published each day by year. The median day for the number of published stories represents the year. For example, because the median for the number of stories published each day in 2002 is 93, we selected the stories published in that day to be in our sample representing the stories published in 2002. For all 23 years we studied, the median number of published stories is very close to the mean for the year. Our sample contains 4116 URIs to news stories (2684 Arabic and 1432 English). The dataset is available on GitHub.
Our dataset, collected in September of 2022, consists of 1.5 million news stories in Arabic and English published between 1999 and 2022. We found that 47% of stories published in Arabic were not archived. On the other hand, only 42% of the stories published in English were not archived. However, the archival rate of Arabic stories has increased from 24% to 53% from 2013 to 2022. Conversely, the archival rate of news stories published in English only increased from 47% to 58%. For Arabic webpages, our results are similar to those from a study published in 2017 where Arabic webpages were found to be archived at a rate of 53% for a different dataset which consists of general Arabic web pages from websites directories including DMOZ, Raddadi, and Star28 (defunct). There is a notable increase in the percentage of archived pages from Arabic websites in the last 10 years.
We discovered that 47% of English news stories published between 1999 and 2013 were archived. This is different from what another study (and a different dataset) which found in 2017 that 72% of English webpages were archived. It is possible that the discrepancy comes from the fact that our dataset only included English news stories published by Arabic media, but their dataset consisted of general English web pages that came from the websites directory, DMOZ.
58% of English news stories between 1999 and 2022 in our dataset were archived. While there is an increase in the archival rate for English pages, it is not as large as the increase in the archival rate for Arabic ones. For English news stories, the increase could be considered normal/expected for a 10 years timeframe. It is worth mentioning that since websites started using more and more JavaScript in the last 10 years, archived mementos have more missing resources like images and other multimedia so the increase is considered an overall improvement but there is a chance that less content per page is captured in recent years. We did not study missing resources from archived mementos we found and cannot confirm whether or not missing resources are still on the rise in archived web pages.
Arabic and English news stories' URIs archival rate
Category
Arabic Language URIs
English Language URIs
URIs Queried
2684
1432
URIs Archived
1435
834
URIs Not Archived
1249
598
Archival Rate
0.53
0.58
While we were sampling from our dataset, we noticed an increase in Arabic stories published per day (median) for each year. The increase in the number of collected stories over time is expected due to news outlets moving towards publishing on the web in the last 20+ years.
The lower number in the following figure for 2022 is due to our dataset only spanning stories published between January 1999 and September 2022.
The number of collected Arabic stories per day (median)
We could not observe a consistent increase or decrease in the number of published stories in English per day (median) for each year because Arab News did not include any stories published after 2013 in its sitemap. Only Aljazeera English, in our dataset, included stories published after 2013 in its sitemap. The other two news websites, Aljazeera Arabic and Alarabiya, publish news in Arabic.
The number of collected English stories per day (median)
For Arabic news stories published in the median day in our dataset, nothing was archived for 1999, 2000, and 2004. Deciding to sample using the median day for the number of stories published per day each year was based on the median being very close to the mean value for the number of stories published per day. Moreover, using the median day. we were able to obtain a relatively small sample, 4116 URIs, that spans and represents 23 years worth of news stories from four news networks in two languages, 1.5 million URIs, that would otherwise not be feasible to study the archival rate for.
The min, max, median, and mean for the number of collected stories' URIs each day by year
We found that there is a little increase in the Arabic webpages archival rate until 2010 and the rate fluctuates after 2013 but it remained above 40% from 2014 to 2022. Generally the increase in Arabic news webpages archival rate is significant over the last 20 years.
Arabic webpages archival rate by year
For English news stories, nothing was collected for 1999 and 2000 because these news outlets had little to no presence on the web during these years. We noticed even more fluctuation in the archival rate for English webpages but less general increase than it is for Arabic webpages.
English webpages archival rate by year
We measured the archival rate for Arabic webpages in our dataset by web archive to find the contribution of each archive to the archiving of these URIs. Using MemGator to check if the collected news stories were archived by public web archives, we studied the following archives:
1. waext.banq.qc.ca: Libraries and National Archives of Quebec 2. warp.da.ndl.go.jp: National Diet Library, Japan 3. wayback.vefsafn.is: Icelandic Web Archive 4. web.archive.bibalex.org: Bibliotheca Alexandrina Web Archive 5. web.archive.org.au: Australian Web Archive 6. webarchive.bac-lac.gc.ca: Library and Archives Canada 7. webarchive.loc.gov: Library of Congress 8. webarchive.nationalarchives.gov: UK National Archives Web Archive 9. webarchive.nrscotland.gov.uk: National Records of Scotland 10. webarchive.org.uk: UK Web Archive 11. webarchive.parliament.uk: UK Parliament Web Archive 12. wayback.nli.org.il: National Library of Israel 13. archive.today: Archive Today 14. arquivo.pt: The Portuguese Web Archive 15. perma.cc: Perma.cc Archive 16. swap.stanford.edu: Stanford Web Archive 17. wayback.archive-it.org: Archive-It (powered by the Internet Archive) 18. web.archive.org: the Internet Archive
Only archive.today and arquivo.pt returned any mementos for the 2684 URIs we queried. They both returned a total of seven mementos for six different URIs.
We found that the Internet Archive has archived more Arabic news pages than all other archives combined by a large margin. Other archives hardly contributed to archiving Arabic stories' URIs.
The percentage of archived Arabic news stories in web archives
As far as English news webpages, looking at the archival rate by web archive, the Internet Archive returned mementos for a much larger amount of URIs than the sum of all other web archives, but the gap in contribution between the IA and the sum of all other web archives is not as large as it is for Arabic news webpages in our dataset.
The percentage of archived English news stories in web archives
Furthermore, we found that the union of all other archives' URI-Rs is a proper subset of the IA's URI-Rs. In other words, only the IA had exclusive copies of URIs of Arabic news stories. All other archives had no exclusive copies. This doesn't necessarily mean that union of all other archives' URI-Ms is a proper subset of the IA's URI-Rs because URIs could've been archived at different times by different web archives. This finding indicates that losing all web archives besides the IA causes almost no loss in information. On the other hand, losing the IA is disastrous to Arabic pages' web archiving.
The percentage of exclusively archived Arabic news stories
For English news webpages in our sample, the IA had many more exclusive copies of URIs than all other archives combined, which indicates that losing all web archives besides the IA causes very little loss in information, but the opposite, losing the IA, is catastrophic.
The percentage of exclusively archived English news stories
Our finding is different from an earlier study by Alsum et al. (2014), where they found that it is possible to retrieve full TimeMaps for 93% of their dataset using the top nine web archives without the IA.
Conclusions
The archival rate of Arabic news pages was, and is still, less than English news pages, but the gap is much smaller than it used to be. The archival rate of Arabic news pages has increased from 24% between 1999 and 2013 to 53% between 2013-2022. Our study shows that most of the increase is due to the IA's augmentation over time while other web archives did not experience such enhancements. Also there was more room for improvement in archiving Arabic news than English news. We show that losing all archives except the IA will cause no loss in archived Arabic news pages, but the loss is irreversible if the IA no longer exists. For English webpages, the majority of archived copies will be lost forever if the IA is crippled.
2025-10-03 edit: I replaced all graphs in this post with graphs that are more visually appealing.
Last May, I spoke at the Bike Windsor Essex's Pecha Kucha portion of their AGM. Today I presented another 20 slides @ 20 seconds each about games and libraries.
Win free books from the October 2025 batch of Early Reviewer titles! We’ve got 206 books this month, and a grand total of 2,416 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.
The deadline to request a copy is Monday, October 27th at 6PM EDT.
Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, Canada, the UK, Australia, Germany, New Zealand, Ireland, Italy, Finland, Czechia and more. Make sure to check the message on each book to see if it can be sent to your country.
Thanks to all the publishers participating this month!
Hello DLF Community! I’m excited to join you as Senior Program Officer and to be a part of this vibrant member network. I really enjoyed meeting the Working Group chairs at the September meeting, in some cases seeing familiar faces! I’m looking forward to continuing these conversations as we dive into preparations for the DLF Forum, a great opportunity to learn, share and connect. I’m grateful to be on this journey with you and eager to see where we go next as a community.
— Shaneé from Team DLF
This month’s news:
Forum registration closing soon: Registration for the 2025 DLF Forum in Denver, CO closes on Friday, 10/31. Register now to secure your spot.
Conference hotel filling:Book your room in the DLF Forum conference hotel for the lowest available room rates and to stay close to the action. Block closes 10/24.
Please join the DAWG Advocacy and Education group on Thursday, October 2, 2025 at 11:30 am ET to learn about the Accessibility Ambassadors Project @UMich.
Tiffany Harris (she/her) is the Accessibility Program Assistant for the University of Michigan library, and she is pursuing her Master’s of Science in Environmental Justice. During her presentation, she will be discussing the accessibility training that she and other members of the Library Accessibility team are leading for the Accessibility teams within the library. She is hosting training on Learning about People with Disabilities, Accessible Documents, Accessible Presentations, and Accessible Spreadsheets. She will also be discussing some of the Accessibility Ambassador projects such as assessing the end cap signage throughout the library, live captioning on slides and PowerPoints, and sensory friendly maps and floor plans. We give student staff a wide variety of projects to choose from to ensure that they are working on projects that relate to their interests and build out skills to develop for their resume.
You must Register in advance for this meeting. After registering, you will receive a confirmation email about joining the meeting. Please review the DLF Code of Conduct prior to attending.
This month’s open DLF group meetings:
For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.
DLF Digital Accessibility IT Subgroup (DAWG-IT): Monday, 10/6, 1pm ET / 10am PT
DLF Born-Digital Access Working Group (BDAWG): Tuesday, 10/7, 2pm ET / 11am PT
DLF Digital Accessibility Working Group (DAWG): Tuesday, 10/7, 2pm ET / 11am PT
DLF AIG Cultural Assessment Working Group: Monday, 10/13, 1pm ET / 10am PT
DLF AIG User Experience Working Group: Friday, 10/17, 11am ET / 8am PT
DLF Digital Accessibility IT Subgroup (DAWG-IT): Monday, 10/27, 1pm ET / 10am PT
James Grant invited me to address the annual conference of Grant's Interest Rate Observer. This was an intimidating prospect, the previous year's conference featured billionaires Scott Bessent and Bill Ackman. As usual, below the fold is the text of my talk, with the slides, links to the sources, and additional material in footnotes. Yellow background indicates textual slides.
The Gaslit Asset Class
Before I explain that much of what you have been told about cryptocurrency technology is gaslighting, I should stress that I hold no long or short positions in cryptocurrencies, their derivatives or related companies. Unlike most people discussing them, I am not "talking my book".
To fit in the allotted time, this talk focuses mainly on Bitcoin and omits many of the finer points. My text, with links to the sources and additional material in footnotes, will go up on my blog later today.
Why Am I Here?
I imagine few of you would understand why a retired software engineer with more than forty years in Silicon Valley was asked to address you on cryptocurrencies[1].
I was an early employee at Sun Microsystems then employee #4 at Nvidia, so I have been long Nvidia for more than 30 years. It has been a wild ride. I quit after 3 years as part of fixing Nvidia's first near-death experience and immediately did 3 years as employee #12 at another startup, which also IPO-ed. If you do two in six years in your late 40s you get seriously burnt out.
So my wife and I started a program at Stanford that is still running 27 years later. She was a career librarian at the Library of Congress and the Stanford Library. She was part of the team that, 30 years ago, pioneered the transition of academic publishing to the Web. She was also the person who explained citation indices to Larry and Sergey, which led to Page Rank.
The academic literature has archival value. Multiple libraries hold complete runs on paper of the Philosophical Transactions of the Royal Society starting 360 years ago[2].
The interesting engineering problem we faced was how to enable libraries to deliver comparable longevity to Web-published journals.
Five Years Before Satoshi Nakamoto
I worked with a group of outstanding Stanford CS Ph.D. students to design and implement a system for stewardship of Web content modeled on the paper library system. The goal was to make it extremely difficult for even a powerful adversary to delete or modify content without detection. It is called LOCKSS, for Lots Of Copies Keep Stuff Safe; a decentralized peer-to-peer system secured by Proof-of-Work. We won a "Best Paper" award for it five years before Satoshi Nakamoto published his decentralized peer-to-peer system secured by Proof-of-Work. When he did, LOCKSS had been in production for a few years and we had learnt a lot about how difficult decentralization is in the online world.
The fundamental problem of representing cash in digital form is that a digital coin can be endlessly copied, thus you need some means to prevent each of the copies being spent. When you withdraw cash from an ATM, turning digital cash in your account into physical cash in your hand, the bank performs an atomic transaction against the database mapping account numbers to balances. The bank is trusted to prevent multiple spending.
There had been several attempts at a cryptocurrency before Bitcoin. The primary goals of the libertarians and cypherpunks were that a cryptocurrency be as anonymous as physical cash, and that it not have a central point of failure that had to be trusted. The only one to get any traction was David Chaum's DigiCash; it was anonymous but it was centralized to prevent multiple spending and it involved banks.
Nakamoto's magnum opus
Bitcoin claims:
The system was trustless because it was decentralized.
It was a medium of exchange for buying and selling in the real world.
Transactions were faster and cheaper than in the existing financial system.
It was secured by Proof-of-Work and cryptography.
It was privacy-preserving.
When in November 2008 Nakamoto published Bitcoin: A Peer-to-Peer Electronic Cash System it was the peak of the Global Financial Crisis and people were very aware that the financial system was broken (and it still is). Because it solved many of the problems that had dogged earlier attempts at electronic cash, it rapidly attracted a clique of enthusiasts. When Nakamoto went silent in 2010 they took over proseltyzing the system. The main claims they made were:
The system was trustless because it was decentralized.
It was a medium of exchange for buying and selling in the real world.
Transactions were faster and cheaper than in the existing financial system.
It was secured by Proof-of-Work and cryptography.
It was privacy-preserving.
They are all either false or misleading. In most cases Nakamoto's own writings show he knew this. His acolytes were gaslighting.
Trustless because decentralized (1)
Assuming that the Bitcoin network consists of a large number of roughly equal nodes, it randomly selects a node to determine the transactions that will form the next block. There is no need to trust any particular node because the chance that they will be selected is small.[3]
At first, most users would run network nodes, but as the network grows beyond a certain point, it would be left more and more to specialists with server farms of specialized hardware. A server farm would only need to have one node on the network and the rest of the LAN connects with that one node.
Satoshi Nakamoto 2nd November 2008
The current system where every user is a network node is not the intended configuration for large scale. ... The design supports letting users just be users. The more burden it is to run a node, the fewer nodes there will be. Those few nodes will be big server farms. The rest will be client nodes that only do transactions and don’t generate.
Satoshi Nakamoto: 29th July 2010
But only three days after publishing his white paper, Nakamoto understood that this assumption would become false:
At first, most users would run network nodes, but as the network grows beyond a certain point, it would be left more and more to specialists with server farms of specialized hardware.
He didn't change his mind. On 29th July 2010, less than five months before he went silent, he made the same point:
The current system where every user is a network node is not the intended configuration for large scale. ... The design supports letting users just be users. The more burden it is to run a node, the fewer nodes there will be. Those few nodes will be big server farms.
"Letting users be users" necessarily means that the "users" have to trust the "few nodes" to include their transactions in blocks. The very strong economies of scale of technology in general and "big server farms" in particular meant that the centralizing force described in W. Brian Arthur's 1994 book Increasing Returns and Path Dependence in the Economy resulted in there being "fewer nodes". Indeed, on 13th June 2014 a single node controlled 51% of Bitcoin's mining, the GHash pool.[4]
The same centralizing economic forces apply to Proof-of-Stake blockchains such as Ethereum. Grant's Memo to the bitcoiners explained the process last February.
Trustless because decentralized (3)
Another centralizing force drives pools like GHash. The network creates a new block and rewards the selected node about every ten minutes. Assuming they're all state-of-the-art, there are currently about 15M rigs mining Bitcoin[6]. Their economic life is around 18 months, so only 0.5%% of them will ever earn a reward. The owners of mining rigs pool their efforts, converting a small chance of a huge reward into a steady flow of smaller rewards. On average GHash was getting three rewards an hour.
A medium of exchange (1)
Quote from: Insti, July 17, 2010, 02:33:41 AM
How would a Bitcoin snack machine work?
You want to walk up to the machine. Send it a bitcoin.
?
Walk away eating your nice sugary snack. (Profit!)
You don’t want to have to wait an hour for you transaction to be confirmed.
The vending machine company doesn’t want to give away lots of free candy.
How does step 2 work?
I believe it’ll be possible for a payment processing company to provide as a service the rapid distribution of transactions with good-enough checking in something like 10 seconds or less.
Satoshi Nakamoto: 17th July 2010
Bitcoin's ten-minute block time is a problem for real-world buying and selling[7], but the problem is even worse. Network delays mean a transaction isn't final when you see it in a block. Assuming no-one controlled more than 10% of the hashing power, Nakamoto required another 5 blocks to have been added to the chain, so 99.9% finality would take an hour. With a more realistic 30%, the rule should have been 23 blocks, with finality taking 4 hours[8].
Nakamoto's 17th July 2010 exchange with Insti shows he understood that the Bitcoin network couldn't be used for ATMs, vending machines, buying drugs or other face-to-face transactions because he went on to describe how a payment processing service layered on top of it would work.
A medium of exchange (2)
assuming that the two sides are rational actors and the smart contract language is Turing-complete, there is no escrow smart contract that can facilitate this exchange without either relying on third parties or enabling at least one side to extort the other.
two-party escrow smart contracts are ... simply a game of who gets to declare their choice first and commit it on the blockchain sooner, hence forcing the other party to concur with their choice. The order of transactions on a blockchain is essentially decided by the miners. Thus, the party with better connectivity to the miners or who is willing to pay higher transaction fees, would be able to declare their choice to the smart contract first and extort the other party.
The situation is even worse when it comes to buying and selling real-world objects via programmable blockchains such as Ethereum[9]. In 2021 Amir Kafshdar Goharshady showed that[10]:
assuming that the two sides are rational actors and the smart contract language is Turing-complete, there is no escrow smart contract that can facilitate this exchange without either relying on third parties or enabling at least one side to extort the other.
on the Ethereum blockchain escrows with trusted third-parties are used more often than two-party escrows, presumably because they allow dispute resolution by a human.
And goes on to show that in practice trusted third-party escrow services are essential because two-party escrow smart contracts are:
simply a game of who gets to declare their choice first and commit it on the blockchain sooner, hence forcing the other party to concur with their choice. The order of transactions on a blockchain is essentially decided by the miners. Thus, the party with better connectivity to the miners or who is willing to pay higher transaction fees, would be able to declare their choice to the smart contract first and extort the other party.
The choice being whether or not the good had been delivered. Given the current enthusiasm for tokenization of physical goods the market for trusted escrow services looks bright.
Fast transactions
Actually the delay between submitting a transaction and finality is unpredictable and can be much longer than an hour. Transactions are validated by miners then added to the mempool of pending transactions where they wait until either:
The selected network node chooses it as one of the most profitable to include in its block.
It reaches either its specified timeout or the default of 2 weeks.
This year the demand for transactions has been low, typically under 4 per second, so the backlog has been low, around 40K or under three hours. Last October it peaked at around 14 hours worth.
The distribution of transaction wait times is highly skewed. The median wait is typically around a block time. The proportion of low-fee transactions means the average wait is normally around 10 times that. But when everyone wants to transact the ratio spikes to over 40 times.
There are two ways miners can profit from including a transaction in a block:
The fee to be paid to the miner which the user chose to include in the transaction. In effect, transaction slots are auctioned off.
The transactions the miner included in the block to front- and back-run the user's transaction, called Maximal Extractable Value[11]:
Maximal extractable value (MEV) refers to the maximum value that can be extracted from block production in excess of the standard block reward and gas fees by including, excluding, and changing the order of transactions in a block.
The block size limit means there is a fixed supply of transaction slots, about 7 per second, but the demand for them varies, and thus so does the price. In normal times the auction for transaction fees means they are much smaller than the block reward. But when everyone wants to transact they suffer massive spikes.
Secured by Proof-of-Work (1)
In cryptocurrencies "secured" means that the cost of an attack exceeds the potential loot. The security provided by Proof-of-Work is linear in its cost, unlike techniques such as encryption, whose security is exponential in cost. It is generally believed that it is impractical to reverse a Bitcoin transaction after about an hour because the miners are wasting such immense sums on Proof-of-Work. Bitcoin pays these immense sums, but it doesn't get the decentralization they ostensibly pay for.
Monero, a privacy-focused blockchain network, has been undergoing an attempted 51% attack — an existential threat to any blockchain. In the case of a successful 51% attack, where a single entity becomes responsible for 51% or more of a blockchain's mining power, the controlling entity could reorganize blocks, attempt to double-spend, or censor transactions.
A company called Qubic has been waging the 51% attack by offering economic rewards for miners who join the Qubic mining pool. They claim to be "stress testing" Monero, though many in the Monero community have condemned Qubic for what they see as a malicious attack on the network or a marketing stunt.
The advent of "mining as a service" about 7 years ago made 51% attacks against smaller Proof-of-Work alt-coin such as Bitcoin Gold endemic. In August Molly White reported that Monero faces 51% attack:
Note that only Bitcoin and Ethereum among cryptocurrencies with "market cap" over $100M would cost more than $100K to attack. The total "market cap" of these 8 currencies is $271.71B and the total cost to 51% attack them is $1.277M or 4.7E-6 of their market cap.
His key insight was that to ensure that 51% attacks were uneconomic, the reward for a block, implicitly the transaction tax, plus the fees had to be greater than the maximum value of the transactions in it. The total transaction cost (reward + fee) typically peaks around 1.8% but is normally between 0.6% and 0.8%, or around 150 times less than Budish's safety criterion. The result is that a conspiracy between a few large pools could find it economic to mount a 51% attack.
Secured by Proof-of-Work (2)
However, ∆attack is something of a “pick your poison” parameter. If ∆attack is small, then the system is vulnerable to the double-spending attack ... and the implicit transactions tax on economic activity using the blockchain has to be high. If ∆attack is large, then a short time period of access to a large amount of computing power can sabotage the blockchain.
But everyone assumes the pools won't do that. Budish further analyzed the effects of a multiple spend attack. It would be public, so it would in effect be sabotage, decreasing the Bitcoin price by a factor ∆attack. He concludes that if the decrease is small, then double-spending attacks are feasible and the per-block reward plus fee must be large, whereas if it is large then access to the hash power of a few large pools can quickly sabotage the currency.
The implication is that miners, motivated to keep fees manageable, believe ∆attack is large. Thus Bitcoin is secure because those who could kill the golden goose don't want to.
Secured by Proof-of-Work (3)
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
In other words, the security of Bitcoin's blockchain depends upon inflating the currency with block rewards. This problem is excerbated by Bitcoin's regular "halvenings" reducing the block reward. To maintain miner's current income after the next halvening in less than three years the "price" would need to be over $200K; security depends upon the "price" appreciating faster than 20%/year.
Once the block reward gets small, safety requires the fees in a block to be worth more than the value of the transactions in it. But everybody has decided to ignore Budish and Auer.
showed that (i) a successful block-reverting attack does not necessarily require ... a majority of the hash power; (ii) obtaining a majority of the hash power ... costs roughly 6.77 billion ... and (iii) Bitcoin derivatives, i.e. options and futures, imperil Bitcoin’s security by creating an incentive for a block-reverting/majority attack.
They assume that an attacker would purchase enough state-of-the-art hardware for the attack. Given Bitmain's dominance in mining ASICs, such a purchase is unlikely to be feasible.
But it would not be necessary. Mining is a very competitive business, and power is the major cost[13]. Making a profit requires both cheap power and early access to the latest, most efficient chips. So it wasn't a surprise that Ferreira et al's Corporate capture of blockchain governance showed that:
As of March 2021, the pools in Table 1 collectively accounted for 86% of the total hash rate employed. All but one pool (Binance) have known links to Bitmain Technologies, the largest mining ASIC producer.
[14]
Bitmain, a Chinese company, exerts significant control of Bitcoin. China has firmly suppressed domestic use of cryptocurrencies, whereas the current administration seems intent on integrating them (and their inevitable grifts) into the US financial system. Except for Bitmain, no-one in China gets eggs from the golden goose. This asymmetry provides China with an way to disrupt the US financial system.
It would be important to prevent the disruption being attributed to China. A necessary precursor would therefore be to obscure the extent of Bitmain-affiliated pools' mining power. This has been a significant trend in the past year, note the change in the "unknown" in the graphs from 38 to 305. There could be other explanations, but whether or not intentionally this is creating a weapon.[15]
Secured by cryptography (1)
The dollars in your bank account are simply an entry in the bank's private ledger tagged with your name. You control this entry, but what you own is a claim on the bank[16]. Similarly, your cryptocurrency coins are effectively an entry in a public ledger tagged with the public half of a key pair. The two differences are that:
No ownership is involved, so you have no recourse if something goes wrong.
Anyone who knows the secret half of the key pair controls the entry. Since it is extremely difficult to stop online secrets leaking, something is likely to go wrong[17].
Satoshi, That would indeed be a solution if SHA was broken (certainly the more likely meltdown), because we could still recognize valid money owners by their signature (their private key would still be secure).
However, if something happened and the signatures were compromised (perhaps integer factorization is solved, quantum computers?), then even agreeing upon the last valid block would be worthless.
True, if it happened suddenly. If it happens gradually, we can still transition to something stronger. When you run the upgraded software for the first time, it would re-sign all your money with the new stronger signature algorithm. (by creating a transaction sending the money to yourself with the stronger sig)
Satoshi Nakamoto: 10th July 2010
On 10th July 2010 Nakamoto addressed the issue of what would happen if either of these algorithms were compromised. There are three problems with his response; that compromise is likely in the near future, when it does Nakamoto's fix is inadequate, and there is a huge incentive for it to happen suddenly:
the elliptic curve signature scheme used by Bitcoin is much more at risk, and could be completely broken by a quantum computer as early as 2027, by the most optimistic estimates.
Their "most optimistic estimates" are likely to be correct; PsiQuantum expects to have two 1M qubit computers operational in 2027[19]. Each should be capable of breaking an ECDSA key in under a week.
Bitcoin's transition to post-quantum cryptography faces a major problem because, to transfer coins from an ECDSA wallet to a post-quantum wallet, you need the key for the ECDSA wallet. Chainalysis estimates that:
about 20% of all Bitcoins have been "lost", or in other words are sitting in wallets whose keys are inaccessible
An example is the notorious hard disk in the garbage dump. A sufficiently powerful quantum computer could recover the lost keys.
The incentive for it to happen suddenly is that, even if Nakamoto's fix were in place, someone with access to the first sufficiently powerful quantum computer could transfer 20% of all Bitcoin, currently worth $460B, to post-quantum wallets they controlled. This would be a 230x return on the investment in PsiQuantum.
Privacy-preserving
privacy can still be maintained by breaking the flow of information in another place: by keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone.
As an additional firewall, a new key pair should be used for each transaction to keep them from being linked to a common owner.
Some linking is still unavoidable with multi-input transactions, which necessarily reveal that their inputs were owned by the same owner. The risk is that if the owner of a key is revealed, linking could reveal other transactions that belonged to the same owner.
Nakamoto addressed the concern that, unlike DigiCash, because Bitcoin's blockchain was public it wasn't anonymous:
privacy can still be maintained by breaking the flow of information in another place: by keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone.
This is true but misleading. In practice, users need to use exchanges and other services that can tie them to a public key.
There is a flourishing ecosystem of companies that deanonymize wallets by tracing the web of transactions. Nakamoto added:
As an additional firewall, a new key pair should be used for each transaction to keep them from being linked to a common owner.
funds in a wallet have to come from somewhere, and it’s not difficult to infer what might be happening when your known wallet address suddenly transfers money off to a new, empty wallet.
Some linking is still unavoidable with multi-input transactions, which necessarily reveal that their inputs were owned by the same owner. The risk is that if the owner of a key is revealed, linking could reveal other transactions that belonged to the same owner.
I have steered clear of the financial risks of cryptocurrencies. It may appear that the endorsement of the current administration has effectively removed their financial risk. But the technical and operational risks remain, and I should note another technology-related risk.
History shows a fairly strong and increasing correlation between equities and cryptocurrencies, so they will get dragged down too. The automatic liquidation of leveraged long positions in DeFi will start, causing a self-reinforcing downturn. Periods of heavy load such as this tend to reveal bugs in IT systems, and especially in "smart contracts", as their assumptions of adequate resources and timely responses are violated.
Experience shows that Bitcoin's limited transaction rate and the fact that the Ethereum computer that runs all the "smart contracts" is 1000 times slower than a $50 Raspberry Pi 4[24] lead to major slow-downs and fee spikes during panic selling, exacerbated by the fact that the panic sales are public[25].
Conclusion
The fascinating thing about cryptocurrency technology is the number of ways people have developed and how much they are willing to pay to avoid actually using it. What other transformative technology has had people desperate not to use it?
The whole of TradFi has been erected on this much worse infrastructure, including exchanges, closed-end funds, ETFs, rehypothecation, and derivatives. Clearly, the only reason for doing so is to escape regulation and extract excess profits from what would otherwise be crimes.
Paper libraries form a model fault-tolerant system. It is highly replicated and decentralized. Libraries cooperate via inter-library loan and copy to deliver a service that is far more reliable than any individual library.
The root problem with conventional currency is all the trust that's required to make it work. The central bank must be trusted not to debase the currency, but the history of fiat currencies is full of breaches of that trust. Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve. We have to trust them with our privacy, trust them not to let identity thieves drain our accounts. Their massive overhead costs make micropayments impossible.
The problem with this ideology is that trust (but verify) is an incredibly effective optimization in almost any system. For example, Robert Putnam et al's Making Democracy Work: Civic Traditions in Modern Italy shows that the difference between the economies of Northern and Southern Italy is driven by the much higher level of trust in the North.
Bitcoin's massive cost is a result of its lack of trust. Users pay this massive cost but they don't get a trustless system, they just get a system that makes the trust a bit harder to see.
In response to Nakamoto's diatribe, note that:
"trusted not to debase the currency", but Bitcoin's security depends upon debasing the currency.
"waves of credit bubbles", is a pretty good description of the cryptocurrency market.
"massive overhead costs". The current cost per transaction is around $100.
I rest my case.
The problem of trusting mining pools is actually much worse. There is nothing to stop pools conspiring coordinating. In 2017 Vitalik Buterin, co-founder of Ethereum, published The Meaning of Decentralization:
In the case of blockchain protocols, the mathematical and economic reasoning behind the safety of the consensus often relies crucially on the uncoordinated choice model, or the assumption that the game consists of many small actors that make decisions independently. If any one actor gets more than 1/3 of the mining power in a proof of work system, they can gain outsized profits by selfish-mining. However, can we really say that the uncoordinated choice model is realistic when 90% of the Bitcoin network’s mining power is well-coordinated enough to show up together at the same conference?
In all, it seems unlikely that up to nine major bitcoin mining pools use a shared custodian for coinbase rewards unless a single entity is behind all of their operations.
The "single entity" is clearly Bitmain.
Peter Ryan, a reformed Bitcoin enthusiast, noted another form of centralization in Money by Vile Means:
Bitcoin is anything but decentralized: Its functionality is maintained by a small and privileged clique of software developers who are funded by a centralized cadre of institutions. If they wanted to change Bitcoin’s 21 million coin finite supply, they could do it with the click of a keyboard.
His account of the politics behind the argument over raising the Bitcoin block size should dispel any idea of Bitcoin's decentralized nature. He also notes:
By one estimate from Hashrate Index, Foundry USA and Singapore-based AntPool control more than 50 percent of computing power, and the top ten mining pools control over 90 percent. Bitcoin blogger 0xB10C, who analyzed mining data as of April 15, 2025, found that centralization has gone even further than this, “with only six pools mining more than 95 percent of the blocks.”
The Bitmain S17 comes in 4 versions with hash rates from 67 to 76 TH/s. Lets assume 70TH/s. As I write the Bitcoin hash rate is about 1 billion TH/s. So if they were all mid-range S17s there would be around 15M mining. If their economic life were 18 months, there would be 77,760 rewards. Thus only 0.5% of them would earn a reward.
In December 2021 Alex de Vries and Christian Stoll estimated that:
The average time to become unprofitable sums up to less than 1.29 years.
It has been obvious since mining ASICs first hit the market that, apart from access to cheap or free electricity, there were two keys to profitable mining:
Having close enough ties to Bitmain to get the latest chips early in their 18-month economic life.
Having the scale to buy Bitmain chips in the large quantities that get you early access.
See David Gerard's account of Steve Early's experiences accepting Bitcoin in his chain of pubs in Attack of the 50 Foot Blockchain Page 94.
The share of U.S. consumers who report using cryptocurrency for payments—purchases, money transfers, or both—has been very small and has declined slightly in recent years. The light blue line in Chart 1 shows that this share declined from nearly 3 percent in 2021 and 2022 to less than 2 percent in 2023 and 2024.
User DeathAndTaxes on Stack Exchange explains the 6 block rule:
p is the chance of attacker eventually getting longer chain and reversing a transaction (0.1% in this case). q is the % of the hashing power the attacker controls. z is the number of blocks to put the risk of a reversal below p (0.1%).
So you can see if the attacker has a small % of the hashing power 6 blocks is sufficient. Remember 10% of the network at the time of writing is ~100GH/s. However if the attacker had greater % of hashing power it would take increasingly longer to be sure a transaction can't be reversed.
If the attacker had significantly more hashpower say 25% of the network it would require 15 confirmation to be sure (99.9% probability) that an attacker can't reverse it.
For example, last May Foundry USA had more than 30% of the hash power, so the rule should have been 24 not 6, and finality should have taken 4 hours.
We show quantitatively how transaction atomicity increases the arbitrage revenue. We moreover analyze two existing attacks with ROIs beyond 500k%. We formulate finding the attack parameters as an optimization problem over the state of the underlying Ethereum blockchain and the state of the DeFi ecosystem. We show how malicious adversaries can efficiently maximize an attack profit and hence damage the DeFi ecosystem further. Specifically, we present how two previously executed attacks can be “boosted” to result in a profit of 829.5k USD and 1.1M USD, respectively, which is a boost of 2.37× and 1.73×, respectively.
They predicted an upsurge in attacks since "flash loans democratize the attack, opening this strategy to the masses". They were right, as you can see from Molly White's list of flash loan attacks.
This is one of a whole series of Impossibilities, many imposed on Ethereum by fundamental results in computer science because it is a Turing-complete programming environment.
For details of the story behind Miners' Extractable Value (MEV), see these posts:
The first links to two must-read posts. The first is from Dan Robinson and Georgios Konstantopoulos, Ethereum is a Dark Forest:
It’s no secret that the Ethereum blockchain is a highly adversarial environment. If a smart contract can be exploited for profit, it eventually will be. The frequency of new hacks indicates that some very smart people spend a lot of time examining contracts for vulnerabilities.
But this unforgiving environment pales in comparison to the mempool (the set of pending, unconfirmed transactions). If the chain itself is a battleground, the mempool is something worse: a dark forest.
(i) 73% of private transactions hide trading activity or re-distribute miner rewards, and 87.6% of MEV collection is accomplished with privately submitted transactions, (ii) our algorithm finds more than $6M worth of MEV profit in a period of 12 days, two thirds of which go directly to miners, and (iii) MEV represents 9.2% of miners' profit from transaction fees.
Furthermore, in those 12 days, we also identify four blocks that contain enough MEV profits to make time-bandit forking attacks economically viable for large miners, undermining the security and stability of Ethereum as a whole.
When they say "large miners" they mean more than 10% of the power.
Our key insight is that with only transaction fees, the variance of the miner reward is very high due to the randomness of the block arrival time, and it becomes attractive to fork a “wealthy” block to “steal” the rewards therein.
The leading source of data on which to base Bitcoin's carbon footprint is the Cambridge Bitcoin Energy Consumption Index. As I write their central estimate is that Bitcoin consumes 205TWh/year, or between Thailand and Vietnam.
AntPool and BTC.com are fully-owned subsidiaries of Bitmain. Bitmain is the largest investor in ViaBTC. Both F2Pool and BTC.TOP are partners of BitDeer, which is a Bitmain-sponsored cloud-mining service. The parent companies of Huobi.pool and OkExPool are strategic partners of Bitmain. Jihan Wu, Bitmain’s founder and chairman, is also an adviser of Huobi (one of the largest cryptocurrency exchanges in the world and the owner of Huobi.pool).
See Who Is Mining Bitcoin? for more detail on the state of mining and its gradual obfuscation.
In this context to say you "control" your entry in the bank's ledger is an oversimplification. You can instruct the bank to perform transactions against your entry (and no-one else's) but the bank can reject your instructions. For example if they would overdraw your account, or send money to a sanctioned account. The key point is that your ownership relationship with the bank comes with a dispute resolution system and the ability to reverse transactions. Your cryptocurrency wallet has neither.
Web3 is Going Just Great is Molly White's list of things that went wrong. The cumulative losses she tracks currently stand at over $79B.
Your secrets are especially at risk if anyone in your software supply chain use a build system implemented using AI "vibe coding". David Gerard's Vibe-coded build system NX gets hacked, steals vibe-coders’ crypto details a truly beautiful example of the extraordinary level of incompetence this reveals.
Molly White's Abuse and harassment on the blockchain is an excellent overview of the privacy risks inherent to real-world transactions on public blockchain ledgers:
Imagine if, when you Venmo-ed your Tinder date for your half of the meal, they could now see every other transaction you’d ever made—and not just on Venmo, but the ones you made with your credit card, bank transfer, or other apps, and with no option to set the visibility of the transfer to “private”. The split checks with all of your previous Tinder dates? That monthly transfer to your therapist? The debts you’re paying off (or not), the charities to which you’re donating (or not), the amount you’re putting in a retirement account (or not)? The location of that corner store right by your apartment where you so frequently go to grab a pint of ice cream at 10pm? Not only would this all be visible to that one-off Tinder date, but also to your ex-partners, your estranged family members, your prospective employers. An abusive partner could trivially see you siphoning funds to an account they can’t control as you prepare to leave them.
In The Risks Of HODL-ing I go into the details of the attack on the parents of Veer Chetal, who had unwisely live-streamed the social engineering that stole $243M from a resident of DC.
Whether AI delivers net value in most cases is debatable. "Vibe coding" is touted as the example of increasing productivity, but the experimentalevidence is that it decreases productivity. Kate Niederhoffer et al's Harvard Business Review article AI-Generated "Workslop” Is Destroying Productivity explains one effect:
Employees are using AI tools to create low-effort, passable looking work that ends up creating more work for their coworkers. On social media, which is increasingly clogged with low-quality AI-generated posts, this content is often referred to as “AI slop.” In the context of work, we refer to this phenomenon as “workslop.” We define workslop as AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task.
Here’s how this happens. As AI tools become more accessible, workers are increasingly able to quickly produce polished output: well-formatted slides, long, structured reports, seemingly articulate summaries of academic papers by non-experts, and usable code. But while some employees are using this ability to polish good work, others use it to create content that is actually unhelpful, incomplete, or missing crucial context about the project at hand. The insidious effect of workslop is that it shifts the burden of the work downstream, requiring the receiver to interpret, correct, or redo the work. In other words, it transfers the effort from creator to receiver.
Unfortunately, this article pretends to be a writeup of a study — but it’s actually a promotional brochure for enterprise AI products. It’s an unlabeled advertising feature.
And goes on to explain where the workslop comes from:
Well, you know how you get workslop — it’s when your boss mandates you use AI. He can’t say what he wants you to use it for. But you’ve been told. You’ve got metrics on how much AI you use. They’re watching and they’re measuring.
Return on investment has evaded chief information officers since AI started moving from early experimentation to more mature implementations last year. But while AI is still rapidly evolving, CIOs are recognizing that traditional ways of recognizing gains from the technology aren’t cutting it.
Tech leaders at the WSJ Leadership Institute’s Technology Council Summit on Tuesday said racking up a few minutes of efficiency here and there don’t add up to a meaningful way of measuring ROI.
Given the hype and the massive sunk costs, admitting that there is no there there would be a career-limiting move.
we picked ten historical bubbles and assessed them on factors including spark, cumulative capex, capex durability and investor group. By our admittedly rough-and-ready reckoning, the potential AI bubble lags behind only the three gigantic railway busts of the 19th century.
For now, the splurge looks fairly modest by historical standards. According to our most generous estimate, American AI firms have invested 3-4% of current American GDP over the past four years. British railway investment during the 1840s was around 15-20% of GDP. But if forecasts for data-centre construction are correct, that will change. What is more, an unusually large share of capital investment is being devoted to assets that depreciate quickly. Nvidia’s cutting-edge chips will look clunky in a few years’ time. We estimate that the average American tech firm’s assets have a shelf-life of just nine years, compared with 15 for telecoms assets in the 1990s.
I think they are over-estimating the shelf-life. Like Bitcoin mining, power is a major part of AI opex. Thus the incentive to (a) retire older, less power-efficient hardware, and (b) adopt the latest data-center power technology, is overwhelming. Note that Nvidia is moving to a one-year product cadence, and even when they were on a two-year cadence Jensen claimed it wasn't worth running chips from the previous cycle. Note also that the current generation of AI systems is incompatible with the power infrastructure of older data centers, and this may well happen again in a future product generation. For example, Caiwei Chen reports in China built hundreds of AI data centers to catch the AI boom. Now many stand unused:
The local Chinese outlets Jiazi Guangnian and 36Kr report that up to 80% of China’s newly built computing resources remain unused.
An AI-bubble crash could be different. AI-related investments have already surpassed the level that telecom hit at the peak of the dot-com boom as a share of the economy. In the first half of this year, business spending on AI added more to GDP growth than all consumer spending combined. Many experts believe that a major reason the U.S. economy has been able to weather tariffs and mass deportations without a recession is because all of this AI spending is acting, in the words of one economist, as a “massive private sector stimulus program.” An AI crash could lead broadly to less spending, fewer jobs, and slower growth, potentially dragging the economy into a recession.
In 2021 Nicholas Weaver estimated that the Ethereum computer was 5000 times slower than a Raspberry Pi 4. Since then the gas limit has been raised making his current estimate only 1000 times slower.
if people do start dumping blockchain-based assets in fire sales, everyone will know immediately because the blockchain is publicly visible. This level of transparency will only add to the panic (at least, that’s what happened during the run on the Terra stablecoin in 2022).
...
We also saw ... that assets on a blockchain can be pre-programmed to execute transactions without the intervention of any human being. In good times, this makes things more efficient – but the code will execute just as quickly in bad situations, even if everyone would be better off if it didn’t.
When things are spiraling out of control like this, sometimes the best medicine is a pause. Lots of traditional financial markets close at the end of the day and on weekends, which provides a natural opportunity for a break (and if things are really bad, for emergency government intervention). But one of blockchain-based finance’s claims to greater efficiency is that operations continue 24/7. We may end up missing the pauses once they’re gone.
In the 26th September Grant's, Joel Wallenberg notes that:
Lucrative though they may be, the problem with stablecoin deposits is that exposure to the crypto-trading ecosystem makes them inherently correlated to it and subject to runs in a new “crypto winter,” like that of 2022–23. Indeed, since as much as 70% of gross stablecoin-transaction volume derives from automated arbitrage bots and high-speed trading algorithms, runs may be rapid and without human over-sight. What may be worse, the insured banks that could feed a stablecoin boom are the very ones that are likely to require taxpayer support if liquidity dries up, and Trump-style regulation is likely to be light.
So the loophole in the GENIUS act for banks is likely to cause contagion from cryptocurrencies via stablecoins to the US banking system.
Acknowledgments
This talk benefited greatly from critiques of drafts by Hilary Allen, David Gerard, Jon Reiter, Joel Wallenberg, and Nicholas Weaver.
The Open Government Partnership (OGP) Global Summit is a key gathering for government representatives, civil society leaders, and open government advocates from around the world. It serves as a crucial platform for sharing ideas, forging partnerships, and setting the agenda for more transparent, participatory, and accountable governance. Our team and key allies will be on...
Our study aims to address this gap by analysing TV news broadcasts for references to social media. These references can be in many representations - on-screen text, verbal mentions by the hosts/speakers, or visual objects displayed on the screen (Figure 1). In this GSoC project, our focus was on the visual objects, specifically detecting social media logos and screenshots of user posts appearing in TV news.
Figure 1: Different representations of social media on TV news (On-screen text, verbal mentions by the hosts/speakers, and visual objects such as logos and screenshots)
Datasets
TV News Data
Internet Archive’s TV News Archive provides access to over 2.6 million U.S. news broadcasts dating back to 2009. Each broadcast at the TV News Archive is uniquely identified by an episode ID that encodes the channel, show name, and timestamp.
A show within the TV News Archive is represented by 1-minute clips, each accessible via a URL arguments in the path, where time is specified in seconds:
For our visual detection tasks, we used the full-resolution frames extracted every second throughout the entirety of the broadcast, provided by Dr. Kalev Leetaru, at The GDELT Project.
Our Sample Data (Selected Episodes)
Between 2020 and 2024, we sampled one day per year during primetime hours (8–11pm) across three major cable news channels (Fox News, MSNBC, and CNN), resulting in 45 episodes.
After excluding 9 episodes consisting of documentaries or special programs (which fall outside the scope of regular prime-time news coverage), the final sample contained 36 news episodes. Of these, 15 were from Fox News, 12 from MSNBC, and 9 from CNN. Table 1 presents the full list of episodes, with the excluded episodes highlighted in red.
Gold Standard Dataset
To create the fold standard dataset, we labeled every 1-second frame of each episode with the presence of logo and/screenshot along with the mentioned social media platform name. The labeling process was facilitated by a previously compiled dataset, which I had created through manual review of TV news broadcasts. In that dataset, each 60-second clip was annotated for any social media references, including text, host mentions, or visual elements such as logos and screenshots. By cross-referencing these annotations, we constructed the gold standard dataset. The gold standard dataset includes only those frames that contain at least one social media reference (either a logo or a screenshot), rather than every second of an episode.
Below is a snippet of the gold standard for CNN episodes (Figure 2).
CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon is the episode ID and000136.jpg indicates the frame taken at the 136th second of that 60-minute episode.
The Logo and Screenshot columns indicate their presence, while their Type columns specify the platform.
For example, the entry CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon-000136.jpg shows that the frame contains both a Twitter logo and a Twitter screenshot.
The complete gold standard datasets for all three channels can be accessed via the following links:
We implemented a system to automatically detect social media logos and user post screenshots in television news images using the ChatGPT API (GPT-4o is a multimodal model capable of processing both text and image input). The workflow is summarized below.
1. System Setup
We accessed the API of the GPT-4o model (using an access token provided by the Internet Archive) to process image frames and return structured text output.
Image-to-Text Pipeline:
Input: TV news frame image in .jpg format.
Output: Structured CSV file containing the fields:
Social Media Logo (Yes/No)
Logo Detection Confidence (0–1)
Social Media Logo Type (e.g., Instagram, Twitter (bird logo), X (X logo))
Social Media Post Screenshot (Yes/No)
Screenshot Detection Confidence (0–1)
Social Media Screenshot Type (e.g., Instagram, Twitter)
2. Image Preprocessing
We extracted the full-resolution frames (per second) of each episode for processing. The raw frames were provided as a .tar file per episode.
Since video content such as TV news broadcasts often contains long segments of visually static or near-identical scenes, processing every extracted frame independently can introduce significant redundancy and is computationally expensive. To address this, we applied perceptual hashing to detect and eliminate duplicate or near-duplicate frames.
We used the Python library ImageHash (with average hashing) to reduce the number of frames that need to be processed (code). To measure how close two frames were, we calculated the Hamming distance between their hashes. A low Hamming distance means the frames are almost the same, while a higher value means they are more different. By setting a threshold t (for example, treating frames with a distance of t ≤ 5 as duplicates), we were able to keep just one representative frame from a group of similar ones.
To identify duplicate groups, we defined a threshold parameter t, such that any two frames with a Hamming distance ≤ t were considered equivalent. Within each group of near-duplicates, only a single representative frame was retained. We evaluated multiple thresholds (t=5,4,3). We also explored whether keeping the middle frame or the last frame from a group of similar frames made any difference in the results. While these choices did not significantly impact our initial findings, it is an aspect that requires further investigation and will be considered as part of future work.
For the final configuration of the deduplication process, we used t=3 for a relatively strict threshold and to minimize the chance of discarding distinct, relevant content. Within each group, we retained the middle frame, guided by the intuition tha the last frame of a group often coincides with transition boundaries (e.g., cuts, fades), whereas the middle frame is less likely to be affected.
3. Prompt Design and Iterations
To automatically detect social media logos and user post screenshots using the ChatGPT API, we designed a structured prompt. We iteratively refined the prompt over seven versions (link to commits) to ensure strict and reproducible detection of social media logos and user post screenshots. Each version introduced improvements and changes made in each version are documented in the commit descriptions.
A major change was made from prompt version 3 (link to v3) to prompt version 4 (link to v4). The update narrowed the task to focus strictly on logo and user post screenshot detection. Previous versions included additional attributes such as textual mentions of social media, post context, and profile mentions, but version 4 and subsequent versions disregarded these elements, emphasizing visual detection only.
After several iterations and refinements based on the results of earlier versions, the final prompt we used was version 7 (link to v7)
The final version of the prompt instructed the model to output the following fields:
Social Media Logo (Yes/No)
Logo Detection Confidence (0–1)
Social Media Logo Type
Social Media Post Screenshot (Yes/No)
Screenshot Detection Confidence (0–1)
Social Media Screenshot Type
The final prompt reflects the following considerations:
Scope of platforms
Only considered the following social media platforms for detection:
Facebook, Instagram, Twitter (bird logo), X (X logo), Threads, TikTok, Truth Social, LinkedIn, Meta, Parler, Pinterest, Rumble, Snapchat, YouTube, and Discord. These are the platforms that appeared in the gold standard.
Explicitly excluded other platforms including messaging apps like WhatsApp or Messenger.
Logo detection rules
Only count the official graphical logos. Text mentions of platform names within the image were explicitly not considered as logos.
Logos had to match official design, color, and proportions.
Only count logos that are clearly identifiable. Any partial, ambiguous, or unclear elements were excluded.
Specific rules were added for X: only consider the stylized ‘X’ logo of the social media platform, excluding other uses of ‘X’ (e.g., ‘X’ in FOX News logo, Xfinity logo)
Post screenshot screenshots
Instructed the model to mark only actual user post screenshots, not interface elements like buttons, menus, or platform logos.
Visual cues such as profile pictures, usernames, timestamps, reactions, and layout elements could indicate a screenshot. However, these features alone do not guarantee that the image is an actual post screenshot.
Confidence Scores
For both logos and screenshots, we prompted the model to provide a confidence score from 0 to 1, indicating how certain it is about its detection. These scores were recorded but not yet used in the analysis; they will be considered in future work.
4. API Interaction
Each request consisted of a single user message containing: 1. The analysis prompt (text instructions) 2. The image (base 64 encoded) as an inline image_url.
Figure 3 shows a snippet of the code used to encode images and send requests to the API (full code).
max_tokens=1000, # Set a reasonable max_tokens for response length
temperature=0.2
)
content = response.choices[0].message.content
parsed_fields = parse_response(content)
successful_request = True
break # If successful, exit the outer retry loop
Figure 3: A snippet of the code
5. Response Parsing and Output
After the API returns a response for each frame, we parse the model’s output into a CSV file for each episode containing all six fields as listed in the prompt design and iterations section. We used a flexible regex-based parser that extracts all fields reliably, even if the model’s formatting varies slightly (L93-L159 of code).
Next, we cleaned the ChatGPT output (code). The script standardizes file paths and normalizes binary columns (Social Media Logo and Social Media Post Screenshot) by converting variations of “Yes”, “No”, and “N/A” into a consistent format. It also normalizes platform names, replacing standalone “X” with “Twitter (X),” updating Twitter bird logos to “Twitter” to align with the labels in the gold standard dataset for evaluation.
After cleaning each episode’s results, we combined them into a single CSV file per channel (code). The script iterates through all individual CSV files for a given channel and merges them into one consolidated CSV.
6. Evaluation
We started small, restricting our initial tests to a single news episode: Fox News at Night With Shannon Bream, March 13, 2020, 8-9 PM (results). This allowed us to experiment with different prompts before scaling to the full database. Across these runs, we varied both the prompt structure (Prompt v1–v4) and the decoding temperature (0.0 and 0.2). The decoding temperature controls randomness in LLM output. Here, lower values (such as 0.0 and 0.2) are more deterministic, higher values more creative. At temperature 0.0, the output is essentially greedy; the same input will likely produce the same output. For the final version, we ended up using temperature as 0.2 to allow some flexibility to interpret edge cases without introducing instability.
Single Episode Evaluation Results
For logo detection, results from the single episode show a clear improvement across prompt versions.
Prompt v1 (Runs 1–7): The baseline instruction set produced very high recall but extremely low precision, with many false positives (results). For example, in Run 4, the model achieved a recall of 0.9155 but precision of only 0.1167, yielding an overall F1-score of 0.2070.
Prompt v2 (Run 8): Refining the prompt substantially reduced false positives, increasing precision to 0.1700 while recall remained high at 0.9577 (results).
Prompt v3 (Run 9): Further improvements to the prompt elded a significant improvement in balance: precision rose to 0.3571 while recall remained strong (0.9155), resulting in an F1-score of 0.5138 (results)
Prompt v4 (Run 10): Explicitly narrowing scope to only logos and screenshots without any questions related to additional context improved our results drastically (results). This change increased the precision (0.9315) while maintaining high recall (0.9577), producing near-perfect accuracy (0.9978) and an overall F1 score (0.9444).
Table 2 shows the key results for logo detection. Results show a clear trajectory of improvement across prompt versions (v1–v4).
For screenshot detection (Table 3), performance was consistently perfect for this episode. The model maintained 100% accuracy, precision, recall, and F1-score across all versions. This also suggests that screenshot detection is a relatively straightforward task compared to logo detection, at least for this particular episode.
Version
Accuracy
Precision
Recall
F1-score
Run 4 (Prompt v1)
0.8640
0.1167
0.9155
0.2070
Run 8 (Prompt v2)
0.9085
0.1700
0.9577
0.2887
Run 9 (Prompt v3)
0.9664
0.3571
0.9155
0.5138
Run 10 (Prompt v4)
0.9978
0.9315
0.9577
0.9444
Table 2: Logo detection key results on a single episode (t=0.2)
Version
Accuracy
Precision
Recall
F1-score
Run 4 (Prompt v1)
1.0000
1.0000
1.0000
1.0000
Run 8 (Prompt v2)
1.0000
1.0000
1.0000
1.0000
Run 9 (Prompt v3)
1.0000
1.0000
1.0000
1.0000
Run 10 (Prompt v4)
1.0000
1.0000
1.0000
1.0000
Table 3: Screenshot detection key results on a single episode (t=0.2)
With appropriate constraints, the model could reliably perform logo detection, while screenshot detection required minimal intervention.
All Episodes Evaluation Results
After establishing stability with Prompt v4, we scaled to all 36 episodes across three channels (results). Performance metrics for logo detection are provided in Table 4, and those for screenshot detection are shown in Table 5.
Channel
Accuracy
Precision
Recall
F1-score
CNN
0.9912
0.1705
0.9565
0.2895
FOX News
0.9903
0.5039
0.9324
0.6542
MSNBC
0.9931
0.5238
0.9649
0.6790
Table 4: Performance metrics for logo detection (Prompt version 4, all episodes)
Channel
Accuracy
Precision
Recall
F1-score
CNN
0.9984
0.5405
0.8696
0.6667
FOX News
0.9922
0.1750
1.0000
0.2979
MSNBC
0.9986
0.8022
0.9605
0.8743
Table 5: Performance metrics for screenshot detection (Prompt version 4, all episodes)
The results showed:
CNN: Very high recall (>0.95) but extremely low precision for logo detection (0.1705), leading to a weak F1-score (0.2895). This reflects over-detection: the model flagged many elements as logos.
Fox News: Precision and recall were more balanced (precision 0.5039, recall 0.9324), producing an F1 of 0.6542.
MSNBC: The best performer, with precision = 0.5238, recall = 0.9649, and F1 = 0.6790.
For screenshot detection, MSNBC again outperformed, with a F1 = 0.8743. CNN (F1 = 0.667) and Fox News (F1 = 0.298) were more prone to over-detection.
The same model and prompt performed better on MSNBC content. This may be related to differences in on-screen visual style, such as clearer or less ambiguous logo and screenshot cues, but this remains speculative and warrants further study.
We made further refinements to the prompt to improve precision:
Prompt v5 (changes): This version of the prompt sets a fixed list of valid platforms, adds confidence scores for detections, and tightens logo detection rules with stricter visual checks.
Prompt v6 (changes): Explicit X logo rules were introduced which reduced false positives. We further clarified confidence score instructions to ensure consistent numeric outputs for all detections. Refined the screenshot criteria to include only user posts, reducing mislabeling; this marked the first substantial prompt change for screenshots, as their performance had previously been consistently stable. From Prompt v5 to Prompt v6, CNN saw a slight drop in logo precision but improved screenshot F1, while MSNBC showed minor gains in screenshot detection with stable logo performance.
Prompt v7 (changes): This final configuration produced the most stable results across channels. It simplifies the X logo rules by removing exceptions for black, white, or inverted colors, while keeping strict guidance to avoid other confusing logos (X in Xfinity logo, FOX News logo, or other random X letters). Clarified the confidence score questions to always require a numeric answer between 0 and 1, explicitly prohibiting “N/A” responses for consistency. It also explicitly states that OCR-detected platform names are not counted as logos.
Results from Prompt version 4 (Table 4 and 5) to Prompt version 5 (Table 6 and 7) show improved precision across all channels while maintaining high recall. Specifically,
CNN: logo F1 increased from 0.29 to 0.46.
Fox News: logo F1 improved from 0.65 to 0.73.
MSNBC: logo F1 rose from 0.68 to 0.78.
Screenshot detection remained stable. Overall, version 5 produced more balanced detections, reducing over-detection, particularly for logos (results).
Channel
Accuracy
Precision
Recall
F1-score
CNN
0.9954
0.3056
0.9167
0.4583
FOX News
0.9941
0.6595
0.8138
0.7286
MSNBC
0.9966
0.6800
0.9239
0.7834
Table 6: Performance metrics for logo detection (Prompt version 5, all episodes)
Channel
Accuracy
Precision
Recall
F1-score
CNN
0.9982
0.5263
0.8696
0.6557
FOX News
0.9926
0.1818
1.0000
0.3077
MSNBC
0.9986
0.7714
0.9474
0.8504
Table 7: Performance metrics for screenshot detection (Prompt version 5, all episodes)
Results from the final prompt version (Prompt version 7) are shown in Table 8 (for logos) and Table 9 (for screenshots). CNN shows an increase in logo F1 from 0.46 to 0.51 and screenshot F1 from 0.66 to 0.70. Fox News experiences a slight decrease in logo F1 (0.73 to 0.69) but an improvement in screenshot precision (0.30 to 0.39). MSNBC achieves a logo F1 of 0.89 (up from 0.78) and a screenshot F1 of 0.91 (up from 0.85). This version achieved the best balance of precision and recall, particularly for MSNBC (results). However, this also shows that no single prompt configuration is optimal for all channels; some adjustments may be required to maximize performance per channel.
Channel
Accuracy
Precision
Recall
F1-score
CNN
0.9965
0.3529
0.8889
0.5053
FOX News
0.9926
0.5963
0.8298
0.6940
MSNBC
0.9980
0.8535
0.9306
0.8904
Table 8: Performance metrics for logo detection (Prompt version 7, all episodes)
Channel
Accuracy
Precision
Recall
F1-score
CNN
0.9986
0.5789
0.8800
0.6984
FOX News
0.9945
0.2456
1.0000
0.3944
MSNBC
0.9988
0.8496
0.9697
0.9057
Table 9: Performance metrics for screenshot detection (Prompt version 7, all episodes)
Overall, these results underscore the value of iterative prompt engineering, temperature tuning, and task-specific constraints in achieving high-quality, reproducible detection outcomes in multimedia content.
Future Work
Several directions remain for extending this work.
Prompt Refinement and Channel-Specific Tuning: We will continue refining the analysis prompt to increase accuracy and consistency in detecting social media logos and user post screenshots. Early observations suggest that performance varies across channels (and programs), likely due to unique ways in how social media is visually presented by them. This indicates that channel- or program-specific prompt tuning could further enhance results.
Decoding Temperature Exploration: While our experiments primarily used low decoding temperatures (0.0 and 0.2), future work can explore a range of temperatures to evaluate whether controlled increases in randomness improve recall in edge cases without significantly raising false positives.
Frame Selection Strategies: We conducted preliminary observations using different Hamming distance thresholds (t=5,4,3) to group similar frames and experimented with selecting the first, middle, or last frame from each group. While these initial explorations provided some insights, they were not systematically analyzed. Future work will investigate the effects of different frame selection strategies to determine the optimal approach for reducing redundancy without losing relevant content.
Confidence Scores: The confidence scores for logos and screenshots (ranging from 0 to 1) were recorded but not yet utilized. Future work will explore integrating these scores into the analysis to weigh detections and potentially improve precision.
Dataset Expansion: Future work includes manually labeling additional episodes from multiple days of prime-time TV news to expand the gold standard dataset. This will uncover more instances of social media references. We will also be able to evaluate the performance of our logo and screenshot detection pipeline across diverse broadcast content.
Advertisement Filtering: With access to advertisement segments, we plan to exclude ad images before the evaluation step. This will improve our results, as currently, the pipeline includes ads, so ChatGPT may label social media references in ads that are not annotated in the gold standard. As a result, some apparent false positives are actually correct detections, highlighting the need to filter ads for accurate evaluation.
Complementary Detection Methods: In addition to logo and screenshot detection, future work will focus on other approaches such as analyzing OCR-extracted text from video frames and analysing closed-caption transcripts for social media references.
Compare Against Other Multimodal Models: We aim to explore other vision-language APIs, such as Google’s Gemini Pro to compare detection performance across different Large Language Models (LLMs).
Acknowledgement
I sincerely thank the Internet Archive and the Google Summer of Code Program for providing this amazing opportunity. Specially, I would like to thank Sawood Alam, Research Lead, and Will Howes, Software Engineer, at the Internet Archive’s Wayback Machine for their guidance and mentorship. I also acknowledge Mark Graham, Director of the Wayback Machine at the Internet Archive and Roger Macdonald, Founder of the Internet Archive’s TV News Archive for their invaluable support. I am grateful to the TV News Archive team for welcoming me into their meetings, which allowed me to gain a deeper understanding of the archive and its work. I am especially grateful to Kalev Leetaru (Founder, the GDELT Project) for providing the necessary Internet Archive data which were processed through the GDELT project. Finally, I would like to thank my PhD advisors, Dr. Michele Weigle and Dr. Michael Nelson (Old Dominion University) and Dr. Alexander Nwala (William & Mary) for their continued guidance.
One of the key things aspiring librarians are taught is "the reference interview". The basic premise is that often when people ask for help or information, what they ask for isn't what they actually want. They ask for books on Spain, when they actually want to understand the origins of the global influenza pandemic of 1918-20. They ask if you have anything on men's fashion, when they want to know how to tie a cravat. They ask if you have anything by Charles Dickens, when they are looking for a primer on Charles Darwin's theory of evolution. The "reference interview" is a technique to gently ask clarifying questions so as to ensure that you help the library user connect to what they are really looking for, rather than what they originally asked.
Sometimes this vagueness is deliberate – perhaps they don't want the librarian to know which specific medical malady they are suffering from, or they're embarrased about their hobby. Sometimes it's because people "don't want to bother" librarians, who they perceive have more important things to do than our literal job of connecting people to the information and cultural works they are interested in – so they'll ask a vague question hoping for a quick answer. Often it's simply that the less we know about something, the harder it is to formulate a clear question about it (like getting our Charles' mixed up). Some of us are merely over-confident that we will figure it out if someone points us in vaguely the right direction. But for many people, figuring out what it is we actually want to know, and whether it was even a good question, is the point.
I was thinking about this after reading Ernie Smith's recent post about Google AI summaries, which are at the centre of a legal case brought against Alphabet Inc by Penske Media. Ernie asks, rhetorically:
Does Google understand why people look up information?
I thought this was an interesting question, because – in the context of the rest of the post – the implication is that Google does not understand why people look up information, despite their gigantic horde of data about what people search for and how they behave after Google loads a page in response to that query. How could this be? Isn't "behavioural data" supposed to tell us about people's "revealed preferences"? Can analytics from billions of searches really be wrong? Maybe if we compare Google's approach to centuries of library science we might find out.
In this work, Wilson described bibliographic work as two powers: Descriptive and Exploitative. The first is the evaluatively neutral description of books called Bibliographic control. The second is the appraisal of texts, which facilitates the exploitation of the texts by the reader.
Professor Hope Olson memorably called this "descriptive power" the power to name. The words we use to describe our reality have a material impact on shaping that reality, or a least our relationship with it. In August this year the Australian Stock Exchange accidentally wiped $400 million off the market value of a listed company because they mixed up the names of TPG Telecom Limited and TPG Capital Asia. Defamation law continues to exist because of the power of being described as a criminal or a liar. Description matters.
Web search too originally sought to describe, and on that basis to return likely results for your search.
Different search engines have approached the creation of indexes using their own strategies, but the purpose was always to map individual web pages both to concepts or keywords and – since Google – to each other. A key difference between web search engines and the systems created and used by libraries is that the latter make use of controlled metadata, whereas the former cannot and do not. Google in particular has made various attempts at outsourcing the work of creating more structured metadata and even controlled vocabularies. All have struggled to varying degrees on the basis that ordinary people creating websites aren't much interested in and don't know how to do this work (for free), and businesses can't see much, if any, profit in it. At least, their own profit.
Whilst there is a widespread feeling that Google search used to be much better, the fact that "Search Engine Optimisation" became a concept soon after the creation of web search engines points to the fundamental limitation of uncontrolled indexes. Librarians describe what creative works are about, thus connecting items to each other through their descriptions. Search engines approach the problem from the other direction: describing the connections between works, thus inferring what concepts they are most strongly associated with. Are either of these approaches really "evaluatively neutral"?
Long held up as a core tenet of librarianship, neutrality or objectivity has been fiercely debated in recent years. Mirroring similar criticisms of journalistic standards, many have pointed out that "objectivity" in social matters very often simply means upholding the status quo. We are social animals. Nothing we can say about ourselves is ever "neutral" or objective, because everything is contested and relational. Yet humans on the whole are anthropocentric in how we see the world. We are a species that frequently thinks it can see the literal face of god in a piece of toast. Everything we say about anything can be seen as being about us, whether it's the mating habits of penguins or the movement of celestial bodies.
So as soon as we recognise Descriptive Power, arguing over semantics is inevitable. In democratic systems an enormous amount of effort is expended attempting to move the Overton Window. Was Australia "settled" or "invaded" in 1788? Are certain countries "poor", "developing", "in the third world" or "from the global south"? Is a political movement "conservative", "alt-right" or "fascist"? Is it still "folk music" if it's played with an electric guitar? We argue about these descriptions because they also say something about how we see ourselves and want to be seen by others. When description has power, you can't really be "neutral" when wielding it. Much like judges or umpires, when using the power to name the librarian's task is to be fair and factual. We are members of the Reality Based Community.
Memory holes
Mita goes on to explore the idea that the library profession has generally seen our job as beginning and ending with descriptive power:
Library catalogues don’t tell you what is true or not. While libraries facilitate claims of authorship, we do not claim ownership of the works we hold. We don’t tell you if the work is good or not. It is up to authors to choose to cite in their bibliographies to connect their work with others and it is up to readers to follow the citation trails that best suit their aims..
Here is where "neutrality" comes in. We might describe a book as being about something, written by a particular person, or connected to other works and people. But we make no claims as to the veracity of the assertions you might read within it. Assessing the truth or artistry of a certain work is, as they say, "left as an exercise for the reader". This is where the (not always exercised) professional commitment to reader privacy comes in. If it's up to the reader to glean their own meaning from the works in our collections, then we can't know and should not assume their purpose or their conclusions. And if people are to be given the freedom to explore these meanings, they can't have the fear of persecution for reading "the wrong things" hanging over their heads.
We can see an almost perfect case study of this clash of approaches in the debacle of Clarivate's "Alma Research Assistant". Aaron Tay lays this out clearly in The AI powered Library Search That Refused to Search:
Imagine a first‑year student typing “Tulsa race riot” into the library search box and being greeted with zero results—or worse, an error suggesting the topic itself is off‑limits. That is exactly what Jay Singley reported in ACRLog when testing Summon Research Assistant, and it matches what I’ve found in my own tests with the functionally similar Primo Research Assistant (both are by Clarivate)...According to Ex Libris, the culprit is a content‑filtering layer imposed by Azure OpenAI, the service that underpins Summon Research Assistant and its Primo cousin.
Want to research something that might be "controversial"? I can't let you do that Dave, computer says no. What's worse in the case of Primo Research Assistant is that rather than declaring to the user that it won't search, the system is designed to simply memory hole any relevant results and claim that it cannot find anything.
A library is a provocation
Whilst they are often associated with a certain stodginess, every library is a provocation. With all these books, how could you restrict yourself to reading only one or two? Look how many different ideas there are. See how many ways there are to describe our worlds. The thing that differentiates a library from a random pile of books is that all these ideas, concepts and stories are organised. They have indexes, catalogues, and classification systems. They are arranged into subjects, genres, and sub-collections. The connections between them and the patterns they map are made legible.
The pattern of activity that digital networks, ranging from the internet to the web, encourage is building connections, the creation of more complex networks. The work of making connections both among websites and in a person’s own thinking is what AI chatbots are designed to replace.
The most exciting developments in library science right now are exploring not how to provide "better answers" but rather how to provide richer opportunities to understand and map the connections between things. With linked data and modern catalogue interfaces we can overlay multiple ontologies onto the same collection, making different kinds of connections based on different ways of understanding the world.
LLMs are mid
Clarifying the connections between people and works. Disambiguating names. Mapping concepts to works, and re-mapping and organising them in line with different ontological understandings. All of this requires precision. Because they under-estimate the skill of manual classification and indexing, and over-estimate their own technologies, AI Bros thought they could combine Wilson's two powers. But description for the purpose of identifying information sources is an exact science. This is why we have invested so much energy and time in things like controlled vocabularies, authority files, and the ever-growing list of Permanent Identifiers (PIDs) like DOI, ISNI, and ORCID. It's why we've been slowly and carefully talking about Linked Open Data.
And then these arrogant little napoleons think they can just YOLO it.
The problem is not the core computing concepts behind machine learning but rather the implementation details and the claims made about the resulting systems. I agree with Aaron Tay's on embedding vector search – this is an incredibly compelling idea that has been stretched far beyond what it is most useful for. Transformer models of any kind are ultimately intended to find the most probable match to an input. Not the best match. Not a perfect match. Not a complete list of matches. An arbitrary number of "result most likely". In contrast to the exactness of library science, these new approaches are merely averaging devices.
Answering machines
Let us now return to Ernie Smith's question: Does Google understand why people look up information?
Google thinks people look up information in order to find "the most likely answer". The company has to think like this, because the key purpose of Google's search tool is to sell audiences to advertisers via automated, hundred-millisecond-long auctions for the purpose of one-shot advertising. Everything else we observe about how it works stems from this key fact about their business. Endless listicles and slop generated by both humans and machines is the inevitable result. And since the purpose of the site is simply to display ads, why not make it "more efficient" by providing an "answer" immediately?
What I think Ernie is gesturing at is that when we search for information what we often want is to know what other people think. We want to explore ideas and expand our horizons. We want to know how things are connected. We want to understand our world. When you immediately answer the question you think someone asked instead of engaging with them to work out what they are actually looking for, it's likely to be unhelpful. Asking good questions is harder than giving great answers.
Chat bots and LLMs can't solve the problem. They're just guessing.
Image created using Open Peeps by Pablo Stanley and Team. Licensed under CC0
The second Library Search Benchmark Survey launched earlier this year and we gathered feedback from students, faculty, staff and library employees about finding and accessing materials through Library Search. We also measured changes in user satisfaction, ease of use, user challenges and wins from our first survey in 2022. Participation from users outside the library grew significantly and responses helped us identify clear and actionable insights we’re excited to share and act on in the coming year.
What is unique about Dagstuhl is that it is located in a wood, very isolated from big cities. The nearest town, Wadern, is 30 minutes on foot, and there is no public transportation. After landing at the huge Frankfurt airport, one has to take the train for 2 hrs from the regional train station to Türkismühle. Then you must take a taxi because there is no public transportation. Based on the recommendation of the Dagstuhl website, I had to reserve a taxi from Taxi Martin at least 3 days in advance. Of course, I had to reserve it again when I left at 4 am!
The whole building contains two parts: an old building, which was built 260 years ago, and a new building, built in 2001. The two buildings are connected with a bridge on the second floor. In addition to lecture rooms and living rooms, the old building has a dining room, a kitchen, a piano room, and a backyard, which is very nice for afternoon teas and outdoor discussions. In addition to lectures and living rooms, the new building has a laundry room, a sauna, and a gym. Both buildings are covered by Wifi. Coffee (including espresso drinks), sparkling water, and wine are available 24/7. The center has everything you need for research, except cars. You have to walk if you need to get out.
The core organization team consists of five people around the world, including Hannah Bast (chair, Germany), Marcel Ackermann (coordinator, Germany), Guillaume Cabanac (co-chair, France), Paolo Manghi (co-chair, Italy), and me (co-chair, United States). We started drafting the proposal back in March 2023. The original goal was to celebrate the 32nd anniversary of DBLP, a legendary digital library for computer and information sciences. Later on, the plan evolved to assemble about 40 scholars to discuss a broader topic: open scholarly information systems (OSIs). To carefully control the size of the seminar and guarantee attendance, the organization team sent invitations at least 3 rounds. The invited people include PIs, core technical people, or directors of well-known digital libraries (e.g., Google Scholar, arXiv, CORE, OpenAIRE, OpenReview, NDLTD, and CiteSeerX), researchers in particular domains (e.g., Natural Language Processing, semantic web, digital libraries, information retrieval), and software companies about open data (e.g., Digital Science).
Different from computer science conferences and workshops, the seminar was held in a way with short talks (10 min -- 20 min) and plenty of time for plenary and small group discussions, and social activities. The final report is collaboratively edited. The activities each day are outlined below.
Perspectives on How to Achieve the Sustainability of Open Scholarly Infrastructure (OSI)
Software tagging in DBLP
Nanopublications to track acknowledgements of DBLP/ Dagstuhl
Find Ghost #96
Take one topic as an example. The manifesto contains the following.
Topic: Perspectives on How to Achieve the Sustainability of Open Scholarly Infrastructure (OSI) (See also Barcelona Declaration, arguments for open scholarly infrastructure )
People Involved: Jian Wu, Petr Knoth, Lynda Hardman, Carole Goble, Bianca Kramer, Daniel Mietchen, Martin Fenner, Nees Jan van Eck, Paolo Manghi, Mario Petrella, etc.
Summary of Outcome:
In general, the group discussion on OSI sustainability indicated that this problem has raised the attention of the open scholar community, and affirmed the initial concerns in a broader context. The main outcomes are itemized below.
Several OSIs (e.g., CiteSeerX, NDLTD, CORE) are facing sustainability challenges in the foreseeable or longer-term future due to financial, administrative, and/or human resource issues, which may lead to the loss of valuable data, software, and services to the scholarly communities.
In future proposals to secure the financial support of OSIs, PIs are suggested to emphasize how the OSIs can bring social and economic impacts that are well aligned with the national priorities (e.g., digital sovereignty, security, AI) or the priorities of private funding institutions instead of only justifying the needs from the researchers' points of view.
Consolidation of under-supported OSIs may be necessary to sustain the data, software, and services. PIs are also suggested to consider business models (e.g., donations, subscriptions) to support the growth and maintenance of OSIs.
New OSIs should avoid duplicating data and services of existing OSIs and have a longer vision of sustainability.
Next Steps:
Collaboration with other OSIs (e.g., CORE) for research and developmental work on scholarly big data.
The last topic is "Find Ghost #96", which originated from a tradition of capturing the "Ghost" located everywhere in the Dagstuhl Castle! If you want to know the details, come and visit Dagstuhl!
The seminar was highly rated by the participants, not only because of the free food, drink, and lodging but also because of the short talks and sufficient time for free-form discussions. Thanks to the chair, Hannah Bast! This form proves more efficient for scholars to have in-depth discussions that go into detail and cover many aspects of concrete problems. This is in contrast with most computer and information science conferences with pre-scheduled 20-minute or 30-minute presentations, which appear to cover lots of materials, but in fact make it easy for attendants to get bored. These presentations are usually followed by a very short amount of time for QA and discussions. Most of the time, the session chair has to set a timeout and leave the discussion "offline", which rarely happens.
Information is essential, and these days so readily available through
the internet that we all suffer from a surfeit of it. But Information
without selection and critique is like a relationship without love:
everything you need is there, except the central element. And without
it, nothing has real value. (Patterson, 2025, pp. 56–57).
Patterson, I. (2025). Books: A Manifesto. London: Weidenfeld
& Nicolson.
ETRA 2025 focuses on all aspects of eye movement research across a wide range of disciplines and brings together computer scientists, engineers, and behavioral scientists to advance eye-tracking research and applications.
Dr. Yukie Nagai, Project Professor at The University of Tokyo, delivered the first keynote of ETRA 2025. Her talk “How People See the World: An Embodied Predictive Processing Theory”. Dr. Nagai introduced a neuro-inspired theory of human visual perception based on embodied predictive processing. She explained how sensorimotor learning drives the development of visual perception and attention, providing new insights into the mechanisms underlying how people see and interpret the world.
🧠 An Embodied Predictive Processing Theory
Incredible keynote by Prof. Nagai from @UTokyo_News!
From computational neuroscience to understanding how neurodiverse individuals perceive the world differently - her research bridges robotics, cognition, and eye tracking beautifully. pic.twitter.com/W53XYftCjK
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 27, 2025
Tutorial Highlight
Dr. Andrew T. Duchowski delivered a tutorial “Gaze Analytics: A Data Science Perspective.” The session covered a wide range of topics, from PsychoPy setup to advanced transition entropy analysis. Participants gained hands-on experience with the complete pipeline, engaging directly with both methodology and stimulus generation. This tutorial gave attendees practical exposure to data science approaches in gaze analytics, reinforcing its value for eye-tracking research.
🎯 #ETRA2025 Tutorial Highlight: "Gaze Analytics: A Data Science Perspective" by Prof. Andrew T. Duchowski covered everything from PsychoPy setup to advanced transition entropy analysis! 📊 Participants got hands-on experience with the complete pipeline 👁️🗨️ #EyeTrackingpic.twitter.com/K8ua4v0uAC
The first paper session featured groundbreaking research covering diverse applications of eye tracking. Presentations included work on Large language Model (LLMs) alignment analysis, laser-based eye tracking for smart glasses, and methods for detecting expertise through gaze patterns. The session demonstrated how eye-tracking research continues to expand its scope, integrating with fields such as artificial intelligence, wearable technology, and cognitive modeling. The variety of topics sparked engaging discussions, highlighting the breadth and depth of current innovations in the community.
🔬 Capturing the moment: Paper Session 1 featured groundbreaking work spanning LLM alignment analysis, laser-based eye tracking for smart glasses, and expertise detection through gaze patterns. The diversity of applications demonstrates the expanding scope of our field. #ETRA2025pic.twitter.com/I9knZDOFQK
Dr. Gavindya, a postdoctoral fellow at The University of Texas at Austin working with Dr. Jacek Gwizdka, presented her research on advancing real-time measures of visual attention through the ambient/focal coefficient K "A Real-Time Approach to Capture Ambient and Focal Attention in Visual Search". Their work introduced a robust parametrization and alternative estimation method, along with two new real-time measures analogous to K.
Building on her broader research interests in Eye-Tracking, Human-Computer Interaction, Human-Information Interaction, Data Science, and Machine Learning, Gavindya’s presentation demonstrated how neuro-physiological evidence and cognitive load detection approaches can be integrated into applied systems.
Johannes Meyer presenting his work titled "Ambient Light Robust Eye-Tracking for Smart Glasses Using Laser Feedback Interferometry Sensors with Elongated Laser Beams" at @ETRA_conference
Virmarie Maquiling presenting her work "Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images" at @ETRA_conference in collaboration with Sean Anthony Byrne, Diederick C. Niehorster, Marco Carminati,@EnkelejdaKasne1#ETRA2025 🇯🇵 pic.twitter.com/NLfuWnARTV
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 28, 2025
Sponsor Exhibitions
The sponsors showcase highlighted cutting-edge developments in eye-tracking technology. Exhibitors demonstrated a range of innovative solutions, including immersive VR setups, wearable eye-tracking devices, and advanced software platforms. Attendees had the chance to interact directly with the technology, exploring applications that bring the future of eye tracking to life. These exhibitions not only showcased sponsor contributions but also emphasized the vital role industry partnerships play in driving forward eye-tracking research and applications.
🔬 Exploring cutting-edge eye tracking solutions at our sponsor showcase! Huge thanks to all our amazing sponsors. Their technology demonstrations are bringing the future of eye tracking to life! Don't miss the amazing sponsor exhibitions at #ETRA2025! pic.twitter.com/epBPkIXWsx
Dr. Jean-Marc Odobez, a senior researcher at IDIAP and EPFL, and Head of the Perception and Activity Understanding Group, delivered the second keynote, “Looking Through Their Eyes: Decoding Gaze and Attention in Everyday Life.” Dr. Odobez, a leading expert in multimodal perception systems and co-founder of Eyeware SA, presented his team’s work in the areas of gaze analysis and visual focus of attention. His keynote showcased how gaze data can be decoded to better understand naturalistic interactions, emphasizing both the technical challenges and practical applications of attention modeling in real-world settings.
@ETRA_conference Day 2, excellent keynote by Jean-Marc Odobez @Idiap_ch "Looking Through Their Eyes: Decoding Gaze and Attention in Everyday Life", some of the work in the areas of gaze and Visual Focus of Attention. #etra2025pic.twitter.com/nOcgJciOsj
🔬 Fascinating keynote by Prof. Odobez exploring the challenges of estimating visual attention in natural environments. His work demonstrates how we can decode not just where people look, but what they're truly attending to in complex scenes.#ETRA2025#EyeTracking#HCI#Tokyopic.twitter.com/wo0vpioEmv
Poster Session showcased 36 innovative research studies highlighting the role of Data, Machine Learning, and AI in eye-tracking applications. The session featured diverse approaches, ranging from neural networks for gaze prediction to advanced analytics methods for interpreting eye-movement data. The room was filled with lively discussions as researchers exchanged ideas, explained methodologies, and received feedback from peers and experts.
📊 Poster Session 1 at #ETRA2025 featured outstanding research on Data, ML, and AI applications in eye tracking. Researchers presented 36 innovative studies, from neural networks for gaze prediction to advanced analytics methods. Great discussions throughout the session.#AI#HCIpic.twitter.com/UCZHabvSqW
The study opened up a promising line of inquiry into how eye-movement data can serve as non-invasive indicators for health conditions, particularly in detecting hypoglycemia during nighttime. Pahan’s poster presentation drew attention from conference attendees, sparking discussions on the medical applications of eye tracking beyond traditional HCI contexts. His contribution marked an important step in extending the reach of eye-tracking research into healthcare and biomedical domains.
Excellent job presenting his first publication at the Premier ACM conference for eye tracking titled "Nocturnal Diabetic Hypoglycemia Detection Using Eye Tracking" , a new way of looking at diabetic hypoglycemia using eye tracking https://t.co/Jwh9akYHTwpic.twitter.com/N3l2nxi6RP
The second paper session emphasized the importance of methodological rigor in eye-tracking research. Presentations ranged from evaluating detection algorithms to applying LLMs for cognitive processing tasks. These studies highlighted how careful methodological design provides the analytical foundations for deriving meaningful insights from eye movement data. The session demonstrated both technical depth and practical relevance, reinforcing the role of rigorous analysis in advancing the reliability and impact of eye-tracking research.
📈 Paper Session 2 highlighted the importance of methodological rigor in eye tracking research. From evaluating detection algorithms to leveraging LLMs for cognitive processing, these studies provide the analytical foundations that enable insights from eye movement data.#ETRA2025pic.twitter.com/6kQp9EgvaK
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025
Workshop Session: GenAI Meets Eye Tracking
The GenAI Workshop focused on the intersection of generative AI and eye-tracking research. The session began with a keynote by Xi Wang on “Decoding Human Behavior Through Gaze Patterns”, which explored how gaze data can be harnessed to understand complex aspects of human behavior.
Following the keynote, the workshop featured contributions from Gjergji Kasneci, Enkelejda Kasneci, Aranxa Villanueva, and Yusuke Sugano. Together, the speakers emphasized how generative AI techniques can advance eye-tracking applications, including cognitive modeling, behavioral prediction, and new opportunities for human-computer interaction.
The workshop was well-attended and highly interactive, bringing together perspectives from academia and industry. It highlighted how AI-driven methods are shaping the future of gaze research and opened discussions about challenges in integrating these tools into practical applications.
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025
Happening now, the first session of #GenEAI at @ETRA_conference with three papers on 1) enhancing the medical domain, 2) detecting learning disorders and finally, 3) enhancing eye tracking with LLM insights#ETRA2025 🇯🇵 pic.twitter.com/fefhn7GPR7
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025
Starting session 2 of the #GenEAI workshop at @ETRA_conference with a keynote by Xucong Zhang on
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025
Ending the #GenEAI workshop at @ETRA_conference with awards congratulations to Quoc-Toan Nguyen on their best paper award titled "Learning Disorder Detection Using Eye Tracking: Are Large Language Models Better Than Machine Learning?"#ETRA2025 🇯🇵 pic.twitter.com/dkh7kuPkc8
— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025
Papers Session 3
The third paper session highlighted innovative Computer Vision approaches applied to eye-tracking research. Presentations explored a range of groundbreaking topics, including iris style transfer for privacy preservation and zero-shot pupil segmentation with SAM 2 applied to over 14 million images. These studies showcased how modern vision-based techniques can push the boundaries of both data-driven analysis and practical applications in eye tracking.
By combining advanced AI methods with large-scale datasets, the session emphasized the critical role of computer vision in addressing challenges of scalability, accuracy, and privacy within the field. Researchers also demonstrated how these developments could translate into real-world solutions, reinforcing the strong connection between technical innovation and human-centered applications in eye-tracking research.
💻 Paper Session 3 showcased innovative Computer Vision approaches in eye tracking research at #ETRA2025. The session featured groundbreaking work from iris style transfer for privacy preservation to zero-shot pupil segmentation using SAM 2 across 14 million images.#HCI#CVpic.twitter.com/LAgYukrQlp
ETRA 2025 wasn’t just about research and presentations—it also provided memorable opportunities for networking and community building. One of the highlights was the conference banquet, held aboard a dinner cruise in Tokyo Bay. Attendees enjoyed an evening of fine Japanese cuisine, sake, and vibrant discussions while taking in breathtaking views of the city skyline.
The social events gave researchers, practitioners, and students a chance to relax, connect, and exchange ideas in a more informal setting. These moments of shared meals and conversations strengthened the sense of community within the eye-tracking research field, ensuring that collaborations extend beyond the technical sessions into long-lasting professional relationships.
🚢✨ What an incredible evening at the #ETRA2025 conference banquet! Our dinner cruise in Tokyo Bay was the perfect setting to celebrate eye tracking research with colleagues from around the world. Amazing Japanese cuisine, sake, and unforgettable conversations! 🍶🗾#EyeTrackingpic.twitter.com/V8JQR36VJS
ETRA 2025 emphasized inclusivity and sustainability even during mealtimes. Attendees were served Halal-friendly, vegan, and regular bento options, ensuring that everyone had accessible and culturally sensitive food choices. The beautifully prepared bento boxes showcased a variety of Japanese flavors while catering to diverse dietary needs.
We're serving HALAL-friendly, VEGAN, and regular bento options for everyone 🌱✨ Please enjoy your meal and help us stay sustainable by disposing of trash in designated areas afterward. Arigatou gozaimasu! 🙏#Tokyo#Sustainabilitypic.twitter.com/j7YekD6DcV
🥂 Last night's reception was the perfect start to #ETRA2025! Our attendees enjoyed a wonderful spread of food and beverages while networking with fellow researchers. Nothing beats good food and great conversations to kick off four days of cutting-edge eye tracking research! 🍰🍷 pic.twitter.com/7cRWesl6Mv
🍣 Fresh sushi being crafted right before our eyes at the #ETRA2025 reception! Our talented sushi chefs brought authentic Japanese culinary artistry to welcome our international researchers. Nothing says "Welcome to Tokyo" like freshly made sushi! 🇯🇵👨🍳#Tokyo#EyeTrackingpic.twitter.com/qyREnrDvBJ
Networking & coffee breaks, beyond the formal sessions, also fostered an environment for informal yet impactful exchanges. it provided a space where researchers, students, and industry professionals gathered to share ideas, sketch concepts on whiteboards, and even run quick demos on their phones. These animated discussions often sparked new collaborations and innovative research directions.
🎧 When your General Chair @_ysugano doubles as the DJ! 🎵 The #ETRA2025 reception had the perfect soundtrack for networking and lively discussions. Nothing like great music to set the mood for connecting researchers from around the globe. Academic conferences with style! 🔥 pic.twitter.com/pDO8QDLEt4
In conclusion, ETRA 2025 in Tokyo was an inspiring event that showcased the latest advancements in eye-tracking research. The conference fostered meaningful discussions, collaborations, and the exchange of ideas, setting the stage for future developments. As we look ahead, the excitement for ETRA 2026 in Marrakesh promises even more groundbreaking research and opportunities for growth. The experiences gained from this event will undoubtedly shape the future of eye-tracking technology and research.
Lawrence Obiuwevwi Graduate Research Assistant Virginia Modeling, Analysis, & Simulation Center Department of Computer Science Old Dominion University, Norfolk, VA 23529 Email: lobiu001@odu.edu Web : lawobiu.com
This presentation is an update on the Language Data Commons of Australia (LDaCA) technical architecture for the LDaCA Steering Committee meeting 22 August 2025, written by members of the LDaCA team; me, Moises Sacal, and Ben Foley edited by Bridey Lea. This version has the slides we presented and our notes, edited for clarity. There's a more compact version of this up over on the LDaCA site
The architecture for LDaCA has not changed significantly for the last couple of years. We are still basing our design on the PILARS protocols.
This presentation will report on some recent developments, mostly in behind-the-scenes improvements to our software stack. It will give a brief refresh of the principles behind the LDaCA approach, and talk about our decentralised approach to data management and how it fits with the metadata standards we have been developing for the last few years. We will also show how the open source tools used across LDaCA’s network of collaborators are starting to be harmonised and shared between services, reducing development and maintenance costs and improving sustainability.
The big news is a new API a new RO-Crate API (“An RO-Crate API” - AROCAPI ) which offers a standardised interface to PILARS-style storage where data is stored as RO-Crates, organized into "Collections" of "Objects" according to the Portland Common Data Model (PCDM) specification, which is built-in to RO-Crate.
A concrete example is that PARADISEC will implement different authentication routes (using the existing “Nabu” catalog) than the LDaCA data portal which uses CADRE ([REMS])(https://www.elixir-finland.org/en/aai-rems-2/).
Promising discussions are taking place with one of our partners about taking on LDaCA data long-term (instead of having to distribute the collections across partner institutions). This would give a consolidated basis for a Language Data repository and a broader Humanities data service.
This slide shows the LDaCA execution strategy. All of the strands (Collect & organise, Conserve, Find, Access, Analyse, Guide) are relevant to the technical architecture.
From the very beginning of the project, the LDaCA architecture has been designed around the principle that to build a “Research Data Commons” we need to look after data above all else. We took an approach that considered long-term data management separately from current uses of the data.
This resulted in some design choices which are markedly different from those commonly seen in software development for research.
Effort was put into:
Organising and describing data using open specifications BEFORE building features into applications;
Designing an access-control system with long-term adaptability in mind (read the story about that as presented at eResearch Australasia 2022);
Batch-conversion of existing data to the new approach; and
Developing a metadata framework and tools to implement it.
With this foundation, and the new interoperability we gain from our collaboration on the AROCAPI API, we are well placed to move into a phase of rapid expansion of the data assets building workspace services. For example:
The new LDaCA analytics forum will drive analytical workspaces
Work by the LDaCA technical team will continue to improve data preparation workspaces, possibly by collaborating to adapt the Nyingarn Workspace for general purpose use.
In 2024, we released the Protocols for Implementing Long Term Archival Repositories (PILARS), described in this 2024 presentation at Open Repositories. The first principle of PILARS is that data should be portable, not locked in to a particular interface, service or mode of storage. Following the lead of PARADISEC two decades ago, the protocols call for storing data in commodity storage services such as file systems or (cloud) object storage services. This means that data is available independently of any specific software.
For the rest of this presentation, we will focus on recent developments in the “Green zone” – the Archival Repository functions of the LDaCA architecture. We will not be talking about the analysis stream as that will be discussed in detail in the newly established Analytics Forum.
I (PT) wanted to throw in a personal story here. This is an unstaged picture of my (PT Sefton’s) garage this morning. The box of hard drives contains some old backups of mine just in case, and also my late father Ian Sefton’s physics education research data, stuff like student feedback from lab programs in the 80s trialling different approaches to teaching core physics concepts and extensive literature reviews. These HAVE been handed on to his younger colleagues but could easily have ended up only available here in this garage. I wanted to remind us all that this project is a once in a career opportunity to develop processes for organising data and putting it somewhere alongside other data in a Data Commons where (a) your descendants are not made responsible for it and put it in a box in the shed or chuck it in a skip; and (b) others can find it, use it (subject to the clear “data will” license permissions you left with the data to describe who should be allowed to do what with it), and build on your legacy.
Remember:
Storage is not data management (particularly if the storage is a shopping bag full of mistreated hard drives)
Passing boxes of storage devices hand to hand is NOT a good strategy to conserve data
Hard drives are not archives
The first principle of PILARS is that data should be portable, not locked-in to a particular mode of storage, interface or service. Following the lead of PARADISEC two decades ago, the protocols call for storing data in commodity storage services such as file systems or (cloud) object storage services. This means that data is available independently of any specific software. This diagram is a sketch of how this approach allows for a wide range of architectures – data stored according to the protocols can be indexed and served over an API (with appropriate access controls). Over the next few slides, we will show some of the architectures that have emerged over the last couple of years at LDaCA.
The first example is the LDaCA data portal, which is a central access-controlled gateway to the data that we have been collecting.
NOTE: during the project it has been unclear how we would look after data at the conclusion of the project. No single organisation had put up its hand up to host data for the medium to long term, but as noted in the News section we have had some positive talks with one of our partner institutions indicating that they may have an appetite for hosting data that otherwise does not have a home, and/or providing some redundancy for at-risk collections where data custodians are comfortable with a copy residing at the university (we won’t say which one until negotiations are more advanced).
This slide shows a demo of two different portal designs accessing the same PARADISEC data, which has been accomplished using the new AROCAPI API. The API will speed development of new PILARS-compliant Research Data Commons deployments, using a variety of storage services and portals that can be adapted and "mixed and matched" via a common API.
Alongside the data portal, we have explored other ways of sharing data assets, including local distribution via portable computers such as Raspberry Pi with a local wireless network. We have also discussed establishing regional cooperative networks where communities reduce risk by holding data for each other.
With our partners, we have developed and adapted a suite of other technical resources, including:
Oni portal software for mid-to-large deployments. Version 1 is live and Version 2 is currently under development with PARADISEC, involving a new shared API and code base that can be used across LDaCA and beyond.
REMS overlaid with CADRE to manage access control for identified users. A service agreement between LDaCA and CADRE has been signed, to manage access control. REMS is still the backend of this tool, but CADRE’s wrapper makes it more user-friendly. CADRE version 2 will replace the admin component of REMS and is in the testing phase now.
‘Corpus tools’ for migrating data from existing formats to LDaCA-ready RO-Crates are available on github. These reduce the cost of developing new migration tools by adapting existing corpus tools, provide reproducible migration processes and are a strong foundation for quality assurance checks.
Software libraries for managing data in RO-Crate, maintaining schemas available on our github organisation.
Nyingarn (focussed on creating searchable text from manuscripts)
Our next steps will involve a multi-modal workspace, for audio and video transcription.
This diagram shows how the PILARS principles have been implemented by different organisations. Each example uses open source software, and accepted standards for metadata and storage, meaning that data is portable.
This slide shows one potential view of LDaCA’s architecture in 2026. There may be an opportunity to deepen the collaboration between the UQ LDaCA team and the PARADISEC team at Melbourne, sharing the development of more code.
For example, Nyingarn’s incomplete repository function could be done by a stand alone instance of the Oni portal, or as shown here, added to the LDaCA portal as a collection.
Likewise the non-existent user-focussed data preparation functions of Nyigarn, where a user can describe an object and submit it could be generalized for use in LDaCA.
Changes shown in this diagram:
Remove the “NOCFL” storage service from Nyingarn and replace with either OCFL or an Object Store solution
Upgrade Nyingarn workspace to be a generic data onboarding app for all kinds of data (rather than only manuscript transcription focus)
To conclude, we have an opportunity now to consider how the distributed LDaCA technical team can collaborate on key pieces of re-deployable infrastructure. This work is having an impact in other Australian Research Data Commons (ARDC) co-investments.
We at OKFN are questioning our own practices and learning how to do things anew. This new publication compiles some of these processes and invites you to put your community first.
On September 20th, 2025, I spoke on a panel at THE LEGACY OF CCH CANADIAN LTD. v. LAW SOCIETY OF UPPER CANADA AND FUTURE OF COPYRIGHT LAW CONFERENCE 2025. Here is my talk.
Smithsonian Institution building, from Wikimedia Commons
We are excited to announce today that the Library Innovation Lab has expanded our Public Data Project beyond datasets available through Data.gov to include 710 TB of data from the Smithsonian Institution — the complete open access portion of the Smithsonian’s collections. This marks an important step in our long-running mission to preserve large scale public collections both for our patrons and for posterity.
From the National Museum of American History. Creative Commons 0 License
The Smithsonian has had the mission, since its founding in 1846, to pursue “the increase and diffusion of knowledge.” In the past, this could only be done by visiting Smithsonian museums in person. Now that its collections are also digital, we are grateful to be able to do our part in preserving and sharing our nation’s cultural heritage.
Our initial collection includes some 5.1 million collection items and 710 TB of data. As is always our practice, we have cryptographically signed these items to ensure provenance and are exploring resilient techniques to share access to them, which we plan to launch in the future.
From the National Museum of African American History and Culture. Creative Commons 0 License
The computing market is absolutely ablaze with AI-driven growth. Regardless of how sustainable it might be, companies are spending untold amounts of wealth on hardware, with most headlines revolving around GPUs. But the storage market is also under pressure, especially hard drive vendors who purportedly haven't done much to increase manufacturing capacity in a decade. TrendForce says lead times for high-capacity "nearline" hard drives have ballooned to over 52 weeks — more than a full year.
warning of "unprecedented demand for every capacity in [its] portfolio," and stating that it is raising prices on all of its hard drives.
The unprecedented demand from AI farms is because:
You don't just need the data required to run inference. You also need the history of everything to prove to regulators that you're not laundering bias, to retrain when new data comes in, and to roll back to a previous checkpoint if your fine-tuned model goes feral and, say, starts referring to itself as MechaHitler. This stuff can't go to offline storage until you're certain it isn't needed in the short term. But it's too big to live in the primary storage of all but the beefiest servers. Thus, the need for nearline hard drives.
WD's projection
At the meeting, Western Digital's Dave Landsman's HDDs are here to stay made the same point with this graph using data from IDC and TrendFocus. They are projecting that both disk and enterprise SSD will grow in the low 20%/year range, so the vast bulk of data in data centers will remain on disk. Landsman claims that SSDs are and will remain 6 times as expensive per bit as hard disk and that 81% of installed data center capacity is on hard disk.
Keeping the data on hard disk might actually be a good idea. Sustainability in Datacenters by Shruti Sethi presented a joint Microsoft/Carnegie-Mellon study of the scope 2 (operational) and scope 3 (embedded) carbon emissions of compute, SSD and HDD racks in Azure's data centers. The study, A Call for Research on Storage Emissions by Sara MacAllister et al concluded that:
an SSD storage rack has approximately 4× the operational emissions per TB of an HDD storage rack. Storage devices (SSDs and HDDs) are the largest single contributor of operational emissions. For SSD racks, storage devices account for 39% of emissions, whereas for HDD racks they account for 48% of emissions. These numbers contradict the conventional wisdom that processing units dominate energy consumption: storage servers carry so many storage devices that they become the dominant energy consumers.
...
SSD racks emit approximately 10× the embodied emissions per TB as that of HDD storage racks. The storage devices themselves dominate embodied emissions, accounting for 81% and 55% of emissions in SSD and HDD racks, respectively.
As usual, the authoritative word on the performance of the storage industry comes from IBM. Georg Lauhoff & Sassan Shahidi's Data Storage Trends: NAND, HDD and Tape Storage added another year's data points to their invaluable graphs and revealed that:
NAND areal density continues to increase rapidly, because 3D scales faster than the 2D of disk and tape.
Disk's 8%/year areal density increase continued, but note that although their graph includes Seagate's 32TB HAMR drive the effect of Seagate's and later WD's deployment of HAMR didn't really start until later in 2025.
Tape continued its 27%/year increase.
Coming from a tape supplier this comment isn't surprising but it is correct:
Despite the promise of alternative archive storage technologies, challenges persist. Enduring relevance of tape storage, which itself is rapidly evolving.
The main problem being that the huge investment and long time horizon needed to displace tape's 7% of the storage market can't generate the necessary return.
One fascinating graph shows the difference between demonstrations and products for tape and disk. I keep pointing out the very long timescales in the storage industry. In January's Storage Roundup I noted that HAMR was just starting to be deployed 23 years after Seagate demonstrated it. Lauhoff & Shahidi's graph shows that the current tape density was demo-ed in 2006 and shipped in 2022, and that disk's current density was demo-ed in 2012.
This graph reinforces that tape's roadmap is credible, but the good Dr. Pangloss noticed the optimism of the NAND and disk roadmaps. New technologies tend to scale faster at first, then slower as they age. So it is likely that the advent of HAMR will accelerate disk's areal density increase somewhat. And it is possible that the difficulty of moving from 3D NAND to 4D NAND will slow its increase.
Lauhoff & Shahidi's cost ratio graph shows that the relative costs of the different media were roughly stable. If Killian is right that the disk manufacturers are increasing prices and lengthening lead times because of demand from AI, this could be different in next year's graph. But Killian also notes that, despite the fact that QLC SSDs are at least "four times the cost per gigabyte":
Trendforce reports that memory suppliers are actively developing SSD products intended for deployment in nearline service. These should help bring costs down once they hit the market. But in the short term, we can expect the storage crunch to cause rising SSD prices as well, at least for enterprise drives.
Lauhoff & Shahidi's bit shipment graph is interesting for two reasons:
Disk's proportion of total bit shipments increased.
They started tracking the proportion of NAND flash that was SSDs. They represented only about 30% of disk's bit shipments. The claim that the bulk of data still lives on hard disk is true and looks to continue. Disk ships mostly to the nearline enterprise market, while SSDs ship mostly to the online enterprise market. Disk is shipping nearly three times as many bits.
Shipments of tape storage media increased again in 2024, according to HPE, IBM, and Quantum – the three companies that back the Linear Tape-Open (LTO) Format.
The three companies on Tuesday claimed they shipped 176.5 Exabytes worth of tape during 2024, a 15.4 percent increase on 2023’s 152.9 Exabytes.
I have been writing skeptically about the medium-term prospects for DNA storage since 2012 and Lauhoff & Shahidi share my skepticism in their graph of the technology's progress in the lab. DNA can only compete in the archival storage market, so the relevant comparison is with LTO tape. Even if you believe Wang's estimate, DNA is more than ten million times too expensive.
DNA is a very dense storage medium and storage researchers have tried to use it for data storage, but without much success, because it’s hard to find info within DNA and read times are slow.
Jiang's team claims to have addressed that problem, establishing a sequence of data partitions on the tape and identifying each of these with a bar code.
The team from Shenzen's focus on reading is a misunderstanding of the fundamental requirements of the hyperscaler archive market, which are in order:
a completely automated closed-loop operation involving addressing, recovery, removal, subsequent file deposition, and file recovery again, all accomplished within 50 min.
Jiang's team only wrote 156.6 kilobytes of data to a test tape for their experiment, consisting of four "puzzle pieces" depicting a Chinese lantern. If the data were damaged, it wouldn't assemble correctly. The researchers managed to successfully recover the lantern image without issue, but it took two and a half hours or not-quite one kilobyte per minute.
Because archival data is guaranteed to be written but is very likely never read, the cost of storing data in DNA is dominated by synthesis. The team effectively admits that they can't compete.
Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.
Finance team, your challenge is to persuade the company to spend $24M a year for the next 10 years for a product that can then earn about $216M a year for 10 years.
Write bandwidth and cost remain the core problems of DNA storage, and while progress has been made in other areas, both are still many orders of magnitude away from competing with hard disk, let alone LTO tape.
Nevertheless, as I concluded more than seven years ago:
Traditional AI models are often viewed as “black boxes” whose decision-making processes are not revealed to humans, leading to a lack of trust and reliability. Explainable artificial intelligence (XAI) is a set of methods humans can use to understand the reasoning behind the decisions or predictions of models, which typically include SHAP, LIME, and PI. We will start from a basic model and then talk about those XAI methods.
Logistic regression
Logistic regression (LR) is a classification model, particularly for problems with a binary outcome. It maps the output of linear regression to the interval [0, 1] through the sigmoid function, indicating the probability for a particular class.
The logistic regression model is simple and highly interpretable. The coefficient of each feature intuitively reflects the impact of the feature, which is easy to understand and explain. Positive weights indicate that the feature is positively related to the positive class, and negatively related if negative weights. LR outputs the classification results, as well as the probability of that result. All those show that LR is an inherently interpretable model.
Logistic regression assumes that the features are linearly related to the log odds (ln(p/(1-p)), where p is the probability of an event occurring. This can be extended into a method for explainability: the odds ratio. Odds ratios are not formally considered part of the XAI toolkit since they only work for LR, but it is practical and widely used in medical research and other fields.
An odds ratio is calculated by dividing the odds of an event occurring in one group by the odds of the event occurring in another group. For example, if the odds ratio for developing lung cancer is 81 for smokers compared to non-smokers, it means smokers are 81 times more likely to develop lung cancer. The OR value is calculated by using the regression coefficient in exponential form. For example, if the coefficient of a feature in the logistic regression model is 0.5, the OR value is:
It means that for every unit increase in the feature, the probability of the event occurring increases by approximately 64.87%.
The OR value can be used to find features with the highest influence on prediction results, and further used for feature selection or optimization. However, it only works for the LR model. In the rest of the blog, we will talk about other machine learning models and model-agnostic XAI methods.
We will discuss three machine learning models, each representing a distinct approach based on probability, tree models, and spatial distance, respectively.
Machine learning model 1:Naive Bayes
Naive Bayes is a classification model based on probability and Bayes' theorem. It assumes that features are independent of each other, which is not always true in reality. This "naive" assumption simplifies the problem but can potentially reduce the accuracy.
Naive Bayes obtains the probability of each class, and then selects the class with the highest probability as the output. It calculates posterior probability with prior probability and conditional probability. For example, the probability of 'win' appearing in spam emails is 80%, and 10% in regular emails. Then we can calculate the probabilities of 'is spam email' and 'is regular email' through a series of calculations and pick the one with the higher probability.
Naive Bayes is insensitive to missing data, so it can still work effectively when there are missing values or when features are incomplete. It has good performance in high-dimensional data due to the independence assumption. However, it also has disadvantages. It's sensitive to input data distribution. Performance may decrease if the data does not follow a Gaussian distribution.
Machine learning model 2: random forest
A decision tree is a learning algorithm with a tree-like structure to make predictions. A random forest uses a bagging of decision trees to make predictions. It randomly draws samples from the training set to train each decision tree. When each decision tree node splits, it randomly selects features to make the best split. It repeats the above steps to build multiple decision trees and form a random forest.
By integrating multiple decision trees, a random forest achieves better performance than a single decision tree. It can reduce overfitting with random sampling and random feature selection. It is insensitive to missing values and outliers and can handle high-dimensional data. But compared with a single decision tree, the training time is longer. In addition, random forests rely on large amounts of data and may not perform well with small datasets.
Machine learning model 3: SVM (support vector machine)
The core of SVM is to find the hyperplane that best separates data points into different classes and to maximize the boundary between classes. It can be used for both classification and regression tasks.
SVM has good performance with high-dimensional sparse data (such as text data), as well as nonlinear classification problems, so it's particularly suitable for text classification and image recognition. In addition, SVMs are relatively robust against overfitting. Overall, SVM is a good choice for high-dimensional data with a small number of samples, but for large-scale data sets, SVM training takes a long time and thus is not a good choice.
After discussing the representative machine learning models, we will see the model-agnostic XAI methods, which can be applied to any machine learning model, including linear models, tree models, neural networks, etc.
XAI method 1: SHAP
SHAP (Shapley Additive Explanations) is a model interpretation method based on cooperative game theory. Shapley values are calculated to quantify the importance of each feature by evaluating its marginal contributions to the model. SHAP local explanations reveal how specific features contribute to individual predictions for each sample. SHAP global explanations describe a model's overall behavior across the entire dataset by aggregatingthe SHAP values for all individual samples. Figure 1 demonstrates the importance ranking of features in global explanation, where 'Elevation' ranks first among all features. We can further look at each feature in detail by the dependence plot as shown in Figure 2, which shows the relationship between the target and the feature. It could be linear, monotonic, or more complex relationships. In addition, there are more visualization methods in the SHAP toolkit based on your needs.
Figure 1: Ranking of influencing features (Fig. 10 in Zhang et al.)
Figure 2: SHAP dependence plot of annual average rainfall (Fig. 14 in Zhang et al.)
XAI method 2: Permutation importance
Permutation importance (PI) is a method of global analysis and it does not focus on local results as SHAP does. It assesses the importance of features in a model by measuring the decrease or increase in performance when the values of a particular feature are randomly permuted, while keeping other features unchanged. By comparing the decrease/increase in performance to baseline performance, permutation importance provides insights into the relative importance of each feature in the model. The difference from baseline performance is the importance value, and it can be positive, negative, or zero. If the value is zero, it means the performance of data with a feature completely shuffled and put back into the original training set is the same as the original data. The model performance is the same with or without the feature and thus this feature is of low importance. If the value is negative, it means it is better not to add this feature at all. Figure 1 shows one example of the ranking of features by permutation importance.
LIME is essentially a method forlocal analysis. It builds a simple interpretable model (such as a linear model) around the target sample and then the contribution of each feature to the prediction can be approximated by interpreting the simple model's coefficients. It consists of the following steps:
Randomly select a sample to be explained.
Generate a perturbation samples x′ near x.
Use the complex model to predict the perturbation sample x′ and get the predicted value f(x′).
Use the perturbation samples x′ and the corresponding predictions f(x′) to train a simple interpretable model (such as logistic regression).
Interpret the complex model using the coefficients of the simple model.
LIME is generally used in local cases. For example, a bank can use LIME to determine the major factors that contribute to a customer being identified as risky by a machine learning model.
To sum up, we mainly have the following XAI methods: permutation importance, SHAP, LIME and odds ratio. Permutation importance and SHAP can give global explanations based on the whole dataset, while LIME can only provide local explanations based on a particular sample. Permutation importance measures how important a feature is for the model, and SHAP measures the marginal contribution of a feature. The first three methods are model-agnostic, while the odds ratio is only used for logistic regression and gives global explanations. We can choose to use one or more of the most suitable methods in real-life applications.
In late August 2025, interlibrary loan staff at libraries across the United States found themselves facing an unprecedented situation. Revocation of the De Minimis tariff exemption for packages worth less than $800, due to become effective on 29 August 2025, threw a blanket of uncertainty over global international shipping operations. More than a dozen countries abruptly paused all shipping to the US; document suppliers and book vendors announced that they, too, would stop shipping to the US until the practical impacts became known. ILL folks had reason to wonder if physical library materials in transit across borders would ever reach their destinations and if new shipments in either direction would be hit with tariffs, incurring unbudgeted and unpredictable expenses.
This global kerfuffle is now well into its third week. The SHARES community, a multinational resource sharing consortium whose members are being impacted in different ways depending on their local context, responded as resource sharing practitioners always do: by banding together, pooling uncertainties, sharing strategies and workarounds, and supporting each other with facts, encouragement, and good humor.
Daily challenges countered by sustained real-time ILL collaboration
The disruption first surfaced on the SHARES-L mailing list on 25 August, when libraries began reporting that major European shipping companies like DHL and Deutsche Post were pausing shipments to the US. The University of Tennessee shared that GEBAY, a major German document supplier, had already begun canceling US loan requests, citing the loss of the under-$800 exemption.
Libraries immediately began sharing their approaches and real-time results. The University of Waterloo in Canada reported experiencing occasional tariff issues on incoming items but planned to continue sharing with US partners. Pennsylvania State University established review processes for international requests and began using specific language on customs forms—”Any value stated is for insurance purposes only”—with some initial success. The University of Pennsylvania, a prolific borrower and supplier of library materials across borders, took a more cautious tack, temporarily pausing all international sharing after having an item stuck in Hong Kong customs, requiring $500 for its release. The University of Glasgow began changing their customs forms from “temporary export” to “personal, not for resale,” which seemed to help avoid additional shipping charges on packages shipped to the US. Yale University and the University of Michigan reported receiving direct notifications from additional European suppliers about temporary service suspensions.
As coordinator of the RLP SHARES community, I synthesized the threads each day and created a shared document where SHARES members could add updates. I also added the De Minimis exemption revocation to the agenda of an upcoming SHARES town hall.
On 26 August 2025, the day after the topic surfaced on SHARES-L, 32 participants attended SHARES Town Hall #264 to compare notes on the latest intelligence coming from shippers and overseas libraries and to share their current strategies. The University of Kansas suggested sending conditional responses to prospective overseas borrowers of physical items, asking for confirmation that they would be able to ship items back to the US once they receive them, and offering to scan tables of contents and indexes as a short-term alternative to physical loans. Recognizing that the complexity of the situation varied by carrier and region, the University of Pittsburgh commenced tracking the statuses of individual shipping companies and countries rather than implementing blanket restrictions. SHARES folks renewed their commitment to updating the shared document with all the latest developments.
The situation continued to evolve rapidly. Later that same week, Princeton University reported that several major international book vendors had informed them they would not be shipping new books to the US until customs procedures were clarified, indicating the impact extended far beyond interlibrary loans to also impact academic acquisitions. The CUNY Graduate Center added Brazil to the growing list of countries that have suspended all shipments to the US. This prompted a suggestion to integrate the evolving country-by-country shipping status into the existing International ILL Toolkit, a crowd-sourced tool used by libraries across the world, created by SHARES during a town hall in 2022.
By 4 September, practical advice from shipping companies began to emerge. The Getty Research Institute shared the following, which they’d just received from FedEx:
The traditional wording (“loan between libraries, no commercial value”) is no longer sufficient on its own. Going forward, they should:
1. Always include a numeric HTS code (4901.x for books; 9801.00.10 for U.S. goods returned).
2. Declare a nominal value rather than “no commercial value.”
3. Add clarifying language like “interlibrary loan – not for sale – temporary export/return.”
This ensures [domestic and foreign customs] process the shipments correctly as duty-free, non-commercial library loans.
Other libraries reported successfully receiving packages from Australia with a tariff of only $10.
By 9 September, 15 days after the topic first surfaced on SHARES-L, participants at SHARES Town Hall #266 reported feeling confident they can once again share physical items across most borders with, at worst, minimal disruption and modest fees. Later in the week, Brown University and the University of Pennsylvania each reported having to reimburse DHL $18.38 for paying duties on packages coming back to them in the US from Canada; Penn plans to dispute the charge retroactively, as these are shared library materials, not commercial imports. A few universities are still pausing their international sharing, but most are back at it, full speed ahead.
Two paths for collaboration
The community response emerged through two distinct but interconnected channels: the asynchronous SHARES-L mailing list and the SHARES town halls. The mailing list discussion centered on the immediate sharing and problem-solving, with institutions reporting their individual circumstances and strategies. This allowed all SHARES members a chance to participate at their convenience. The town halls provided a crucial real-time forum where a subset of SHARES practitioners could engage in dynamic discussion, ask questions, coordinate responses, and coalesce around a set of preferred practices, with the outcomes being cycled back to all SHARES participants for comment via the SHARES-L mailing list.
The power of community
The SHARES response to the recent disruption of international shipping exemplifies the extraordinary power of community. Through information sharing, collaborative problem-solving, and mutual support, the SHARES network transformed individual institutional confusion into collective wisdom. Time and again, connections to trusted peers have proven to be every bit as essential as all the other types of infrastructure we depend upon to do our jobs.
The need to capture patrons attention with interesting flyers and advertisement is extremely critical to library staff’s work. So having an easy to use graphical program that can help even the most novice designer can help elevate designs to the next level. Two free online graphic design programs, Canva and Adobe Express, make it easy for any creative project. While each program is fairly similar, the few difference between the two programs may help decide why chose one over the other.
The contributions of women to the printing trade during the hand press era have long been under- documented, leaving significant historical gaps in our understanding of early print culture. This article presents a project that uses ChatGPT-4o, a generative artificial intelligence (AI) chatbot, to help bridge those gaps by identifying, analyzing, and contextualizing the work of women printers represented in the University of Notre Dame’s rare book collections.
Mountain West Digital Library (MWDL) was founded in 2001 and offers a public search portal supporting discovery of over a million items from digitized historical collections throughout the US Mountain West. This aggregation work necessitates a metadata application profile (MAP) to ensure metadata consistency and interoperability from the regional member network of libraries, archives, and cultural heritage organizations. Unique issues arise in combining metadata from diverse local digital repository platforms and aggregation technology infrastructure introduces further constraints, challenges, and opportunities. Upstream aggregation of metadata in the Digital Public Library of America (DPLA) also influences local and regional metadata modeling decisions. This article traces the history of MWDL’s MAPs, comparing and contrasting five published standards to date. In particular, it will focus primarily on decisions and changes made in the most recent version, published in early 2020.
This paper considers modular approaches to building library software and situates these practices within the context of the rationalizing logics of modern programming. It briefly traces modularism through its elaboration in the social sciences, in computer science and ultimately describes it as it is deployed in contemporary academic libraries. Using the methodology of a case study, we consider some of the very tangible and pragmatic benefits of a modular approach, while also touching upon some of the broader implications. We find that the modularism is deeply integrated into modern software practice, and that it can help support the work of academic librarians.
New York University Libraries recently completed a redesign for their finding aids publishing service to replace an outdated XSLT stylesheet publishing method. The primary design goals focused on accessibility and usability for patrons, including improving the presentation of digital archival objects. In this article, we focus on the iterative process devised by a team of designers, developers, and archivists. We discuss our process for creating a data model to map Encoded Archival Description files exported from ArchivesSpace into JSON structured data for use with Hugo, an open-source static site generator. We present our overall systems design for the suite of microservices used to automate and scale this process. The new solution is available for other institutions to leverage for their finding aids.
This mixed-method study investigates the representation of race and ethnicity within the J. Willard Marriott Digital Library at the University of Utah. The digital collections analyzed in this study come from the Marriott Library’s Special Collections, which represent only a fraction of the library’s physical material (less than 1 percent), albeit those most public facing. Using a team-based approach with librarians from various disciplines and areas of expertise, this project yielded dynamic analysis and conversation combined with heavy contemplation. These investigations are informed by contemporary efforts in librarianship focused on inclusive cataloging, reparative metadata, and addressing archival silences. By employing a data-intensive approach, the authors sought methods of analyzing both the content and individuals represented in our collections. This article introduces a novel approach to metadata analysis—as well as a critique of the team’s initial experiments—that may guide future digital collection initiatives toward enhanced diversity and inclusion.
This paper aims to utilize historical newspapers through the application of computer vision and machine/deep learning to extract the headlines and illustrations from newspapers for storytelling. This endeavor seeks to unlock the historical knowledge embedded within newspaper contents while simultaneously utilizing cutting-edge methodological paradigms for research in the digital humanities (DH) realm. We targeted to provide another facet apart from the traditional search or browse interfaces and incorporated those DH tools with place- and time-based visualizations. Experimental results showed our proposed methodologies in OCR (optical character recognition) with scraping and deep learning object detection models can be used to extract the necessary textual and image content for more sophisticated analysis. Timeline and geodata visualization products were developed to facilitate a comprehensive exploration of our historical newspaper data. The timeline-based tool spanned the period from July 1942 to July 1945, enabling users to explore the evolving narratives through the lens of daily headlines. The interactive geographical tool can enable users to identify geographic hotspots and patterns. Combining both products can enrich users’ understanding of the events and narratives unfolding across time and space.
In 2022, a task group at the University of Victoria Libraries moved reference service off the desk and into an appointments model. We used Springshare’s LibCal to create a public web calendar and booking system, with librarians setting office hours and appointments done over Zoom, phone, or in person. LibCal allows us to send feedback follow-up requests to appointments and to keep stats and assess usage. This article is a practical case study in implementing a service model change, with emphasis on how we adapted LibCal to support the service to serve librarians and students.
Over the last year, I’ve written afewtimes about what began as a small sabbatical project called ManoWhisper, or mano-whisper, or manowhisper 😅. Naming things is hard.
This project started with a simple but intriguing question from my DigFemNet/SIGNAL collaborators and colleagues Shana MacDonald and Brianna Wiens: “Have you ever done anything with podcast transcripts?”
At the time, my answer was no. But, I was curious about experimenting with Whisper, and that curiosity quickly grew into something much larger.
Some transcripts are sourced from the Knowledge Fight Interactive Search Tool and flagged accordingly on their show and episode pages (more about fight.fudgie.org below). To complement the website, a companion command-line tool (manowhisper) generates classifications and statistics.
42 podcast series currently being transcribed, classified, and indexed;
35,000+ transcripts available for search and interaction;
A significant backlog of additional series and episodes waiting to be processed.
The scale of this dataset/corpus will continue to grow as we’re able to resource it. Hopefully, offering collaborators and researchers new ways to examine the narratives and ideologies in the corpus.
Support, Collaboration, and Thank Yous!
This project would not be possible without the financial and in-kind support of:
I want to extend a special thank you to Erlend Simonsen for his generous work and support. If it’s not obvious, my work here was HEAVILY inspired by his amazing work with the Knowledge Fight Interactive Search Tool. I encourage everyone to explore that project. It is regularly updated with new show, episodes, features, and insights.
Working in close dialogue with colleagues in the Digital Feminist Network/SIGNAL Network, we’ve expanded this scope to include more Canadian context. This aligns with our broader research goals: to understand how digital gender discourse ecosystems, including the Manosphere, Femosphere, and incel groups, shape and influence local institutions and communities.
ManoWhisper’s journey from a “fun little question” to a research platform has been a bit of a surprise, and incredibly rewarding. What began as a collection of scripts is now a living, expanding infrastructure for studying how discourse circulates through podcasts.
This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. On 7th March 2025, I organized and celebrated a community Open Data Day 2025 in Eastern Province, Southeast of Kigali, through a project entitled WIKI-SHE EVENT RWANDA under the theme “Promoting gender equity and increasing the...
This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. On the 4th of March 2025 the University of Aruba participated in Open Data Day 2025! About 20 researchers and students gathered to hear all about Open Data, Research Data Management and the FAIR...