October 15, 2025

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-10-14: Goodbye, goo.gl/0R8XX6

goo.gl/0R8XX6 is now 404 on the live web. 

It's been nearly two months since Google stopped redirecting some of its goo.gl short URLs (August 25, 2025). In previous posts, I looked at how many of these goo.gl URLs that Wayback Machine had archived (at least 1.3M at the time), and estimated 4,000 goo.gl short links used in scholarly publications would be lost.  In early September, I verified that Google had indeed implemented this terrible, horrible, no good, very bad idea. It's now mid-October, and I'm just now getting a chance to write up this behavior. While the redirections are missing from the live web,  many (most? all?) have been archived, and it also turns out they leave a live web tombstone. 

First, let's look at the live web goo.gl URLs. The ones that were not lucky enough to receive traffic in "late 2024" no longer return a sunset message, and now return a garden variety HTTP 404 (image above, HTTP response below): 

% curl -Is https://goo.gl/0R8XX6 | head -7

HTTP/2 404 

content-type: text/html; charset=utf-8

cache-control: no-cache, no-store, max-age=0, must-revalidate

pragma: no-cache

expires: Mon, 01 Jan 1990 00:00:00 GMT

date: Tue, 14 Oct 2025 22:31:25 GMT

content-length: 0

Recall that goo.gl/0R8XX6 was one of the 26 shortened URLs from a 2017 survey of data sets for self-driving cars that was not lucky enough to have received traffic during late 2024, and thus was no longer going to continue to redirect (the other 25 shortened URLs are still redirecting).  One reason that I had put off posting about this finding is that other than saying that they did the thing they said they were going to do, there wasn't a surprise or interesting outcome. But, it turns out I was wrong: it appears that you can look at the HTML entity to determine if there was ever a redirection at the now 404 shortened URL,  

I wanted to test if goo.gl would return a 410 Gone response for URLs that no longer redirect. The semantics of a 410 are slightly stronger than a 404, in that a 410 allows you to infer that there used to be a resource identified by this URL, but there isn't now.  A regular 404 doesn't allow you to distinguish from something that used to be 200 (or 302*, in the case of goo.gl) vs. something that was never 200 (or 301, 302, etc.). Unfortunately, 410s are rare in the live web, but goo.gl deprecating some of its URLs seemed like a perfect opportunity to use them.  But in my testing of shortened URLs, I discovered that you get a different HTML entity depending on if the goo.gl URL ever existed or not.  

Let's take a look at the HTML entity that comes back via curl (I've created a gist with the full responses, but here I'll just show byte count):

% curl -s https://goo.gl/0R8XX6 | wc -c

    1652   

Doing the same thing for a shortened URL that presumably never existed, we get a response that's about 5X bigger (9,237 bytes vs. 1,652 bytes), even though it's still an HTTP 404:

% curl -s https://goo.gl/asdkfljlsdjfljasdljfl | wc -c

    9237


% curl -Is https://goo.gl/asdkfljlsdjfljasdljfl | head -7

HTTP/2 404 

content-type: text/html; charset=utf-8

vary: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site

cache-control: no-cache, no-store, max-age=0, must-revalidate

pragma: no-cache

expires: Mon, 01 Jan 1990 00:00:00 GMT

date: Wed, 15 Oct 2025 00:00:10 GMT

We can see that goo.gl/asdkfljlsdjfljasdljfl produces a completely different (Firebase-branded**) HTML page:

goo.gl/asdkfljlsdjfljasdljfl -- still 404, but a different HTML entity


Note that the 404 page shown in the top image is the same Google-branded 404 page that one gets from google.com; for example google.com/asdkfljlsdjfljasdljfl.

https://google.com/asdkfljlsdjfljasdljfl


It's possible there's a regular expression that checks for goo.gl style hashes in the URLs and "asdkfljlsdjfljasdljfl" was handled differently.  So next I tested a pair of six character hashes: goo.gl/111111 vs. goo.gl/111112 and got the same behavior: both 404, but 111111's HTML was 5X bigger than 111112's HTML:

% curl -Is https://goo.gl/111111 | head -7 

HTTP/2 404 

content-type: text/html; charset=utf-8

cache-control: no-cache, no-store, max-age=0, must-revalidate

pragma: no-cache

expires: Mon, 01 Jan 1990 00:00:00 GMT

date: Wed, 15 Oct 2025 00:05:28 GMT

content-length: 0


% curl -Is https://goo.gl/111112 | head -7 

HTTP/2 404 

content-type: text/html; charset=utf-8

vary: Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site

cache-control: no-cache, no-store, max-age=0, must-revalidate

pragma: no-cache

expires: Mon, 01 Jan 1990 00:00:00 GMT

date: Wed, 15 Oct 2025 00:05:34 GMT


% curl -s https://goo.gl/111111 | wc -c   

    1652


% curl -s https://goo.gl/111112 | wc -c 

    9222

https://goo.gl/111111

https://goo.gl/111112


Turns out that I was lucky with my first pair of random strings: goo.gl/111111 has an archived redirection and goo.gl/111112 does not, with 111111 also not being popular in "late 2024".  While the archived redirection proves that there was a redirection for 111111, the lack of an archived redirection for 111112 technically does not prove that there was never a redirection (there could have been one and it wasn't archived).  While I could spend more time trying to reverse engineer goo.gl and Firebase, I will be satisfied with my initial guess and trust my intuition, which says that the different returned HTML entities allow you to determine what goo.gl URLs used to redirect (i.e.,  de facto HTTP 410s) vs. the goo.gl URLs that never redirected (i.e., correct HTTP 404s). 

https://web.archive.org/web/*/https://goo.gl/111111

https://web.archive.org/web/*/https://goo.gl/111112

So the current status of goo.gl is even crazier than it first seems: rather than simply have all the goo.gl URLs redirect, they are keeping a separate list of goo.gl URLs that do not redirect.  We now have:

  1. goo.gl URLs that still redirect correctly
  2. goo.gl URLs that no longer redirect, but goo.gl knows they used to redirect, because they return a Google-branded 404 page
  3. goo.gl URLs that never redirected (i.e., were never really goo.gl shortened URLs), for which goo.gl returns a Firebase-branded 404 page

I suppose we should be happy that they did not deprecate all of the goo.gl URLs, but surely keeping all of them would have been easier. 

Fortunately, web archives, specifically IA's Wayback Machine in this case, have archived these redirections. The Wayback Machine is especially important in the case of goo.gl/0R8XX6, since its redirection target, 3dvis.ri.cmu.edu/data-sets/localization/, no longer resolves, and the page is not unambiguously discoverable via a Google search.  In this case, we need the Wayback Machine to get both the goo.gl URL and the cmu.edu URL.  

https://web.archive.org/web/*/https://goo.gl/0R8XX6

https://web.archive.org/web/20231125001435/https://goo.gl/0R8XX6
https://web.archive.org/web/20190107062345/http://3dvis.ri.cmu.edu/data-sets/localization/


So there is a possible, but admittedly unlikely, use case for this bit of knowledge.  If you're resolving goo.gl URLs and get a 404 instead of a 302, then check the Wayback Machine, it probably has the redirect archived.  If Wayback Machine doesn't have the redirect archived, you can check the HTML entity returned in the goo.gl 404 response: Google-branded 404s (deprecated goo.gl URLs) are much smaller than Firebase-branded 404s (never valid goo.gl URLs). A small, Goole-branded 404 page is a good indicator that there used to be a redirection, and if the Wayback Machine doesn't have it archived, maybe another web archive does.  

So goodbye, goo.gl/0R8XX6. Happily, you were in archive.org a full two years before goo.gl knew you were dead.


--Michael



* While we're at it, why did goo.gl use 302s and not the more standard practice of 301s?!

** At some point, gool.gl URLs were implemented as "Firebase Dynamic Links", which were also deprecated on 2025-08-25.

by Michael L. Nelson (noreply@blogger.com) at October 15, 2025 02:46 AM

October 14, 2025

HangingTogether

Exploring AI uses in archives and special collections: Integration, entities, and addressing need

This is the second in a short blog series on what we learned from the OCLC RLP Managing AI in Metadata Workflows Working Group.

A vibrant digital illustration featuring the name "Mary Shelley" repeated in various languages and scripts, overlaid on a glowing network of interconnected nodes and lines. The colorful nodes emit light in hues of yellow, pink, green, and blue, symbolizing data connections and metadata integration.

Archives and special collections contain a wide range of resource types requiring different metadata workflows. Resources may be described in library catalogs, digital repositories, or finding aids, and the metadata can vary greatly because of platforms, collections priorities, and institutional policies. Providing online access and discovery for these unique resources presents an ongoing challenge because of inconsistent or incomplete metadata and new digital accessibility standards. AI presents new possibilities for providing access to unique resources in archives and special collections, where it may be used for data—like captions and transcriptions—relying on the strengths of large language models (LLMs).

This blog post—the second in our series on the work of the OCLC Research Library Partnership (RLP) Managing AI in Metadata Workflows Working Group—focuses on the “Metadata for Special and Distinctive Collections” workstream. It shares current uses of AI by members, insights on assessing whether AI is suitable for a task, and open questions about accuracy and data provenance.

Participants

This workstream brought together metadata professionals from diverse institutions, including academic libraries, national archives, and museums. Their collective expertise and the use cases they shared provided valuable insights into how AI tools can address the unique challenges of special and distinctive collections. Members of this group included:

Helen Baer, Colorado State UniversityJill Reilly, National Archives and Records Administration
Amanda Harlan, Nelson-Atkins Museum of ArtMia Ridge, British Library
Miloche Kottman, University of KansasTim Thompson, Yale University

Integration in existing tools

Participants primarily described using tools already available to them through existing licensing agreements with their parent institution. While this works for proof-of-concept experimentation, these ad hoc approaches do not scale up to production levels or provide the desired increases in efficiency. Participants expressed that they want integrated tools within the library workflow products they are already using.

Using multiple tools is a long-standing feature of metadata work. In the days of catalog cards, a cataloger might have a bookcase full of LCSH volumes (i.e., the big red books), LCC volumes, AACR2, LCRIs, a few language dictionaries, a few binders of local policy documents, and, of course, a typewriter manual. Today, a cataloger may have four or five applications open on their computer, including a browser with several tabs. Working with digital collections compounds this complexity, requiring additional tools for content management, file editing, and project tracking. Since AI has already been integrated into several popular applications, including search engines, metadata managers hope to see similar functionality embedded within their existing workflows, potentially reducing the burden of managing so many passwords, windows, and tabs.

Entity management

Many metadata managers, including our subgroup members, dream of automated reconciliation against existing entity databases. This becomes even more important for archives, which often contain collections of family papers with multiple members with the same names. A participant observed that URIs are preferable for disambiguation due to the requirement to create unique authorized access points for persons using a limited set of data elements. The natural question then becomes, “How can AI help us do this?”

Yale University’s case study explored this question, noting that it used AI in combination with many other tools, as using an LLM for this work would have been prohibitively expensive. The technology stack is shared in the entity resolution pipeline and includes a purpose-built vector database for text embeddings. The results included a 99% precision rate in determining whether two bibliographic records with different headings (e.g., “Schubert, Franz” and “Schubert, Franz, 1797-1828”) referred to the same person and did not make traditional string match errors that occur when identical name strings refer to different persons. This case study demonstrated how AI could be effectively used in combination with multiple tools, but it may also require technical expertise beyond that of many librarians and archivists.

Readiness and need

All participants indicated some level of organizational interest in experimenting with AI to address current metadata needs. Due to distinct workflows and operations common in special collections and archives, there were fewer concerns about AI replacing human expertise than in the general cataloging subgroup.

We identified three factors influencing their willingness to experiment with AI:

Traditional divisions of work

In archival work, item-level description elements, such as image captions and transcripts, have often been done selectively by volunteers and student workers rather than metadata professionals due to the volume of items and the lack of specialized skills needed.* For example, the United States’ National Archives and Records Administration (NARA) relies on its Citizen Archivist volunteer program to provide tagging and transcription of digitized resources. Even with these dedicated volunteers, NARA uses AI-generated descriptions because of the extensive number of resources. However, NARA’s volunteers provide quality control on the AI-generated metadata, and the amount of metadata generated by AI ensures that these volunteers continue to be needed and appreciated.

Quantity of resources

Archival collections may range from a single item to several thousand items, resulting in significant variation in the type and level of description provided. Collection contents are often summarized with statements such as “45 linear feet,” “mostly typescripts,” and “several pamphlets in French.” However, when collections are digitized, more granular description is required to support discovery and access. The workflow at NARA is a good demonstration of how an archive uses AI to provide description at a scale that is not feasible for humans. Many archivists have been open to the idea of using AI for these tasks because the quantity of resources meant that detailed metadata was not possible.

Meeting accessibility requirements

Accessibility is a growing priority for libraries and archives, driven by legal requirements such as the ADA Title II compliance deadline in the US. For digital collections, this may mean providing alt text for images, embedded captions and audio descriptions for video recordings, and full transcripts for audio recordings.

A participant observed that, in their experience with AI-generated transcripts, AI does well transcribing single-language, spoken word recordings. However, the additional nuances with singing and multiple-language recordings are too complex for AI. This provides a natural triage for audio transcript workflows in their institution.

Creating transcripts of audio recordings is time-consuming, and archives have largely relied on student workers and volunteers for this work. Many institutions have a backlog of recordings with no transcriptions available. Thus, using AI for transcripts enables them to meet accessibility requirements and increase discovery of these resources.

Challenges and open questions around the use of AI

While AI offers opportunities, the group also identified several challenges and open questions that must be addressed for successful implementation. Metadata quality and data provenance were the top issues emerging for special and distinctive collections.

Assessing metadata quality

What is an acceptable error rate for AI-generated metadata? Participants noted that while perfection is unattainable, even for human catalogers, institutions need clear benchmarks for evaluating AI outputs. Research providing comparative studies of error rates between AI and professional catalogers would prove valuable for informing AI adoption decisions, but few such findings currently exist. High precision remains critical for maintaining quality in library catalogs, as misidentification of an entity will provide users with incorrect information about a resource.

The subgroup also discussed the concept of “accuracy” in transcription. For instance, AI-generated transcripts may be more literal, while human transcribers often adjust formatting to improve context and readability. An example from NARA showing a volunteer-created transcription and the AI data (labeled as “Extracted Text”) illustrates these differences. The human transcription moves the name “Lily Doyle Dunlap” to the same line as “Mrs.”, but the AI transcribes line by line. While the human transcriber noted untranscribed text as “[illegible],” the AI transcribed it as “A.” Neither reflects what was written, so both could be described as not completely accurate. Unlike cataloging metadata, there has never been an expectation that transcriptions of documents or audiovisual records would be perfect in all cases for various reasons, including handwriting legibility and audio quality. One participant characterized their expectations for AI-generated transcripts as “needed to be good, but not perfect.”

One case study used confidence scores as a metric to determine whether the AI-generated metadata should be provided to users without review. Confidence scores provide a numerical value indicating the probability that the AI output is correct. For example, a value of over 70% might be set as a threshold for providing data without review. Because confidence scores are provided by the models themselves, they are as much a reflection of the model’s training as its output.

Providing data provenance

Data provenance—the story of how metadata is created—is a critical concern for AI-generated outputs. Given the risk of AI “hallucinations” (generating incorrect or fabricated data), it is important to provide information to users about AI-created metadata. Working group members whose institutions are currently providing such data provenance shared their practices. NARA indicates that a document transcript is AI-generated using the standard text “Contributed by FamilySearch NARA Partner AI / Machine-Generated” (see this example for extracted text of a printed and handwritten document).

OCLC recognizes the importance of this issue to the community and is providing support in these ways:

Conclusion

Metadata professionals have a long-standing interest in the use of automation to provide and improve metadata, and AI joins macros, controlling headings, and batch updates as the latest technology tool in this effort. Our subgroup’s case studies demonstrated that AI tools can be used in special collections workflows in cases where AI is well-suited to the metadata needed. The most compelling applications involved transcribing documents and recordings, where AI capabilities, such as automatic speech recognition (ASR) and natural language processing (NLP), make it a good fit for such tasks.

NB: As you might expect, AI technologies were used extensively throughout this project. We used a variety of tools—Copilot, ChatGPT, and Claude—to summarize notes, recordings, and transcripts. These were useful for synthesizing insights for each of the three subgroups and for quickly identifying the types of overarching themes described in this blog post.


*It is worth noting that the labor available to national and university archives includes volunteers and student workers, whereas a smaller stand-alone archive like a historical society would not have access to so many human resources.

The post Exploring AI uses in archives and special collections: Integration, entities, and addressing need appeared first on Hanging Together.

by Kate James at October 14, 2025 02:03 PM

October 13, 2025

Mita Williams

How I use ActivityPub (and Why)

§1 The Why Before the How §2 The (RSS) Feed is Dead. Long Live the (RSS) Feed §3 "Social networks consist of people who are connected by a shared object" §4 ActivityPub needs local champions §5 An ActivityPub Membership Drive

by Mita Williams at October 13, 2025 09:00 PM

Distant Reader Blog

Four curated study carrels: Emma, the Iliad and Odyssey, Moby Dick, and Walden

While here, basking inside the grandeur of the Sainte-Geneviève Library (Paris), I have finished curating four Distant Reader study carrels:

  1. Emma by Jane Austen
  2. The Iliad and the Odyssey by Homer
  3. Moby Dick by Herman Melville
  4. Walden by Henry David Thoreau

Introduction

Distant Reader study carrels are data sets intended to be read by people as well as computers. They are created through the use of a tool of my own design -- the Distant Reader Toolbox. Given an arbitrary number of files in a myriad of formats the Toolbox caches the files, transforms them into plain text files, performs feature extractions against the plain text, and finally saves the results as sets of tab-delimited files as well as an SQLite database. The files and the database can then computed against -- modeled -- in a myriad of ways: extents (sizes in words and readability scores), frequencies (unigrams, bigrams, keywords, parts-of-speech, named entities), topic modeling, network analysis, and a growing number of indexes (concordances, full-text searching, semantic indexing, and more recently, large language model embeddings).

I call these data sets "study carrels", and they are designed to be platform- and network-independent. Study carrel functionality requires zero network connectivity, and study carrel files can be read by any spreadsheet, database, analysis program (like OpenRefine), or programming language. Heck, I could even compute against study carrels on my old Macintosh SE 30 (circa 1990) if I really desired. For more detail regarding study carrels, see the readme file included with each carrel. All that said, once a carrel is created is lends itself to all sorts analysis, automated or not. The "not automated" analysis I call "curation" which is akin to a librarian curating any of their print collections.

With this in mind, I have curated four study carrels. I have divided each of the four books (above) into their individual chapters. I then created study carrels from the results, and I have done distant reading against each. I applied distant reading, observed the results, summarized my observations, and documented what I learned. Since each curation details what I learned, I won't go into all of it here, but I will highlight some of the results of my topic modeling.

Topic modeling

In a sentence, topic modeling is an unsupervised machine learning process used to enumerate the latent themes in any corpus. Given an integer (T), topic modeling divides a corpus into T topics and outputs the words associated with each topic. Like most machine learning techniques, the topic modeling process is nuanced and therefore the results are not deterministic. Still, topic modeling can be quite informative. For example, once a model has been created, the underlying documents can be associated with ordinal values (such as dates or sequences of chapters). The model can then be pivoted so the ordinal values and the topic model weights are compared. Finally, the pivoted table can be visualized in the form of a line chart. Thus, a person can address the age-old question, "How did such and such and so and so topic ebb and flow over time?" This is exactly what I did with Emma, the Iliad and the Odyssey, Moby Dick, and Walden. In each and every case, my topic modeling described the ebb and flow of the given book, which, in the end, was quite informative and helped me characterize each.

Emma

I topic modeled Emma with only four topics, and I assert the novel is about "emma", "engagement", "charade", and "jane". Moreover, these topics can be visualized as a pie chart as well as a line chart. Notice how "emma" dominates. From my point of view, it is all about Emma and her interactions/relationships with the people around her. For more elaboration, see the curated carrel.

label weight features
emma 3.0353 emma harriet weston knightley elton time great woodhouse quite nothing dear always
engagement 0.2443 engagement affection attachment snow circumstances happiness behaviour letter feeling heart resolution reflection
charade 0.18906 charade likeness sit eye sea lines alone wingfield maid picture smith's south
jane 0.16189 jane fairfax bates campbell dixon colonel cole dancing campbells instrument dance crown
./figures/emma_topics.png
topics
./figures/emma_topics-over-time.png
topics over time

Iliad and Odyssey

I did the same thing with the Iliad and the Odyssey, but this time I modeled with a value of eight. From this process, I assert the epic poems are about "man", "trojans", "achaeans", "achilles", "sea", ulysses", "horses", and "alcinous". This time "man" dominates, but "trojans" and "acheans" are a close second. More importantly, plotting the topics over the sequence of the books (time), I can literally see how the two poems are distinct stories; notice how the first part of the line chart is all about "trojans", and the second is all about "man". See the curated carrel for an elaboration.

labels weights features
man 1.08328 man house men father ulysses home people gods
trojans 0.38826 trojans spear hector achaeans fight jove ships battle
achaeans 0.16146 achaeans ships agamemnon atreus jove king held host
achilles 0.14746 achilles peleus hector city priam body river women
sea 0.11063 sea ship men circe island cave wind sun
ulysses 0.10853 ulysses telemachus suitors penelope eumaeus stranger house bow
horses 0.10553 horses diomed tydeus nestor agamemnon chariot menelaus ulysses
alcinous 0.0444 alcinous phaeacians clothes stranger demodocus vulcan girl nausicaa
./figures/homer_topics.png
topics
./figures/homer_topics-over-time.png
topics over time

Moby Dick

I topic modeled Moby Dick with a value of ten, and the resulting topics included: "ahab", "whales", "soul", "pip", "boats", "queequeg", "cook", "whaling", "jonah", and "bildad". The topics of "ahab" and "whales" dominate, and if you know the story, then this makes perfect sense. Topic modeling over time illustrates how the book's themes alternate, and thus I assert the book is not only about Ahab's obsession with the white whale, but it is also about the process of whaling, kinda like an instruction manual. Again, see the curated carrel for an elaboration.

labels weights features
ahab 0.99807 ahab man ship sea time stubb head men
whales 0.2336 whales sperm leviathan time might fish world many
soul 0.08189 soul whiteness dick moby brow mild times wild
pip 0.08068 pip carpenter coffin sun fire blacksmith doubloon try-works
boats 0.07229 boats line air spout water oars tashtego leeward
queequeg 0.06133 queequeg bed room landlord harpooneer door tomahawk bedford
cook 0.0569 cook sharks dat blubber mass tun bucket bunger
whaling 0.05234 whaling ships gabriel voyage whale-ship whalers fishery english
jonah 0.04218 jonah god loose-fish fast-fish law shipmates guernsey-man woe
bildad 0.02847 bildad peleg steelkilt sailor gentlemen lakeman radney don
./figures/moby_topics.png
topics
./figures/moby_topics-over-time.png
topics over time

Walden

Unlike the other books, Walden is not a novel but instead a set of essays. Set against the backdrop of a pond (but we would call it a lake), Thoreau elaborates on his observations of nature and what it means to be human. In this case I modeled with seven topics, and the results included: "man", "water", "woods", "beans", "books", "purity", and "sheltor". Yet again, the topic of "man" dominates, but notice how each of the chapters' titles very closely correspond to each of the computed topics. As I alluded to previously, pivoting a topic model on some other categorical value, often brings very interesting details to light. See the curated carrel for more detail.

labels weights features
man 1.64 man life men house time day part world get morning work thought
water 0.56165 water pond ice shore surface walden spring deep bottom snow winter summer
woods 0.2046 woods round fox pine door bird snow evening winter night suddenly near
beans 0.19471 beans hoe fields seed cultivated soil john corn field planted labor dwelt
books 0.19032 books forever words language really things men learned concord intellectual news wit
purity 0.16002 purity evening body warm laws gun humanity streams sensuality hunters vegetation animal
shelter 0.10274 shelter clothes furniture cost labor fuel clothing free houses people works boards
./figures/walden_topics.png
topics
./figures/walden_topics-over-time.png
topics over time

Summary

I have used both traditional as well as distant reading against four well-known books. I have documented what I learned, and this documentation has been manifested as a set of four curated Distant Reader study carrels. I assert traditional reading's value will never go away. After all, novels and sets of essays are purposely designed to be consumed through traditional ("close") reading. On the other hand, the application of distant reading can quickly and easily highlight all sorts of characteristics which are not, at first glance, very evident. The traditional and distant reading processes compliment each other.

October 13, 2025 04:00 AM

October 12, 2025

Raffaele Messuti

The missing feature in digital libraries: searchable tables of contents

In the context of electronic books, I've always been frustrated by how reading applications relegate navigation of table of contents to a minor feature in their UI/UX.

(Note: Throughout history, indexes — those alphabetical listings at the back of books — have been crucial for knowledge access, as Dennis Duncan explores in "Index, A History of the". But this post focuses on tables of contents, which show the hierarchical structure of chapters and sections.)

I find it very useful to view the table of contents before opening a book. I often do this in the terminal. You can easily create your own script in any programming language using an existing library for EPUB files (or PDF, or whatever the format you need to read). For EPUB files, the simplest approach I have found is to use Readium CLI and jq to print a tree-like structure of the book. This is the script I use:

#!/bin/bash

# Usage: ./epub-toc.sh <epub-file>

if [ $# -eq 0 ]; then
    echo "Usage: $0 <epub-file>" >&2
    exit 1
fi

EPUB_FILE="$1"

if [ ! -f "$EPUB_FILE" ]; then
    echo "Error: File '$EPUB_FILE' not found" >&2
    exit 1
fi

if ! command -v readium &> /dev/null; then
    echo "Error: 'readium' command not found. Please install readium-cli." >&2
    exit 1
fi

if ! command -v jq &> /dev/null; then
    echo "Error: 'jq' command not found. Please install jq." >&2
    exit 1
fi

readium manifest "$EPUB_FILE" | jq -r '
  def tree($items; $prefix):
    $items | to_entries[] |
    (if .key == (($items | length) - 1) then
      $prefix + "└── "
    else
      $prefix + "├── "
    end) + .value.title,
    (if .value.children then
      tree(.value.children; $prefix + (if .key == (($items | length) - 1) then "    " else "│   " end))
    else
      empty
    end);

  if .toc then
    tree(.toc; "")
  else
    "Error: No .toc field found in manifest" | halt_error(1)
  end
'
Example of a book with a long and nested table of contents
~ readium-toc La_comunicazione_imperfetta_-_Peppino_Ortoleva_Gabriele_Balbi.epub
├── Copertina
├── Frontespizio
├── LA COMUNICAZIONE IMPERFETTA
├── Introduzione
│   ├── 1. I percorsi movimentati, e accidentati, del comunicare.
│   ├── 2. Teorie lineari della comunicazione: una breve archeologia.
│   ├── 3. Oltre la linearità, verso l’imperfezione.
│   └── 4. La struttura del libro.
├── Parte prima. Una mappa
│   ├── I. Malintesi
│   │   ├── 1. Capirsi male. Un’introduzione al tema.
│   │   ├── 2. Una prima definizione, anzi due.
│   │   ├── 3. A chi si deve il malinteso.
│   │   ├── 4. Il gioco dei ruoli.
│   │   ├── 5. L’andamento del malinteso.
│   │   ├── 6. Le cause del malinteso.
│   │   │   ├── 6.1. Errori e deformazioni materiali.
│   │   │   ├── 6.2. Parlare lingue diverse.
│   │   │   ├── 6.3. La comunicazione non verbale: toni, espressioni, gesti.
│   │   │   ├── 6.4. La comunicazione verbale: l’inevitabile ambiguità del parlare.
│   │   │   ├── 6.5. Detto e non detto.
│   │   │   └── 6.6. Sovra-interpretare.
│   │   ├── 7. Le conseguenze: il disagio e l’ostilità.
│   │   ├── 8. La spirale del non capirsi.
│   │   ├── 9. Uscire dal malinteso.
│   │   └── 10. Il ruolo del malinteso nella comunicazione umana.
│   ├── II. Malfunzionamenti
│   │   ├── 1. Malfunzionamenti involontari.
│   │   ├── 2. Malfunzionamenti intenzionali.
│   │   ├── 3. (In)tollerabilità del malfunzionamento.
│   │   ├── 4. Contrastare il malfunzionamento: manutenzione e riparazione.
│   │   ├── 5. Produttività del malfunzionamento.
│   │   └── 6. Relativizzare il malfunzionamento: per una conclusione.
│   ├── III. Scarsità e sovrabbondanza
│   │   ├── 1. Il peso della quantità.
│   │   │   ├── 1.1. La scarsità informativa: effetti negativi e produttivi.
│   │   │   ├── 1.2. La sovrabbondanza informativa: effetti negativi e produttivi.
│   │   │   └── 1.3. Qualche principio generale.
│   │   ├── 2. Politiche della scarsità e politiche dell’abbondanza.
│   │   │   ├── 2.1. Accesso all’informazione, accesso al potere.
│   │   │   └── 2.2. Controllare la circolazione dell’informazione: limitare o sommergere.
│   │   ├── 3. Scarsità e abbondanza nell’economia della comunicazione.
│   │   │   ├── 3.1. Il valore dell’informazione tra domanda e offerta.
│   │   │   ├── 3.2. L’economia dell’attenzione.
│   │   │   └── 3.3. I padroni della quantità.
│   │   ├── 4. Le basi tecnologiche della scarsità e della sovrabbondanza.
│   │   │   ├── 4.1. Effluvio comunicativo e scarsità materiale.
│   │   │   └── 4.2. Scarsità e abbondanze oggettive o create ad arte.
│   │   └── 5. Gestire il troppo e il troppo poco.
│   │       ├── 5.1. Il troppo stroppia o melius est abundare quam deficere?
│   │       ├── 5.2. Colmare un ambiente povero di informazioni.
│   │       └── 5.3. Due concetti relativi.
│   └── IV. Silenzi
│       ├── 1. La comunicazione zero.
│       │   ├── 1.1. La presenza dell’assenza.
│       │   ├── 1.2. I silenzi comunicano.
│       │   └── 1.3. Silenzi codificati e silenzi enigmatici.
│       ├── 2. Il silenzio del mittente.
│       ├── 3. Il silenzio del ricevente.
│       ├── 4. Il silenzio dei pubblici.
│       ├── 5. Silenzi parziali: le omissioni.
│       ├── 6. Il valore del silenzio: il segreto.
│       │   ├── 6.1. Una breve tipologia dei segreti.
│       │   ├── 6.2. Preservare e carpire i segreti.
│       │   └── 6.3. Ancora sulla fragilità del segreto.
│       └── 7. I paradossi del silenzio.
├── Parte seconda. Verso una teoria
│   └── V. La comunicazione è imperfetta
│       ├── 1. L’imperfezione inevitabile.
│       ├── 2. Correggere, rimediare.
│       │   ├── 2.1. Prima dell’invio: le correzioni umane, e non.
│       │   ├── 2.2. Durante l’invio.
│       │   └── 2.3. E quando la comunicazione ha già raggiunto il destinatario o l’arena pubblica?
│       ├── 3. Le vie dell’adattamento.
│       │   ├── 3.1. Avere tempo.
│       │   ├── 3.2. Adattarsi e adattare a sé.
│       │   └── 3.3. Tra le persone, con gli strumenti.
│       └── 4. Dal lineare al non lineare e all’imperfetto.
├── Bibliografia
├── Il libro
├── Gli autori
└── Copyright

The readium manifest command prints a unified representation of a publication, in JSON format.

But this is just a hacky trick, nothing more.

The serious discussion I'd like to engage in — though I'm not sure where or which community would be best for this — is whether online book catalogs, from both stores and public libraries, publish their books' table of contents and whether those are searchable. Is this a technical limitation, a licensing restriction from publishers, or simply an overlooked feature? Being able to search within tables of contents would significantly improve book discovery and research workflows.

Here are some examples I know so far:

Digitocs by University of Bologna

DigiTocs is a service launched by the University of Bologna in 2009 that provides online access to indexes, tables of contents, and supplementary pages from books cataloged in their library system. The service works through a distributed network of participating university libraries, each responsible for digitizing and uploading pages along with OCR-generated text and metadata. The platform is integrated with the library's OPAC (online catalog), allowing users to view and search digitized indexes and tables of contents directly from catalog records (example book and its TOC).

Neural Archive

Neural Archive is the online catalog of the library maintained by Neural Magazine. For each book they review, they publish high-quality cover images, minimal metadata, and the book's TOC.

Out of context

Read J.G. Ballard's short story, The Index (1977)


This blog has no comments or webmentions, so let's continue the discussion on the fediverse, I am @raffaele@digipres.club.

October 12, 2025 12:00 AM

October 10, 2025

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-10-10: An Internship Experience With the Internet Archive as a Google Summer of Code Contributor

 


In the summer of 2025, I was selected for Google Summer of Code (GSoC), a program that introduces new contributors to open source software development. I had the opportunity to contribute to the Internet Archive, an organization I have long admired for its efforts to preserve digital knowledge for all.


Numerous open source organizations annually participate in the program as mentoring organizations (2025 mentoring organizations), and that includes the Internet Archive. As a GSoC contributor, I was mentored by Dr. Sawood Alam, Research Lead of the Wayback Machine and WS-DL alum. Over the coding period, our project focused on detecting social media content in TV news, specifically through logo and screenshot detection. My work as a contributor is documented in my previous blog post, while this post highlights the GSoC program and my experience in it.

Becoming a GSoC contributor

Becoming a GSoC contributor is open to any beginners in open source (student or non-student) who meet a few basic requirements: you must be at least 18 years old at the time of registration, be a student or a newcomer to open source, be eligible to work in your country of residence during the program, and not reside in a country currently under U.S. embargo. The application process begins by exploring project ideas listed by mentoring organizations, drafting a proposal, and submitting it to Google for review. The project ideas are published on each organization’s page, and contributors can choose one (or more) of these ideas to develop into a proposal. Alternatively, you could also propose your own project idea (this is the  option I did) that may be of interest to the organization you are applying for. Contributors are encouraged to share their drafts with mentors from the organization to get feedback before submitting to Google. Once accepted, contributors spend the summer coding under the guidance of a mentor.

Working on My Project

Information diffusion on social media has been widely studied, but little is known about how social media is referenced in traditional TV news. Our project addresses this gap by analyzing broadcasts for such references by detecting social media logos and screenshots of user posts. 


My original proposal to GSoC involved training object detection and image classification models. However, we then pivoted to using large language models (LLMs), specifically ChatGPT-4o for logo and screenshot detection. This change was worthwhile as we realized that LLMs could perform logo and screenshot detection tasks with significantly less manual data labeling and setup than traditional machine learning approaches. It also taught me to stay flexible and adapt your methods when needed.


You can find the final work product here:

https://summerofcode.withgoogle.com/programs/2025/projects/j0CKIRCi


And a blog post on the technical details here:

https://ws-dl.blogspot.com/2025/09/2025-09-29-summer-project-as-google.html


This was my first time working with LLMs. I have learned a lot, and am still learning about creating effective prompts and integrating this model into a functional pipeline.


Beyond coding, GSoC taught me several valuable lessons. It is really important to stay flexible and to communicate regularly with your mentors. It is also crucial to prioritize your work by putting off critical tasks for future work to maintain steady progress. And of course, effective time management is key, since juggling work and life requires careful planning.

The Best Part

For me, the most exciting part of GSoC was working with the Internet Archive team. I had weekly meetings with my mentors - Dr. Sawood Alam, my assigned GSoC mentor and Will Howes, a Software Engineer at the Internet Archive. Will was mentoring two other GSoC students who joined the same sessions. Both the mentors were very helpful, very responsive through Slack, and always offering advice whenever needed. The Internet Archive leadership, such as Mark Graham, the Director of the Wayback Machine and Roger Macdonald, the Founder of the TV News Archive created a welcoming environment for contributors and always made sure we had the resources we needed. 

Being added to the TV News Archive guest Slack channel and invited to join the weekly TV News Archive team meetings during the Summer were great opportunities for me as a student researcher interested in this field. It was nice to observe how the team curates and preserves broadcast news content, and to learn about their ongoing projects.

Final Thoughts

GSoC was more than just a coding program - it was a huge opportunity for me to learn from great mentors and contribute to the open source community.  I hope to stay involved with the Internet Archive and its team. The technical and collaborative skills I gained, especially from working with LLMs boosted my skills and confidence as a student researcher. Finally, being selected as a GSoC contributor was a great experience and not to mention a notable addition to my resume, I would definitely consider applying again.


~ Himarsha Jayanetti (HimarshaJ)


by Himarsha Jayanetti (noreply@blogger.com) at October 10, 2025 10:28 PM

2025-10-10: Six Years, Countless Experiments, One Framework: The Story of Multi-Eyes

In 2019, I packed my bags and flew from Sri Lanka to Virginia to begin my Ph.D. in Computer Science at Old Dominion University. I did not have a clear roadmap or any prior research experience; all I had was the hope that I would be able to figure things out along the way. After six years, I found myself diving deep into eye-tracking, human-computer interaction, and machine learning; eventually completing my dissertation in multi-user eye-tracking using commodity cameras, with the support of my advisor, Dr. Sampath Jayarathna, NIRDS Lab, and ODU Web Science and Digital Libraries Research group
 

 
 
When I started my Ph.D. at ODU, I had limited knowledge and experience in eye tracking and computer vision research. After learning about ongoing research at the lab on cognitive load using eye tracking, I was fascinated by how we could use technology to better understand humans in terms of their intentions, focus, attention, and interactions with the world. That curiosity, combined with my liking for working with hardware, eventually led me to eye-tracking research.

Early on, I realized that most eye-tracking studies focused on single users, highly controlled environments, and expensive hardware. That works for lab studies, but the real world is messy, as we experienced during our first event participation, STEAM on Spectrum at VMASC. Our demo application for eye tracking was successful for a single user in the laboratory environment, but it did not perform well in the real world. Also, since we had only one eye tracker for the demo, only one person could experience eye tracking, while the others had to wait in line away from the tracker. These problems led us to question how we could enable two or more people to interact with an eye tracker while also measuring their joint attention, which a traditional eye tracker could not do. That was when the idea for Multi-Eyes started to take shape.

 

All of my students (@NirdsLab) are at @vmasc_odu for the "STEAM on Spectrum 2019" #inclusive event for Autistic kids. Nearly 200 participants and 50+ volunteers. @WebSciDL @oducs @ODUSCI @sheissheba @DynamicMelody1 @LalitaSharkey @yasithmilinda @Gavindya2 @mahanama94 pic.twitter.com/f131qYGjkA

— Sampath Jayarathna (@OpenMaze) October 12, 2019

First, we started with the trivial approach of having a dedicated eye tracker for each user. It worked well until all the users started moving, which sometimes prevented the eye trackers from capturing valid eye tracking data, giving us incorrect values. Movement constraints and the high cost of eye trackers made the setup very expensive and difficult to use in real-life applications. It may be disadvantageous for eye tracking when participants are physically together. Still, it worked best when they were online, which we later published in CHIIR 2023, "DisETrac: Distributed Eye-tracking for Online Collaboration." 


 

Due to the limitations of this approach, mainly the need for a dedicated device for each participant, we attempted to create Multi-Eyes using low-cost, commodity cameras, such as webcams, thereby eliminating the need for specialized eye-tracking hardware. Although modern eye trackers made the process appear simple, there were numerous challenges to overcome when building Multi-Eyes.

The first challenge was developing a gaze estimation model that can identify where a person is looking in various environments, such as poorly lit rooms, different camera hardware, extreme head angles, and different facial features. To address this, we developed a gaze model that utilizes unsupervised domain adaptation techniques, providing robust gaze estimates across a wide range of environmental conditions. Additionally, we focused on achieving parameter efficiency through existing model architectures. We validated this through a series of experiments on publicly available gaze estimation datasets, with our approach and findings published in IEEE IRI 2024 (Multi-Eyes: A Framework for Multi-User Eye-Tracking using Webcameras), and IEEE IRI 2025 (Unsupervised Domain Adaptation for Appearance-based Gaze Estimation). 

Beyond gaze estimates, we had to solve the problem of mapping each user’s gaze direction onto a shared display, a commonly discussed scenario in multi-user interaction within human-computer interaction. The mapping process required transforming gaze information from the user coordinate frame into the display coordinate frame. We designed a simple yet effective learnable mapping function, eliminating the need for complex setup procedures. Our approach achieved on-screen gaze locations with horizontal and vertical gaze errors of 319 mm and 219 mm, respectively, using 9-point 9-sample calibration. Considering large shared displays, the error is sufficient and stable for gaze classification or coarse-grained gaze estimation tasks.

By combining these approaches, we developed a prototype application that can run at ~17 gaze samples per second on commodity hardware, without utilizing GPU acceleration or a specialized installation. We replicated an existing study in the literature using a setup that traditionally requires expensive hardware, demonstrating that Multi-Eyes could serve as a viable low-cost alternative. 

Throughout the Multi-Eyes project, we contributed to advancements in the field of eye tracking through conference presentations and publications. Notably, our review paper on eye tracking and pupillary measures helped us set the requirements for Multi-Eyes, which later received the Computer Science Editor’s Pick award. We first proposed the Multi-Eyes architecture at ETRA 2022 and then refined the approach, showcasing its feasibility at IEEE IRI 2024. Along with the papers, we also published our research on gaze estimation approaches, capsule-based gaze estimation at Augmented Human 2020, parameter-efficient gaze estimation at IEEE IRI 2024, and parameter-efficient gaze estimation with domain adaptation in IEEE IRI 2025. 

Beyond the main framework, Multi-Eyes sparked several spin-off projects. Our work, utilizing a dedicated eye tracker-based approach, resulted in published research in ACM CHIIR 2023, IEEE IRI 2023, and IJMDEM 2024. In addition, through my work with eye trackers, I contributed to several publications on visual search patterns, published in JCDL 2021, ETRA 2022, and ETRA 2025, as well as drone navigation, published in Augmented Humans 2023

 


 

Throughout my Ph.D., I also contributed to the research community by serving as a program committee member and a reviewer for conferences, including ACM ETRA, ACM CUI, ACM CIKM, ACM/IEEE JCDL, ACM CHI, and ACM CHIIR. In addition, I participated in various university events and summer programs, including ODU CS REU, STRS: Student ThinSat Research Camp, Trick or Research, Science Connection Day, and STEAM on Spectrum

We are getting ready for the very first Trick-or-Research @oducs, candy bags ready, passport printed...!!! @ODUSCI @WebSciDL @odu pic.twitter.com/qCNMG9TS22

— Sampath Jayarathna (@OpenMaze) October 30, 2019

Looking back, I’m grateful that my work has had a positive impact on the broader community by advancing research in eye tracking and making the technology more accessible. After a journey of over five years, I’m starting a new chapter as a lecturer at the Department of Computer Science at ODU. While teaching is my primary role, I plan to continue my research, exploring new directions in eye tracking and human-computer interaction. 

While I have documented most of my research findings, I am adding a few tips for myself, in case I ever happen to do it again or travel through time, which someone else might find helpful. 

I am immensely grateful to my dissertation committee members and mentors: Dr. Sampath Jayarathna, Dr. Michael Nelson, Dr. Michele Weigle, Dr. Vikas Ashok, and Dr. Yusuke Yamani for their invaluable feedback, which greatly contributed to my success. I also owe my heartfelt thanks to my family, friends, and research collaborators, whose encouragement kept me going through the highs and lows of this journey. 

--Bhanuka (@mahanama94)

by Bhanuka Mahanama (noreply@blogger.com) at October 10, 2025 08:05 PM

Harvard Library Innovation Lab

Welcome to LIL’s Data.gov Archive Search

Photograph of people working in the Card Division of the Library of Congress, circa 1900–1920 Card Division of the Library of Congress, ca. 1900–1920. Source: Wikimedia Commons.

In February, the Library Innovation Lab announced its archive of the federal data clearinghouse Data.gov. Today, we’re pleased to share Data.gov Archive Search, an interface for exploring this important collection of government datasets. Our work builds on recent advancements in lightweight, browser-based querying to enable discovery of more than 311,000 datasets comprising some 17.9 terabytes of data on topics ranging from automotive recalls to chronic disease indicators.

Traditionally, supporting search across massive collections has required investment in dedicated computing infrastructure, such as a server running a database or search index. In recent years, innovative tools and methods for client-side querying have opened a new path. With these technologies, users can execute fast queries over large volumes of static data using only a web browser.

This interface joins a host of recent efforts not only to preserve government data, but also to make it accessible in independent interfaces. The recently released Data Rescue Project Portal offers metadata-level search of the more than 1,000 datasets it has preserved. Most of these datasets live in DataLumos, the archive for valuable government data resources maintained by the University of Michigan’s Institute for Social Research.

LIL has chosen Source Cooperative as the ideal repository for its Data.gov archive for a number of reasons. Built on cloud object storage, the repository supports direct publication of massive datasets, making it easy to share the data in its entirety or as discrete objects. Additionally, LIL has used the Library of Congress standard for the transfer of digital files. The “BagIt” principles of archiving ensure that each object is digitally signed and retains detailed metadata for authenticity and provenance. Our hope is that these additional steps will make it easier for researchers and the public to cite and access the information they need over time.

Still frame depicting a satellite dish transmitting waves into space Screenshot from the Library of Congress video BagIt: Transferring Content for Digital Preservation.

In the coming month, we will continue our work, fine-tuning the interface and incorporating feedback. We also continue to explore various modes of access to large government datasets, and so we are exploring, for example, how we might create greater access to the 710 TB of Smithsonian collections data we recently copied. Please be in touch with questions or feedback.

October 10, 2025 08:00 PM

October 09, 2025

HangingTogether

Backlogs and beyond: AI in primary cataloging workflows

This is the first post in a short blog series on what we learned from the OCLC RLP Managing AI in Metadata Workflows Working Group. This post was co-authored by Merrilee Proffitt and Annette Dortmund.

Libraries face persistent challenges in managing metadata, including backlogs of uncataloged resources, inconsistent legacy metadata, and difficulties in processing resources in languages and scripts for which there is not staff expertise. These issues limit discovery and strain staff capacity. At the same time, advances in artificial intelligence (AI) provide opportunities for streamlining workflows and amplifying human expertise—but how can AI assist cataloging staff in working more effectively?

To address these questions, the OCLC Research Library Partnership (RLP) formed the Managing AI in Metadata Workflows Working Group earlier this year. This group brought together metadata managers from around the globe to examine the opportunities and risks of integrating AI into their workflows. Their goal: to engage collective curiosity, identify key challenges, and empower libraries to make informed choices about how and when it is appropriate to adopt AI tools to enhance discovery, improve efficiency, and maintain the integrity of metadata practices.

This blog post—the first in a four-part series—focuses on one of the group’s critical workstreams: primary cataloging workflows. We share insights, recommendations, and open questions from the working group on how AI may address primary cataloging challenges, such as backlogs and metadata quality, all while keeping human expertise at the core of cataloging.

The “Primary Cataloging Workflows” group was the largest of our three workstreams, comprising seven participants from Australia, Canada, the United States, and the United Kingdom. Participants represented institutions in primarily English-speaking countries in which libraries may lack needed capacity to provide metadata for resources written in non-Latin scripts like Chinese and Arabic.

Jenn Colt, Cornell UniversityChingmy Lam, University of Sydney
Elly Cope, University of LeedsYasha Razizadeh, New York University
Susan Dahl, University of CalgaryCathy Weng, Princeton University
Michela Goodwin, National Library of Australia

Motivations: shared (and persistent) needs

Working group members are turning to AI to help solve a set of familiar cataloging challenges that result from a combination of resource constraints and limited access to specific skills. These challenges include:

Members of the working group assessed both the capabilities and limitations of AI tools in addressing these challenges by examining specific tools and workflows that could support this work.

Increasing cataloging efficiency

Backlogs of uncataloged resources prevent users from discovering valuable resources. Even experienced, dedicated staff are unable to keep up with the amount of resources awaiting description. AI offers the potential to address this problem by streamlining and accelerating the cataloging workflow for these materials. The working group identified key use cases of backlogs, including legal deposits, gifts, self-published resources, and those lacking ISBNs.

Copy cataloging is critical to addressing backlog issues, but the key challenge here is to identify the “best record.” Working group participants discussed how AI could streamline these workflows by automating record selection based on criteria such as the number of holdings or metadata completeness.

When original cataloging is required, AI-generated brief records for these materials can enable them to appear in discovery systems earlier, accelerating the process of making hidden collections discoverable and supporting local inventory control. This approach addresses the immediate need for discovery while allowing records to be completed, enriched, or refined over time.

Improving legacy metadata

Legacy metadata may contain errors, inconsistencies, or outdated terminology, which hinders discovery and fails to connect users with relevant resources. AI could assist with metadata cleanup and enrichment, reducing manual effort while maintaining high standards. This was an area where working group members had not experimented directly with AI tools, but could imagine a number of use cases, including:

Improving metadata quality, including reducing the number of duplicate records, has also been an area where OCLC has devoted considerable effort, including the development and use of human-informed machine learning processes, as illustrated in this recent blog post on “Scaling de-duplication in WorldCat: Balancing AI innovation with cataloging care.”

Providing support for scripts

Language and script expertise is a long-standing cataloging issue. In English-speaking countries, this manifests as difficulty describing resources written in languages using non-Latin scripts and those that are not often taught in local schools. AI tools could assist with transliteration, transcription, and language identification, enabling the more efficient processing of these materials. Some tools lack the basic functionality or support for specific, required languages. Even when AI tools confidently provide transliteration, human expertise is still very much required to evaluate AI-generated work. A library looking to AI to fill an expertise gap for these languages faces a double challenge of not fully trusting AI tools and also lacking access to internal language skills to effectively evaluate and correct its work.

Working group members brainstormed ways to address the needs in this situation. Research Libraries collect resources in dozens or even hundreds of languages to support established academic programs. Although the library may lack direct access to language proficiency, this expertise may be abundant across campus, with students, faculty, and researchers who are experts in the languages for whom hard-to-catalog resources are selected. These campus community members could help address a specific skill gap and safeguard the accuracy of AI-assisted workflows, while fostering community involvement and ensuring that humans are in the loop. In implementing such a program, libraries would need to create an engagement framework that includes rewards and incentives—such as compensation, course credit, or public acknowledgment—to encourage participation.

Open questions around the use of AI

Unsurprisingly, as with any new technology, opportunities come paired with questions and concerns. Metadata managers shared that some of their staff expressed uncertainty about adopting AI workflows, feeling they need more training and confidence-building support. Others wondered whether shifting from creating metadata to reviewing AI-generated records might make their work less engaging or meaningful.

Metadata managers themselves raised a particularly important question: If AI handles foundational tasks like creating brief records—work that traditionally serves as essential training for new catalogers—how do we ensure new professionals still develop the core skills they’ll need to effectively evaluate AI outputs?

These are important considerations as we explore the implementation of AI tools as amplifiers of human expertise, rather than replacements for it. The goal is to create primary cataloging workflows where AI manages routine tasks at scale, freeing qualified staff for higher-level work while preserving the meaningful aspects of metadata creation that make this field rewarding.

Conclusion

While not a panacea, AI offers significant potential to address primary cataloging challenges, including backlogs, support for scripts, and metadata cleanup. By adopting a pragmatic approach and emphasizing the continued relevance of human expertise, libraries can leverage AI with care to address current capacity issues that will make materials available more easily and improve discovery for users.

NB: As you might expect, AI technologies were used extensively throughout this project. We used a variety of tools—including Copilot, ChatGPT, and Claude—to summarize notes, recordings, and transcripts. These were useful for synthesizing insights for each of the three subgroups for quickly identifying the types of overarching themes described in this blog post.

The post Backlogs and beyond: AI in primary cataloging workflows appeared first on Hanging Together.

by Merrilee Proffitt at October 09, 2025 03:01 PM

Journal of Web Librarianship

Mapping AI Literacy Among Library Professionals: A Cross-Regional Study of South Asia and the Middle East

.

by Zafar Imam Khan Zakir Hossain Md. Sakib Biswas Md. Emdadul Islam a Library, Hamdan Bin Mohammed Smart University, Dubai, United Arab Emiratesb Independent Researcher, Le Bocage International School, Mauritiusc Institute of Information Sciences, Noakhali Science and Technology University, Bangladesh at October 09, 2025 02:18 AM

October 08, 2025

LibraryThing (Thingology)

Author Interview: S.J. Bennett

S.J. Bennett

LibraryThing is pleased to sit down this month with British mystery novelist S.J. Bennett, whose Her Majesty the Queen Investigates series, casting Queen Elizabeth II as a secret detective, has sold more than half a million copies worldwide, across more than twenty countries. Educated at London University and Cambridge University, where she earned a PhD in Italian Literature, she has worked as a lobbyist and management consultant, as well as a creative writing instructor. As Sophia Bennet she made her authorial debut with the young adult novel Threads, which won the Times Chicken House Children’s Fiction Competition in 2009, going on to publish a number of other young adult and romance novels under that name. In 2017 her Love Song was named Romantic Novel of the Year by the RNA (Romantic Novelists’ Association). She made her debut as S.J. Bennett in 2020 with The Windsor Knot, the first of five books in the Her Majesty the Queen Investigates series. The fifth and final title thus far, The Queen Who Came In From the Cold is due out next month from Crooked Lane Books. Bennett sat down with Abigail this month to discuss the book.

The Queen Who Came In From the Cold is the latest entry in your series depicting Queen Elizabeth II’s secret life as a detective. How did the idea for the series first come to you? What is it about the Queen that made you think of her as a likely sleuth?

The Queen was alive and well when I first had the idea to incorporate her into fiction. She was someone who fascinated people around the world, and she was getting a lot of attention because of The Crown.

I was looking for inspiration for a new series, and I suddenly thought that she would fit well into the mold of a classic Golden Age detective, because she lived in a very specific, self-contained world and she had a strong sense of public service, which I wanted to explore. Her family didn’t always live up to it, but she tried! What’s great for a novelist is that everyone thinks they know her, but she didn’t give interviews, so it leaves a lot of room to imagine what she was really thinking and doing behind the scenes.

I often get asked if I was worried about including her as a real figure, and I was a bit, to start with. But then I realized that she has inspired a long line of novelists and playwrights – from Alan Bennett’s The Uncommon Reader, and A Question of Attribution, to Peter Morgan’s The Queen, The Crown and The Audience, Sue Townsend’s The Queen and I. I think they were also attracted by that combination of familiarity and mystery, along with the extraordinary life she led, in which she encountered most of the great figures of the twentieth century.

My own books are about how a very human public figure, with heavy expectations on her, juggles her job, her beliefs, her interests and her natural quest for justice. The twist is, she can’t be seen to do it, so she has to get someone else to take the credit for her Miss Marple-like genius.

Unlike many other detectives, yours is based on a real-life person. Does this influence how you tell your stories? Do you feel a responsibility to get things right, given the importance of your real-world inspiration, and what does that mean, in this context?

I do feel that responsibility. I chose Elizabeth partly because I admired her steady, reliable leadership, in a world where our political leaders often take us by surprise, and not always in a good way. So, I wanted to do justice to that.

The Queen’s circumstances are so interesting, combining the constraints of a constitutional monarch who can’t ever step out of line with the glamour of living in a series of castles and palaces. Weaving those contrasts into the book keeps me pretty busy, in a fun way. Plus, of course, there’s a murder, and only her experience and intelligence can solve it.

I made the decision at the start that I wouldn’t make any of the royals say or do anything we couldn’t imagine them saying or doing in real life. Anyone who has to behave oddly or outrageously to fit my plots is an invented character. But it helps that the royal family contained some big characters who leap off the page anyway. Prince Philip, Princess Margaret and the Queen Mother have lots of scenes that make me giggle, but that I hope are still true to how they really were. I would honestly find it much harder to write about the current generations, because their lives are more normal in many ways, and also, because we already know about their inner lives, because they tell us. The Queen and Prince Philip were the last of the ‘mythical’ royals, I think.

With a murder seen from a train, and the title The Queen Who Came In From the Cold, your book suggests both Agatha Christie and John le Carré. Are there other authors and works of mystery and espionage fiction that influenced your story?

I love referencing other writers, and someone on the train in this novel is reading Thunderball, by Ian Fleming, which came out in 1961 and deals with one of the themes that’s present in my book too, namely the threat of nuclear war. At that point, The Queen Who Came In From the Cold is very much still in the Agatha Christie mold, where a murder is supposedly seen from the train, but Fleming’s book hints at the more modern spy story that the book will become in the second half.

As well as Fleming and John le Carré, whose debut novel came out that year, I read a lot of Len Deighton when I was growing up, so I hope some of his sense of adventure is in there too. But another big influence was film. I love the comedy and graphic design of The Pink Panther, and the London-centered louche photography of Blow-Up. I asked if the jacket designer (a brilliant Spanish illustrator called Iker Ayesteran) could bring some of that Sixties magic to the cover, and I like to think he has done … even if the lady in the tiara isn’t an exact replica of the Queen.

Unlike the earlier books in your series, which were contemporaneous, your latest is set during the Cold War. Did you have to do a great deal of additional research to write the story? What are some of the most interesting things you learned?

I hadn’t realized there were quite so many Russian spy rings on the go in and around London at the time! One of my characters is based on a real-life Russian agent called Kolon Molody, who embedded himself in British culture as an entrepreneur (set up by the KGB) selling jukeboxes and vending machines. According to his own account, he became a millionaire out of it before he was caught. His world was a classic one of microdots and dead-letter drops.

As a teenager, I lived in Berlin in the 1980s, when the Berlin Wall literally ran around the edge of our back garden. We were at the heart of the Cold War, but by then it was obvious the West was winning, so I didn’t personally feel under threat – although people were still dying trying to escape from East Germany to the West. I hadn’t fully realized
how much more unsafe people must have felt a generation earlier. I don’t think the western world has felt so unstable since those days … until now, perhaps.

It fascinates me that Peter Sellers, who was so entertaining as Inspector Clouseau in the Pink Panther films, was also the star in Dr Stangelove, which was based on an early thriller about the threat of nuclear annihilation called Red Alert, by Peter George. That dichotomy between fear and fun seemed to characterize the early 19§0s, and is exactly what I’m trying to capture in the book.

On a different note, it was a surprise to see how well Russia was doing in the Space Race. At that time, the Soviet Union was always a step ahead. Yuri Gagarin was the first person to go into orbit, and the Queen and Prince Philip were as awestruck as anyone else. When Gagarin visited the UK in the summer of 1961, they invited him to lunch at the palace and afterwards, it was Elizabeth who asked for a picture with him, not the other way around.

The Soviet success was largely down to the brilliance of the man they called the Chief Designer. His real name was Sergei Korolev, but the West didn’t find this out for years, because the Soviets kept his identity a closely-guarded secret. He was an extraordinary figure – imprisoned in the gulags by Stalin, and then brought out to run their most important space program. I’d call that pretty forgiving! Their space program never recovered after he died. I’m a big fan of his ingenuity, and he has a place in the book.

Tell us a little bit about your writing process. Do you have a particular writing spot and routine? Do you know the solution to your mysteries from the beginning? Do you outline your story, or does it come to you as you go along?

I went to an event recently, where Richard Osman and Mick Herron – both British writers whose work I enjoy – talked about how they are ‘pantsers’, who are driven purely by the relationships between the characters they create. I tried that early in my writing life and found I usually ran out of steam after about five thousand words, so now I plot in a reasonable amount of detail before I start.

I always know who did it and how, and I’ve given myself the challenge of fitting the murder mystery alongside everything the Queen was really doing at the time, so I need a spreadsheet to keep track of it all. Nevertheless, red herrings will occur to me during the writing process, and I adapt the plot to fit. I find if I know too much detail, then the act of writing each chapter loses its fun. I need to leave room for discoveries along the way.

If in doubt, I get Prince Philip on the scene to be furious or reassuring about something. He’s always a joy to write. So is the Queen Mother, as I mentioned. It’s the naughty characters who always give the books their bounce.

Her Majesty the Queen Investigates was published as part of a five-book deal. Will there be more books? Do you have any other projects in the offing?

I was very lucky to get that first deal from Bonnier in the UK. My editor had never done a five-book deal before, and I’m not sure he’s done one since! I always knew I wanted the series to be longer, though. I’ve just persuaded him to let me write two more, so book six, set in the Caribbean in 1966, will be out next year, and another one, set in Balmoral back in 2017, will hopefully be out the year after. I miss Captain Rozie Oshodi, the Queen’s sidekick in the first three books, and so do lots of readers, so it’ll be great to be in her company again for one last outing.

Tell us about your library. What’s on your own shelves?

My bookshelves are scattered around the house and my writing shed, wherever they’ll fit. I studied French and Italian at university, so there are a lot of twentieth century books from both countries. I love the fact that French spines read bottom up, whereas English ones read top down. I bought really cool blue and white editions of my favourite authors from Editions de Minuit in the 1990s and it’s lovely to have them on my shelves.

I’ve always loved classical literature, so there are plenty of Everyman editions of Jane Austen, George Eliot and Henry James, but equally, the books that got me through stressful times like exams were Jilly Cooper and Jackie Collins, so they have their place. These are the books that inspired the kind of literature I wanted to write: escapist, absorbing and fun. They’re near the travel guides, for all the real-life escaping I love to do.

I have two bookcases dedicated to crime fiction, packed with Christie, Dorothy L. Sayers, Ngaio Marsh, P.D. James, Rex Stout (Nero Wolfe was a big inspiration for the way I write the Queen and her sidekicks), Donna Leon and Chris Brookmyre. I inherited my love of the mystery genre from my mother, who has a library full of books I’ve also loved, by other authors such as Robert B. Parker and Sue Grafton, as well as her own shelf of Le Carrés. She decided to start clearing them out recently, but I begged her not to: I still love seeing them there.

Finally, my bedroom is awash with overfull shelves and teetering piles of contemporary novels and non-fiction that I really must sort out one day. Highlights include Golden Hill by Francis Spufford, which someone at my book club recommended, A Visit From the Goon Squad by Jennifer Egan and Where’d You Go Bernadette by Maria Semple. They’re all books whose inventiveness inspires me.

What have you been reading lately, and what would you recommend to other readers?

Thanks to my book club, I’ve been re-reading Jane Austen, and am reminded of what a fabulous stylist she was. But in terms of new writers, I’ve recently enjoyed The Art of a Lie by Laura Shepherd-Robinson, set in Georgian London, and A Case of Mice and Murder by Sally Smith, set in the heart of legal London at the turn of the twentieth century. Both Laura and Sally write vivid characters with aplomb, and create satisfying, twisty plots that are a joy to follow. I definitely recommend them both.

by Abigail Adams at October 08, 2025 06:59 PM

In the Library, With the Lead Pipe

The Digital Opaque: Refusing the Biomedical Object

In Brief

This essay examines the extractive practices employed in biomedical research to reconsider how librarians, archivists, and knowledge professionals engage with the unethical materials found in their collections. We anchor this work in refusal—a practice upheld by Indigenous researchers that denies or limits scholarly access to personal, communal or sacred knowledges. We refuse to see human remains in the biomedical archive as research objects. Presenting refusal as an ethical and methodological intervention that responds to the often stolen biomatter and biometrics in medical collections, this essay creates frameworks for scholars working with archival or historical materials that were obtained through violent, deceitful, or otherwise unethical means.

By Sean Purcell, Kalani Craig, and Michelle Dalmau


Introduction

There are many photographs of doctors at the turn of the twentieth century posing with dead human subjects. Medicine’s visual culture in this period is marked by a nonchalance toward the deceased subjects who constituted their research materials. Medical students posed with their anatomical cadavers (Warner 2014) (fig. 1), and doctors were framed in candid shots in ways that displayed their wet specimens (fig. 2). John Harley Warner, writing of anatomical students’ group photographs, noted how, for American doctors who often acquired their cadavers from Black graveyards, these photographs mimicked the composition of the lynching photograph:

The practices represented in photographs of this other “strange fruit” involved not just dismemberment of dead bodies but also constant threat to certain black communities of postmortem violation, actual trauma inflicted on those still living. (16)

Because these human remains were obtained prior to the codification of informed consent (Lederer 1995), and because medical science historically depended on theft as a means to forward epistemic, cultural, and monetary value (Richardson 1987; Sappol 2002; Redman 2016; Alberti 2011), there remains an open wound caused by the use of stolen human material in the creation of biomedical argument.

This black and white image shows a group of seven men standing over an anatomical cadaver. The cadaver’s body has been edited out of the image, leaving a black shape where it was. The men have words written on their coats--”Maine”, “CAN.”, “VA”, “Maine”, “W VA”, “CAN.”, “ME.”. The words “He Lived for Others But Died for Us ‘1914.’” are written on the autopsy table.Figure 1. In the early twentieth century, medical students often posed with anatomical cadavers. These images often included written elements related to the gaining of knowledge through sacrifice (Warner, 2014). The cadaver in the foreground has been made opaque, as consent was likely not obtained from the individual pre-mortem. Photograph courtesy of the Medical Historical Library, Harvey Cushing/John Hay Whitney Medical Library, Yale University.
This black and white image shows the neurological laboratory at the Henry Phipps Institute in Philadelphia. The lab is open, with tables covered with various jars, shelves, and chemistry ephemera. In the background a scientist is working, with his back to the camera. In front of him, in the foreground are dozens of wet specimen jars. These jars contain human brains taken at autopsy, but have been erased from the image, leaving a black shape where they were.Figure 2. A candid portrait of a doctor working in the neurology lab of the Henry Phipps Institute. In the foreground are dozens of jars filled with human brains extracted at autopsy as part of the Phipp’s Institute’s research into tuberculosis. The jars in the foreground have been made opaque, as they contain stolen human tissues. Report of the Henry Phipps Institute for the Study, Treatment and Prevention of Tuberculosis. Philadelphia: Henry Phipps Institute, 1905.

This essay describes ethical and methodological interventions developed in response to the extractivist program employed by medical scientists at the turn of the twentieth century. Our intervention, the Opaque Publisher (OP), introduces a theoretical framework that lets professionals whose work engages with stolen material choose which sections of material in their collections need to be redacted. The framework also provides readers a way to engage with these ethical decisions through a toggling interface (fig. 3). This essay is the first of two essays written for Lead Pipe on the ways digital methods afford different approaches to ethical problems. In our second essay we will go into more detail on the design-based methodology that led the development of the OP, as well as DigitalArc, the community archiving platform from which the OP was originally built.

A GIF of a webpage. The webpage has text and images related to the history of tuberculosis at the turn of the century. A mouse moves to press on three buttons--”Transparent”, “Partially Opaque”, and “Opaque”--in the upper left hand corner of the image. When the cursor clicks on the “Partially Opaque” button, parts of the text become formatted with the strike through class, and the human patients in the images have their eyes blacked out for anonymity purposes. When the cursor clicks on the “Opaque” button, parts of the text become completely redacted, and the entire body of the patients in the images are removed. When the cursor clicks on the “Transparent” button, all of the effects of the previous buttons are removed.Figure 3. A gif showing how users interact with images and text made opaque for the Tuberculosis Specimen. Link to the example page.

We ground our argument in a case study: a dissertation that examines biomedical extractivism in tuberculosis research at the turn of the twentieth century (Purcell 2025). Tuberculosis has been the center of exclusionary and anti-immigrant policies employed by nations, states, and cities. These policies tend to target brown and black populations, creating an apparatus to more easily deny immigration from those communities (Abel 2007). The disease has also been flagged to leverage eugenicist discourses in the United States (Feldberg 1995), and has been used to manufacture middle and upper class aesthetics of health and wellness (Bryder 1988). The dissertation makes a strong case study for our digital-methods intervention because it examines how biomedical and public health professionals studied the disease, and how these research programs fit into America’s expanding biomedical and public health infrastructures.

The process of medical research, especially the research employed by medical scientists at the turn of the twentieth century, sees research subjects as valuable epistemic resources. We use the term ‘epistemic’ to refer to the philosophical tradition of epistemology—or the study of how knowledge is created—with a particular stress on implicit historical, cultural, and ideological assumptions that Michel Foucault frames in the discursive épistémè (Foucault 1994). Building on histories of anatomy that describe the commodification and exploitation of postmortem subjects, we argue that biomedical science depends on the theft of human material. These extractive methods were built out of historical practices that disregarded the autonomy of non-white communities, seeing their lives, cultures, and histories as a resource to be mined (Redman 2016; Washington 2006; Sappol 2002).

Megan Rosenbloom, in her excellent book on anthropodermic bibliopegy—or books bound in human skin—describes the problem we address in an anecdote about a book challenge brought against Édouard Pernkoft’s Topographische Anatomie des Menschen. The book was written by a Nazi scientist with illustrations that may have been drawn using the bodies of subjects killed by the Nazi regime. Describing USC’s Norris Medical Library’s decision to keep the book, while adding additional information about the history of the text, Rosenbloom writes, “if books have to be removed from a medical library because the bodies depicted in them were obtained through unethical and nonconsensual means, there might not be an anatomical text left on the shelf” (170). Central to Rosenbloom’s logic is a presumption that knowledge–medical, historical, cultural knowledge–supercedes the needs of abused historical subjects, their communities, and their descendants.

Rosenbloom’s careful attention to historical violences in the history of medicine describes a broader problem practiced by knowledge workers in medical libraries, archives, and museums. Knowledge workers are obligated to maintain and preserve these materials because of their epistemic and cultural value, in spite of their awful, nonconsensual origins. We wanted to create an ethical and methodological framework that enabled the divestment of stolen human biomatter and biometrics from institutions, whose collecting histories harmed Black, brown, and Indigenous communities (Monteiro 2023). Knowing that the majority of these research materials—subjects depicted in medical atlases, described in research reports, and whose remains have been collected and maintained in medical museums—were extracted from people who never consented to that research, we present a model that calls attention to that theft. We ask, is it possible to do research in the history of medicine that respects our interlocutors’ autonomy?

Our answer to this question is a methodological one: we argue for librarians, archivists, and knowledge workers to refuse the object. While biomedical researchers saw the materials that populated their journals, textbooks, and archives as objects, we advocate for an approach that reestablishes the human base upon which these disciplines are built. Refusing the object is a countermethod to the reductive, dehistoricizing, and decontextualizing processes that harm humans caught in biomedicine’s dragnet.

This approach builds on frameworks around refusal. Refusal is a practice described and employed by Indigenous researchers and academics working with Indigenous communities that denies academic access to personal, communal, and sacred knowledges (Simpson 2007; Tuck & Yang 2014; Liborion 2021). In its most broad definition, refusal is a generative, socially embedded practice of saying ‘no’, akin to, but distinct from, resistance. It is a critique that is levied in different ways by different actors, circumscribed by their social and political context (McGranahan 2016). In its original contexts, refusal refers to the gestures made by research subjects to disrupt and disallow research (Liboiron 2021, 143). We argue that knowledge workers have ethical obligations to their interlocutors that require unique, case-by-case interventions (Caswell & Cifor 2016), and that sometimes these obligations force us, as Audra Simpson argues, to work through a calculus of “what you need to know and what I refuse to write in” (2007, 72). We argue for frameworks that enable knowledge workers to refuse materials that depend on the objectification of, and through that objectification the commodification of, human subjects.

Building from arts-based approaches to opacity (Blas 2014; Purcell 2022), we developed protocols—structured methods applied uniformly across our primary materials—for refusing the objectifying practices employed in the creation of our primary sources. These protocols highlighted the ways opacity would be scaffolded in a final published work, imagining how norms of anonymity and consent might be applied post hoc. For text, we redacted words where our primary sources revealed too much about their subjects (fig. 3). For images, we erased parts of people’s bodies depending on who was in the frame (fig. 4). What drove our design was a desire to scaffold the effects of refusal in ways that were obvious and intentional. We wanted to show the effects of refusal, rather than hypothesize about what might be lost in the process.

There are three images lined up next to one another. They are the same image, of a man looking at the camera, with diagrams drawn on his chest. These diagrams are for doctors to become better at diagnosis. The left-most version of this image is as it is found in the primary source. The center image has the man’s identity protected, with a black bar placed across his eyes. The right most image has the man’s body removed, leaving a black shape in his place. The diagrams remain unedited in this removal.Figure 4. For images, a labor-intensive step-by-step omission was practiced in order to protect patients. Reading from left to right, the image becomes blacked out based on the level of opacity applied when accessing the site. For the dissertation project, partial opacity was defined as matching contemporary needs for anonymity in research; full opacity scrubbed patient’s bodies from the images, but tried to maintain any material produced by researchers. A more detailed description of the opacity process is described in the dissertation’s website. Crofton, W. M.. Pulmonary Tuberculosis: Its Diagnosis, Prevention and Treatment. Philadelphia: P. Blakiston’s Son & Co., 1917.

For this essay, we will begin with a discussion of objectifying practices in biomedical epistemics, before talking through refusal-as-method. We will finish with a discussion of ethics audits, which can be applied late in a research project using the concepts we have outlined in this article.

Pathology’s Objects

One of the messier epistemic contradictions which enables the collection of biomatter, biometrics, images, and histories from patients is that the process of collection transforms the patient or subject into an object. Object, as we use the term, refers to a representation of phenomena used in scientific research that has been divorced from its historical, cultural origin (Daston & Galison 2007, 17). Biomedical research depends on multiple objectifying practices, the most famous of which is known as the clinical gaze. As described by Michel Foucault in The Birth of the Clinic, this visual practice refers to the ways doctors are trained to see the difference between a patient’s body and an assumed ‘normal’ human anatomy as disease. The first issue with this visual practice is that it imagines a single supposedly perfect human anatomy (a body of a cis, heterosexual, white, nondisabled man), and that this model assumes that anyone whose body is different from this constructed normal (in sexuality, gender, race, or ability) as being diseased.

The second issue with the practice comes from the clinical method. This method ties case histories with postmortem examination: patients would visit a clinic, doctors would track their symptoms, collect relevant information—their family histories, the progression of the disease—and then, if the patient died under their care, doctors would try to link the patient’s symptoms to phenomena found at autopsy. A good example of this practice can be seen in the work of René Laennec, a French doctor who practiced clinical research in the post-revolutionary period (Foucault 1994, 135-36). Laennec observed tubercles—hard, millet sized growths—in the lungs of the consumptive patients he autopsied, and he connected the symptoms experienced by these patients to these pathologies (fig. 5).

Figure 5. This illustration comes from René Laennec’s research into diseases of the chest. Underneath the opaqued redaction applied by the research team are images that show the formations of tubercules in autopsied lungs from non-consented patients. Treatise on the Diseases of the Chest in Which they are Described According to their Anatomical characters and their Diagnosis Established on a New Principle by Means of Acoustick Instruments, with plates. Translated by: Forbes, John. Philadelphia: James Webster, 1823. Image courtesy of the New York Academy of Medicine.

The clinical gaze has long been described as one that alienates patients because doctors are only trained to see them as nests of symptoms. What is important to remember is that the clinical method, as it is described by Foucault, is similarly alienating, insomuch as it sees patient symptoms as data to be gathered, analyzed, and extrapolated for medical progress. Even case histories, filled as they are with intimate details of an individual’s life, are described in such a way as to flatten that life into possible causes that may be examined in the abstract for future biomedical argument.

What Foucault neglects to mention, but which anatomical historians have made clear, is that developing in the same period was a commodification of human remains in medical contexts. Ruth Richardson has linked the popularization of the Parisian anatomical method, which required medical students to anatomize a cadaver in their training, to the rise of graverobbing in England and Scotland in the late eighteenth and early nineteenth centuries. In the same period, medical schools were deemed more or less prestigious based on the scale and quality of their medical museums and specimen collections (Alberti 2011). The production of a valuable, commodifiable object went hand-in-hand with the epistemic framework that dehumanized patients in diagnosis. The creation of a pathological specimen—a representational object that purports to show some aspect of a disease’s progression—splits the disease from the human subject whose life, death, and afterlife was necessary in the collection of that phenomenon. In denying this connection, biomedical argument enables a specimen to stand-in as an objective representation for observation and study.

This objectification extends beyond medical contexts. The problem that arises is that to engage with these historical materials as academics, even as practitioners of subjective, qualitative research, we have to approach them as research objects—as representational materials that describe the phenomena we critique. To refuse the object in the history of medicine is to refuse to decouple the biomedical object from the subject from whose body this specimen was taken. It is a refusal and denial of the material’s ultimate epistemic value, both for the sciences but also for humanistic, historical, or qualitative research.

Opaque Protocols

Our methods to refuse the objectifying practices in medicine began with a speculative approach to the history of medicine. We use “speculative” to refer to the methodological interventions into archival research argued for by Saidiya Hartman. These methods ask for historians to read against the grain of the archive, and to see the archival omissions as being part and parcel of broader carceral, colonial histories (2008). Krista Thompson has built on this scholarship to advocate for “speculative art history” which practices historical fabulation—the manipulation of archival materials—to imagine histories that otherwise would never be seen (Thompson 2017; Lafont et al. 2017).

The speculative historical method enables us to intervene on primary materials in critical, reparative ways. It allows us to shift our understanding of the primary document as a concrete, essential thing, to something that comes from structural practices that denied the humanity of certain subjects. By applying opacity—these methods of conspicuous, obdurate erasure—to primary sources, we reassert the centrality of the patient in our argument (fig. 4). This term, opacity, derives from Édouard Glissant’s critique of western academic essentialism. To be opaque is to refuse access to a phenomena’s root and the totalitarian possibility afforded by control of that essential character (1997, 11-22; 189-94).

We extend this practice beyond the platform—using this method in the images we have supplied for this article—as a way to continue the same critique: Were these images necessary for our argument? Are our claims lessened if they are intentionally marked or changed?

Refusing the Object

Our approach to opacity came about from a nagging discomfort we experienced when engaging with materials in the history of medicine. So much of medicine’s violence has been practiced in the open (Washington 2006, 12), and its harms are felt as a “bruise” by the communities whose bodies were subjected to research and ignored by the institutions that benefited from those practices (Richardson 1987, xvi). Taking primary evidence at face value, accepting the harms, and deeming them necessary for revelatory research felt hypocritical, especially because academic research so often only benefits those doing the research and not their subjects (Hale 2006).

The opaque protocols we used to redact images and text (figs. 3, 4, 6 show multiple layers of opacity) were also moments of refusal—of denying the reader access to stolen, coerced, and unethically extracted materials produced in biomedical research. Where refusal is a mode adopted by research interlocutors (Liborion 2021, 143), it is also a tool for knowledge workers working in obligation to the people and communities who inform their research (Simpson 2007). For Max Liboiron, in the context of community peer review, refusal “refers to ethical and methodological considerations about how and whether findings should be shared with and within academia at all” (2021, 142). Premised on this idea is the realization that not all knowledge needs to be known within academic systems. As Liboiron writes, “[g]iving up the entitlement and perceived right to data is a central—the central!—ethic of anticolonial sciences” (Ibid., 142, footnote 96).

Refusal, for us, is predicated on an understanding that our current knowledge infrastructure depends on extraction enacted through theft and hidden in plain sight. Roopika Risam, in her keynote for DH2025, notes the digital humanities’ long quest to make collections accessible has its own ideological basis. She writes,

Because access without accountability risks becoming a kind of digital settler colonialism: where archives are opened but not contextualized, where stories are extracted from communities but not returned to them, where knowledge circulates but the people who shaped it are left behind. It is access that takes, not access that gives back. (2025)

There is a broader need to acknowledge that the materials we maintain, use, and reproduce are so defined by their extraction—thefts of people’s biomatter, their history, and their secrets. Refusal is to say ‘no’ to this extraction, and to critique why we reveal materials in such ways.

Knowing that biomedical materials are linked to human subjects with cultures and histories, we need to acknowledge that in order to respect a community or patient’s consent we may have to lose those materials. We refuse the processes that turn people into objects. We refuse to place the value these materials offer our institutions and disciplines above the people whose bodies were made into valuable epistemic resources.

Ethics Audits

The application of opacities—obvious redactions of text and images—to the dissertation, The Tuberculosis Specimen, occurred at the end of the research and writing process. It was only after each chapter had been approved by the dissertation chair that images and text would be made opaque. Every image had to be reviewed for content, and if an image included sensitive material—human subjects undergoing treatment, children who could not have consented to having their image taken, or human remains—it would need to be edited multiple times for the final published website (figs. 4, 6, 7 illustrate this editing process). Primary quotations were also reviewed for sensitive materials. For the final publication, the text that was deemed unethical, and which needed opacity applied, was changed in the final markdown (.md) file uploaded to the site. Span classes, or hypertext markup language (HTML) wrappers that flag certain stylistic or functional changes on the final site, were added to the text to enable the redaction of that text.

The original goal was never to actually erase the materials. Many of the images that were used in that project, and which we shared in this essay, were obtained through HathiTrust—a digital library made up of many university collections which are, and will remain, available for academic research. The speculative turn was a means of thinking through the argumentative need for such materials. It helped us reconsider our roles as scholars and stewards, digital humanists and critical practitioners.      

In preparing the dissertation to be published using the OP, there emerged two parallel reflective practices: both the text and image had to be made opaque. This process required a great deal of labor, editing multiple versions of each image (fig. 7). This labor was a boon insomuch as it afforded time and effort to think through each image: What is being shown? Who is central in the image? Who has agency and who does not? What needs to be erased? And what needs to be maintained (fig. 7)?

Three versions of the same image that have been placed side by side. Each image shows a scene from Bellevue Hospital's tuberculosis clinic. Patients are waiting to be seen while three medical professionals appear to be at work. From left to right there are different levels of erasure applied to the images: the leftmost is as it appears in the primary document. In the center image, patients have been granted some anonymity through the removal of their eyes from the image. In the rightmost image the bodies of patients have been removed.Figure 6. As we edited this image from Belleview Hospital’s tuberculosis clinic, we became aware of small details that changed our understanding of the image. What is the nurse at the center of the photograph doing? Was it a choice to frame patients with their backs turned? For each level of opacity, we had to consider who was in the image, and make decisions of who we would remove. Image courtesy of the New York Academy of Medicine.
A photograph of a medical exam room in the early twentieth century. A doctor stares at the camera from behind his desk. On the right side a nurse is collecting the weight of a patient. The patient has been removed from the image.Figure 7. The framing in this photograph similarly gave us pause. Is the doctor’s expression that way because he did not want a camera in the examining room? Why does the doctor return the camera’s gaze while the patient does not? Crowell, F. Elisabeth. The Work of New York’s Tuberculosis Clinics. 1910. Image courtesy of the New York Academy of Medicine.

The process of digitally editing these photos, of cutting away bodies with the help of image editing tools, was also a process of touching upon them, enabling a haptic relationship between ourselves and the primary source. As Tina Campt argues, holding archival materials, grasping them in our hands, forces us “to attend to the quiet frequencies of austere images that reverberate between images, statistical data, and state practices of social regulation” (90). The process of making images and text opaque was a laborious one, not a rote one like cropping an image or changing a .tiff into a .jpg. It was a moment of reflection and communion. It let us touch upon the lives of our interlocutors, to say: I see you. You did not want to be here.

This work enabled us to better understand the practices that captured subjects within academic research. Our conclusions regarding the continuity between subject and object, between patient and stolen remains, are the result of this hands-on process.

From this experience, we advocate for the use of a research audit. This is a moment of retrospective reflection that occurs at the end of a research program but before publication. During a research audit, a researcher or team of researchers review their work. They conduct a close reading of their evidentiary materials with the goal to tease apart the epistemic assumptions in their research that reduce, dehumanize, or alienate their human interlocutors. In this phase of the work, researchers ask, whose lives, deaths, and afterlives are integral to my research? Why am I using them? And who benefits from this work? Am I working in obligation with those who make my own research possible?

Obligation, as we use it, is indebted to Indigenous axiology, epistemology, ontology, and methodology (Wilson 2008), particularly to the kind of relationships between researchers and subjects. Kim TallBear, describing these relationships, reminds us that we are obligated to care for everyone who our research touches upon, both those from subaltern groups and as well as those from powerful positionalities (2014). Our research is not our own, but a collaboration that ties us to the people, institutions, histories, and moments that inform our arguments.

This end-of-project reflection helps us attend to everyone who is entangled in our research. Importantly, it occurs after completing research, and it is intentionally separate from the requirements of institutional review board (IRB) approval. In an ethics audit, researchers review their work, and mark the finished product in ways that show their fraught relationship between their jobs as knowledge workers and their obligations to their interlocutors. The opacity protocols developed for the OP and The Tuberculosis Specimen were a means of showing that the evidentiary requirements for a completed dissertation conflicted with an ethics of care. They were designed to make clear to the reader that 1) as scholars we have to show our work, and 2) this practice of showing is often at the expense of those whose lives and deaths are entangled in our research programs.

The ethics audit is a moment to embrace hypocrisy as a critical method, acknowledging that knowledge work is always partial, contested, and conflicted. This idea is built out of the feminist, anticolonial approach to ethics described by Max Liboiron and their colleagues. Seeing a need to navigate the complex, impossible, overlapping, and contradicting ethical demands in a research project, they write,

our obligations and relations are often compromised, meaning we are beholden to some over others, and reproduce problematic parts of dominant frameworks while reproducing good relations at other scales. Compromise is not a mistake or a failure—it is the condition for action in a diverse field of relations. (137-38)

The opaque methodology we described above was premised on an assumption that we cannot do ethically perfect research. We instead chose to do the best research we could, while considering the historical, cultural, and epistemic violence that we addressed and within which we are enmeshed (Caswell & Cifor 2016). We inhabit a hypocritical position by design, revealing and concealing in the same breath.

Conclusion

This intervention was developed to advocate for the return of stolen materials in the history of medicine. In working on this project, we found ourselves in a double bind: we have built an environment for more and new scholarship, and find ourselves arguing against that kind of output. Caring for the lives, deaths, and afterlives of research subjects means not using their bodies, histories, and images to further our own agenda, and yet, in this essay, we have. We wanted to show our work, and to convey the importance of these problems, but realize that our evidence is, itself, the problem.

Refusal, in academic contexts, is multi-leveled in practice. Our interlocutors and subjects can refuse us. We knowledge workers can also refuse (Simpson 2007; Liboiron 2021). Academic refusal does not have a one size fits all model. It is an imperfect patchwork assembled through care, attention, and practice. The approach to opacity outlined in this essay is flawed by design. We write from institutions with their own histories of collection and exploitation.

Modes and methods of refusal do not have to be clean, or perfect, nor do they foreclose the creation of knowledge. To that end, we developed the OP as a platform that refuses and reveals, in an attempt to make these contradictions more visible. The process forced us, as knowledge workers, to more carefully assess our materials, and to work in obligation to the people whose lives, deaths, cultures, and histories are necessary to making our arguments.


Acknowledgements

We would like to thank Emily Clark, Vanessa Elias, and all of our colleagues at the Institute for Digital Arts and Humanities at Indiana University Bloomington for their help at different times during the research process. We are also thankful to Marisa Hicks-Alcaraz who assisted in the early phases of our research. This article was vastly improved through a generous and generative open-peer review process. Thanks to Roopika Risam, Jessica Schomberg and Pamella Lach for their constructive feedback. This research was made possible thanks to funding from the New York Academy of Medicine, the Center for Research on Race Ethnicity and Society, and with support from the American Council of Learned Societies’ (ACLS) Digital Justice grant program.


This essay is the first in a two-part series of articles developed for Lead Pipe. Where this essay focuses on the theoretical grounding of our critique, our follow-up essay will describe and detail the specific technological and methodological approaches we developed while creating the Opaque Publisher (OP) and DigitalArc. Together these essays will show how ethical frameworks and methodologies can be produced through a design-based approach to research.


References

Abel, Emily K. 2007. Tuberculosis & the Politics of Exclusion: A History of Public Health & Migration to Los Angeles. Rutgers University Press.

Alberti, Samuel J. M. M. 2011. Morbid Curiosities: Medical Museums in Nineteenth-Century Britain. Oxford University Press.

Blas, Zach. 2014. “Informatic Opacity.” Journal of Aesthetics & Protest, no. 9.

Bryder, Linda. 1988. Below the Magic Mountain: A Social History of Tuberculosis in Twentieth-Century Britain. Clarendon Press.

Campt, Tina M. 2017. Listening to Images. Duke University Press.

Caswell, Michelle, and Marika Cifor. 2016. “From Human Rights to Feminist Ethics: Radical Empathy in the Archives.” Archivaria 81: 23–43.

Daston, Lorraine, and Peter Galison. 2007. Objectivity. Zone Books.

Feldberg, Georgina D. 1995. Disease and Class: Tuberculosis and the Shaping of Modern North American Society. Rutgers University Press.

Foucault, Michel. 1994a. The Birth of the Clinic: An Archeology of Medical Perception. Translated by A. M. Sheridan Smith. Vintage Books.

Foucault, Michel. 1994b. The Order of Things: An Archaeology of the Human Sciences. Vintage Books.

Glissant, Édouard. 1997. Poetics of Relation. Translated by Betsy Wing. University of Michigan Press.

Hartman, Saidiya. 2008. “Venus in Two Acts.” Small Axe 12 (2): 1–14.

Lederer, Susan. 1995. Subjected to Science: Human Experimentation in America before the Second World War. The Johns Hopkins University Press.

Liboiron, Max. 2021. Pollution Is Colonialism. Duke University Press.

Liboiron, Max, Emily Simmonds, Edward Allen, et al. 2021. “Doing Ethics with Cod.” In Making & Doing: Activating STS through Knowledge Expression and Travel, edited by Gary Lee Downey and Teun Zuiderent-Jerak. The MIT Press.

McGranahan, Carole. 2016. “Theorizing Refusal: An Introduction.” Cultural Anthropology 31 (3): 319–25.

Monteiro, Lyra. 2023. “Open Access Violence: Legacies of White Supremacist Data Making at the Penn Museum, from the Morton Cranial Collection to the MOVE Remains.” International Journal of Cultural Property 30: 105–37.

Purcell, Sean. 2022. “Dermographic Opacities.” Epoiesen, ahead of print. http://dx.doi.org/10.22215/epoiesen/2022.1.

Purcell, Sean. 2025. “The Tuberculosis Specimen: The Dying Body and Its Use in the War Against the ‘Great White Plague.’” Indiana University. tuberculosisspecimen.github.io/diss.

Redman, Samuel J. 2016. Bone Rooms: From Scientific Racism to Human Prehistory in Museums. Harvard University Press.

Richardson, Ruth. 1987. Death, Dissection and the Destitute. The University of Chicago Press.

Risam, Roopika. 2025. “DH2025 Keynote – Digital Humanities for a World Unmade.” DH2025, Lisbon, July 18. https://roopikarisam.com/talks-cat/dh2025-keynote-digital-humanities-for-a-world-unmade/.

Rosenbloom, Megan. 2020. Dark Archives: A Librarian’s Investigation into the Science and History of Books Bound in Human Skin. Farrar, Straus and Giroux.

Sappol, Michael. 2002. A Traffic of Dead Bodies: Anatomy and Embodied Social Identity in Nineteenth-Century America. Princeton University Press.

Simpson, Audra. 2007. “On Ethnographic Refusal: Indigeneity, ‘Voice’ and Colonial Citizenship.” Junctures 9: 67–80.

Sutton, Jazma, and Kalani Craig. 2022. “Reaping the Harvest: Descendant Archival Practice to Foster Sustainable Digital Archives for Rural Black Women.” Digital Humanities Quarterly 16 (3).

TallBear, Kim. 2014. “Standing With and Speaking as Faith: A Feminist-Indigenous Approach to Inquiry.” Journal of Research Practice 10 (2).

Thompson, Krista. 2017. “Art, Fiction, History.” Perspectives.

Tuck, Eve, and K. Wayne Yang. 2014. “Unbecoming Claims: Pedagogies of Refusal in Qualitative Research.” Qualitative Inquiry 20 (6): 811–18.

Warner, John Harley. 2014. “The Aesthetic Grounding Of Modern Medicine.” Bulletin of the History of Medicine 88 (1): 1–47.

Washington, Harriet A. 2006. Medical Apartheid: The Dark History of Medical Experimentation on Black Americans from Colonial Times to the Present. Harlem Moon & Broadway Books.

Wilson, Shawn. 2008. Research Is Ceremony: Indigenous Research Methods. Fernwood Publishing.

by Sean Purcell at October 08, 2025 12:38 PM

October 06, 2025

Ed Summers

Quotation

A monk asked Chao Chou, “The Ultimate Path has no difficulties–just avoid picking and choosing. As soon as there are words and speech, this is picking and choosing.’ So how do you help people, Teacher?”

Chou said, “Why don’t you quote this saying in full?” The monk said, “I only remember up to here.”

Chou said, “It’s just this: ‘This Ultimate Path has no difficulties–just avoid picking and choosing.’” (Cleary, 2005, p. 337)

Cleary, T. (2005). The Blue Cliff Record. Boston: Shambhala.

October 06, 2025 04:00 AM

October 03, 2025

John Mark Ockerbloom

Will universities let Trump dictate what their libraries can do?

As has now been widely reported, the White House has sent a number of universities, including the one I work at, a set of terms it wants them to agree to, which indicate that not doing so may mean they “forego federal benefits”. It’s not entirely clear what criteria were used to select the universities, though I suspect in my university’s case it may have had something to do with its recent willingness to give in to earlier demands from the Trump regime when it looked like the only community members they’d have to sell out were their transgender student athletes.

Now, as Martin Niemöller’s readers could have predicted, they’re coming back for more. As I write this, I’ve heard no word from our university administration, either in response or acknowledgement, but we also didn’t hear a lot from them before they made their previous deal with the White House. (Another university’s board chair, though, suggested eagerness to comply.)

I am happy to see that a number of our faculty have been quick to call attention to the proposal’s threats to the academic freedom it claims to champion. Notable early responses include the AAUP-Penn Executive Committee’s statement (with its accompanying petition that Penn community members can sign) and Professor Jonathan Zimmerman’s op-ed in the Philadelphia Inquirer, which notes some of the traps the agreement would set for universities that sign on to it.

But it isn’t just university faculty and research centers that would be muzzled by the agreement. The libraries would be too. That’s the implication of section 4 of the proposal, which mandates that “all of the university’s academic units, including all colleges, faculties, schools, departments, programs, centers, and institutes” comply with what the White House calls “institutional neutrality”. University libraries are among those centers, and the proposal says they would have to “abstain from actions or speech relating to societal and political events except in cases in which external events have a direct impact upon the university.”

Academic libraries are full of speech relating to societal and political events that don’t have.a “direct impact on the university”. It’s obviously in many of the books in our collection, which deal with societal and political events of all kinds. But it’s also in what we do to build our collections, put them in context, and invite our community to engage with them. It’s in the exhibits we create, the web pages we publish, the events we host, and the speakers we invite. Much of it is usually not particularly controversial; I’ve heard no protests about our Revolution at Penn? exhibit, for instance. But exhibits honestly dealing with revolution cannot avoid talking about political events, and while that might be welcome when they discuss how Revolutionary leaders fought for America’s freedom, we’ve seen how the White House reacts when they also discuss how they denied some Americans’ freedom. (I’ll note that a similar subject is also addressed in another exhibit our library hosts.)

The proposal also calls for “transforming or abolishing institutional units that purposefully… belittle… conservative ideas”. Many of the recent calls to ban books in US libraries and schools are the ideas of self-proclaimed conservatives, and libraries of all kinds speak out against these “societal and political” events. To date, most American research libraries have not yet been directly impacted by these bans, which have largely been imposed on public and K-12 school libraries. But they still have every right to object to them, and this proposal could easily be used to chill such objections. Indeed, much to my chagrin, even without this agreement my university’s library has already taken down online statements championing other important library values, out of concern over government reaction. I hope the statements will return online before long, but agreeing to the White House’s new terms would increase, rather than reduce, an already unacceptable expressive chill.

Research libraries also cannot assume their own collections will be safe from censorship, should their universities sign on to the White House’s proposal. Recently a controversial Fifth Circuit court decision upheld a book ban in part by accepting an argument that “a library’s collection decisions are government speech”— which is to say, official speech. The White House could use this argument to interfere with collection decisions they also consider to be official institutional speech on societal and political events, should a library’s sponsoring institution sign on to this agreement.

A university might argue that some of these restraints on library activities and collections aren’t a reasonable interpretation of the terms of White House proposal. But the agreement takes the decision on what’s reasonable out of the university’s hands. Instead, “adherence to this agreement shall be subject to review by the Department of Justice”, which has the power to compel the return of “all monies advanced by the U.S. government during the year of any violation”, large or small, whether in the library or elsewhere. The Department of Justice is not an agency with particular expertise in education, librarianship, or research. And it’s also no longer an agency independent of the White House, and a number of commentators (including some former GOP-appointed officials) have noted that it is now carrying out “vindictive retribution” against Donald Trump’s enemies.

Academic libraries are often called “the heart of the university” because of how their collections, spaces, and people sustain the university’s intellectual life. As I’ve shown above, both the terms of the White House’s proposed agreement and its context threaten to cut off the free inquiry, dialogues, and innovation that our libraries sustain. Universities that accept its extreme demands even as a basis for negotiation, rather than completely rejecting them, risk being distracted about the shape of the noose they are asked to get into. They should refuse the noose outright.

by John Mark Ockerbloom at October 03, 2025 09:39 PM

Meet the people behind the books

Today I’m introducing new pages for people and other authors on The Online Books Page. The new pages combine and augment information that’s been on author listings and subject pages. They let readers see in one place books both about and by particular people. They also let readers quickly see who the authors are and learn more about them. And they encourage readers to explore to find related authors and books online and in their local libraries. They draw on information resources created by librarians, Wikipedians, and other people online who care about spreading knowledge freely. I plan to improve on them over time, but I think they’re developed enough now to be useful to readers. Below I’ll briefly explain my intentions for these pages, and I hope to hear from you if you find them useful, or have suggestions for improvement.

Who is this person?

Readers often want to know more about the people who created the books they’re interested in. If they like an author, they might want to learn more about them and their works– for instance, finding out what Mark Twain did besides creating Tom Sawyer and Huckleberry Finn. For less familiar authors, it helps to know what background, expertise, and perspectives the author has to write about a particular subject. For instance, Irving Fisher, a famous economist in the early 20th century, wrote about various subjects, not just ones dealing with economics, but also with health and public policy. One might treat his writings on these various topics in different ways if one knows what areas he was trained in and in what areas he was an interested amateur. (And one might also reassess his predictive abilities even in economics after learning from his biography that he’d famously failed to anticipate the 1929 stock market crash just before it happened.)

The Wikipedia and the Wikimedia Commons communities have created many articles, and uploaded many images, of the authors mentioned in the Online Books collection, and they make them freely reusable. We’re happy to include their content on our pages, with attribution, when it helps readers better understand the people whose works they’re reading. Wikipedia is of course not the last word on any person, but it’s often a useful starting point, and many of its articles include links to more authoritative and in-depth sources. We also link to other useful free references in many cases. For example, our page on W. E. B. Du Bois includes links to articles on Du Bois from the Encyclopedia of Science Fiction, the Internet Encyclopedia of Philosophy, BlackPast, and the Archives and Records center at the University of Pennsylvania, each of which describes him from a different perspective. Our goal in including these links on the page is not to exhaustively present all the information we can about an author, but to give readers enough context and links to understand who they are reading, and to encourage them to find out more.

Find more books and authors

Part of encouraging readers to find out more is to give them ways of exploring books and authors beyond the ones they initially find. Our page on Rachel Carson, for example, includes a number of works she co-wrote as an employee of the US Fish and Wildlife Service, as well as a public domain booklet on her prepared by the US Department of State. But it doesn’t include her most famous works like Silent Spring and the Sea Around Us, which are still under copyright without authorized free online editions, as are many recent biographies and studies of Carson. But you can find many of these books in libraries near you. Links we have on the left of her page will search library catalogs for works about her, and links on the bottom right will search them for work by her, via our Forward to Libraries service.

Readers might also be interested in Carson’s colleagues. The “Associated authors” links on the left side of Carson’s page go to other pages about people that Carson collaborated with who are also represented in our collection, like Bob Hines and Shirley Briggs. Under the “Example of” heading, you can also follow links to other biologists and naturalists, doing similar work to Carson.

Metadata created with care by people, processed with care by code

I didn’t create, and couldn’t have created (let alone maintained) all of the links you see on these pages. They’re the work of many other people. Besides the people who wrote the linked books, collaborated on the linked reference articles, and created the catalog and authority metadata records for the books, there are lots of folks who created the linked data technology and data that I use to automatically pull together these resources on The Online Books Page. I owe a lot to the community that has created and populated Wikidata, which much of what you see on these pages depends on, and to the LD4 library linked data community, which has researched, developed, and discussed much of the technology used. (Some community members have themselves produced services and demonstrations similar to the ones I’ve put on Online Books.) Other crucial parts of my services’ data infrastructure come from the Library of Congress Linked Data Service and the people that create the records that go into that. The international VIAF collaboration has also been both a foundation and inspiration for some of this work.

These days, you might expect a new service like this to use or tout artificial intelligence somehow. I’m happy to say that the service does not use any generative AI to produce what readers see, either directly, or (as far as I’m aware) indirectly. There’s quite a bit of automation and coding behind the scenes, to be sure, but it’s all built by humans, using data produced in the main by humans, who I try to credit and cite appropriately. We don’t include statistically plausible generated text that hasn’t actually been checked for truth, or that appropriates other people’s work without permission or credit. We don’t have to worry about unknown and possibly unprecedented levels of power and water consumption to power our pages, or depend on crawlers for AI training so aggressive that they’re knocking library and other cultural sites offline. (I haven’t yet had to resort to the sorts of measures that some other libraries have taken to defend themselves against aggressive crawling, but I’ve noticed the new breed of crawlers seriously degrading my site’s performance, to the point of making it temporarily unusable, on more than one occasion.) With this and my other services, I aim to develop and use code that serves people (rather than selfishly or unthinkingly exploiting them), and that centers human readers and authors.

Work in progress

I hope readers find the new “people” pages on The Online Books Page useful in discovering and finding out more about books and authors of interest to them. I’ve thought of a number of ways we can potentially extend and build on what we’re providing with these new pages, and you’ll likely see some of them in future revisions of the service. I’ll be rolling the new pages out gradually, and plan to take some time to consider what features improve readers’ experience, and don’t excessively get in their way. The older-style “books by” and “books about” people pages will also continue to be available on the site for a while, though these new integrated views of people may eventually replace them.

If you enjoy the new pages, or have thoughts on how they could be improved, I’d enjoy hearing from you! And as always, I’m also interested in your suggestions for more books and serials — and people! — we can add to the Online Books collection.

by John Mark Ockerbloom at October 03, 2025 04:56 PM

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-06-10: Comparing the Archival Rate of Arabic and English News Stories Published Between 1999 and 2022

Aljazeera Arabic Timemap

About 0.5% of websites publish their content in Arabic, occupying the 20th place among other languages; however, Arabic is the 6th most spoken language in the world at 3.4%. A considerable portion of Arabs live in English speaking countries. For example, Arabs make up roughly 1.2% of the U.S. population. Some of them, mainly first generation, are able to consume news in Arabic in addition to English. Second, third, and fourth generation Arabs might be interested in the Arabic narrative of news stories, but they prefer the English language since it is their first language. In this post, we present a quantitative study for the archival rate of news webpages published in Arabic as compared to news pages published in English by Arabic media from 1999 to 2022. We reveal that, contrary to the general conjecture which is that web archives favor English webpages, the archival rate of Arabic webpages in increasing more rapidly than the archival rate for English webpages.

The Dataset

Our dataset consists of 1.5 million multilingual news stories' URLs, collected in September of 2022, from sitemaps of four prominent news websites: Aljazeera ArabicAljazeera EnglishAlarabiya, and Arab News. Using sitemaps yielded the maximum amount of stories' URLs. I examined multiple methods to fetch URLs including RSS, Twitter, GDELT, web crawling, and sitemaps. We selected a sample of our dataset based on the median number of stories published each day by year. The median day for the number of published stories represents the year. For example, because the median for the number of stories published each day in 2002 is 93, we selected the stories published in that day to be in our sample representing the stories published in 2002. For all 23 years we studied, the median number of published stories is very close to the mean for the year. Our sample contains 4116 URIs to news stories (2684 Arabic and 1432 English). The dataset is available on GitHub.

Our dataset, collected in September of 2022, consists of 1.5 million news stories in Arabic and English published between 1999 and 2022. We found that 47% of stories published in Arabic were not archived. On the other hand, only 42% of the stories published in English were not archived. However, the archival rate of Arabic stories has increased from 24% to 53% from 2013 to 2022. Conversely, the archival rate of news stories published in English only increased from 47% to 58%. For Arabic webpages, our results are similar to those from a study published in 2017 where Arabic webpages were found to be archived at a rate of 53% for a different dataset which consists of general Arabic web pages from websites directories including DMOZ, Raddadi, and Star28 (defunct). There is a notable increase in the percentage of archived pages from Arabic websites in the last 10 years. 

We discovered that 47% of English news stories published between 1999 and 2013 were archived. This is different from what another study (and a different dataset) which found in 2017 that 72% of English webpages were archived. It is possible that the discrepancy comes from the fact that our dataset only included English news stories published by Arabic media, but their dataset consisted of general English web pages that came from the websites directory, DMOZ.

58% of English news stories between 1999 and 2022 in our dataset were archived. While there is an increase in the archival rate for English pages, it is not as large as the increase in the archival rate for Arabic ones. For English news stories, the increase could be considered normal/expected for a 10 years timeframe. It is worth mentioning that since websites started using more and more JavaScript in the last 10 years, archived mementos have more missing resources like images and other multimedia so the increase is considered an overall improvement but there is a chance that less content per page is captured in recent years. We did not study missing resources from archived mementos we found and cannot confirm whether or not missing resources are still on the rise in archived web pages.

Arabic and English news stories' URIs archival rate
Category Arabic Language URIs English Language URIs
URIs Queried 2684 1432
URIs Archived 1435 834
URIs Not Archived 1249 598
Archival Rate 0.53 0.58

While we were sampling from our dataset, we noticed an increase in Arabic stories published per day (median) for each year. The increase in the number of collected stories over time is expected due to news outlets moving towards publishing on the web in the last 20+ years.

The lower number in the following figure for 2022 is due to our dataset only spanning stories published between January 1999 and September 2022.

The number of collected Arabic stories per day (median)
The number of collected Arabic stories per day (median)

We could not observe a consistent increase or decrease in the number of published stories in English per day (median) for each year because Arab News did not include any stories published after 2013 in its sitemap. Only Aljazeera English, in our dataset, included stories published after 2013 in its sitemap. The other two news websites, Aljazeera Arabic and Alarabiya, publish news in Arabic.

The number of collected English stories per day (median)
The number of collected English stories per day (median)

For Arabic news stories published in the median day in our dataset, nothing was archived for 1999, 2000, and 2004. Deciding to sample using the median day for the number of stories published per day each year was based on the median being very close to the mean value for the number of stories published per day. Moreover, using the median day. we were able to obtain a relatively small sample, 4116 URIs, that spans and represents 23 years worth of news stories from four news networks in two languages, 1.5 million URIs, that would otherwise not be feasible to study the archival rate for.

The min, max, median, and mean for the number of collected stories' URIs each day by year
The min, max, median, and mean for the number of collected stories' URIs each day by year

We found that there is a little increase in the Arabic webpages archival rate until 2010 and the rate fluctuates after 2013 but it remained above 40% from 2014 to 2022. Generally the increase in Arabic news webpages archival rate is significant over the last 20 years.

Arabic webpages archival rate by year
Arabic webpages archival rate by year

For English news stories, nothing was collected for 1999 and 2000 because these news outlets had little to no presence on the web during these years. We noticed even more fluctuation in the archival rate for English webpages but less general increase than it is for Arabic webpages.

English webpages archival rate by year
English webpages archival rate by year

We measured the archival rate for Arabic webpages in our dataset by web archive to find the contribution of each archive to the archiving of these URIs. Using MemGator to check if the collected news stories were archived by public web archives, we studied the following archives:

1. waext.banq.qc.ca: Libraries and National Archives of Quebec
2. warp.da.ndl.go.jp: National Diet Library, Japan
3. wayback.vefsafn.is: Icelandic Web Archive
4. web.archive.bibalex.org: Bibliotheca Alexandrina Web Archive
5. web.archive.org.au: Australian Web Archive
6. webarchive.bac-lac.gc.ca: Library and Archives Canada
7. webarchive.loc.gov: Library of Congress
8. webarchive.nationalarchives.gov: UK National Archives Web Archive
9. webarchive.nrscotland.gov.uk: National Records of Scotland
10. webarchive.org.uk: UK Web Archive
11. webarchive.parliament.uk: UK Parliament Web Archive
12.  wayback.nli.org.il: National Library of Israel
13. archive.today: Archive Today
14. arquivo.pt: The Portuguese Web Archive
15. perma.cc: Perma.cc Archive
16. swap.stanford.edu: Stanford Web Archive
17. wayback.archive-it.org: Archive-It (powered by the Internet Archive)
18. web.archive.org: the Internet Archive

Only archive.today and arquivo.pt returned any mementos for the 2684 URIs we queried. They both returned a total of seven mementos for six different URIs.

We found that the Internet Archive has archived more Arabic news pages than all other archives combined by a large margin. Other archives hardly contributed to archiving Arabic stories' URIs.

The percentage of archived Arabic news stories in web archives
The percentage of archived Arabic news stories in web archives

As far as English news webpages, looking at the archival rate by web archive, the Internet Archive returned mementos for a much larger amount of URIs than the sum of all other web archives, but the gap in contribution between the IA and the sum of all other web archives is not as large as it is for Arabic news webpages in our dataset.

The percentage of archived English news stories in web archives
The percentage of archived English news stories in web archives

Furthermore, we found that the union of all other archives' URI-Rs is a proper subset of the IA's URI-Rs. In other words, only the IA had exclusive copies of URIs of Arabic news stories. All other archives had no exclusive copies. This doesn't necessarily mean that union of all other archives' URI-Ms is a proper subset of the IA's URI-Rs because URIs could've been archived at different times by different web archives. This finding indicates that losing all web archives besides the IA causes almost no loss in information. On the other hand, losing the IA is disastrous to Arabic pages' web archiving.

The percentage of exclusively archived Arabic news stories
The percentage of exclusively archived Arabic news stories

For English news webpages in our sample, the IA had many more exclusive copies of URIs than all other archives combined, which indicates that losing all web archives besides the IA causes very little loss in information, but the opposite, losing the IA, is catastrophic.

The percentage of exclusively archived English news stories
The percentage of exclusively archived English news stories

While it is not a secret that the IA is the largest web archive on the internet, our study shows that the bulk of archived webpages on the internet could be lost forever if the Internet Archive was killed by legal threats or crippled by repeated cyber attacks. The most recent DDoS attack and data breach happened in October 2024. Luckily, the DDoS attack was solved, but the incident caused the IA to be down or partially down to keep the data safe.

Our finding is different from an earlier study by Alsum et al. (2014), where they found that it is possible to retrieve full TimeMaps for 93% of their dataset using the top nine web archives without the IA. 

Conclusions

The archival rate of Arabic news pages was, and is still, less than English news pages, but the gap is much smaller than it used to be. The archival rate of Arabic news pages has increased from 24% between 1999 and 2013 to 53% between 2013-2022. Our study shows that most of the increase is due to the IA's augmentation over time while other web archives did not experience such enhancements. Also there was more room for improvement in archiving Arabic news than English news. We show that losing all archives except the IA will cause no loss in archived Arabic news pages, but the loss is irreversible if the IA no longer exists. For English webpages, the majority of archived copies will be lost forever if the IA is crippled.

2025-10-03 edit: I replaced all graphs in this post with graphs that are more visually appealing.

-Hussam Hallak



by Hussam Hallak (noreply@blogger.com) at October 03, 2025 03:36 PM

October 01, 2025

Mita Williams

Libraries deliver the freedom that games promise

Last May, I spoke at the Bike Windsor Essex's Pecha Kucha portion of their AGM. Today I presented another 20 slides @ 20 seconds each about games and libraries.

by Mita Williams at October 01, 2025 10:11 PM

LibraryThing (Thingology)

October 2025 Early Reviewers Batch Is Live!

Win free books from the October 2025 batch of Early Reviewer titles! We’ve got 206 books this month, and a grand total of 2,416 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.

If you haven’t already, sign up for Early Reviewers. If you’ve already signed up, please check your mailing/email address and make sure they’re correct.

» Request books here!

The deadline to request a copy is Monday, October 27th at 6PM EDT.

Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, Canada, the UK, Australia, Germany, New Zealand, Ireland, Italy, Finland, Czechia and more. Make sure to check the message on each book to see if it can be sent to your country.

The Little Drummer Girl: An Unexpected Christmas StoryStones Still Speak: How Biblical Archaeology Illuminates the Stories You Thought You KnewA Sun Behind Us / Un sol caído avanza82nd DivisionCoyotes and Culture: Essays from Old MalibuWhere Heaven Sinks: PoemsNervous Systems: Spiritual Practices to Calm Anxiety in Your Body, the Church, and PoliticsThe Future Begins with Z: Nine Strategies to Lead Generation Z As They Disrupt the WorkplaceBy Other MeansThe Gardener's Wife's MistressPosthumously YoursGap Yearsomething out there in the distanceTethered Spirits: Wiaqtaqne’wasultijik na KjijaqmijinaqStokerCon 2025 Souvenir AnthologyStokerCon 2025 Souvenir AnthologyEvery Pawn A QueenDivrei Halev: Thoughts of Rabbi Professor David Weiss Halivni on the Weekly Torah PortionPirkei Hallel: A Shared Journey for Bat Mitzvah Girls and Their MothersMilo's Moonlight MissionIf a Bumblebee Lands on Your ToeAn eighty year old's poems for a poundThe Arrows of FealtyRational Ideas Book TwoOctober 7: A Story of Courage and ResurrectionThe Accidental HeroThe Happiness Files: Insights on Work and Life by Arthur C. BrooksImagine WagonsThey Kill People: Bonnie and Clyde, a Hollywood Revolution, and America's Obsession with Guns and OutlawsA Method of Reaching Extreme Altitudes and Other StoriesEpic Disruptions: 11 Innovations That Shaped Our Modern WorldStory Work: Field Notes on Self-Discovery and Reclaiming Your NarrativeSerial Fixer: Break Free from the Habit of Solving Other People's ProblemsThe Curse of the Cole Women“Moonrakers” – The Great War Story Of The 2nd Battalion, The Wiltshire RegimentLove Me Like She Loves YouSilence, Not For SaleChurch of the Last LambDiyas at Circle Time: A Celebration of South Asian Festivals Around Diwali TimeBro ken Rengay: Unruly Poetry101 Stories of Love: Poetry CollectionQuintus Huntley: BotanyDeath Is a Hungry AngelThe Music of CreationMeet RebaThe Little Red BookAnd Life Goes OnChasing the GoddessA Future for Ferals: A Charity AnthologyAn Eye for VengeanceUntil Death Taps You on the ShoulderOne Night in BethlehemIn Plain SightThe Lost Star of FaewyrnBib and TuckerPossession PointThe PacifistMoving to My Dog's Hometown: Stories of Everything I Didn't Know I WantedBrokenJust Another Perfect DayLena the Chicken (But Really a Dinosaur!)The Newest GnomeNot-So-Sweetie PieThe One about the BlackbirdPluto Rocket: Over the MoonRobot IslandTeam ParkWhispers in the Currents: A Poetry CollectionThe HobbetteSilent ExtractionThey Tried Their BestThrough the Darkness: A Story of Love from the Other SideLes Aventures d'Emma Brown : Le Village D'AmyvilleLa Famille GoodhartTo Catch a SpyJust a ChanceAtannaThe Adventures of Syd: Lost in Bone CaveBlood & GunsThe DrawAdapt, Panic, or Profit?: Hilariously Stressful Quizzes About the FutureThe 30 Day Creative Writing Workbook for Kids Ages 8-12: Fun Daily Lessons and Prompts to Build Confidence and Teach Great StorytellingAve Molech: Ex TenebrisThe Taken PathMargo's CaféLawless Game Of LiesTo Eternity and Back: Discovering and Decoding the Map of the MultiverseAmelia Armadillo: Appearances Aren't EverythingBernie Bear: A Story about Best FriendsThe Undoing: Who Shall StandSimply DeliciousWrecking BallDragon RogueSabrina Tells Maddie the Truth About Her PastPerfectly HugoThe Road to RedemptionWhen the Lights Are Off: Lessons from the Quiet MomentsOvercoming BPH: The Man's Guide to Beating an Enlarged Prostate: Proven Ways to Shrink Your Prostate, Improve Urine Flow, and Reclaim Your VitalityMekatilili wa Menza: The Woman Warrior Who Led the Giriama of KenyaEye in the Blue BoxThe LiminalisThe Strategic Customer Success Manager: A Blueprint for Elevating Your Impact and Advancing Your CareerLow FODMAP Diet Cookbook: Easy and Healthy FODMAP Recipes for Beginners and IBS Relief - Gluten-Free Meal Plan with Gut-Friendly Breakfast, Lunch and Dinner IdeasVegan Mexican Cookbook: Easy and Authentic Plant-Based Recipes from MexicoThe Absolute Path: A Spiritual Guide to Eternal FulfillmentMooncrowThe Breath of the Machine: To Do or To Be – What the Machine Can’t GraspHavoc: Trouble on the TrailThe Broken Woman Cycle: Complete SeriesWalking the Standing Stones: Wiltshire's Sarsen Way and Cranborne Droves WayDo It for Beauty: A Practical Guide to Sustainable LivingThe Final ShelterThe Lessons of Legions: How the Devil Interferes in the Lives of HumansThe War for Every Soul: True Encounters with the Spirit RealmEveOut of Gaza: A Tale of Love, Exile, and FriendshipThe Music MakersInkbound InheritanceThe Tapestry of TimeAbsolute TriumphWhy We Choose Freedom from Nine to Five: Transforming Ordinary Days Into Extraordinary FreedomSouthwest Gothic: The Harvesting Angel of the PlainsSouthwest Gothic: The Harvesting Angel of the PlainsRare Mamas: Empowering Strategies for Navigating Your Child's Rare DiseaseTapoutWayWard: The Valley WarLead Anyway: Teaching Through the Fog When the System Stops Seeing YouMystic Nomad: A Woman's Wild Journey to True ConnectionThe Hidden Vegetables Cookbook: 90 Tasty Recipes for Veggie-Averse AdultsCalming Teenage Anxiety: A Parent's Guide to Helping Your Teenager Cope with WorryWe the People: A PremonitionOuter Chaos Inner CalmKids These Days: Understanding and Supporting Youth Mental HealthMilo's Pet Problem: A Laugh-Out-Loud Pet Adventure for Curious KidsA National Park Love Story: A Journey of Love, Healing, and Second Chances Across America’s 63 National ParksBot CampHow to Focus Energy and Go All In: Build Unstoppable Momentum and Dominate Your LifeShadow HeirThe Human, Not the Man: How A Mistranslation Shaped CivilizationRaising Genius: Mozart, Einstein, Jobs: The Price of BrillianceThe Reckoning: A Definitive History of the COVID-19 Pandemic and Other AbsurditiesBut Will It Fly?: The History and Science of Unconventional Aerial Power and PropulsionNala Roonie Goes To Richmond ParkThe AI of the Beholder: Art and Creativity in the Age of AlgorithmsSuch an Odd Word to UseHalf-TruthsDaedalusThe Lavender Blade: An Exorcist's ChronicleTo Outwit Them AllCarried AwayWing HavenWild SunPale PiecesThe Shepherd DescendsThree Years to FreedomAuroraAn Artsy Girl's Guide to FootballVan Gogh's LoverSilly Zoo Baby Mix-Ups: A Hilarious Rhyming and Movement BookBroken BondsThe Gariboldi AffairDo Not Kill a SpiderThe Orsini AffairThe Tale of the Young WitchPiper's Fort FrenzyPhantom AlgebraBlack Girls Day OffThe GLP-1 Weight Loss Cookbook for Beginners: Simple, Quick, Healthy Recipes and Meal Plans For Rapid Fat Loss Using Wegovy, Ozempic and Mounjaro, Burn Belly Fat, Curb Cravings and Keep FitThe Town That Feared Dusk5 Ingredients Mediterranean Slow Cooker Cookbook for Busy Families: Simple, Quick and Healthy Mediterranean Recipes You Can Set and Forget, Perfect For Beginners and Families on a Tight ScheduleThe Flames of DarknessOne Grain of SandCampus of ShadowsThree Faces of Noir Curse Crime CringeThe Flown Bird Society: An Illuminated Story10 E 10 : El Tiempo Que Eligió AmarFour Corners, Volume 2Sea of Stats: An Introduction to Descriptive StatisticsThe IdolDeclined but DivineThe Road UnveiledNarc 101: The Illustrated Practical Guide to Identifying and Healing from Narcissistic AbuseStormy Normy Goes ReiningThe Light Switch Myth: A Beginner's Guide to Creating Realistic and Sustainable ChangeThe Lady of the RingsThe Road Unveiled : A NovelForever from AfarThe Last Library of MidnightMaya's Diary: The Lost JournalThe Explorer's Guid To The Galaxy: How to Tame Your ADHD Chaos, Build a Decency & Determination: How Nicholas Winton Saved Hundreds of Children from the HolocaustSeleneThe Heart That Found YouRoderick RecursiveNotes on Letting GoClass Is In Session: The Expectant Teacher Survival HandbookRevelation: Worship the LambVamparencyBlue Helmet: My Year As a UN Peacekeeper in South SudanBlue Helmet: My Year As a UN Peacekeeper in South SudanThe Pansy ParadoxSet Point SeductionShadowbound: An Indian Superhero ThrillerSolitude: Four Unsettling Tales of Love, Obsession and HorrorBehind the MirrorThe Enchanted Suitcase: Vade Satana

Thanks to all the publishers participating this month!

Akashic Books Alcove Press Baker Books
Broadleaf Books Egret Lake Books eSpec Books
Gefen Publishing House Harbor Lane Books, LLC. HarperCollins Leadership
Harvard Business Review Press Kinkajou Press Modern Marigold Books
OC Publishing Picnic Heist Publishing Prolific Pulse Press LLC
PublishNation Revell Rippple Books
Riverfolk Books Rootstock Publishing Shadow Dragon Press
Somewhat Grumpy Press Spinning Wheel Stories Tundra Books
Type Eighteen Books Underground Voices University of Nevada Press
University of New Mexico Press UpLit Press Wise Media Group
WorthyKids Yeehoo Press

by Abigail Adams at October 01, 2025 06:18 PM

Digital Library Federation

DLF Digest: October 2025

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here

Hello DLF Community! I’m excited to join you as Senior Program Officer and to be a part of this vibrant member network. I really enjoyed meeting the Working Group chairs at the September meeting, in some cases seeing familiar faces! I’m looking forward to continuing these conversations as we dive into preparations for the DLF Forum, a great opportunity to learn, share and connect. I’m grateful to be on this journey with you and eager to see where we go next as a community. 

— Shaneé from Team DLF

 

This month’s news:

 

This month’s DLF group events:

DAWG Accessibility Ambassadors Project Overview

Thursday, October 2, 2025 at 11:30am ET / 8:30am PT; 

https://sjny.zoom.us/meeting/register/Rfc7pOoCTjKzaB94hXHCSw  

Please join the DAWG Advocacy and Education group on Thursday, October 2, 2025 at 11:30 am ET to learn about the Accessibility Ambassadors Project @UMich. 

Tiffany Harris (she/her) is the Accessibility Program Assistant for the University of Michigan library, and she is pursuing her Master’s of Science in Environmental Justice. During her presentation, she will be discussing the accessibility training that she and other members of the Library Accessibility team are leading for the Accessibility teams within the library. She is hosting training on Learning about People with Disabilities, Accessible Documents, Accessible Presentations, and Accessible Spreadsheets. She will also be discussing some of the Accessibility Ambassador projects such as assessing the end cap signage throughout the library, live captioning on slides and PowerPoints, and sensory friendly maps and floor plans.  We give student staff a wide variety of projects to choose from to ensure that they are working on projects that relate to their interests and build out skills to develop for their resume. 

You must Register in advance for this meeting. After registering, you will receive a confirmation email about joining the meeting. Please review the DLF Code of Conduct prior to attending.

 

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at info@diglib.org

 

Get Involved / Connect with Us

Below are some ways to stay connected with us and the digital library community: 

The post DLF Digest: October 2025 appeared first on DLF.

by Aliya Reich at October 01, 2025 01:00 PM

David Rosenthal

The Gaslit Asset Class

James Grant invited me to address the annual conference of Grant's Interest Rate Observer. This was an intimidating prospect, the previous year's conference featured billionaires Scott Bessent and Bill Ackman. As usual, below the fold is the text of my talk, with the slides, links to the sources, and additional material in footnotes. Yellow background indicates textual slides.

The Gaslit Asset Class

Before I explain that much of what you have been told about cryptocurrency technology is gaslighting, I should stress that I hold no long or short positions in cryptocurrencies, their derivatives or related companies. Unlike most people discussing them, I am not "talking my book".

To fit in the allotted time, this talk focuses mainly on Bitcoin and omits many of the finer points. My text, with links to the sources and additional material in footnotes, will go up on my blog later today.

Why Am I Here?

I imagine few of you would understand why a retired software engineer with more than forty years in Silicon Valley was asked to address you on cryptocurrencies[1].

NVDA Log Plot
I was an early employee at Sun Microsystems then employee #4 at Nvidia, so I have been long Nvidia for more than 30 years. It has been a wild ride. I quit after 3 years as part of fixing Nvidia's first near-death experience and immediately did 3 years as employee #12 at another startup, which also IPO-ed. If you do two in six years in your late 40s you get seriously burnt out.

So my wife and I started a program at Stanford that is still running 27 years later. She was a career librarian at the Library of Congress and the Stanford Library. She was part of the team that, 30 years ago, pioneered the transition of academic publishing to the Web. She was also the person who explained citation indices to Larry and Sergey, which led to Page Rank.

The academic literature has archival value. Multiple libraries hold complete runs on paper of the Philosophical Transactions of the Royal Society starting 360 years ago[2]. The interesting engineering problem we faced was how to enable libraries to deliver comparable longevity to Web-published journals.

Five Years Before Satoshi Nakamoto

I worked with a group of outstanding Stanford CS Ph.D. students to design and implement a system for stewardship of Web content modeled on the paper library system. The goal was to make it extremely difficult for even a powerful adversary to delete or modify content without detection. It is called LOCKSS, for Lots Of Copies Keep Stuff Safe; a decentralized peer-to-peer system secured by Proof-of-Work. We won a "Best Paper" award for it five years before Satoshi Nakamoto published his decentralized peer-to-peer system secured by Proof-of-Work. When he did, LOCKSS had been in production for a few years and we had learnt a lot about how difficult decentralization is in the online world.

Bitcoin built on more than two decades of research. Neither we nor Nakamoto invented Proof-of-Work, Cynthia Dwork and Moni Naor published it in 1992. Nakamoto didn't invent blockchains, Stuart Haber and W. Scott Stornetta patented them in 1991. He was extremely clever in assembling well-known techniques into a cryptocurrency, but his only major innovation was the Longest Chain Rule.

Digital cash

The fundamental problem of representing cash in digital form is that a digital coin can be endlessly copied, thus you need some means to prevent each of the copies being spent. When you withdraw cash from an ATM, turning digital cash in your account into physical cash in your hand, the bank performs an atomic transaction against the database mapping account numbers to balances. The bank is trusted to prevent multiple spending.

There had been several attempts at a cryptocurrency before Bitcoin. The primary goals of the libertarians and cypherpunks were that a cryptocurrency be as anonymous as physical cash, and that it not have a central point of failure that had to be trusted. The only one to get any traction was David Chaum's DigiCash; it was anonymous but it was centralized to prevent multiple spending and it involved banks.

Nakamoto's magnum opus

Bitcoin claims:
When in November 2008 Nakamoto published Bitcoin: A Peer-to-Peer Electronic Cash System it was the peak of the Global Financial Crisis and people were very aware that the financial system was broken (and it still is). Because it solved many of the problems that had dogged earlier attempts at electronic cash, it rapidly attracted a clique of enthusiasts. When Nakamoto went silent in 2010 they took over proseltyzing the system. The main claims they made were:
They are all either false or misleading. In most cases Nakamoto's own writings show he knew this. His acolytes were gaslighting.

Trustless because decentralized (1)

Assuming that the Bitcoin network consists of a large number of roughly equal nodes, it randomly selects a node to determine the transactions that will form the next block. There is no need to trust any particular node because the chance that they will be selected is small.[3]

At first, most users would run network nodes, but as the network grows beyond a certain point, it would be left more and more to specialists with server farms of specialized hardware. A server farm would only need to have one node on the network and the rest of the LAN connects with that one node.
Satoshi Nakamoto 2nd November 2008
The current system where every user is a network node is not the intended configuration for large scale. ... The design supports letting users just be users. The more burden it is to run a node, the fewer nodes there will be. Those few nodes will be big server farms. The rest will be client nodes that only do transactions and don’t generate.
Satoshi Nakamoto: 29th July 2010
But only three days after publishing his white paper, Nakamoto understood that this assumption would become false:
At first, most users would run network nodes, but as the network grows beyond a certain point, it would be left more and more to specialists with server farms of specialized hardware.
He didn't change his mind. On 29th July 2010, less than five months before he went silent, he made the same point:
The current system where every user is a network node is not the intended configuration for large scale. ... The design supports letting users just be users. The more burden it is to run a node, the fewer nodes there will be. Those few nodes will be big server farms.
"Letting users be users" necessarily means that the "users" have to trust the "few nodes" to include their transactions in blocks. The very strong economies of scale of technology in general and "big server farms" in particular meant that the centralizing force described in W. Brian Arthur's 1994 book Increasing Returns and Path Dependence in the Economy resulted in there being "fewer nodes". Indeed, on 13th June 2014 a single node controlled 51% of Bitcoin's mining, the GHash pool.[4]

Trustless because decentralized (2)

In June 2022 Cooperation among an anonymous group protected Bitcoin during failures of decentralization by Alyssa Blackburn et al showed that it had not been decentralized from the very start. The same month a DARPA-sponsored report entitled Are Blockchains Decentralized? by a large team from the Trail of Bits security company examined the economic and many other centralizing forces affecting a wide range of blockchain implementations and concluded that the answer to their question is "No".[5]

The same centralizing economic forces apply to Proof-of-Stake blockchains such as Ethereum. Grant's Memo to the bitcoiners explained the process last February.

Trustless because decentralized (3)

Another centralizing force drives pools like GHash. The network creates a new block and rewards the selected node about every ten minutes. Assuming they're all state-of-the-art, there are currently about 15M rigs mining Bitcoin[6]. Their economic life is around 18 months, so only 0.5%% of them will ever earn a reward. The owners of mining rigs pool their efforts, converting a small chance of a huge reward into a steady flow of smaller rewards. On average GHash was getting three rewards an hour.

A medium of exchange (1)

Quote from: Insti, July 17, 2010, 02:33:41 AM
How would a Bitcoin snack machine work?
  1. You want to walk up to the machine. Send it a bitcoin.
  2. ?
  3. Walk away eating your nice sugary snack. (Profit!)
You don’t want to have to wait an hour for you transaction to be confirmed.

The vending machine company doesn’t want to give away lots of free candy.

How does step 2 work?
I believe it’ll be possible for a payment processing company to provide as a service the rapid distribution of transactions with good-enough checking in something like 10 seconds or less.
Satoshi Nakamoto: 17th July 2010
Bitcoin's ten-minute block time is a problem for real-world buying and selling[7], but the problem is even worse. Network delays mean a transaction isn't final when you see it in a block. Assuming no-one controlled more than 10% of the hashing power, Nakamoto required another 5 blocks to have been added to the chain, so 99.9% finality would take an hour. With a more realistic 30%, the rule should have been 23 blocks, with finality taking 4 hours[8].

Nakamoto's 17th July 2010 exchange with Insti shows he understood that the Bitcoin network couldn't be used for ATMs, vending machines, buying drugs or other face-to-face transactions because he went on to describe how a payment processing service layered on top of it would work.

A medium of exchange (2)

assuming that the two sides are rational actors and the smart contract language is Turing-complete, there is no escrow smart contract that can facilitate this exchange without either relying on third parties or enabling at least one side to extort the other.

two-party escrow smart contracts are ... simply a game of who gets to declare their choice first and commit it on the blockchain sooner, hence forcing the other party to concur with their choice. The order of transactions on a blockchain is essentially decided by the miners. Thus, the party with better connectivity to the miners or who is willing to pay higher transaction fees, would be able to declare their choice to the smart contract first and extort the other party.
Amir Kafshdar Goharshady, Irrationality, Extortion, or Trusted Third-parties: Why it is Impossible to Buy and Sell Physical Goods Securely on the Blockchain
The situation is even worse when it comes to buying and selling real-world objects via programmable blockchains such as Ethereum[9]. In 2021 Amir Kafshdar Goharshady showed that[10]:
assuming that the two sides are rational actors and the smart contract language is Turing-complete, there is no escrow smart contract that can facilitate this exchange without either relying on third parties or enabling at least one side to extort the other.
Goharshady noted that:
on the Ethereum blockchain escrows with trusted third-parties are used more often than two-party escrows, presumably because they allow dispute resolution by a human.
And goes on to show that in practice trusted third-party escrow services are essential because two-party escrow smart contracts are:
simply a game of who gets to declare their choice first and commit it on the blockchain sooner, hence forcing the other party to concur with their choice. The order of transactions on a blockchain is essentially decided by the miners. Thus, the party with better connectivity to the miners or who is willing to pay higher transaction fees, would be able to declare their choice to the smart contract first and extort the other party.
The choice being whether or not the good had been delivered. Given the current enthusiasm for tokenization of physical goods the market for trusted escrow services looks bright.

Fast transactions

Actually the delay between submitting a transaction and finality is unpredictable and can be much longer than an hour. Transactions are validated by miners then added to the mempool of pending transactions where they wait until either:
Mempool count
This year the demand for transactions has been low, typically under 4 per second, so the backlog has been low, around 40K or under three hours. Last October it peaked at around 14 hours worth.

The distribution of transaction wait times is highly skewed. The median wait is typically around a block time. The proportion of low-fee transactions means the average wait is normally around 10 times that. But when everyone wants to transact the ratio spikes to over 40 times.

Cheap transactions

Average fee/transaction
There are two ways miners can profit from including a transaction in a block:
The block size limit means there is a fixed supply of transaction slots, about 7 per second, but the demand for them varies, and thus so does the price. In normal times the auction for transaction fees means they are much smaller than the block reward. But when everyone wants to transact they suffer massive spikes.

Secured by Proof-of-Work (1)

In cryptocurrencies "secured" means that the cost of an attack exceeds the potential loot. The security provided by Proof-of-Work is linear in its cost, unlike techniques such as encryption, whose security is exponential in cost. It is generally believed that it is impractical to reverse a Bitcoin transaction after about an hour because the miners are wasting such immense sums on Proof-of-Work. Bitcoin pays these immense sums, but it doesn't get the decentralization they ostensibly pay for.

Monero, a privacy-focused blockchain network, has been undergoing an attempted 51% attack — an existential threat to any blockchain. In the case of a successful 51% attack, where a single entity becomes responsible for 51% or more of a blockchain's mining power, the controlling entity could reorganize blocks, attempt to double-spend, or censor transactions.

A company called Qubic has been waging the 51% attack by offering economic rewards for miners who join the Qubic mining pool. They claim to be "stress testing" Monero, though many in the Monero community have condemned Qubic for what they see as a malicious attack on the network or a marketing stunt.
Molly White: Monero faces 51% attack
The advent of "mining as a service" about 7 years ago made 51% attacks against smaller Proof-of-Work alt-coin such as Bitcoin Gold endemic. In August Molly White reported that Monero faces 51% attack:

In 2018's The Economic Limits Of Bitcoin And The Blockchain Eric Budish of the Booth School analyzed two versions of the 51% attack. I summarized his analysis of the classic multiple spend attack thus:
Note that only Bitcoin and Ethereum among cryptocurrencies with "market cap" over $100M would cost more than $100K to attack. The total "market cap" of these 8 currencies is $271.71B and the total cost to 51% attack them is $1.277M or 4.7E-6 of their market cap.
His key insight was that to ensure that 51% attacks were uneconomic, the reward for a block, implicitly the transaction tax, plus the fees had to be greater than the maximum value of the transactions in it. The total transaction cost (reward + fee) typically peaks around 1.8% but is normally between 0.6% and 0.8%, or around 150 times less than Budish's safety criterion. The result is that a conspiracy between a few large pools could find it economic to mount a 51% attack.

Secured by Proof-of-Work (2)

However, ∆attack is something of a “pick your poison” parameter. If ∆attack is small, then the system is vulnerable to the double-spending attack ... and the implicit transactions tax on economic activity using the blockchain has to be high. If ∆attack is large, then a short time period of access to a large amount of computing power can sabotage the blockchain.
Eric Budish: The Economic Limits Of Bitcoin And The Blockchain
But everyone assumes the pools won't do that. Budish further analyzed the effects of a multiple spend attack. It would be public, so it would in effect be sabotage, decreasing the Bitcoin price by a factor ∆attack. He concludes that if the decrease is small, then double-spending attacks are feasible and the per-block reward plus fee must be large, whereas if it is large then access to the hash power of a few large pools can quickly sabotage the currency.

The implication is that miners, motivated to keep fees manageable, believe ∆attack is large. Thus Bitcoin is secure because those who could kill the golden goose don't want to.

Secured by Proof-of-Work (3)

proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
Raphael Auer: Beyond the doomsday economics of “proof-of-work” in cryptocurrencies
The following year, in Beyond the doomsday economics of “proof-of-work” in cryptocurrencies, Raphael Auer of the Bank for International Settlements showed that the problem Budish identified was inevitable[12]:
proof-of-work can only achieve payment security if mining income is high, but the transaction market cannot generate an adequate level of income. ... the economic design of the transaction market fails to generate high enough fees.
In other words, the security of Bitcoin's blockchain depends upon inflating the currency with block rewards. This problem is excerbated by Bitcoin's regular "halvenings" reducing the block reward. To maintain miner's current income after the next halvening in less than three years the "price" would need to be over $200K; security depends upon the "price" appreciating faster than 20%/year.

Once the block reward gets small, safety requires the fees in a block to be worth more than the value of the transactions in it. But everybody has decided to ignore Budish and Auer.

Secured by Proof-of-Work (4)

Farokhnia Table 1
In 2024 Soroush Farokhnia & Amir Kafshdar Goharshady's Options and Futures Imperil Bitcoin's Security:
showed that (i) a successful block-reverting attack does not necessarily require ... a majority of the hash power; (ii) obtaining a majority of the hash power ... costs roughly 6.77 billion ... and (iii) Bitcoin derivatives, i.e. options and futures, imperil Bitcoin’s security by creating an incentive for a block-reverting/majority attack.
They assume that an attacker would purchase enough state-of-the-art hardware for the attack. Given Bitmain's dominance in mining ASICs, such a purchase is unlikely to be feasible.

Secured by Proof-of-Work (5)

Ferreira Table 1
But it would not be necessary. Mining is a very competitive business, and power is the major cost[13]. Making a profit requires both cheap power and early access to the latest, most efficient chips. So it wasn't a surprise that Ferreira et al's Corporate capture of blockchain governance showed that:
As of March 2021, the pools in Table 1 collectively accounted for 86% of the total hash rate employed. All but one pool (Binance) have known links to Bitmain Technologies, the largest mining ASIC producer. [14]

Secured by Proof-of-Work (6)

Mining Pools 5/17/24
Bitmain, a Chinese company, exerts significant control of Bitcoin. China has firmly suppressed domestic use of cryptocurrencies, whereas the current administration seems intent on integrating them (and their inevitable grifts) into the US financial system. Except for Bitmain, no-one in China gets eggs from the golden goose. This asymmetry provides China with an way to disrupt the US financial system.

Mining Pools 4/30/25
It would be important to prevent the disruption being attributed to China. A necessary precursor would therefore be to obscure the extent of Bitmain-affiliated pools' mining power. This has been a significant trend in the past year, note the change in the "unknown" in the graphs from 38 to 305. There could be other explanations, but whether or not intentionally this is creating a weapon.[15]

Secured by cryptography (1)

The dollars in your bank account are simply an entry in the bank's private ledger tagged with your name. You control this entry, but what you own is a claim on the bank[16]. Similarly, your cryptocurrency coins are effectively an entry in a public ledger tagged with the public half of a key pair. The two differences are that:
XKCD #538
The secret half of your key can leak via what Randall Munro depicted as a "wrench attack", via phishing, social engineering, software supply chain attacks[18], and other forms of malware. Preventing these risks requires you to maintain an extraordinary level of operational security.

Secured by cryptography (2)

Even perfect opsec may not be enough. Bitcoin and most cryptocurrencies use two cryptographic algorithms, SHA256 for hashing and ECDSA for signatures.

Quote from: llama on July 01, 2010, 10:21:47 PM
Satoshi, That would indeed be a solution if SHA was broken (certainly the more likely meltdown), because we could still recognize valid money owners by their signature (their private key would still be secure).

However, if something happened and the signatures were compromised (perhaps integer factorization is solved, quantum computers?), then even agreeing upon the last valid block would be worthless.
True, if it happened suddenly. If it happens gradually, we can still transition to something stronger. When you run the upgraded software for the first time, it would re-sign all your money with the new stronger signature algorithm. (by creating a transaction sending the money to yourself with the stronger sig)
Satoshi Nakamoto: 10th July 2010
On 10th July 2010 Nakamoto addressed the issue of what would happen if either of these algorithms were compromised. There are three problems with his response; that compromise is likely in the near future, when it does Nakamoto's fix is inadequate, and there is a huge incentive for it to happen suddenly:

Secured by cryptography (3)

Divesh Aggarwal et al's 2019 paper Quantum attacks on Bitcoin, and how to protect against them noted that:
the elliptic curve signature scheme used by Bitcoin is much more at risk, and could be completely broken by a quantum computer as early as 2027, by the most optimistic estimates.
Their "most optimistic estimates" are likely to be correct; PsiQuantum expects to have two 1M qubit computers operational in 2027[19]. Each should be capable of breaking an ECDSA key in under a week.

Bitcoin's transition to post-quantum cryptography faces a major problem because, to transfer coins from an ECDSA wallet to a post-quantum wallet, you need the key for the ECDSA wallet. Chainalysis estimates that:
about 20% of all Bitcoins have been "lost", or in other words are sitting in wallets whose keys are inaccessible
An example is the notorious hard disk in the garbage dump. A sufficiently powerful quantum computer could recover the lost keys.

The incentive for it to happen suddenly is that, even if Nakamoto's fix were in place, someone with access to the first sufficiently powerful quantum computer could transfer 20% of all Bitcoin, currently worth $460B, to post-quantum wallets they controlled. This would be a 230x return on the investment in PsiQuantum.

Privacy-preserving

privacy can still be maintained by breaking the flow of information in another place: by keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone.

As an additional firewall, a new key pair should be used for each transaction to keep them from being linked to a common owner.

Some linking is still unavoidable with multi-input transactions, which necessarily reveal that their inputs were owned by the same owner. The risk is that if the owner of a key is revealed, linking could reveal other transactions that belonged to the same owner.
Satoshi Nakamoto: Bitcoin: A Peer-to-Peer Electronic Cash System
Nakamoto addressed the concern that, unlike DigiCash, because Bitcoin's blockchain was public it wasn't anonymous:
privacy can still be maintained by breaking the flow of information in another place: by keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone.
This is true but misleading. In practice, users need to use exchanges and other services that can tie them to a public key. There is a flourishing ecosystem of companies that deanonymize wallets by tracing the web of transactions. Nakamoto added:
As an additional firewall, a new key pair should be used for each transaction to keep them from being linked to a common owner.
This advice is just unrealistic. As Molly White wrote[20]:
funds in a wallet have to come from somewhere, and it’s not difficult to infer what might be happening when your known wallet address suddenly transfers money off to a new, empty wallet.
Nakamoto acknowledged:
Some linking is still unavoidable with multi-input transactions, which necessarily reveal that their inputs were owned by the same owner. The risk is that if the owner of a key is revealed, linking could reveal other transactions that belonged to the same owner.
For more than a decade Jamison Lopp has been tracking what happens when a wallet with significant value is deanonymized, and it is a serious risk to life and limbs[21].

One more risk

I have steered clear of the financial risks of cryptocurrencies. It may appear that the endorsement of the current administration has effectively removed their financial risk. But the technical and operational risks remain, and I should note another technology-related risk.

Source
Equities are currently being inflated by the AI bubble. The AI platforms are running the drug-dealer's algorithm, "the first one's free", burning cash by offering their product free or massively under-priced. This cannot last; only 8% of their users would pay even the current price. OpenAI's August launch of GPT-5, which was about cost-cutting not better functionality, and Anthropic's cost increases were both panned by the customers who do pay. AI may deliver some value, but it doesn't come close to the cost of delivering it[22].

There is likely to be an epic AI equity bust. Analogies are being drawn to the telecom boom, but The Economist reckons[23]:
the potential AI bubble lags behind only the three gigantic railway busts of the 19th century.
Source
History shows a fairly strong and increasing correlation between equities and cryptocurrencies, so they will get dragged down too. The automatic liquidation of leveraged long positions in DeFi will start, causing a self-reinforcing downturn. Periods of heavy load such as this tend to reveal bugs in IT systems, and especially in "smart contracts", as their assumptions of adequate resources and timely responses are violated.

Source
Experience shows that Bitcoin's limited transaction rate and the fact that the Ethereum computer that runs all the "smart contracts" is 1000 times slower than a $50 Raspberry Pi 4[24] lead to major slow-downs and fee spikes during panic selling, exacerbated by the fact that the panic sales are public[25].

Conclusion

The fascinating thing about cryptocurrency technology is the number of ways people have developed and how much they are willing to pay to avoid actually using it. What other transformative technology has had people desperate not to use it?

The whole of TradFi has been erected on this much worse infrastructure, including exchanges, closed-end funds, ETFs, rehypothecation, and derivatives. Clearly, the only reason for doing so is to escape regulation and extract excess profits from what would otherwise be crimes.

Footnotes

  1. The cause was the video of a talk I gave at Stanford in 2022 entitled Can We Mitigate The Externalities Of Cryptocurrencies?. It was an updated version of a talk at the 2021 TTI/Vanguard conference. The talk conformed to Betteridge's Law of Headlines in that the answer was "no".
  2. Paper libraries form a model fault-tolerant system. It is highly replicated and decentralized. Libraries cooperate via inter-library loan and copy to deliver a service that is far more reliable than any individual library.
  3. The importance Satoshi Nakamoto attached to trustlessness can be seen from his release note for Bitcoin 0.1:
    The root problem with conventional currency is all the trust that's required to make it work. The central bank must be trusted not to debase the currency, but the history of fiat currencies is full of breaches of that trust. Banks must be trusted to hold our money and transfer it electronically, but they lend it out in waves of credit bubbles with barely a fraction in reserve. We have to trust them with our privacy, trust them not to let identity thieves drain our accounts. Their massive overhead costs make micropayments impossible.
    The problem with this ideology is that trust (but verify) is an incredibly effective optimization in almost any system. For example, Robert Putnam et al's Making Democracy Work: Civic Traditions in Modern Italy shows that the difference between the economies of Northern and Southern Italy is driven by the much higher level of trust in the North.

    Bitcoin's massive cost is a result of its lack of trust. Users pay this massive cost but they don't get a trustless system, they just get a system that makes the trust a bit harder to see.

    In response to Nakamoto's diatribe, note that:
    • "trusted not to debase the currency", but Bitcoin's security depends upon debasing the currency.
    • "waves of credit bubbles", is a pretty good description of the cryptocurrency market.
    • "not to let identity thieves drain our accounts", see Molly White's Web3 is Going Just Great.
    • "massive overhead costs". The current cost per transaction is around $100.
    I rest my case.
  4. The problem of trusting mining pools is actually much worse. There is nothing to stop pools conspiring coordinating. In 2017 Vitalik Buterin, co-founder of Ethereum, published The Meaning of Decentralization:
    In the case of blockchain protocols, the mathematical and economic reasoning behind the safety of the consensus often relies crucially on the uncoordinated choice model, or the assumption that the game consists of many small actors that make decisions independently. If any one actor gets more than 1/3 of the mining power in a proof of work system, they can gain outsized profits by selfish-mining. However, can we really say that the uncoordinated choice model is realistic when 90% of the Bitcoin network’s mining power is well-coordinated enough to show up together at the same conference?
    See "Sufficiently Decentralized" for a review of evidence from a Protos article entitled New research suggests Bitcoin mining centralized around Bitmain that concludes:
    In all, it seems unlikely that up to nine major bitcoin mining pools use a shared custodian for coinbase rewards unless a single entity is behind all of their operations.
    The "single entity" is clearly Bitmain.
  5. Peter Ryan, a reformed Bitcoin enthusiast, noted another form of centralization in Money by Vile Means:
    Bitcoin is anything but decentralized: Its functionality is maintained by a small and privileged clique of software developers who are funded by a centralized cadre of institutions. If they wanted to change Bitcoin’s 21 million coin finite supply, they could do it with the click of a keyboard.
    His account of the politics behind the argument over raising the Bitcoin block size should dispel any idea of Bitcoin's decentralized nature. He also notes:
    By one estimate from Hashrate Index, Foundry USA and Singapore-based AntPool control more than 50 percent of computing power, and the top ten mining pools control over 90 percent. Bitcoin blogger 0xB10C, who analyzed mining data as of April 15, 2025, found that centralization has gone even further than this, “with only six pools mining more than 95 percent of the blocks.”
  6. The Bitmain S17 comes in 4 versions with hash rates from 67 to 76 TH/s. Lets assume 70TH/s. As I write the Bitcoin hash rate is about 1 billion TH/s. So if they were all mid-range S17s there would be around 15M mining. If their economic life were 18 months, there would be 77,760 rewards. Thus only 0.5% of them would earn a reward.

    In December 2021 Alex de Vries and Christian Stoll estimated that:
    The average time to become unprofitable sums up to less than 1.29 years.
    It has been obvious since mining ASICs first hit the market that, apart from access to cheap or free electricity, there were two keys to profitable mining:
    1. Having close enough ties to Bitmain to get the latest chips early in their 18-month economic life.
    2. Having the scale to buy Bitmain chips in the large quantities that get you early access.
  7. See David Gerard's account of Steve Early's experiences accepting Bitcoin in his chain of pubs in Attack of the 50 Foot Blockchain Page 94.

    Chart 1
    U.S. Consumers’ Use of Cryptocurrency for Payments by Fumiko Hayashi and Aditi Routh of the Kansas City Fed reports that:
    The share of U.S. consumers who report using cryptocurrency for payments—purchases, money transfers, or both—has been very small and has declined slightly in recent years. The light blue line in Chart 1 shows that this share declined from nearly 3 percent in 2021 and 2022 to less than 2 percent in 2023 and 2024.
  8. User DeathAndTaxes on Stack Exchange explains the 6 block rule:
    p is the chance of attacker eventually getting longer chain and reversing a transaction (0.1% in this case). q is the % of the hashing power the attacker controls. z is the number of blocks to put the risk of a reversal below p (0.1%).

    So you can see if the attacker has a small % of the hashing power 6 blocks is sufficient. Remember 10% of the network at the time of writing is ~100GH/s. However if the attacker had greater % of hashing power it would take increasingly longer to be sure a transaction can't be reversed.

    If the attacker had significantly more hashpower say 25% of the network it would require 15 confirmation to be sure (99.9% probability) that an attacker can't reverse it.
    For example, last May Foundry USA had more than 30% of the hash power, so the rule should have been 24 not 6, and finality should have taken 4 hours.
  9. To be fair, Ethereum has introduced at least one genuine innovation, Flash Loans. In Flash loans, flash attacks, and the future of DeFi Aidan Saggers, Lukas Alemu and Irina Mnohoghitnei of the Bank of England provide an excellent overview of them. Back in 2021 Kaihua Qin, Liyi Zhou, Benjamin Livshits, and Arthur Gervais from Imperial College posted Attacking the defi ecosystem with flash loans for fun and profit, analyzing and optimizing two early flash loan attacks:
    We show quantitatively how transaction atomicity increases the arbitrage revenue. We moreover analyze two existing attacks with ROIs beyond 500k%. We formulate finding the attack parameters as an optimization problem over the state of the underlying Ethereum blockchain and the state of the DeFi ecosystem. We show how malicious adversaries can efficiently maximize an attack profit and hence damage the DeFi ecosystem further. Specifically, we present how two previously executed attacks can be “boosted” to result in a profit of 829.5k USD and 1.1M USD, respectively, which is a boost of 2.37× and 1.73×, respectively.
    They predicted an upsurge in attacks since "flash loans democratize the attack, opening this strategy to the masses". They were right, as you can see from Molly White's list of flash loan attacks.
  10. This is one of a whole series of Impossibilities, many imposed on Ethereum by fundamental results in computer science because it is a Turing-complete programming environment.
  11. For details of the story behind Miners' Extractable Value (MEV), see these posts:
    1. The Order Flow from November 2020.
    2. Ethereum Has Issues from April 2022.
    3. Miners' Extractable Value From September 2022.
    Source
    The first links to two must-read posts. The first is from Dan Robinson and Georgios Konstantopoulos, Ethereum is a Dark Forest:
    It’s no secret that the Ethereum blockchain is a highly adversarial environment. If a smart contract can be exploited for profit, it eventually will be. The frequency of new hacks indicates that some very smart people spend a lot of time examining contracts for vulnerabilities.

    But this unforgiving environment pales in comparison to the mempool (the set of pending, unconfirmed transactions). If the chain itself is a battleground, the mempool is something worse: a dark forest.
    The second is from Samczsun, Escaping the Dark Forest. It is an account of how:
    On September 15, 2020, a small group of people worked through the night to rescue over 9.6MM USD from a vulnerable smart contract.
    Note in particular that MEV poses a risk to the integrity of blockchains. In Extracting Godl [sic] from the Salt Mines: Ethereum Miners Extracting Value Julien Piet, Jaiden Fairoze and Nicholas Weaver examine the use of transactions that avoid the mempool, finding that:
    (i) 73% of private transactions hide trading activity or re-distribute miner rewards, and 87.6% of MEV collection is accomplished with privately submitted transactions, (ii) our algorithm finds more than $6M worth of MEV profit in a period of 12 days, two thirds of which go directly to miners, and (iii) MEV represents 9.2% of miners' profit from transaction fees.

    Furthermore, in those 12 days, we also identify four blocks that contain enough MEV profits to make time-bandit forking attacks economically viable for large miners, undermining the security and stability of Ethereum as a whole.
    When they say "large miners" they mean more than 10% of the power.
  12. Back in 2016 Arvind Narayanan's group at Princeton had published a related instability in Carlsten et al's On the instability of bitcoin without the block reward. Narayanan summarized the paper in a blog post:
    Our key insight is that with only transaction fees, the variance of the miner reward is very high due to the randomness of the block arrival time, and it becomes attractive to fork a “wealthy” block to “steal” the rewards therein.
  13. The leading source of data on which to base Bitcoin's carbon footprint is the Cambridge Bitcoin Energy Consumption Index. As I write their central estimate is that Bitcoin consumes 205TWh/year, or between Thailand and Vietnam.
  14. Ferreira et al write:
    AntPool and BTC.com are fully-owned subsidiaries of Bitmain. Bitmain is the largest investor in ViaBTC. Both F2Pool and BTC.TOP are partners of BitDeer, which is a Bitmain-sponsored cloud-mining service. The parent companies of Huobi.pool and OkExPool are strategic partners of Bitmain. Jihan Wu, Bitmain’s founder and chairman, is also an adviser of Huobi (one of the largest cryptocurrency exchanges in the world and the owner of Huobi.pool).
    This makes economic sense. Because mining rigs depreciate quickly, profit depends upon early access to the latest chips.
  15. See Who Is Mining Bitcoin? for more detail on the state of mining and its gradual obfuscation.
  16. In this context to say you "control" your entry in the bank's ledger is an oversimplification. You can instruct the bank to perform transactions against your entry (and no-one else's) but the bank can reject your instructions. For example if they would overdraw your account, or send money to a sanctioned account. The key point is that your ownership relationship with the bank comes with a dispute resolution system and the ability to reverse transactions. Your cryptocurrency wallet has neither.
  17. Web3 is Going Just Great is Molly White's list of things that went wrong. The cumulative losses she tracks currently stand at over $79B.
  18. Your secrets are especially at risk if anyone in your software supply chain use a build system implemented using AI "vibe coding". David Gerard's Vibe-coded build system NX gets hacked, steals vibe-coders’ crypto details a truly beautiful example of the extraordinary level of incompetence this reveals.
  19. IBM's Heron, which HSBC recently used to grab headlines, has 156 qubits.
  20. Molly White's Abuse and harassment on the blockchain is an excellent overview of the privacy risks inherent to real-world transactions on public blockchain ledgers:
    Imagine if, when you Venmo-ed your Tinder date for your half of the meal, they could now see every other transaction you’d ever made—and not just on Venmo, but the ones you made with your credit card, bank transfer, or other apps, and with no option to set the visibility of the transfer to “private”. The split checks with all of your previous Tinder dates? That monthly transfer to your therapist? The debts you’re paying off (or not), the charities to which you’re donating (or not), the amount you’re putting in a retirement account (or not)? The location of that corner store right by your apartment where you so frequently go to grab a pint of ice cream at 10pm? Not only would this all be visible to that one-off Tinder date, but also to your ex-partners, your estranged family members, your prospective employers. An abusive partner could trivially see you siphoning funds to an account they can’t control as you prepare to leave them.
  21. In The Risks Of HODL-ing I go into the details of the attack on the parents of Veer Chetal, who had unwisely live-streamed the social engineering that stole $243M from a resident of DC.

    Anyone with significant cryptocurrency wallets needs to follow Jamison Lopp's Known Physical Bitcoin Attacks.
  22. Source
    Torsten Sløk's AI Has Moved From a Niche Sector to the Primary Driver of All VC Investment leads with this graph, one of the clearest signs that we're in a bubble.

    Whether AI delivers net value in most cases is debatable. "Vibe coding" is touted as the example of increasing productivity, but the experimental evidence is that it decreases productivity. Kate Niederhoffer et al's Harvard Business Review article AI-Generated "Workslop” Is Destroying Productivity explains one effect:
    Employees are using AI tools to create low-effort, passable looking work that ends up creating more work for their coworkers. On social media, which is increasingly clogged with low-quality AI-generated posts, this content is often referred to as “AI slop.” In the context of work, we refer to this phenomenon as “workslop.” We define workslop as AI generated work content that masquerades as good work, but lacks the substance to meaningfully advance a given task.

    Here’s how this happens. As AI tools become more accessible, workers are increasingly able to quickly produce polished output: well-formatted slides, long, structured reports, seemingly articulate summaries of academic papers by non-experts, and usable code. But while some employees are using this ability to polish good work, others use it to create content that is actually unhelpful, incomplete, or missing crucial context about the project at hand. The insidious effect of workslop is that it shifts the burden of the work downstream, requiring the receiver to interpret, correct, or redo the work. In other words, it transfers the effort from creator to receiver.
    David Gerard's Workslop: bad ‘study’, but an excellent word points out that:
    Unfortunately, this article pretends to be a writeup of a study — but it’s actually a promotional brochure for enterprise AI products. It’s an unlabeled advertising feature.
    And goes on to explain where the workslop comes from:
    Well, you know how you get workslop — it’s when your boss mandates you use AI. He can’t say what he wants you to use it for. But you’ve been told. You’ve got metrics on how much AI you use. They’re watching and they’re measuring.
    Belle Lin and Steven Rosenbush's Stop Worrying About AI’s Return on Investment describes goalposts being moved:
    Return on investment has evaded chief information officers since AI started moving from early experimentation to more mature implementations last year. But while AI is still rapidly evolving, CIOs are recognizing that traditional ways of recognizing gains from the technology aren’t cutting it.

    Tech leaders at the WSJ Leadership Institute’s Technology Council Summit on Tuesday said racking up a few minutes of efficiency here and there don’t add up to a meaningful way of measuring ROI.
    Given the hype and the massive sunk costs, admitting that there is no there there would be a career-limiting move.

    None of this takes account of the productivity externalities of AI, such as Librarians Are Being Asked to Find AI-Hallucinated Books, academic journals' reviewers' time wasted by AI slop papers, judges' time wasted with hallucinated citations, a flood of generated child sex abuse videos, the death of social media and a vast new cyberthreat landscape.
  23. The Economist writes in What if the AI stockmarket blows up?:
    we picked ten historical bubbles and assessed them on factors including spark, cumulative capex, capex durability and investor group. By our admittedly rough-and-ready reckoning, the potential AI bubble lags behind only the three gigantic railway busts of the 19th century.
    They note that:
    For now, the splurge looks fairly modest by historical standards. According to our most generous estimate, American AI firms have invested 3-4% of current American GDP over the past four years. British railway investment during the 1840s was around 15-20% of GDP. But if forecasts for data-centre construction are correct, that will change. What is more, an unusually large share of capital investment is being devoted to assets that depreciate quickly. Nvidia’s cutting-edge chips will look clunky in a few years’ time. We estimate that the average American tech firm’s assets have a shelf-life of just nine years, compared with 15 for telecoms assets in the 1990s.
    I think they are over-estimating the shelf-life. Like Bitcoin mining, power is a major part of AI opex. Thus the incentive to (a) retire older, less power-efficient hardware, and (b) adopt the latest data-center power technology, is overwhelming. Note that Nvidia is moving to a one-year product cadence, and even when they were on a two-year cadence Jensen claimed it wasn't worth running chips from the previous cycle. Note also that the current generation of AI systems is incompatible with the power infrastructure of older data centers, and this may well happen again in a future product generation. For example, Caiwei Chen reports in China built hundreds of AI data centers to catch the AI boom. Now many stand unused:
    The local Chinese outlets Jiazi Guangnian and 36Kr report that up to 80% of China’s newly built computing resources remain unused.
    Rogé Karma makes the same point as The Economist in Just How Bad Would an AI Bubble Be?:
    An AI-bubble crash could be different. AI-related investments have already surpassed the level that telecom hit at the peak of the dot-com boom as a share of the economy. In the first half of this year, business spending on AI added more to GDP growth than all consumer spending combined. Many experts believe that a major reason the U.S. economy has been able to weather tariffs and mass deportations without a recession is because all of this AI spending is acting, in the words of one economist, as a “massive private sector stimulus program.” An AI crash could lead broadly to less spending, fewer jobs, and slower growth, potentially dragging the economy into a recession.
  24. In 2021 Nicholas Weaver estimated that the Ethereum computer was 5000 times slower than a Raspberry Pi 4. Since then the gas limit has been raised making his current estimate only 1000 times slower.
  25. Prof. Hilary Allen writes in Fintech Dystopia that:
    if people do start dumping blockchain-based assets in fire sales, everyone will know immediately because the blockchain is publicly visible. This level of transparency will only add to the panic (at least, that’s what happened during the run on the Terra stablecoin in 2022).
    ...
    We also saw ... that assets on a blockchain can be pre-programmed to execute transactions without the intervention of any human being. In good times, this makes things more efficient – but the code will execute just as quickly in bad situations, even if everyone would be better off if it didn’t.
    She adds:
    When things are spiraling out of control like this, sometimes the best medicine is a pause. Lots of traditional financial markets close at the end of the day and on weekends, which provides a natural opportunity for a break (and if things are really bad, for emergency government intervention). But one of blockchain-based finance’s claims to greater efficiency is that operations continue 24/7. We may end up missing the pauses once they’re gone.
    In the 26th September Grant's, Joel Wallenberg notes that:
    Lucrative though they may be, the problem with stablecoin deposits is that exposure to the crypto-trading ecosystem makes them inherently correlated to it and subject to runs in a new “crypto winter,” like that of 2022–23. Indeed, since as much as 70% of gross stablecoin-transaction volume derives from automated arbitrage bots and high-speed trading algorithms, runs may be rapid and without human over-sight. What may be worse, the insured banks that could feed a stablecoin boom are the very ones that are likely to require taxpayer support if liquidity dries up, and Trump-style regulation is likely to be light.
    So the loophole in the GENIUS act for banks is likely to cause contagion from cryptocurrencies via stablecoins to the US banking system.

Acknowledgments

This talk benefited greatly from critiques of drafts by Hilary Allen, David Gerard, Jon Reiter, Joel Wallenberg, and Nicholas Weaver.

by David. (noreply@blogger.com) at October 01, 2025 01:00 AM

September 30, 2025

Ed Summers

powerlinesky

powerlinesky

September 30, 2025 04:00 AM

autumn sky parking lot

autumn sky parking lot

September 30, 2025 04:00 AM

September 29, 2025

Open Knowledge Foundation

Open Knowledge Foundation Priorities for the 2025 OGP

The Open Government Partnership (OGP) Global Summit is a key gathering for government representatives, civil society leaders, and open government advocates from around the world. It serves as a crucial platform for sharing ideas, forging partnerships, and setting the agenda for more transparent, participatory, and accountable governance. Our team and key allies will be on...

The post Open Knowledge Foundation Priorities for the 2025 OGP first appeared on Open Knowledge Blog.

by OKFN at September 29, 2025 05:51 PM

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-09-29: Summer Project as a Google Summer of Code (GSoC) Contributor

This summer (summer of 2025), I got the opportunity to be a Google Summer of Code (GSoC) contributor. As a GSoC contributor, I worked with the TV News Archive at the Internet Archive under the mentorship of Dr. Sawood Alam, Research Lead of the Wayback Machine. Over the 12-week coding period, our project focused on detecting social media content in TV news, specifically through logo and post screenshot detection.

Introduction 

Information diffusion in social media has been well studied.  For example, social and political scientists have tracked how social movements like #MeToo spread on social media and estimated political leanings of social media users.  But one area that has yet to be studied is the reference of social media in conventional broadcast television news. 


Our study aims to address this gap by analysing TV news broadcasts for references to social media. These references can be in many representations - on-screen text, verbal mentions by the hosts/speakers, or visual objects displayed on the screen (Figure 1). In this GSoC project, our focus was on the visual objects, specifically detecting social media logos and screenshots of user posts appearing in TV news.



Figure 1: Different representations of social media on TV news (On-screen text, verbal mentions by the hosts/speakers, and visual objects such as logos and screenshots)

Datasets

TV News Data

Internet Archive’s TV News Archive provides access to over 2.6 million U.S. news broadcasts dating back to 2009. Each broadcast at the TV News Archive is uniquely identified by an episode ID that encodes the channel, show name, and timestamp.

For example, the episode ID:


FOXNEWSW_20230711_040000_Fox_News_at_Night 


corresponds to the show link: 

https://archive.org/details/FOXNEWSW_20230711_040000_Fox_News_at_Night


A show within the TV News Archive is represented by 1-minute clips, each accessible via a URL arguments in the path, where time is specified in seconds:


https://archive.org/details/FOXNEWSW_20230711_040000_Fox_News_at_Night/start/0/end/60 


For our visual detection tasks, we used the full-resolution frames extracted every second throughout the entirety of the broadcast, provided by Dr. Kalev Leetaru, at The GDELT Project

Our Sample Data (Selected Episodes)

Between 2020 and 2024, we sampled one day per year during primetime hours (8–11pm) across three major cable news channels (Fox News, MSNBC, and CNN), resulting in 45 episodes.


After excluding 9 episodes consisting of documentaries or special programs (which fall outside the scope of regular prime-time news coverage), the final sample contained 36 news episodes. Of these, 15 were from Fox News, 12 from MSNBC, and 9 from CNN. Table 1 presents the full list of episodes, with the excluded episodes highlighted in red.

Gold Standard Dataset 

To create the fold standard dataset, we labeled every 1-second frame of each episode with the presence of logo and/screenshot along with the mentioned social media platform name. The labeling process was facilitated by a previously compiled dataset, which I had created through manual review of TV news broadcasts. In that dataset, each 60-second clip was annotated for any social media references, including text, host mentions, or visual elements such as logos and screenshots. By cross-referencing these annotations, we constructed the gold standard dataset. The gold standard dataset includes only those frames that contain at least one social media reference (either a logo or a screenshot), rather than every second of an episode.


Below is a snippet of the gold standard for CNN episodes (Figure 2). 


Figure 2: A snippet of the gold standard for CNN episodes


Each row represents a single frame and is described by five columns. The filename format is episode ID-seconds.


For example,

CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon-000136.jpg


CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon is the episode ID and 000136.jpg indicates the frame taken at the 136th second of that 60-minute episode.


The Logo and Screenshot columns indicate their presence, while their Type columns specify the platform. 


For example, the entry CNNW_20200314_030000_CNN_Tonight_With_Don_Lemon-000136.jpg shows that the frame contains both a Twitter logo and a Twitter screenshot. 


The complete gold standard datasets for all three channels can be accessed via the following links:


CNN:  https://github.com/internetarchive/tvnews_socialmedia_mentions/blob/main/GoldStandardDataset/Labels_for_Images/gold_standard_images_cnn.csv


Fox News: https://github.com/internetarchive/tvnews_socialmedia_mentions/blob/main/GoldStandardDataset/Labels_for_Images/gold_standard_images_foxnews.csv


MSNBC:  https://github.com/internetarchive/tvnews_socialmedia_mentions/blob/main/GoldStandardDataset/Labels_for_Images/gold_standard_images_msnbc.csv

Logo and Screenshot Detection Process

We implemented a system to automatically detect social media logos and user post screenshots in television news images using the ChatGPT API (GPT-4o is a multimodal model capable of processing both text and image input). The workflow is summarized below. 

1. System Setup

We accessed the API of the GPT-4o model (using an access token provided by the Internet Archive) to process image frames and return structured text output.


Image-to-Text Pipeline:

2. Image Preprocessing

We extracted the full-resolution frames (per second) of each episode for processing. The raw frames were provided as a .tar file per episode. 


Since video content such as TV news broadcasts often contains long segments of visually static or near-identical scenes, processing every extracted frame independently can introduce significant redundancy and is computationally expensive. To address this, we applied perceptual hashing to detect and eliminate duplicate or near-duplicate frames.


We used the Python library ImageHash (with average hashing) to reduce the number of frames that need to be processed (code). To measure how close two frames were, we calculated the Hamming distance between their hashes. A low Hamming distance means the frames are almost the same, while a higher value means they are more different. By setting a threshold t (for example, treating frames with a distance of t ≤ 5 as duplicates), we were able to keep just one representative frame from a group of similar ones.


To identify duplicate groups, we defined a threshold parameter t, such that any two frames with a Hamming distance ≤ t were considered equivalent. Within each group of near-duplicates, only a single representative frame was retained. We evaluated multiple thresholds (t=5,4,3). We also explored whether keeping the middle frame or the last frame from a group of similar frames made any difference in the results. While these choices did not significantly impact our initial findings, it is an aspect that requires further investigation and will be considered as part of future work.


For the final configuration of the deduplication process, we used t=3 for a relatively strict threshold and to minimize the chance of discarding distinct, relevant content. Within each group, we retained the middle frame, guided by the intuition tha the last frame of a group often coincides with transition boundaries (e.g., cuts, fades), whereas the middle frame is less likely to be affected.

3. Prompt Design and Iterations

To automatically detect social media logos and user post screenshots using the ChatGPT API, we designed a structured prompt. We iteratively refined the prompt over seven versions (link to commits) to ensure strict and reproducible detection of social media logos and user post screenshots. Each version introduced improvements and changes made in each version are documented in the commit descriptions.


A major change was made from prompt version 3 (link to v3) to prompt version 4 (link to v4). The update narrowed the task to focus strictly on logo and user post screenshot detection. Previous versions included additional attributes such as textual mentions of social media, post context, and profile mentions, but version 4 and subsequent versions disregarded these elements, emphasizing visual detection only.


After several iterations and refinements based on the results of earlier versions, the final prompt we used was version 7 (link to v7)


The final version of the prompt instructed the model to output the following fields:

  1. Social Media Logo (Yes/No)

  2. Logo Detection Confidence (0–1)

  3. Social Media Logo Type

  4. Social Media Post Screenshot (Yes/No)

  5. Screenshot Detection Confidence (0–1)

  6. Social Media Screenshot Type


The final prompt reflects the following considerations:

  1. Scope of platforms

Facebook, Instagram, Twitter (bird logo), X (X logo), Threads, TikTok, Truth Social, LinkedIn, Meta, Parler, Pinterest, Rumble, Snapchat, YouTube, and Discord. These are the platforms that appeared in the gold standard.

 

  1. Logo detection rules

  1. Post screenshot screenshots


  1. Confidence Scores

For both logos and screenshots,  we prompted the model to provide a confidence score from 0 to 1, indicating how certain it is about its detection. These scores were recorded but not yet used in the analysis; they will be considered in future work.

4. API Interaction

Each request consisted of a single user message containing:
1. The analysis prompt (text instructions)
2. The image (base 64 encoded) as an inline image_url.

Figure 3 shows a snippet of the code used to encode images and send requests to the API (full code).

with open(image_path, "rb") as image_file:

                    encoded_image = base64.b64encode(image_file.read()).decode('utf-8')


                messages = [

                    {

                        "role": "user",

                        "content": [

                            {"type": "text", "text": analysis_prompt},

                            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"}}

                        ]

                    }

                ]

                response = client.chat.completions.create(

                    model="gpt-4o",

                    messages=messages,

                    max_tokens=1000, # Set a reasonable max_tokens for response length

                    temperature=0.2

                )

                

                content = response.choices[0].message.content

                parsed_fields = parse_response(content)

                successful_request = True

                break # If successful, exit the outer retry loop


Figure 3: A snippet of the code

5. Response Parsing and Output

After the API returns a response for each frame, we parse the model’s output into a CSV file for each episode containing all six fields as listed in the prompt design and iterations section. We used a flexible regex-based parser that extracts all fields reliably, even if the model’s formatting varies slightly (L93-L159 of code). 


Next, we cleaned the ChatGPT output (code). The script standardizes file paths and normalizes binary columns (Social Media Logo and Social Media Post Screenshot) by converting variations of “Yes”, “No”, and “N/A” into a consistent format. It also normalizes platform names, replacing standalone “X” with “Twitter (X),” updating Twitter bird logos to “Twitter” to align with the labels in the gold standard dataset for evaluation. 


After cleaning each episode’s results, we combined them into a single CSV file per channel (code). The script iterates through all individual CSV files for a given channel and merges them into one consolidated CSV.

6. Evaluation

We started small, restricting our initial tests to a single news episode: Fox News at Night With Shannon Bream, March 13, 2020, 8-9 PM (results). This allowed us to experiment with different prompts before scaling to the full database. Across these runs, we varied both the prompt structure (Prompt v1–v4) and the decoding temperature (0.0 and 0.2). The decoding temperature controls randomness in LLM output. Here, lower values (such as 0.0 and 0.2) are more deterministic, higher values more creative. At temperature 0.0, the output is essentially greedy; the same input will likely produce the same output. For the final version, we ended up using temperature as 0.2 to allow some flexibility to interpret edge cases without introducing instability.  

Single Episode Evaluation Results

For logo detection, results from the single episode show a clear improvement across prompt versions. 


Prompt v1 (Runs 1–7): The baseline instruction set produced very high recall but extremely low precision, with many false positives (results). For example, in Run 4, the model achieved a recall of 0.9155 but precision of only 0.1167, yielding an overall F1-score of 0.2070.


Prompt v2 (Run 8): Refining the prompt substantially reduced false positives, increasing precision to 0.1700 while recall remained high at 0.9577 (results).


Prompt v3 (Run 9): Further improvements to the prompt elded a significant improvement in balance: precision rose to 0.3571 while recall remained strong (0.9155), resulting in an F1-score of 0.5138 (results)


Prompt v4 (Run 10): Explicitly narrowing scope to only logos and screenshots without any questions related to additional context improved our results drastically (results). This change increased the precision (0.9315) while maintaining high recall (0.9577), producing near-perfect accuracy (0.9978) and an overall F1 score (0.9444).


Table 2 shows the key results for logo detection. Results show a clear trajectory of improvement across prompt versions (v1–v4).


For screenshot detection (Table 3), performance was consistently perfect for this episode. The model maintained 100% accuracy, precision, recall, and F1-score across all versions. This also  suggests that screenshot detection is a relatively straightforward task compared to logo detection, at least for this particular episode.



Version

Accuracy

Precision

Recall

F1-score

Run 4 (Prompt v1)

0.8640

0.1167

0.9155

0.2070

Run 8 (Prompt v2)

0.9085

0.1700

0.9577

0.2887


Run 9 (Prompt v3)

0.9664

0.3571

0.9155

0.5138

Run 10 (Prompt v4) 

0.9978

0.9315

0.9577


0.9444


Table 2: Logo detection key results on a single episode (t=0.2)



Version

Accuracy

Precision

Recall

F1-score

Run 4 (Prompt v1)

1.0000


1.0000

1.0000

1.0000

Run 8 (Prompt v2)

1.0000

1.0000

1.0000

1.0000

Run 9 (Prompt v3)

1.0000

1.0000

1.0000

1.0000

Run 10 (Prompt v4) 

1.0000

1.0000

1.0000

1.0000


Table 3: Screenshot detection key results on a single episode (t=0.2)


With appropriate constraints, the model could reliably perform logo detection, while screenshot detection required minimal intervention.

All Episodes Evaluation Results

After establishing stability with Prompt v4, we scaled to all 36 episodes across three channels  (results). Performance metrics for logo detection are provided in Table 4, and those for screenshot detection are shown in Table 5. 


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9912

0.1705

0.9565

0.2895

FOX News

0.9903

0.5039

0.9324

0.6542

MSNBC

0.9931

0.5238

0.9649

0.6790

Table 4: Performance metrics for logo detection (Prompt version 4, all episodes)



Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9984

0.5405

0.8696

0.6667

FOX News

0.9922

0.1750

1.0000

0.2979

MSNBC

0.9986

0.8022

0.9605

0.8743

Table 5: Performance metrics for screenshot detection (Prompt version 4, all episodes)


The results showed:


For screenshot detection, MSNBC again outperformed, with a F1 = 0.8743. CNN (F1 = 0.667) and Fox News (F1 = 0.298) were more prone to over-detection. 


The same model and prompt performed better on MSNBC content. This may be related to differences in on-screen visual style, such as clearer or less ambiguous logo and screenshot cues, but this remains speculative and warrants further study.


We made further refinements to the prompt to improve precision: 


Prompt v5 (changes): This version of the prompt sets a fixed list of valid platforms, adds confidence scores for detections, and tightens logo detection rules with stricter visual checks.


Prompt v6 (changes): Explicit X logo rules were introduced which reduced false positives. We further clarified confidence score instructions to ensure consistent numeric outputs for all detections. Refined the screenshot criteria to include only user posts, reducing mislabeling; this marked the first substantial prompt change for screenshots, as their performance had previously been consistently stable. From Prompt v5 to Prompt v6, CNN saw a slight drop in logo precision but improved screenshot F1, while MSNBC showed minor gains in screenshot detection with stable logo performance.


Prompt v7 (changes): This final configuration produced the most stable results across channels. It simplifies the X logo rules by removing exceptions for black, white, or inverted colors, while keeping strict guidance to avoid other confusing logos (X in Xfinity logo, FOX News logo, or other random X letters). Clarified the confidence score questions to always require a numeric answer between 0 and 1, explicitly prohibiting “N/A” responses for consistency. It also explicitly states that OCR-detected platform names are not counted as logos.


Results from Prompt version 4 (Table 4 and 5) to Prompt version 5 (Table 6 and 7) show improved precision across all channels while maintaining high recall. Specifically,

Screenshot detection remained stable. Overall, version 5 produced more balanced detections, reducing over-detection, particularly for logos (results).


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9954

0.3056

0.9167

0.4583

FOX News

0.9941

0.6595

0.8138

0.7286

MSNBC

0.9966

0.6800

0.9239

0.7834

Table 6: Performance metrics for logo detection (Prompt version 5, all episodes)


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9982

0.5263

0.8696

0.6557

FOX News

0.9926

0.1818

1.0000

0.3077

MSNBC

0.9986

0.7714

0.9474

0.8504

Table 7: Performance metrics for screenshot detection (Prompt version 5, all episodes)


Results from the final prompt version (Prompt version 7) are shown in Table 8 (for logos) and Table 9 (for screenshots). CNN shows an increase in logo F1 from 0.46 to 0.51 and screenshot F1 from 0.66 to 0.70. Fox News experiences a slight decrease in logo F1 (0.73 to 0.69) but an improvement in screenshot precision (0.30 to 0.39). MSNBC achieves a logo F1 of 0.89 (up from 0.78) and a screenshot F1 of 0.91 (up from 0.85). This version achieved the best balance of precision and recall, particularly for MSNBC (results). However, this also shows that no single prompt configuration is optimal for all channels; some adjustments may be required to maximize performance per channel.


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9965

0.3529

0.8889

0.5053

FOX News

0.9926

0.5963

0.8298

0.6940

MSNBC

0.9980

0.8535

0.9306

0.8904

Table 8: Performance metrics for logo detection (Prompt version 7, all episodes)


Channel

Accuracy

Precision

Recall

F1-score

CNN

0.9986

0.5789

0.8800

0.6984

FOX News

0.9945

0.2456

1.0000

0.3944

MSNBC

0.9988

0.8496

0.9697

0.9057

Table 9: Performance metrics for screenshot detection (Prompt version 7, all episodes)


Overall, these results underscore the value of iterative prompt engineering, temperature tuning, and task-specific constraints in achieving high-quality, reproducible detection outcomes in multimedia content.

Future Work

Several directions remain for extending this work. 


Prompt Refinement and Channel-Specific Tuning: We will continue refining the analysis prompt to increase accuracy and consistency in detecting social media logos and user post screenshots. Early observations suggest that performance varies across channels (and programs), likely due to unique ways in how social media is visually presented by them. This indicates that channel- or program-specific prompt tuning could further enhance results.


Decoding Temperature Exploration: While our experiments primarily used low decoding temperatures (0.0 and 0.2), future work can explore a range of temperatures to evaluate whether controlled increases in randomness improve recall in edge cases without significantly raising false positives.


Frame Selection Strategies: We conducted preliminary observations using different Hamming distance thresholds (t=5,4,3) to group similar frames and experimented with selecting the first, middle, or last frame from each group. While these initial explorations provided some insights, they were not systematically analyzed. Future work will investigate the effects of different frame selection strategies to determine the optimal approach for reducing redundancy without losing relevant content.


Confidence Scores: The confidence scores for logos and screenshots (ranging from 0 to 1) were recorded but not yet utilized. Future work will explore integrating these scores into the analysis to weigh detections and potentially improve precision.


Dataset Expansion: Future work includes manually labeling additional episodes from multiple days of prime-time TV news to expand the gold standard dataset. This will uncover more instances of social media references. We will also be able to evaluate the performance of our logo and screenshot detection pipeline across diverse broadcast content.


Advertisement Filtering: With access to advertisement segments, we plan to exclude ad images before the evaluation step. This will improve our results, as currently, the pipeline includes ads, so ChatGPT may label social media references in ads that are not annotated in the gold standard. As a result, some apparent false positives are actually correct detections, highlighting the need to filter ads for accurate evaluation.


Complementary Detection Methods: In addition to logo and screenshot detection, future work will focus on other approaches such as analyzing OCR-extracted text from video frames and analysing closed-caption transcripts for social media references. 


Compare Against Other Multimodal Models: We aim to explore other vision-language APIs, such as Google’s Gemini Pro to compare detection performance across different Large Language Models (LLMs).

Acknowledgement

I sincerely thank the Internet Archive and the Google Summer of Code Program for providing this amazing opportunity. Specially, I would like to thank Sawood Alam, Research Lead, and Will Howes, Software Engineer, at the Internet Archive’s Wayback Machine for their guidance and mentorship. I also acknowledge Mark Graham, Director of the Wayback Machine at the Internet Archive and Roger Macdonald, Founder of the Internet Archive’s TV News Archive for their invaluable support.  I am grateful to the TV News Archive team for welcoming me into their meetings, which allowed me to gain a deeper understanding of the archive and its work. I am especially grateful to Kalev Leetaru (Founder, the GDELT Project) for providing the necessary Internet Archive data which were processed through the GDELT project. Finally, I would like to thank my PhD advisors, Dr. Michele Weigle and Dr. Michael Nelson (Old Dominion University) and Dr. Alexander Nwala (William & Mary) for their continued guidance.



by Himarsha Jayanetti (noreply@blogger.com) at September 29, 2025 12:50 PM

Hugh Rundle

Did you mean..?

Let's solve the problem but let's not make it any worse by guessing.

Apollo 13 Flight Journal - Gene Kranz, Flight Director, Apollo 13

One of the key things aspiring librarians are taught is "the reference interview". The basic premise is that often when people ask for help or information, what they ask for isn't what they actually want. They ask for books on Spain, when they actually want to understand the origins of the global influenza pandemic of 1918-20. They ask if you have anything on men's fashion, when they want to know how to tie a cravat. They ask if you have anything by Charles Dickens, when they are looking for a primer on Charles Darwin's theory of evolution. The "reference interview" is a technique to gently ask clarifying questions so as to ensure that you help the library user connect to what they are really looking for, rather than what they originally asked.

Sometimes this vagueness is deliberate – perhaps they don't want the librarian to know which specific medical malady they are suffering from, or they're embarrased about their hobby. Sometimes it's because people "don't want to bother" librarians, who they perceive have more important things to do than our literal job of connecting people to the information and cultural works they are interested in – so they'll ask a vague question hoping for a quick answer. Often it's simply that the less we know about something, the harder it is to formulate a clear question about it (like getting our Charles' mixed up). Some of us are merely over-confident that we will figure it out if someone points us in vaguely the right direction. But for many people, figuring out what it is we actually want to know, and whether it was even a good question, is the point.

I was thinking about this after reading Ernie Smith's recent post about Google AI summaries, which are at the centre of a legal case brought against Alphabet Inc by Penske Media. Ernie asks, rhetorically:

Does Google understand why people look up information?

I thought this was an interesting question, because – in the context of the rest of the post – the implication is that Google does not understand why people look up information, despite their gigantic horde of data about what people search for and how they behave after Google loads a page in response to that query. How could this be? Isn't "behavioural data" supposed to tell us about people's "revealed preferences"? Can analytics from billions of searches really be wrong? Maybe if we compare Google's approach to centuries of library science we might find out.

Two kinds of power

In Libraries and Large Language Models as Cultural Technologies and Two Kinds of Power Mita Williams introduces Patrick Wilson's Two kinds of power, a philosophical essay about bibliographic work:

In this work, Wilson described bibliographic work as two powers: Descriptive and Exploitative. The first is the evaluatively neutral description of books called Bibliographic control. The second is the appraisal of texts, which facilitates the exploitation of the texts by the reader.

Libraries and Large Language Models as Cultural Technologies and Two Kinds of Power - Mita Williams

Professor Hope Olson memorably called this "descriptive power" the power to name. The words we use to describe our reality have a material impact on shaping that reality, or a least our relationship with it. In August this year the Australian Stock Exchange accidentally wiped $400 million off the market value of a listed company because they mixed up the names of TPG Telecom Limited and TPG Capital Asia. Defamation law continues to exist because of the power of being described as a criminal or a liar. Description matters.

Web search too originally sought to describe, and on that basis to return likely results for your search.

Different search engines have approached the creation of indexes using their own strategies, but the purpose was always to map individual web pages both to concepts or keywords and – since Google – to each other. A key difference between web search engines and the systems created and used by libraries is that the latter make use of controlled metadata, whereas the former cannot and do not. Google in particular has made various attempts at outsourcing the work of creating more structured metadata and even controlled vocabularies. All have struggled to varying degrees on the basis that ordinary people creating websites aren't much interested in and don't know how to do this work (for free), and businesses can't see much, if any, profit in it. At least, their own profit.

Whilst there is a widespread feeling that Google search used to be much better, the fact that "Search Engine Optimisation" became a concept soon after the creation of web search engines points to the fundamental limitation of uncontrolled indexes. Librarians describe what creative works are about, thus connecting items to each other through their descriptions. Search engines approach the problem from the other direction: describing the connections between works, thus inferring what concepts they are most strongly associated with. Are either of these approaches really "evaluatively neutral"?

Long held up as a core tenet of librarianship, neutrality or objectivity has been fiercely debated in recent years. Mirroring similar criticisms of journalistic standards, many have pointed out that "objectivity" in social matters very often simply means upholding the status quo. We are social animals. Nothing we can say about ourselves is ever "neutral" or objective, because everything is contested and relational. Yet humans on the whole are anthropocentric in how we see the world. We are a species that frequently thinks it can see the literal face of god in a piece of toast. Everything we say about anything can be seen as being about us, whether it's the mating habits of penguins or the movement of celestial bodies.

So as soon as we recognise Descriptive Power, arguing over semantics is inevitable. In democratic systems an enormous amount of effort is expended attempting to move the Overton Window. Was Australia "settled" or "invaded" in 1788? Are certain countries "poor", "developing", "in the third world" or "from the global south"? Is a political movement "conservative", "alt-right" or "fascist"? Is it still "folk music" if it's played with an electric guitar? We argue about these descriptions because they also say something about how we see ourselves and want to be seen by others. When description has power, you can't really be "neutral" when wielding it. Much like judges or umpires, when using the power to name the librarian's task is to be fair and factual. We are members of the Reality Based Community.

Memory holes

Mita goes on to explore the idea that the library profession has generally seen our job as beginning and ending with descriptive power:

Library catalogues don’t tell you what is true or not. While libraries facilitate claims of authorship, we do not claim ownership of the works we hold. We don’t tell you if the work is good or not. It is up to authors to choose to cite in their bibliographies to connect their work with others and it is up to readers to follow the citation trails that best suit their aims..

Here is where "neutrality" comes in. We might describe a book as being about something, written by a particular person, or connected to other works and people. But we make no claims as to the veracity of the assertions you might read within it. Assessing the truth or artistry of a certain work is, as they say, "left as an exercise for the reader". This is where the (not always exercised) professional commitment to reader privacy comes in. If it's up to the reader to glean their own meaning from the works in our collections, then we can't know and should not assume their purpose or their conclusions. And if people are to be given the freedom to explore these meanings, they can't have the fear of persecution for reading "the wrong things" hanging over their heads.

We can see an almost perfect case study of this clash of approaches in the debacle of Clarivate's "Alma Research Assistant". Aaron Tay lays this out clearly in The AI powered Library Search That Refused to Search:

Imagine a first‑year student typing “Tulsa race riot” into the library search box and being greeted with zero results—or worse, an error suggesting the topic itself is off‑limits. That is exactly what Jay Singley reported in ACRLog when testing Summon Research Assistant, and it matches what I’ve found in my own tests with the functionally similar Primo Research Assistant (both are by Clarivate)...According to Ex Libris, the culprit is a content‑filtering layer imposed by Azure OpenAI, the service that underpins Summon Research Assistant and its Primo cousin.

Want to research something that might be "controversial"? I can't let you do that Dave, computer says no. What's worse in the case of Primo Research Assistant is that rather than declaring to the user that it won't search, the system is designed to simply memory hole any relevant results and claim that it cannot find anything.

A library is a provocation

Whilst they are often associated with a certain stodginess, every library is a provocation. With all these books, how could you restrict yourself to reading only one or two? Look how many different ideas there are. See how many ways there are to describe our worlds. The thing that differentiates a library from a random pile of books is that all these ideas, concepts and stories are organised. They have indexes, catalogues, and classification systems. They are arranged into subjects, genres, and sub-collections. The connections between them and the patterns they map are made legible.

The pattern of activity that digital networks, ranging from the internet to the web, encourage is building connections, the creation of more complex networks. The work of making connections both among websites and in a person’s own thinking is what AI chatbots are designed to replace.

A linkless internet - Collin Jennings

The most exciting developments in library science right now are exploring not how to provide "better answers" but rather how to provide richer opportunities to understand and map the connections between things. With linked data and modern catalogue interfaces we can overlay multiple ontologies onto the same collection, making different kinds of connections based on different ways of understanding the world.

LLMs are mid

Clarifying the connections between people and works. Disambiguating names. Mapping concepts to works, and re-mapping and organising them in line with different ontological understandings. All of this requires precision. Because they under-estimate the skill of manual classification and indexing, and over-estimate their own technologies, AI Bros thought they could combine Wilson's two powers. But description for the purpose of identifying information sources is an exact science. This is why we have invested so much energy and time in things like controlled vocabularies, authority files, and the ever-growing list of Permanent Identifiers (PIDs) like DOI, ISNI, and ORCID. It's why we've been slowly and carefully talking about Linked Open Data.

And then these arrogant little napoleons think they can just YOLO it.

The problem is not the core computing concepts behind machine learning but rather the implementation details and the claims made about the resulting systems. I agree with Aaron Tay's on embedding vector search – this is an incredibly compelling idea that has been stretched far beyond what it is most useful for. Transformer models of any kind are ultimately intended to find the most probable match to an input. Not the best match. Not a perfect match. Not a complete list of matches. An arbitrary number of "result most likely". In contrast to the exactness of library science, these new approaches are merely averaging devices.

Answering machines

Let us now return to Ernie Smith's question: Does Google understand why people look up information?

Google thinks people look up information in order to find "the most likely answer". The company has to think like this, because the key purpose of Google's search tool is to sell audiences to advertisers via automated, hundred-millisecond-long auctions for the purpose of one-shot advertising. Everything else we observe about how it works stems from this key fact about their business. Endless listicles and slop generated by both humans and machines is the inevitable result. And since the purpose of the site is simply to display ads, why not make it "more efficient" by providing an "answer" immediately?

What I think Ernie is gesturing at is that when we search for information what we often want is to know what other people think. We want to explore ideas and expand our horizons. We want to know how things are connected. We want to understand our world. When you immediately answer the question you think someone asked instead of engaging with them to work out what they are actually looking for, it's likely to be unhelpful. Asking good questions is harder than giving great answers.

Chat bots and LLMs can't solve the problem. They're just guessing.


by Hugh Rundle at September 29, 2025 12:00 AM

Library Tech Talk (U of Michigan)

How 1000+ Users’ Feedback Has Transformed Library Search

Illustration of three people with thought or speech bubbles representing their survey feedback
Image Caption

Image created using Open Peeps by Pablo Stanley and Team. Licensed under CC0

The second Library Search Benchmark Survey launched earlier this year and we gathered feedback from students, faculty, staff and library employees about finding and accessing materials through Library Search. We also measured changes in user satisfaction, ease of use, user challenges and wins from our first survey in 2022. Participation from users outside the library grew significantly and responses helped us identify clear and actionable insights we’re excited to share and act on in the coming year.

by Ben Howell at September 29, 2025 12:00 AM

September 28, 2025

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-09-28: Trip report: Dagstuhl Seminar 25382: Open Scholarly Information Systems: Status Quo, Challenges, Opportunities


During September 14-19, I was honored to organize and participate in the Dagstuhl Seminar titled Open Scholarly Information Systems: Status Quo, Challenges, Opportunities. The seminar was held at Schloss Dagstuhl, a small castle-like computer science center in the west of Germany. 


What is unique about Dagstuhl is that it is located in a wood, very isolated from big cities. The nearest town, Wadern, is 30 minutes on foot, and there is no public transportation. After landing at the huge Frankfurt airport, one has to take the train for 2 hrs from the regional train station to Türkismühle. Then you must take a taxi because there is no public transportation. Based on the recommendation of the Dagstuhl website, I had to reserve a taxi from Taxi Martin at least 3 days in advance. Of course, I had to reserve it again when I left at 4 am! 

The whole building contains two parts: an old building, which was built 260 years ago, and a new building, built in 2001. The two buildings are connected with a bridge on the second floor. In addition to lecture rooms and living rooms, the old building has a dining room, a kitchen, a piano room, and a backyard, which is very nice for afternoon teas and outdoor discussions. In addition to lectures and living rooms, the new building has a laundry room, a sauna, and a gym.  Both buildings are covered by Wifi. Coffee (including espresso drinks), sparkling water, and wine are available 24/7. The center has everything you need for research, except cars. You have to walk if you need to get out.

The core organization team consists of five people around the world, including Hannah Bast (chair, Germany), Marcel Ackermann (coordinator, Germany), Guillaume Cabanac (co-chair, France), Paolo Manghi (co-chair, Italy), and me (co-chair, United States). We started drafting the proposal back in March 2023. The original goal was to celebrate the 32nd anniversary of DBLP, a legendary digital library for computer and information sciences. Later on, the plan evolved to assemble about 40 scholars to discuss a broader topic: open scholarly information systems (OSIs). To carefully control the size of the seminar and guarantee attendance, the organization team sent invitations at least 3 rounds. The invited people include PIs, core technical people, or directors of well-known digital libraries (e.g., Google Scholar, arXiv, CORE, OpenAIRE, OpenReview, NDLTD, and CiteSeerX),  researchers in particular domains (e.g., Natural Language Processing, semantic web, digital libraries, information retrieval), and software companies about open data (e.g., Digital Science). 

Different from computer science conferences and workshops, the seminar was held in a way with short talks (10 min -- 20 min) and plenty of time for plenary and small group discussions, and social activities. The final report is collaboratively edited. The activities each day are outlined below.

Monday

Tuesday 

Wednesday 

Thursday 

Friday

The manifesto summarized the main topics, conclusions, and next steps discussed in working groups. The table of contents is shown below. 

  1. OpenReview and DBLP
  2. Barcelona Declaration, arguments for open scholarly infrastructure
  3. Barcelona Declaration, signatories
  4. Wikidata and QLever
  5. oAsIs - The role of open agentic scholarly information systems in the age of agentic AI
  6. Collaborative Metadata
  7. Author name disambiguation
  8. Harmonize Subject Area Classification
  9. CORE to Commons workflow
  10. Assessing overlaps between Dimensions and Wikidata
  11. Schloss Singapura on AI
  12. Modal µ-calculus in SPARQL
  13. Do we need rankings OR How to change the current perverse system?
  14. CS Ontology and DBLP
  15. ACL Anthology
  16. Fake conference metadata discussion
  17. Future of bibliographic metadata in Wikidata
  18. Working with Scholarly Metadata Dumps
  19. Perspectives on How to Achieve the Sustainability of Open Scholarly Infrastructure (OSI)
  20. Software tagging in DBLP
  21. Nanopublications to track acknowledgements of DBLP/ Dagstuhl
  22. Find Ghost #96 
Take one topic as an example. The manifesto contains the following. 

The last topic is "Find Ghost #96", which originated from a tradition of capturing the "Ghost" located everywhere in the Dagstuhl Castle! If you want to know the details, come and visit Dagstuhl! 

The seminar was highly rated by the participants, not only because of the free food, drink, and lodging but also because of the short talks and sufficient time for free-form discussions. Thanks to the chair, Hannah Bast! This form proves more efficient for scholars to have in-depth discussions that go into detail and cover many aspects of concrete problems. This is in contrast with most computer and information science conferences with pre-scheduled 20-minute or 30-minute presentations, which appear to cover lots of materials, but in fact make it easy for attendants to get bored. These presentations are usually followed by a very short amount of time for QA and discussions. Most of the time, the session chair has to set a timeout and leave the discussion "offline", which rarely happens.

On the 32nd anniversary of DBLP, Dr. Michael Ley announced his retirement. Many people agreed that with the retirements of C. Lee Giles, Ed A. Fox, and Michael Ley, this marks the end of an era of digital libraries. DBLP will have a new director. 

The food at Dagstuhl was VERY good, featuring home-made dishes, very tasty, fresh, and healthy. The honey is produced by Dagstuhl's own honeywell.

Finally, here is the featured picture of this seminar. How many people can you recognize? Some of them are well-known in their domains.



-- Jian Wu


by Jian Wu (noreply@blogger.com) at September 28, 2025 10:12 PM

September 26, 2025

Ed Summers

Information

Information is essential, and these days so readily available through the internet that we all suffer from a surfeit of it. But Information without selection and critique is like a relationship without love: everything you need is there, except the central element. And without it, nothing has real value. (Patterson, 2025, pp. 56–57).

Patterson, I. (2025). Books: A Manifesto. London: Weidenfeld & Nicolson.

September 26, 2025 04:00 AM

blurrymoon w/ pipecleaner trees

blurrymoon w/ pipecleaner trees

September 26, 2025 04:00 AM

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-09-25: ACM Symposium on Eye Tracking Research and Applications (ETRA) 2025 Trip Report

 


ACM Symposium on Eye Tracking Research & Applications (ETRA) 2025 was an in-person conference held at Miraikan, Tokyo, Japan. The conference took place from May 26 to May 29, 2025. 

ETRA 2025 focuses on all aspects of eye movement research across a wide range of disciplines and brings together computer scientists, engineers, and behavioral scientists to advance eye-tracking research and applications. 

🎉 ETRA 2025 is officially underway in Tokyo! 🇯🇵
Join researchers and practitioners from around the world as we advance eye tracking research together. Four days of cutting-edge presentations, workshops, and networking at @miraikan#ETRA2025 #EyeTracking #Tokyo pic.twitter.com/9Pym0BkSEd

— ETRA (@ETRA_conference) May 26, 2025

Keynote 1

Dr. Yukie Nagai, Project Professor at The University of Tokyo, delivered the first keynote of ETRA 2025. Her talk “How People See the World: An Embodied Predictive Processing Theory”. Dr. Nagai introduced a neuro-inspired theory of human visual perception based on embodied predictive processing. She explained how sensorimotor learning drives the development of visual perception and attention, providing new insights into the mechanisms underlying how people see and interpret the world.

🧠 An Embodied Predictive Processing Theory
Incredible keynote by Prof. Nagai from @UTokyo_News!
From computational neuroscience to understanding how neurodiverse individuals perceive the world differently - her research bridges robotics, cognition, and eye tracking beautifully. pic.twitter.com/W53XYftCjK

— ETRA (@ETRA_conference) May 28, 2025

@ETRA_conference Day 2. Keynote by Professor Yukie Nagai @UTokyo_News_en The University of Tokyo, ""How People See the World: An Embodied Predictive Processing Theory"" #etra2025 pic.twitter.com/LLtUol5RTD

— Sampath Jayarathna (@OpenMaze) May 27, 2025

Happening now at @ETRA_conference, Prof. Yukie Nagai giving the 1st keynote on "How People See the World: An Embodied Predictive Processing Theory" #ETRA2025 🇯🇵 pic.twitter.com/HjT0g4kl55

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 27, 2025

 

Tutorial Highlight

Dr. Andrew T. Duchowski delivered a tutorial “Gaze Analytics: A Data Science Perspective.” The session covered a wide range of topics, from PsychoPy setup to advanced transition entropy analysis. Participants gained hands-on experience with the complete pipeline, engaging directly with both methodology and stimulus generation. This tutorial gave attendees practical exposure to data science approaches in gaze analytics, reinforcing its value for eye-tracking research.

🎯 #ETRA2025 Tutorial Highlight: "Gaze Analytics: A Data Science Perspective" by Prof. Andrew T. Duchowski covered everything from PsychoPy setup to advanced transition entropy analysis! 📊 Participants got hands-on experience with the complete pipeline 👁️‍🗨️ #EyeTracking pic.twitter.com/K8ua4v0uAC

— ETRA (@ETRA_conference) May 26, 2025

Papers Session 1

The first paper session featured groundbreaking research covering diverse applications of eye tracking. Presentations included work on Large language Model (LLMs) alignment analysis, laser-based eye tracking for smart glasses, and methods for detecting expertise through gaze patterns. The session demonstrated how eye-tracking research continues to expand its scope, integrating with fields such as artificial intelligence, wearable technology, and cognitive modeling. The variety of topics sparked engaging discussions, highlighting the breadth and depth of current innovations in the community.

🔬 Capturing the moment: Paper Session 1 featured groundbreaking work spanning LLM alignment analysis, laser-based eye tracking for smart glasses, and expertise detection through gaze patterns. The diversity of applications demonstrates the expanding scope of our field. #ETRA2025 pic.twitter.com/I9knZDOFQK

— ETRA (@ETRA_conference) May 28, 2025


Dr. Sampath Jayarathna of NirdsLab from Old Dominion University, Dr. Yasith Jayawardana from Georgia Institute of Technology, and Dr. Gavindya Jayawardana from University of Texas at Austin attended the paper sessions, poster sessions, and keynotes of the conference in person.

Dr. Gavindya, a postdoctoral fellow at The University of Texas at Austin working with Dr. Jacek Gwizdka, presented her research on advancing real-time measures of visual attention through the ambient/focal coefficient K "A Real-Time Approach to Capture Ambient and Focal Attention in Visual Search"Their work introduced a robust parametrization and alternative estimation method, along with two new real-time measures analogous to K.

Building on her broader research interests in Eye-Tracking, Human-Computer Interaction, Human-Information Interaction, Data Science, and Machine Learning, Gavindya’s presentation demonstrated how neuro-physiological evidence and cognitive load detection approaches can be integrated into applied systems. 

@ETRA_conference, always a pleasure to see @NirdsLab @WebSciDL @oducs distinguished alumni. My students Gavindya @Gavindya2 and @yasithdev with Andrew Duchowski @atduchowski. Gavindya's paper "A Real-Time Approach to Capture Ambient and Focal Attention in Visual Search" #etra2025 pic.twitter.com/EQ0kFHpGDt

— Sampath Jayarathna (@OpenMaze) May 28, 2025


Johannes Meyer presented Ambient Light Robust Eye-Tracking for Smart Glasses Using Laser Feedback Interferometry Sensors with Elongated Laser Beams. This innovative approach focuses on developing eye-tracking methods that remain robust under varying light conditions, a crucial step toward making smart glasses more reliable in real-world environments. 

Johannes Meyer presenting his work titled "Ambient Light Robust Eye-Tracking for Smart Glasses Using Laser Feedback Interferometry Sensors with Elongated Laser Beams" at @ETRA_conference
This work is funded by VIVA project#ETRA2025 🇯🇵 pic.twitter.com/XwQLlRZbGb

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 27, 2025


Mengdi Wang presented Iris Style Transfer: Enhancing Iris Recognition with Style Features and Privacy Preservation through Neural Style Transfer. The work, carried out in collaboration with Dr. Efe Bozkir and Dr. Enkelejda Kasneci, explored how neural style transfer techniques can improve iris recognition systems while simultaneously addressing privacy concerns. This presentation highlighted the potential of combining computer vision methods with privacy-preserving mechanisms, making it a noteworthy contribution to the conference program.

Happening now at @ETRA_conference , Mengdi Wang presenting his work titled "Papers
Iris Style Transfer: Enhancing Iris Recognition with Style Features and Privacy Preservation through Neural Style Transfer" in collaboration with@efebozkir and @EnkelejdaKasne1 #ETRA2025 🇯🇵 pic.twitter.com/jtxjUrqzoH

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 28, 2025


Virmarie Maquiling presented Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images. This large-scale study, conducted in collaboration with Dr. Sean Anthony Byrne, Dr. Diederick C. Niehorster, Dr. Marco Carminati, and Dr. Enkelejda Kasneci, demonstrated the use of the Segment Anything Model (SAM 2) for robust pupil segmentation. By applying zero-shot learning methods to an extensive dataset of over 14 million images, the work highlighted both the scalability and efficiency of modern machine learning in eye-tracking applications.

Virmarie Maquiling presenting her work "Zero-Shot Pupil Segmentation with SAM 2: A Case Study of Over 14 Million Images" at @ETRA_conference in collaboration with Sean Anthony Byrne, Diederick C. Niehorster, Marco Carminati,@EnkelejdaKasne1 #ETRA2025 🇯🇵 pic.twitter.com/NLfuWnARTV

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 28, 2025


Sponsor Exhibitions

The sponsors showcase highlighted cutting-edge developments in eye-tracking technology. Exhibitors demonstrated a range of innovative solutions, including immersive VR setups, wearable eye-tracking devices, and advanced software platforms. Attendees had the chance to interact directly with the technology, exploring applications that bring the future of eye tracking to life. These exhibitions not only showcased sponsor contributions but also emphasized the vital role industry partnerships play in driving forward eye-tracking research and applications.

🔬 Exploring cutting-edge eye tracking solutions at our sponsor showcase! Huge thanks to all our amazing sponsors. Their technology demonstrations are bringing the future of eye tracking to life! Don't miss the amazing sponsor exhibitions at #ETRA2025! pic.twitter.com/epBPkIXWsx

— ETRA (@ETRA_conference) May 28, 2025


Keynote 2

Dr. Jean-Marc Odobez, a senior researcher at IDIAP and EPFL, and Head of the Perception and Activity Understanding Group, delivered the second keynote, “Looking Through Their Eyes: Decoding Gaze and Attention in Everyday Life.” Dr. Odobez, a leading expert in multimodal perception systems and co-founder of Eyeware SA, presented his team’s work in the areas of gaze analysis and visual focus of attention. His keynote showcased how gaze data can be decoded to better understand naturalistic interactions, emphasizing both the technical challenges and practical applications of attention modeling in real-world settings.

@ETRA_conference Day 2, excellent keynote by Jean-Marc Odobez @Idiap_ch "Looking Through Their Eyes: Decoding Gaze and Attention in Everyday Life", some of the work in the areas of gaze and Visual Focus of Attention. #etra2025 pic.twitter.com/nOcgJciOsj

— Sampath Jayarathna (@OpenMaze) May 28, 2025

🔬 Fascinating keynote by Prof. Odobez exploring the challenges of estimating visual attention in natural environments. His work demonstrates how we can decode not just where people look, but what they're truly attending to in complex scenes.#ETRA2025 #EyeTracking #HCI #Tokyo pic.twitter.com/wo0vpioEmv

— ETRA (@ETRA_conference) May 28, 2025

Poster Session

Poster Session showcased 36 innovative research studies highlighting the role of Data, Machine Learning, and AI in eye-tracking applications. The session featured diverse approaches, ranging from neural networks for gaze prediction to advanced analytics methods for interpreting eye-movement data. The room was filled with lively discussions as researchers exchanged ideas, explained methodologies, and received feedback from peers and experts. 

📊 Poster Session 1 at #ETRA2025 featured outstanding research on Data, ML, and AI applications in eye tracking. Researchers presented 36 innovative studies, from neural networks for gaze prediction to advanced analytics methods. Great discussions throughout the session.#AI #HCI pic.twitter.com/UCZHabvSqW

— ETRA (@ETRA_conference) May 28, 2025


@ETRA_conference Poster session. #etra2025 pic.twitter.com/P9IblDbgul

— Sampath Jayarathna (@OpenMaze) May 28, 2025


Pahan Jayarathna presented his first publication at the premier ACM eye-tracking conference. His work, titled Nocturnal Diabetic Hypoglycemia Detection Using Eye Tracking,” introduced a novel approach to monitoring diabetic hypoglycemia through eye-tracking techniques.

The study opened up a promising line of inquiry into how eye-movement data can serve as non-invasive indicators for health conditions, particularly in detecting hypoglycemia during nighttime. Pahan’s poster presentation drew attention from conference attendees, sparking discussions on the medical applications of eye tracking beyond traditional HCI contexts. His contribution marked an important step in extending the reach of eye-tracking research into healthcare and biomedical domains.

Excellent job presenting his first publication at the Premier ACM conference for eye tracking titled "Nocturnal Diabetic Hypoglycemia Detection Using Eye Tracking" , a new way of looking at diabetic hypoglycemia using eye tracking https://t.co/Jwh9akYHTw pic.twitter.com/N3l2nxi6RP

— Sampath Jayarathna (@OpenMaze) May 28, 2025

 

Paper Session 2

The second paper session emphasized the importance of methodological rigor in eye-tracking research. Presentations ranged from evaluating detection algorithms to applying LLMs for cognitive processing tasks. These studies highlighted how careful methodological design provides the analytical foundations for deriving meaningful insights from eye movement data. The session demonstrated both technical depth and practical relevance, reinforcing the role of rigorous analysis in advancing the reliability and impact of eye-tracking research.

📈 Paper Session 2 highlighted the importance of methodological rigor in eye tracking research. From evaluating detection algorithms to leveraging LLMs for cognitive processing, these studies provide the analytical foundations that enable insights from eye movement data.#ETRA2025 pic.twitter.com/6kQp9EgvaK

— ETRA (@ETRA_conference) May 28, 2025

 

Panel Discussion: Eye Tracking for Accessibility

The final day began with an engaging panel discussion on Eye Tracking for Accessibility. The session brought together experts including Dr. Pawel Kasprowski (Silesian University of Technology), Dr. Diako Mardanbegi (American University of Beirut), Dr. Krzysztof Krejtz (SWPS University of Social Sciences and Humanities), and Dr. Hironobu Takagi (IBM Research, Miraikan). The discussion was led by Dr. Mohamed Khamis (University of Glasgow). Panelists explored how eye-tracking technologies can be leveraged to improve accessibility in everyday contexts, addressing challenges and opportunities for making digital and physical environments more inclusive. Their insights underscored the transformative potential of eye tracking to enhance usability and accessibility for diverse populations.

Starting @ETRA_conference's last day with panel discussion on Accessibility with the panelists Pawel Kasprowski, Diako Mardanbegi , Krzysztof Krejtz , Hironobu Takagi, and lead by @MKhamisHCI #ETRA2025 🇯🇵 pic.twitter.com/wizEGYZtOY

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025


Workshop Session: GenAI Meets Eye Tracking

The GenAI Workshop focused on the intersection of generative AI and eye-tracking research. The session began with a keynote by Xi Wang on “Decoding Human Behavior Through Gaze Patterns”, which explored how gaze data can be harnessed to understand complex aspects of human behavior.

Following the keynote, the workshop featured contributions from Gjergji Kasneci, Enkelejda Kasneci, Aranxa Villanueva, and Yusuke Sugano. Together, the speakers emphasized how generative AI techniques can advance eye-tracking applications, including cognitive modeling, behavioral prediction, and new opportunities for human-computer interaction.

The workshop was well-attended and highly interactive, bringing together perspectives from academia and industry. It highlighted how AI-driven methods are shaping the future of gaze research and opened discussions about challenges in integrating these tools into practical applications.

Happening now @Gjergji_ , @EnkelejdaKasne1 , Aranxa Villanueva and @_ysugano starting #GenEAI workshop at @ETRA_conference #ETRA2025 🇯🇵 pic.twitter.com/Lv3uJHY71r

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025

Xi Wang giving the first keynote at #GenEAI workshop at @ETRA_conference on
Decoding Human Behavior Through Gaze Patterns#ETRA2025 🇯🇵 pic.twitter.com/ymxbHFUNgp

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025

Happening now, the first session of #GenEAI at @ETRA_conference with three papers on 1) enhancing the medical domain, 2) detecting learning disorders and finally, 3) enhancing eye tracking with LLM insights#ETRA2025 🇯🇵 pic.twitter.com/fefhn7GPR7

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025

Starting session 2 of the #GenEAI workshop at @ETRA_conference with a keynote by Xucong Zhang on
Generative Models for Gaze Estimation#ETRA2025 🇯🇵 pic.twitter.com/ZKS3bZQkxs

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025

Ending the #GenEAI workshop at @ETRA_conference with awards congratulations to Quoc-Toan Nguyen on their best paper award titled "Learning Disorder Detection Using Eye Tracking: Are Large Language Models Better Than Machine Learning?"#ETRA2025 🇯🇵 pic.twitter.com/dkh7kuPkc8

— Yasmeen Abdrabou, PhD (@AbdrabouYasmeen) May 29, 2025

 

Papers Session 3

The third paper session highlighted innovative Computer Vision approaches applied to eye-tracking research. Presentations explored a range of groundbreaking topics, including iris style transfer for privacy preservation and zero-shot pupil segmentation with SAM 2 applied to over 14 million images. These studies showcased how modern vision-based techniques can push the boundaries of both data-driven analysis and practical applications in eye tracking.

By combining advanced AI methods with large-scale datasets, the session emphasized the critical role of computer vision in addressing challenges of scalability, accuracy, and privacy within the field. Researchers also demonstrated how these developments could translate into real-world solutions, reinforcing the strong connection between technical innovation and human-centered applications in eye-tracking research.

💻 Paper Session 3 showcased innovative Computer Vision approaches in eye tracking research at #ETRA2025. The session featured groundbreaking work from iris style transfer for privacy preservation to zero-shot pupil segmentation using SAM 2 across 14 million images.#HCI #CV pic.twitter.com/LAgYukrQlp

— ETRA (@ETRA_conference) May 29, 2025


Social Events

ETRA 2025 wasn’t just about research and presentations—it also provided memorable opportunities for networking and community building. One of the highlights was the conference banquet, held aboard a dinner cruise in Tokyo Bay. Attendees enjoyed an evening of fine Japanese cuisine, sake, and vibrant discussions while taking in breathtaking views of the city skyline.

The social events gave researchers, practitioners, and students a chance to relax, connect, and exchange ideas in a more informal setting. These moments of shared meals and conversations strengthened the sense of community within the eye-tracking research field, ensuring that collaborations extend beyond the technical sessions into long-lasting professional relationships.

🚢✨ What an incredible evening at the #ETRA2025 conference banquet! Our dinner cruise in Tokyo Bay was the perfect setting to celebrate eye tracking research with colleagues from around the world. Amazing Japanese cuisine, sake, and unforgettable conversations! 🍶🗾#EyeTracking pic.twitter.com/V8JQR36VJS

— ETRA (@ETRA_conference) May 30, 2025
 

ETRA 2025 emphasized inclusivity and sustainability even during mealtimes. Attendees were served Halal-friendly, vegan, and regular bento options, ensuring that everyone had accessible and culturally sensitive food choices. The beautifully prepared bento boxes showcased a variety of Japanese flavors while catering to diverse dietary needs.

🍱 Lunch time at #ETRA2025!
We're serving HALAL-friendly, VEGAN, and regular bento options for everyone 🌱✨ Please enjoy your meal and help us stay sustainable by disposing of trash in designated areas afterward. Arigatou gozaimasu! 🙏#Tokyo #Sustainability pic.twitter.com/j7YekD6DcV

— ETRA (@ETRA_conference) May 26, 2025

Networking & coffee breaks, beyond the formal sessions, also fostered an environment for informal yet impactful exchanges. it provided a space where researchers, students, and industry professionals gathered to share ideas, sketch concepts on whiteboards, and even run quick demos on their phones. These animated discussions often sparked new collaborations and innovative research directions.

📸 Captured: The moment when brilliant minds meet over coffee ☕ These animated discussions during breaks are where the real breakthroughs happen! From whiteboard sketches to quick demos on phones 📱✏️#ETRA2025 #Research #Networking #EyeTracking #HCI pic.twitter.com/AILwE6lPy2

— ETRA (@ETRA_conference) May 26, 2025
In conclusion, ETRA 2025 in Tokyo was an inspiring event that showcased the latest advancements in eye-tracking research. The conference fostered meaningful discussions, collaborations, and the exchange of ideas, setting the stage for future developments. As we look ahead, the excitement for ETRA 2026 in Marrakesh promises even more groundbreaking research and opportunities for growth. The experiences gained from this event will undoubtedly shape the future of eye-tracking technology and research.


About the Author:
Lawrence Obiuwevwi is a Ph.D. student in the Department of Computer Science, a graduate research assistant with The Center for Secure and Intelligent Critical Systems (CSICS), and a proud student member The Web Science and Digital Libraries (WS-DL) Research Group, and NirdsLab at Old Dominion University.


Lawrence Obiuwevwi
Graduate Research Assistant
Virginia Modeling, Analysis, & Simulation Center
Department of Computer Science
Old Dominion University, Norfolk, VA 23529
Email: lobiu001@odu.edu
Web : lawobiu.com

by Lawrence Obiuwevwi (noreply@blogger.com) at September 26, 2025 03:44 AM

September 24, 2025

Peter Sefton

LDaCA Technical Architecture Update

LDaCA Technical Architecture update 2025 :: PT Sefton, Moises Sacal Bonequi, Ben Foley ::

This presentation is an update on the Language Data Commons of Australia (LDaCA) technical architecture for the LDaCA Steering Committee meeting 22 August 2025, written by members of the LDaCA team; me, Moises Sacal, and Ben Foley edited by Bridey Lea. This version has the slides we presented and our notes, edited for clarity. There's a more compact version of this up over on the LDaCA site

The architecture for LDaCA has not changed significantly for the last couple of years. We are still basing our design on the PILARS protocols.

What’s in this presentation? :: News :: Refresh memories of the distinction between Workspaces vs Archival Repositories :: Explore the architecture of our work on Archival Repositories :: Decentralised approach: multiple Data Stores under appropriate governance :: Standards and specifications: :: RO-Crate for describing data objects :: RO-Crate Metadata Profiles for data interchange within a discipline or domain (like language data) :: Open source tools ::

This presentation will report on some recent developments, mostly in behind-the-scenes improvements to our software stack. It will give a brief refresh of the principles behind the LDaCA approach, and talk about our decentralised approach to data management and how it fits with the metadata standards we have been developing for the last few years. We will also show how the open source tools used across LDaCA’s network of collaborators are starting to be harmonised and shared between services, reducing development and maintenance costs and improving sustainability.

News! :: John Ferlito (PARADISEC) has created a new version of the LDaCA portal using a simpler API that can be used for PARADISEC and LDaCA (and potentially Nyingarn and many other repositories) :: New API is “An RO-Crate API” - AROCAPI :: Generic API for collections of Objects/Items :: Objects are described using RO-Crates :: Working together on a new Oni-stack using the new API  :: New stack can be used for RAPID and other data portals ::

The big news is a new API a new RO-Crate API (“An RO-Crate API” - AROCAPI ) which offers a standardised interface to PILARS-style storage where data is stored as RO-Crates, organized into "Collections" of "Objects" according to the Portland Common Data Model (PCDM) specification, which is built-in to RO-Crate.

A concrete example is that PARADISEC will implement different authentication routes (using the existing “Nabu” catalog) than the LDaCA data portal which uses CADRE ([REMS])(https://www.elixir-finland.org/en/aai-rems-2/).

More news :: 🤞💾 ::

Promising discussions are taking place with one of our partners about taking on LDaCA data long-term (instead of having to distribute the collections across partner institutions). This would give a consolidated basis for a Language Data repository and a broader Humanities data service.

collect & :: organise :: Language data is rarely organised or described in reusable ways, if it's described at all :: conserve :: A lot of language data is at risk of being lost forever  :: find :: It’s difficult to know what language data exists and where to find it :: Ad hoc tools, analysis and annotation methods are used, lacking reproducibility :: Shared tools can process, analyse, reuse, repurpose, annotate, visualise and enhance data at scale :: access :: Processes for granting permissions and getting access to data are either absent or aren’t easy to understand or apply :: analyse  :: Standards and tools are available and being applied by data stewards :: Good governance and standardised, distributed storage of data helps  :: preserve and return data :: Discovering and locating language data is easy via linked portals :: Access controls are in place and easy to use, so that data access can be given to the right people  :: LDaCA Execution Strategy Overview :: Strengthen the data management skills of language worker communities  :: Develop shared tools, standards and technical infrastructure to help data stewards care for  data for the long term :: Build data portals with useful search functions and lightweight technical structures :: Create guidance for data stewards to document and grant access and reuse rights :: Support language communities to gain greater control over their language data :: Develop tools for data and metadata conversion, processing, analysis, annotation, visualisation, and enrichment :: Develop and guide the implementationof local and national policy and governance toolkits :: Provide examples and training for research at scale :: guide :: Best practice advice and training for working with language data is available from a single source which is easy to find :: Guidance and training for collecting, handling, using and analysing data are scattered and hard to find :: Version: 2025-07-31 :: > analysis overview :: Starting state (2021) :: Desired state (2028) :: Activities ::

This slide shows the LDaCA execution strategy. All of the strands (Collect & organise, Conserve, Find, Access, Analyse, Guide) are relevant to the technical architecture.

 ::  :: Repositories: institutional, domain or both ::  ::  ::  ::  ::  ::  :: Find / Access services :: Research Data Management Plan :: Workspaces: ::  :: working storage :: domain specific tools :: domain specific services :: collect :: describe :: analyse :: Reusable, Interoperable  :: data objects :: deposit early :: deposit often :: Findable, Accessible, Reusable data objects :: reuse data objects :: V1.0 © Marco La Rosa, Peter Sefton 2021 https://creativecommons.org/licenses/by-sa/4.0/  ::  ::

From the very beginning of the project, the LDaCA architecture has been designed around the principle that to build a “Research Data Commons” we need to look after data above all else. We took an approach that considered long-term data management separately from current uses of the data.

This resulted in some design choices which are markedly different from those commonly seen in software development for research.

Effort was put into:

With this foundation, and the new interoperability we gain from our collaboration on the AROCAPI API, we are well placed to move into a phase of rapid expansion of the data assets building workspace services. For example:

In 2024, we released the Protocols for Implementing Long Term Archival Repositories (PILARS), described in this 2024 presentation at Open Repositories. The first principle of PILARS is that data should be portable, not locked in to a particular interface, service or mode of storage. Following the lead of PARADISEC two decades ago, the protocols call for storing data in commodity storage services such as file systems or (cloud) object storage services. This means that data is available independently of any specific software.

 ::  :: collect & :: organise :: Language data is rarely organised or described in reusable ways, if it's described at all :: conserve :: A lot of language data is at risk of being lost forever  :: find :: It’s difficult to know what language data exists and where to find it :: Ad hoc tools, analysis and annotation methods are used, lacking reproducibility :: Shared tools can process, analyse, reuse, repurpose, annotate, visualise and enhance data at scale :: access :: Processes for granting permissions and getting access to data are either absent or aren’t easy to understand or apply :: analyse  :: Standards and tools are available and being applied by data stewards :: Good governance and standardised, distributed storage of data helps  :: preserve and return data :: Discovering and locating language data is easy via linked portals :: Access controls are in place and easy to use, so that data access can be given to the right people  :: LDaCA Execution Strategy Overview :: Strengthen the data management skills of language worker communities  :: Develop shared tools, standards and technical infrastructure to help data stewards care for  data for the long term :: Build data portals with useful search functions and lightweight technical structures :: Create guidance for data stewards to document and grant access and reuse rights :: Support language communities to gain greater control over their language data :: Develop tools for data and metadata conversion, processing, analysis, annotation, visualisation, and enrichment :: Develop and guide the implementationof local and national policy and governance toolkits :: Provide examples and training for research at scale :: guide :: Best practice advice and training for working with language data is available from a single source which is easy to find :: Guidance and training for collecting, handling, using and analysing data are scattered and hard to find :: Version: 2025-07-31 :: Starting state (2021) :: Desired state (2028) :: Activities ::

For the rest of this presentation, we will focus on recent developments in the “Green zone” – the Archival Repository functions of the LDaCA architecture. We will not be talking about the analysis stream as that will be discussed in detail in the newly established Analytics Forum.

I (PT) wanted to throw in a personal story here. This is an unstaged picture of my (PT Sefton’s) garage this morning. The box of hard drives contains some old backups of mine just in case, and also my late father Ian Sefton’s physics education research data, stuff like student feedback from lab programs in the 80s trialling different approaches to teaching core physics concepts and extensive literature reviews. These HAVE been handed on to his younger colleagues but could easily have ended up only available here in this garage. I wanted to remind us all that this project is a once in a career opportunity to develop processes for organising data and putting it somewhere alongside other data in a Data Commons where (a) your descendants are not made responsible for it and put it in a box in the shed or chuck it in a skip; and (b) others can find it, use it (subject to the clear “data will” license permissions you left with the data to describe who should be allowed to do what with it), and build on your legacy.

Remember:

PILARS 1: Data is Portable: assets are not locked-in to a particular mode of storage, interface or service  ::

The first principle of PILARS is that data should be portable, not locked-in to a particular mode of storage, interface or service. Following the lead of PARADISEC two decades ago, the protocols call for storing data in commodity storage services such as file systems or (cloud) object storage services. This means that data is available independently of any specific software. This diagram is a sketch of how this approach allows for a wide range of architectures – data stored according to the protocols can be indexed and served over an API (with appropriate access controls). Over the next few slides, we will show some of the architectures that have emerged over the last couple of years at LDaCA.

One storage service ↔ One API ↔ One Portal ::  :: This pattern will be used for LDaCA, the Batchelor CALL Collection, RAPID (Hansard) the UTS Research Data  Repository and other major collections ::

The first example is the LDaCA data portal, which is a central access-controlled gateway to the data that we have been collecting.

NOTE: during the project it has been unclear how we would look after data at the conclusion of the project. No single organisation had put up its hand up to host data for the medium to long term, but as noted in the News section we have had some positive talks with one of our partner institutions indicating that they may have an appetite for hosting data that otherwise does not have a home, and/or providing some redundancy for at-risk collections where data custodians are comfortable with a copy residing at the university (we won’t say which one until negotiations are more advanced).

One data store ↔ One API ↔ 2 portals (demo) ::  :: DEMO ONLY ::  ::  ::

This slide shows a demo of two different portal designs accessing the same PARADISEC data, which has been accomplished using the new AROCAPI API. The API will speed development of new PILARS-compliant Research Data Commons deployments, using a variety of storage services and portals that can be adapted and "mixed and matched" via a common API.

Other deployment options :: Set up a stand-alone service for a specific archive (Batchelor CALL Collection work in progress) :: Automation of deployment of portals on demand for testing or show and tell :: Distributed regional archival repositories,local orgs share infrastructure, avoiding cloud services :: Put part (or all) of a collection on a tiny computer (Raspberry Pi) for distribution :: ↑Raspberry Pi containing a collection :: ←Access on mobile via wifi ::

Alongside the data portal, we have explored other ways of sharing data assets, including local distribution via portable computers such as Raspberry Pi with a local wireless network. We have also discussed establishing regional cooperative networks where communities reduce risk by holding data for each other.

Services, software, standards and guides ::

With our partners, we have developed and adapted a suite of other technical resources, including:

This diagram shows how the PILARS principles have been implemented by different organisations. Each example uses open source software, and accepted standards for metadata and storage, meaning that data is portable.

= ::

This slide shows one potential view of LDaCA’s architecture in 2026. There may be an opportunity to deepen the collaboration between the UQ LDaCA team and the PARADISEC team at Melbourne, sharing the development of more code.

For example, Nyingarn’s incomplete repository function could be done by a stand alone instance of the Oni portal, or as shown here, added to the LDaCA portal as a collection.

Likewise the non-existent user-focussed data preparation functions of Nyigarn, where a user can describe an object and submit it could be generalized for use in LDaCA.

Changes shown in this diagram:

To conclude, we have an opportunity now to consider how the distributed LDaCA technical team can collaborate on key pieces of re-deployable infrastructure. This work is having an impact in other Australian Research Data Commons (ARDC) co-investments.

September 24, 2025 12:00 AM

September 23, 2025

Open Knowledge Foundation

Our New Field Guide: ‘The Tech We Want | Read This Before You Build’

We at OKFN are questioning our own practices and learning how to do things anew. This new publication compiles some of these processes and invites you to put your community first.

The post Our New Field Guide: ‘The Tech We Want | Read This Before You Build’ first appeared on Open Knowledge Blog.

by OKFN at September 23, 2025 05:22 PM

September 20, 2025

Mita Williams

Libraries and Large Language Models as Cultural Technologies and Two Kinds of Power

On September 20th, 2025, I spoke on a panel at THE LEGACY OF CCH CANADIAN LTD. v. LAW SOCIETY OF UPPER CANADA AND FUTURE OF COPYRIGHT LAW CONFERENCE 2025. Here is my talk.

by Mita Williams at September 20, 2025 06:18 PM

September 19, 2025

Ed Summers

Some Trees

some trees

September 19, 2025 04:00 AM

September 18, 2025

Harvard Library Innovation Lab

Expanding Our Public Data Project to Include Smithsonian Collections Data

Photograph of the Smithsonian Institution Building in Washington, D.C. Smithsonian Institution building, from Wikimedia Commons

We are excited to announce today that the Library Innovation Lab has expanded our Public Data Project beyond datasets available through Data.gov to include 710 TB of data from the Smithsonian Institution — the complete open access portion of the Smithsonian’s collections. This marks an important step in our long-running mission to preserve large scale public collections both for our patrons and for posterity.

Scanned image of suffragette ribbon that reads, "Votes for Women — Brooklyn Woman Suffrage Association — 1869 — 'Failure Is Impossible' — S. B. A." From the National Museum of American History. Creative Commons 0 License

The Smithsonian has an incredible 157.5 million items and specimens, of which 18.4 million are searchable and 5.1 are released under a public domain license, offering an extraordinary view of the American experience — everything from Thomas Jefferson’s own compilation of Bible verses to 3D images of the grand piano owned and used by Thelonious Monk, from Samuel Morse’s transcription of the first telegraph message sent in 1844 to the Women’s Suffragette Ribbon.

The Smithsonian has had the mission, since its founding in 1846, to pursue “the increase and diffusion of knowledge.” In the past, this could only be done by visiting Smithsonian museums in person. Now that its collections are also digital, we are grateful to be able to do our part in preserving and sharing our nation’s cultural heritage.

Our initial collection includes some 5.1 million collection items and 710 TB of data. As is always our practice, we have cryptographically signed these items to ensure provenance and are exploring resilient techniques to share access to them, which we plan to launch in the future.

From the National Museum of African American History and Culture. Creative Commons 0 License

September 18, 2025 04:00 PM

David Rosenthal

Hard Disk Unexpectedly Not Dead

As I read Zak Killian's Expect HDD, SSD shortages as AI rewrites the rules of storage hierarchy — multiple companies announce price hikes, too I realized I had forgotten to write this year's version of my annual post on the Library of Congress' Desihning Storage Architectures meeting, which was back in March. So below the fold I discuss a few of the DSA talks, Killian's more recent post, and yet another development in DNA storage. The TL;DR is that the long-predicted death of hard disks is continuing to fail to materialize, and so is the equally long-predicted death of tape.

Killian's post starts:

The computing market is absolutely ablaze with AI-driven growth. Regardless of how sustainable it might be, companies are spending untold amounts of wealth on hardware, with most headlines revolving around GPUs. But the storage market is also under pressure, especially hard drive vendors who purportedly haven't done much to increase manufacturing capacity in a decade. TrendForce says lead times for high-capacity "nearline" hard drives have ballooned to over 52 weeks — more than a full year.
Source
Western Digital is:
warning of "unprecedented demand for every capacity in [its] portfolio," and stating that it is raising prices on all of its hard drives.
The unprecedented demand from AI farms is because:
You don't just need the data required to run inference. You also need the history of everything to prove to regulators that you're not laundering bias, to retrain when new data comes in, and to roll back to a previous checkpoint if your fine-tuned model goes feral and, say, starts referring to itself as MechaHitler. This stuff can't go to offline storage until you're certain it isn't needed in the short term. But it's too big to live in the primary storage of all but the beefiest servers. Thus, the need for nearline hard drives.
WD's projection
At the meeting, Western Digital's Dave Landsman's HDDs are here to stay made the same point with this graph using data from IDC and TrendFocus. They are projecting that both disk and enterprise SSD will grow in the low 20%/year range, so the vast bulk of data in data centers will remain on disk. Landsman claims that SSDs are and will remain 6 times as expensive per bit as hard disk and that 81% of installed data center capacity is on hard disk.

Keeping the data on hard disk might actually be a good idea. Sustainability in Datacenters by Shruti Sethi presented a joint Microsoft/Carnegie-Mellon study of the scope 2 (operational) and scope 3 (embedded) carbon emissions of compute, SSD and HDD racks in Azure's data centers. The study, A Call for Research on Storage Emissions by Sara MacAllister et al concluded that:
an SSD storage rack has approximately 4× the operational emissions per TB of an HDD storage rack. Storage devices (SSDs and HDDs) are the largest single contributor of operational emissions. For SSD racks, storage devices account for 39% of emissions, whereas for HDD racks they account for 48% of emissions. These numbers contradict the conventional wisdom that processing units dominate energy consumption: storage servers carry so many storage devices that they become the dominant energy consumers.
...
SSD racks emit approximately 10× the embodied emissions per TB as that of HDD storage racks. The storage devices themselves dominate embodied emissions, accounting for 81% and 55% of emissions in SSD and HDD racks, respectively.
Areal Density Trends
As usual, the authoritative word on the performance of the storage industry comes from IBM. Georg Lauhoff & Sassan Shahidi's Data Storage Trends: NAND, HDD and Tape Storage added another year's data points to their invaluable graphs and revealed that:
Coming from a tape supplier this comment isn't surprising but it is correct:
Despite the promise of alternative archive storage technologies, challenges persist. Enduring relevance of tape storage, which itself is rapidly evolving.
The main problem being that the huge investment and long time horizon needed to displace tape's 7% of the storage market can't generate the necessary return.

Product vs. Demo
One fascinating graph shows the difference between demonstrations and products for tape and disk. I keep pointing out the very long timescales in the storage industry. In January's Storage Roundup I noted that HAMR was just starting to be deployed 23 years after Seagate demonstrated it. Lauhoff & Shahidi's graph shows that the current tape density was demo-ed in 2006 and shipped in 2022, and that disk's current density was demo-ed in 2012.

Source
This graph reinforces that tape's roadmap is credible, but the good Dr. Pangloss noticed the optimism of the NAND and disk roadmaps. New technologies tend to scale faster at first, then slower as they age. So it is likely that the advent of HAMR will accelerate disk's areal density increase somewhat. And it is possible that the difficulty of moving from 3D NAND to 4D NAND will slow its increase.

Cost Ratio
Lauhoff & Shahidi's cost ratio graph shows that the relative costs of the different media were roughly stable. If Killian is right that the disk manufacturers are increasing prices and lengthening lead times because of demand from AI, this could be different in next year's graph. But Killian also notes that, despite the fact that QLC SSDs are at least "four times the cost per gigabyte":
Trendforce reports that memory suppliers are actively developing SSD products intended for deployment in nearline service. These should help bring costs down once they hit the market. But in the short term, we can expect the storage crunch to cause rising SSD prices as well, at least for enterprise drives.
Annual Bit Shipments
Lauhoff & Shahidi's bit shipment graph is interesting for two reasons:
Tape's predicted demise is just as delayed as disk's. Back in July Simon Sharwood posted And now for our annual ‘Tape is still not dead’ update:
Shipments of tape storage media increased again in 2024, according to HPE, IBM, and Quantum – the three companies that back the Linear Tape-Open (LTO) Format.

The three companies on Tuesday claimed they shipped 176.5 Exabytes worth of tape during 2024, a 15.4 percent increase on 2023’s 152.9 Exabytes.
DNA Storage
I have been writing skeptically about the medium-term prospects for DNA storage since 2012 and Lauhoff & Shahidi share my skepticism in their graph of the technology's progress in the lab. DNA can only compete in the archival storage market, so the relevant comparison is with LTO tape. Even if you believe Wang's estimate, DNA is more than ten million times too expensive.

Figure 1
Via Brandon Vigliarolo's Boffins invent DNA tape that could pack 375 petabytes into an LTO cart we find the latest idea for improving DNA storage in A compact cassette tape for DNA-based data storage by Jiankai Li et al from the Southern University of Science and Technology in Shenzhen. Their idea is to deposit DNA on "good old-fashioned polyester-nylon composite tape" that could reside in an LTO cartridge. Vigliarolo notes that:
DNA is a very dense storage medium and storage researchers have tried to use it for data storage, but without much success, because it’s hard to find info within DNA and read times are slow.

Jiang's team claims to have addressed that problem, establishing a sequence of data partitions on the tape and identifying each of these with a bar code.
The team from Shenzen's focus on reading is a misunderstanding of the fundamental requirements of the hyperscaler archive market, which are in order:
  1. Write bandwidth. Kestutis Patiejunas pointed this out over a decade ago.
  2. Cost per byte. Archival storage is for data that can no longer earn its keep on lower-latency media, so it has to be very cheap.
Read latency and bandwidth are pretty much irrelevant because, as Kestutis Patiejunas said, the main reason the data would be read is a subpoena.

The paper claims they demonstrated:
a completely automated closed-loop operation involving addressing, recovery, removal, subsequent file deposition, and file recovery again, all accomplished within 50 min.
Vigliarolo notes that:
Jiang's team only wrote 156.6 kilobytes of data to a test tape for their experiment, consisting of four "puzzle pieces" depicting a Chinese lantern. If the data were damaged, it wouldn't assemble correctly. The researchers managed to successfully recover the lantern image without issue, but it took two and a half hours or not-quite one kilobyte per minute.
And the team admit that:
DNA synthesis costs are still very high
Because archival data is guaranteed to be written but is very likely never read, the cost of storing data in DNA is dominated by synthesis. The team effectively admits that they can't compete.

I summed up my skepticism in 2018's DNA's Niche in the Storage Market, posing this challenge to a hypothetical product team:
Engineers, your challenge is to increase the speed of synthesis by a factor of a quarter of a trillion, while reducing the cost by a factor of fifty trillion, in less than 10 years while spending no more than $24M/yr.

Finance team, your challenge is to persuade the company to spend $24M a year for the next 10 years for a product that can then earn about $216M a year for 10 years.
Write bandwidth and cost remain the core problems of DNA storage, and while progress has been made in other areas, both are still many orders of magnitude away from competing with hard disk, let alone LTO tape.

Nevertheless, as I concluded more than seven years ago:
That isn't to say that researching DNA storage technologies is a waste of resources. Eventually, I believe it will be feasible and economic. But eventually is many decades away. This should not be a surprise, time-scales in the storage industry are long. Disk is a 60-year-old technology, tape is at least 65 years old, CDs are 35 years old, flash is 30 years old and has yet to impact bulk data storage.

by David. (noreply@blogger.com) at September 18, 2025 03:00 PM

September 17, 2025

Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

2025-09-17: Classic Machine Learning Models and XAI Methods

  


Figure 1 in Dwivedi et al.

Traditional AI models are often viewed as “black boxes” whose decision-making processes are not revealed to humans, leading to a lack of trust and reliability. Explainable artificial intelligence (XAI) is a set of methods humans can use to understand the reasoning behind the decisions or predictions of models, which typically include SHAP, LIME, and PI. We will start from a basic model and then talk about those XAI methods. 


Logistic regression

Logistic regression (LR) is a classification model, particularly for problems with a binary outcome. It maps the output of linear regression to the interval [0, 1] through the sigmoid function, indicating the probability for a particular class. 

The logistic regression model is simple and highly interpretable. The coefficient of each feature intuitively reflects the impact of the feature, which is easy to understand and explain. Positive weights indicate that the feature is positively related to the positive class, and negatively related if negative weights. LR outputs the classification results, as well as the probability of that result. All those show that LR is an inherently interpretable model.

Logistic regression assumes that the features are linearly related to the log odds (ln(p/(1-p)), where p is the probability of an event occurring. This can be extended into a method for explainability: the odds ratioOdds ratios are not formally considered part of the XAI toolkit since they only work for LR, but it is practical and widely used in medical research and other fields.

An odds ratio is calculated by dividing the odds of an event occurring in one group by the odds of the event occurring in another group. For example, if the odds ratio for developing lung cancer is 81 for smokers compared to non-smokers, it means smokers are 81 times more likely to develop lung cancer. The OR value is calculated by using the regression coefficient in exponential formFor example, if the coefficient of a feature in the logistic regression model is 0.5, the OR value is:

It means that for every unit increase in the feature, the probability of the event occurring increases by approximately 64.87%.

The OR value can be used to find features with the highest influence on prediction results, and further used for feature selection or optimization. However, it only works for the LR model. In the rest of the blog, we will talk about other machine learning models and model-agnostic XAI methods.

We will discuss three machine learning models, each representing a distinct approach based on probability, tree models, and spatial distance, respectively.


Machine learning model 1:Naive Bayes

Naive Bayes is a classification model based on probability and Bayes' theorem. It assumes that features are independent of each other, which is not always true in reality. This "naive" assumption simplifies the problem but can potentially reduce the accuracy. 

Naive Bayes obtains the probability of each class, and then selects the class with the highest probability as the output. It calculates posterior probability with prior probability and conditional probability. For example, the probability of 'win' appearing in spam emails is 80%, and 10% in regular emails. Then we can calculate the probabilities of 'is spam email' and 'is regular email' through a series of calculations and pick the one with the higher probability.

Naive Bayes is insensitive to missing data, so it can still work effectively when there are missing values or when features are incomplete. It has good performance in high-dimensional data due to the independence assumption. However, it also has disadvantages. It's sensitive to input data distribution. Performance may decrease if the data does not follow a Gaussian distribution. 


Machine learning model 2: random forest

A decision tree is a learning algorithm with a tree-like structure to make predictions. A random forest uses a bagging of decision trees to make predictions. It randomly draws samples from the training set to train each decision tree. When each decision tree node splits, it randomly selects features to make the best split. It repeats the above steps to build multiple decision trees and form a random forest.

By integrating multiple decision trees, a random forest achieves better performance than a single decision tree. It can reduce overfitting with random sampling and random feature selection. It is insensitive to missing values ​​and outliers and can handle high-dimensional data. But compared with a single decision tree, the training time is longer. In addition, random forests rely on large amounts of data and may not perform well with small datasets.


Machine learning model 3: SVM (support vector machine)

The core of SVM is to find the hyperplane that best separates data points into different classes and to maximize the boundary between classes. It can be used for both classification and regression tasks.

SVM has good performance with high-dimensional sparse data (such as text data), as well as nonlinear classification problems, so it's particularly suitable for text classification and image recognition. In addition, SVMs are relatively robust against overfitting. Overall, SVM is a good choice for high-dimensional data with a small number of samples, but for large-scale data sets, SVM training takes a long time and thus is not a good choice


After discussing the representative machine learning models, we will see the model-agnostic XAI methods, which can be applied to any machine learning model, including linear models, tree models, neural networks, etc.


XAI method 1: SHAP

SHAP (Shapley Additive Explanations) is a model interpretation method based on cooperative game theory. Shapley values are calculated to quantify the importance of each feature by evaluating its marginal contributions to the model. SHAP local explanations reveal how specific features contribute to individual predictions for each sample. SHAP global explanations describe a model's overall behavior across the entire dataset by aggregating the SHAP values for all individual samples. Figure 1 demonstrates the importance ranking of features in global explanation, where 'Elevation' ranks first among all features. We can further look at each feature in detail by the dependence plot as shown in Figure 2, which shows the relationship between the target and the feature. It could be linear, monotonic, or more complex relationships. In addition, there are more visualization methods in the SHAP toolkit based on your needs.


Figure 1: Ranking of influencing features (Fig. 10 in Zhang et al.)


Figure 2: SHAP dependence plot of annual average rainfall (Fig. 14 in Zhang et al.)


XAI method 2: Permutation importance

Permutation importance (PI) is a method of global analysis and it does not focus on local results as SHAP does. It assesses the importance of features in a model by measuring the decrease or increase in performance when the values of a particular feature are randomly permuted, while keeping other features unchanged. By comparing the decrease/increase in performance to baseline performance, permutation importance provides insights into the relative importance of each feature in the model. The difference from baseline performance is the importance value, and it can be positive, negative, or zero. If the value is zero, it means the performance of data with a feature completely shuffled and put back into the original training set is the same as the original data. The model performance is the same with or without the feature and thus this feature is of low importanceIf the value is negative, it means it is better not to add this feature at all. Figure 1 shows one example of the ranking of features by permutation importance.


Figure 3: Ranking of features by permutation importance (from scikit-learn user guide)



XAI method 3: LIME

LIME is essentially a method for local analysis. It builds a simple interpretable model (such as a linear model) around the target sample and then the contribution of each feature to the prediction can be approximated by interpreting the simple model's coefficients. It consists of the following steps:

  1. Randomly select a sample to be explained.
  2. Generate a perturbation samples x′ near x.
  3. Use the complex model to predict the perturbation sample x′ and get the predicted value f(x′).
  4. Use the perturbation samples x′ and the corresponding predictions f(x′) to train a simple interpretable model (such as logistic regression).
  5. Interpret the complex model using the coefficients of the simple model.

LIME is generally used in local cases. For example, a bank can use LIME to determine the major factors that contribute to a customer being identified as risky by a machine learning model.


To sum up, we mainly have the following XAI methods: permutation importance, SHAP, LIME and odds ratio. Permutation importance and SHAP can give global explanations based on the whole dataset, while LIME can only provide local explanations based on a particular sample. Permutation importance measures how important a feature is for the model, and SHAP measures the marginal contribution of a feature. The first three methods are model-agnostic, while the odds ratio is only used for logistic regression and gives global explanations. We can choose to use one or more of the most suitable methods in real-life applications.


- Xin

by Xin (noreply@blogger.com) at September 17, 2025 08:14 PM

September 15, 2025

HangingTogether

Rising to the challenge: How the SHARES resource sharing community navigated a global disruption to international shipping

An open cardboard box containing three books standing upright, each with a solid-colored cover: magenta, teal, and navy blue. The books are neatly arranged, with their white pages slightly visible, suggesting organization or packing for storage or transport.Image from pixabay.com

In late August 2025, interlibrary loan staff at libraries across the United States found themselves facing an unprecedented situation. Revocation of the De Minimis tariff exemption for packages worth less than $800, due to become effective on 29 August 2025, threw a blanket of uncertainty over global international shipping operations. More than a dozen countries abruptly paused all shipping to the US; document suppliers and book vendors announced that they, too, would stop shipping to the US until the practical impacts became known. ILL folks had reason to wonder if physical library materials in transit across borders would ever reach their destinations and if new shipments in either direction would be hit with tariffs, incurring unbudgeted and unpredictable expenses.

This global kerfuffle is now well into its third week. The SHARES community, a multinational resource sharing consortium whose members are being impacted in different ways depending on their local context, responded as resource sharing practitioners always do: by banding together, pooling uncertainties, sharing strategies and workarounds, and supporting each other with facts, encouragement, and good humor.

Daily challenges countered by sustained real-time ILL collaboration

The disruption first surfaced on the SHARES-L mailing list on 25 August, when libraries began reporting that major European shipping companies like DHL and Deutsche Post were pausing shipments to the US. The University of Tennessee shared that GEBAY, a major German document supplier, had already begun canceling US loan requests, citing the loss of the under-$800 exemption.

Libraries immediately began sharing their approaches and real-time results. The University of Waterloo in Canada reported experiencing occasional tariff issues on incoming items but planned to continue sharing with US partners. Pennsylvania State University established review processes for international requests and began using specific language on customs forms—”Any value stated is for insurance purposes only”—with some initial success. The University of Pennsylvania, a prolific borrower and supplier of library materials across borders, took a more cautious tack, temporarily pausing all international sharing after having an item stuck in Hong Kong customs, requiring $500 for its release. The University of Glasgow began changing their customs forms from “temporary export” to “personal, not for resale,” which seemed to help avoid additional shipping charges on packages shipped to the US. Yale University and the University of Michigan reported receiving direct notifications from additional European suppliers about temporary service suspensions.

As coordinator of the RLP SHARES community, I synthesized the threads each day and created a shared document where SHARES members could add updates. I also added the De Minimis exemption revocation to the agenda of an upcoming SHARES town hall.

On 26 August 2025, the day after the topic surfaced on SHARES-L, 32 participants attended SHARES Town Hall #264 to compare notes on the latest intelligence coming from shippers and overseas libraries and to share their current strategies. The University of Kansas suggested sending conditional responses to prospective overseas borrowers of physical items, asking for confirmation that they would be able to ship items back to the US once they receive them, and offering to scan tables of contents and indexes as a short-term alternative to physical loans. Recognizing that the complexity of the situation varied by carrier and region, the University of Pittsburgh commenced tracking the statuses of individual shipping companies and countries rather than implementing blanket restrictions. SHARES folks renewed their commitment to updating the shared document with all the latest developments.

The situation continued to evolve rapidly. Later that same week, Princeton University reported that several major international book vendors had informed them they would not be shipping new books to the US until customs procedures were clarified, indicating the impact extended far beyond interlibrary loans to also impact academic acquisitions. The CUNY Graduate Center added Brazil to the growing list of countries that have suspended all shipments to the US. This prompted a suggestion to integrate the evolving country-by-country shipping status into the existing International ILL Toolkit, a crowd-sourced tool used by libraries across the world, created by SHARES during a town hall in 2022.

By 4 September, practical advice from shipping companies began to emerge. The Getty Research Institute shared the following, which they’d just received from FedEx:

The traditional wording (“loan between libraries, no commercial value”) is no longer sufficient on its own. Going forward, they should:

1. Always include a numeric HTS code (4901.x for books; 9801.00.10 for U.S. goods returned).

2. Declare a nominal value rather than “no commercial value.”

3. Add clarifying language like “interlibrary loan – not for sale – temporary export/return.”

This ensures [domestic and foreign customs] process the shipments correctly as duty-free, non-commercial library loans.

Other libraries reported successfully receiving packages from Australia with a tariff of only $10.

By 9 September, 15 days after the topic first surfaced on SHARES-L, participants at SHARES Town Hall #266 reported feeling confident they can once again share physical items across most borders with, at worst, minimal disruption and modest fees. Later in the week, Brown University and the University of Pennsylvania each reported having to reimburse DHL $18.38 for paying duties on packages coming back to them in the US from Canada; Penn plans to dispute the charge retroactively, as these are shared library materials, not commercial imports. A few universities are still pausing their international sharing, but most are back at it, full speed ahead.

Two paths for collaboration

The community response emerged through two distinct but interconnected channels: the asynchronous SHARES-L mailing list and the SHARES town halls. The mailing list discussion centered on the immediate sharing and problem-solving, with institutions reporting their individual circumstances and strategies. This allowed all SHARES members a chance to participate at their convenience. The town halls provided a crucial real-time forum where a subset of SHARES practitioners could engage in dynamic discussion, ask questions, coordinate responses, and coalesce around a set of preferred practices, with the outcomes being cycled back to all SHARES participants for comment via the SHARES-L mailing list.

The power of community

The SHARES response to the recent disruption of international shipping exemplifies the extraordinary power of community. Through information sharing, collaborative problem-solving, and mutual support, the SHARES network transformed individual institutional confusion into collective wisdom. Time and again, connections to trusted peers have proven to be every bit as essential as all the other types of infrastructure we depend upon to do our jobs.

The post Rising to the challenge: How the SHARES resource sharing community navigated a global disruption to international shipping appeared first on Hanging Together.

by Dennis Massie at September 15, 2025 08:23 PM

Information Technology and Libraries

Letter from the Editors

by Kenneth J. Varnum, Joanna DiPasquale at September 15, 2025 07:00 AM

Free Online Graphic Design Software

The need to capture patrons attention with interesting flyers and advertisement is extremely critical to library staff’s work. So having an easy to use graphical program that can help even the most novice designer can help elevate designs to the next level. Two free online graphic design programs, Canva and Adobe Express, make it easy for any creative project. While each program is fairly similar, the few difference between the two programs may help decide why chose one over the other.

by Jess Barth at September 15, 2025 07:00 AM

Uncovering the Works by Early Modern Women Printers

The contributions of women to the printing trade during the hand press era have long been under- documented, leaving significant historical gaps in our understanding of early print culture. This article presents a project that uses ChatGPT-4o, a generative artificial intelligence (AI) chatbot, to help bridge those gaps by identifying, analyzing, and contextualizing the work of women printers represented in the University of Notre Dame’s rare book collections.

by Tang (Cindy) Tian, Daniela Rovida at September 15, 2025 07:00 AM

MAPping out the Future from the Past

Mountain West Digital Library (MWDL) was founded in 2001 and offers a public search portal supporting discovery of over a million items from digitized historical collections throughout the US Mountain West. This aggregation work necessitates a metadata application profile (MAP) to ensure metadata consistency and interoperability from the regional member network of libraries, archives, and cultural heritage organizations. Unique issues arise in combining metadata from diverse local digital repository platforms and aggregation technology infrastructure introduces further constraints, challenges, and opportunities. Upstream aggregation of metadata in the Digital Public Library of America (DPLA) also influences local and regional metadata modeling decisions. This article traces the history of MWDL’s MAPs, comparing and contrasting five published standards to date. In particular, it will focus primarily on decisions and changes made in the most recent version, published in early 2020.

by Teresa K. Hebron at September 15, 2025 07:00 AM

From Weberian Rationalization to JavaScript Components

This paper considers modular approaches to building library software and situates these practices within the context of the rationalizing logics of modern programming. It briefly traces modularism through its elaboration in the social sciences, in computer science and ultimately describes it as it is deployed in contemporary academic libraries. Using the methodology of a case study, we consider some of the very tangible and pragmatic benefits of a modular approach, while also touching upon some of the broader implications. We find that the modularism is deeply integrated into modern software practice, and that it can help support the work of academic librarians.

by Mark E. Eaton at September 15, 2025 07:00 AM

Finding Aids Unleashed

New York University Libraries recently completed a redesign for their finding aids publishing service to replace an outdated XSLT stylesheet publishing method. The primary design goals focused on accessibility and usability for patrons, including improving the presentation of digital archival objects. In this article, we focus on the iterative process devised by a team of designers, developers, and archivists. We discuss our process for creating a data model to map Encoded Archival Description files exported from ArchivesSpace into JSON structured data for use with Hugo, an open-source static site generator. We present our overall systems design for the suite of microservices used to automate and scale this process. The new solution is available for other institutions to leverage for their finding aids.

by Deb Verhoff, Joseph G. Pawletko, Donald R. Mennerich, Laura Henze at September 15, 2025 07:00 AM

The Invisible Default

This mixed-method study investigates the representation of race and ethnicity within the J. Willard Marriott Digital Library at the University of Utah. The digital collections analyzed in this study come from the Marriott Library’s Special Collections, which represent only a fraction of the library’s physical material (less than 1 percent), albeit those most public facing. Using a team-based approach with librarians from various disciplines and areas of expertise, this project yielded dynamic analysis and conversation combined with heavy contemplation. These investigations are informed by contemporary efforts in librarianship focused on inclusive cataloging, reparative metadata, and addressing archival silences. By employing a data-intensive approach, the authors sought methods of analyzing both the content and individuals represented in our collections. This article introduces a novel approach to metadata analysis—as well as a critique of the team’s initial experiments—that may guide future digital collection initiatives toward enhanced diversity and inclusion.

by Kaylee P. Alexander, Dorothy Terry, Jasmine Kirby, Rachel Jane Wittmann, Anna Neatrour at September 15, 2025 07:00 AM

Unlocking the Digitized Historical Newspaper Archive

This paper aims to utilize historical newspapers through the application of computer vision and machine/deep learning to extract the headlines and illustrations from newspapers for storytelling. This endeavor seeks to unlock the historical knowledge embedded within newspaper contents while simultaneously utilizing cutting-edge methodological paradigms for research in the digital humanities (DH) realm. We targeted to provide another facet apart from the traditional search or browse interfaces and incorporated those DH tools with place- and time-based visualizations. Experimental results showed our proposed methodologies in OCR (optical character recognition) with scraping and deep learning object detection models can be used to extract the necessary textual and image content for more sophisticated analysis. Timeline and geodata visualization products were developed to facilitate a comprehensive exploration of our historical newspaper data. The timeline-based tool spanned the period from July 1942 to July 1945, enabling users to explore the evolving narratives through the lens of daily headlines. The interactive geographical tool can enable users to identify geographic hotspots and patterns. Combining both products can enrich users’ understanding of the events and narratives unfolding across time and space.

by Vincent Wai-Yip Lum, Michael Kin-Fu Yip at September 15, 2025 07:00 AM

Using Springshare’s LibCal to Move Librarians off the Reference Desk

In 2022, a task group at the University of Victoria Libraries moved reference service off the desk and into an appointments model. We used Springshare’s LibCal to create a public web calendar and booking system, with librarians setting office hours and appointments done over Zoom, phone, or in person. LibCal allows us to send feedback follow-up requests to appointments and to keep stats and assess usage. This article is a practical case study in implementing a service model change, with emphasis on how we adapted LibCal to support the service to serve librarians and students.

by Karen Munro at September 15, 2025 07:00 AM

September 14, 2025

Nick Ruest

Introducing ManoWhisper... again... and little bit different this time 😅

Introduction

Over the last year, I’ve written a few times about what began as a small sabbatical project called ManoWhisper, or mano-whisper, or manowhisper 😅. Naming things is hard.

Anyway, today I’m happy to share its newest evolution, an interactive website!

This project started with a simple but intriguing question from my DigFemNet/SIGNAL collaborators and colleagues Shana MacDonald and Brianna Wiens: “Have you ever done anything with podcast transcripts?”

At the time, my answer was no. But, I was curious about experimenting with Whisper, and that curiosity quickly grew into something much larger.

Bash and Python Scripts

Last fall, what began as a loosely connected set of bash and Python scripts has now evolved into a fully interactive website. The growth was driven by the research needs and questions of Shana, Brianna, and more recently, Karmvir Padda.

The result is manowhisper.signalnetwork.org, a Flask-based website with:

Some transcripts are sourced from the Knowledge Fight Interactive Search Tool and flagged accordingly on their show and episode pages (more about fight.fudgie.org below). To complement the website, a companion command-line tool (manowhisper) generates classifications and statistics.

ManoWhisper Website

At the time of writing, ManoWhisper has:

The scale of this dataset/corpus will continue to grow as we’re able to resource it. Hopefully, offering collaborators and researchers new ways to examine the narratives and ideologies in the corpus.

Support, Collaboration, and Thank Yous!

This project would not be possible without the financial and in-kind support of:

I want to extend a special thank you to Erlend Simonsen for his generous work and support. If it’s not obvious, my work here was HEAVILY inspired by his amazing work with the Knowledge Fight Interactive Search Tool. I encourage everyone to explore that project. It is regularly updated with new show, episodes, features, and insights.

Broader Research Context

Much like the Knowledge Fight Interactive Search Tool, ManoWhisper tracks and identifies podcasts connected to movements such as:

Working in close dialogue with colleagues in the Digital Feminist Network/SIGNAL Network, we’ve expanded this scope to include more Canadian context. This aligns with our broader research goals: to understand how digital gender discourse ecosystems, including the Manosphere, Femosphere, and incel groups, shape and influence local institutions and communities.

ManoWhisper’s journey from a “fun little question” to a research platform has been a bit of a surprise, and incredibly rewarding. What began as a collection of scripts is now a living, expanding infrastructure for studying how discourse circulates through podcasts.

September 14, 2025 04:00 AM

September 12, 2025

Open Knowledge Foundation

🇷🇼 Open Data Day 2025 in Kigali: Raising Awareness of the Gender Gap in Technology

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. On 7th March 2025, I organized and celebrated a community Open Data Day 2025 in Eastern Province, Southeast of Kigali, through a project entitled WIKI-SHE EVENT RWANDA under the theme “Promoting gender equity and increasing the...

The post 🇷🇼 Open Data Day 2025 in Kigali: Raising Awareness of the Gender Gap in Technology first appeared on Open Knowledge Blog.

by Rose Cyatukwire at September 12, 2025 07:27 PM

🇦🇼 Open Data Day 2025 in Oranjestad: The Challenges of Managing Open Data in Science

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices. On the 4th of March 2025 the University of Aruba participated in Open Data Day 2025! About 20 researchers and students gathered to hear all about Open Data, Research Data Management and the FAIR...

The post 🇦🇼 Open Data Day 2025 in Oranjestad: The Challenges of Managing Open Data in Science first appeared on Open Knowledge Blog.

by Esther Plomp at September 12, 2025 06:58 PM