Creating synthesized instances with SMOTE (from Figure 3 in Wongvorachan et al.)
Jason Brownlee gave the definition of feature engineering as follows: "feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data". In other words, feature engineering is manually designing what the input Xs should be.
There are many application scenarios for feature engineering, including loan application fraud detection/prevention, recommendation system user behavior modeling and disease diagnosis/risk prediction, etc. In a loan application fraud prevention program, data scientists can decide whether a user is reliable with features based on user's basic information, credit history and other information. A recommendation system can analyze a user's behavior features such as materials clicked by the user in the past few months, positive or negative reactions, user type and so on, to decide the user's most interested topics.
A feature is an attribute useful for the modeling task, but not all attributes can be used as features. In general, for most industrial modeling, expert knowledge is important for feature creation. For example, in a loan application fraud prevention program, experiences from the risk control department will be very helpful. There are potentially hundreds of features based on a user's basic information, credit report, and assets, but not all of them will be used in modeling. Expert knowledge can help data scientists quickly perform feature construction and screening.
1. Feature Describing
This step provides a general understanding of the dataset. We explore max, min, mean, standard deviation values of features, understand the tendency, dispersion, or distribution and find missing values, outliers, and duplicate values. This step serves as the preparatory work for the next steps.
2. Feature Processing
The foundation of feature engineering is feature processing, which is time-consuming and directly associated with data quality. It includes operations such as data-cleaning, standardization, and resampling, and aims to transform raw data into a format suitable for model training.
2.1 Data-cleaning
Data-cleaning generally processes missing values, outliers, and inconsistencies to ensure the accuracy of data.
Some features may contain missing values because of the lack of observations. Missing values are typically processed in the following ways:
Drop directly. We can choose to drop the whole sample (row) or the feature (column) containing the missing value.
Fill with other values. We can fill the missing values with a constant such as 0,9999, -9999, -99, etc.
data['feature'] =data['feature'].fillna('-99')
Or we can fill the missing values with the mean, mode, previous value, or next value.
from fancyimpute import BiScaler, KNN, NuclearNormMinimization, SoftImpute
dataset = KNN(k=3).complete(dataset)
The most frequently used method is to drop directly or fill with mean values.
Outliers are identified based on interquartile range, mean and standard deviation. In addition, points whose distance from most points is greater than a certain threshold are considered outliers. The main distance measurement methods used are absolute distance (Manhattan distance), Euclidean distance, and Mahalanobis distance.
We need to process outliers to reduce noise and improve data quality. Typical strategies for processing outliers include: directly deleting outliers when they have a significant impact on the analysis results, treating outliers as missing values and using previous methods for missing values to fill them out, or keeping the outliers when they are considered to be important.
Duplicate values refer to identical samples from different sources, which will waste storage space and reduce data processing efficiency. The most common way is to drop duplicates completely or partially based on experience.
2.2 Resampling
Class imbalancerefers to the situation where the number of samples in different categories of the training set is significantly different. Machine learning methods generally assume that the number of samples of positive and negative classes is close. However, in the real world, we often observe class imbalance. There are some extreme cases: 2% of credit card accounts are fraudulent every year, and online advertising conversion rate is in the range of 10^-3 to 10^ -6 and so on. Class imbalance can cause prediction results of models to be biased towards the majority class and thus lower prediction power.
We can mitigate class imbalance by oversampling the minority class or undersampling the majority class. When the original dataset is huge, undersampling is a good choice that randomly deletes samples in the majority class to make the number of samples in the two classes equal. When the dataset is small, we prefer to use oversampling. One practice is to resample repeatedly in the minority class to increase the number in the minority class, until it equals the number in the majority class, which has high a risk of over-fitting. A better way is to use SMOTE (Synthetic Minority Over-sampling Technique), in which synthetic instances of the minority class are generated by interpolating feature vectors of neighboring instances, effectively increasing their representation in the training data. To be specific, SMOTE picks a sample point x in the minority class, and randomly picks a point x' from its k nearest neighbors. Then the synthetic instance will be created by the formula x_new = x + (x'-x)*d, where d is in the range [0,1]. Three figures fromWongvorachan et al. shown below demonstrate the three methods more intuitively.
Table 1 shows the operating principle, advantages, and drawbacks of each resampling technique. The methods are commonly used, but we also need to emphasize the disadvantages: random oversampling increases the likelihood of overfitting, random undersampling can only keep partial information in the original dataset, SMOTE potentially creates noise in the dataset.
Table 1. The comparison of resampling techniques (Table 1 in Wongvorachan et al.)
A more straightforward way to mitigate class imbalance is Class Weights, which assigns a class weight to each class in the training set. If the number of samples in this class is large, then its weight is low, otherwise the weight is high. There is no need to generate new samples with this method, we just need to adjust the weights in the loss function.
2.3 Feature Transformation
Different features have different scales and ranges. Eliminating scale differences between different features can put data on the same scale and make them numerically comparable.
StandardScaler transforms data into a distribution with a mean of 0 and a standard deviation of 1 by Z-score normalization. Similarly, MinMax scaling normalizes all features to be within 0 and 1. To be specific, StandardScaler obtains the mean and standard deviation of the training data, and then uses these statistics to make a Z-score normalization with the following formula:
μ is mean and σis standard deviation.
MinMaxScaler obtains the maximum and minimum values of the training data, and then transforms data with the following formula:
σσ
Feature transformation has different impacts on different models. It has a great impact on SVM (support vector machine) and NN (nearest neighbor) which are based on distances in a Euclidean space, but has little impact on tree models such as random forest or XGBoost.
With a broader definition of feature engineering, the generation of embeddings which represent latent features is also regarded as feature engineering. Latent features follow a different set of methodologies. In this article, we only focus on the narrow definition of feature engineering, where shallow features are selected based on expert knowledge, and data is processed with the methodology discussed above. In real-world practice, especially in industry, successful feature engineering is essential for the model's performance.
There will be fewer than usual posts to this blog for a while. I have to write another talk for an intimidating audience, similar to the audience for my 2021 Talk at TTI/Vanguard Conference. That one took a lot of work but a few months later it became my EE380 Talk. That in turn became by far my most-read post, having so far gained 522K views. The EE380 talk eventually led to the invitation for the upcoming talk. Thus I am motivated to focus on writing this talk for the next few weeks.
On my recent visit to London I was struck by how many of the advertisements in the Tube were selling AI. They fell into two groups, one aimed at CEOs and the other at marketing people. This is typical, the pitch for AI is impedance-matched to these targets:
The irresistible pitch to CEOs is that they can "do more with less", or in other words they can lay off all these troublesome employees without impacting their products and sales.
Marketing people value plausibility over correctness, which is precisely what LLMs are built to deliver. So the idea that a simple prompt will instantly generate reams of plausible collateral is similarly irresistible.
why Sam Altman et al are so desperate to run the "drug-dealer's algorithm" (the first one's free) and get the world hooked on this drug so they can supply a world of addicts.
You can see how this works for the two targets. Once a CEO has addicted his company to AI by laying off most of the staff, there is no way he is going to go cold turkey by hiring them back even if the AI fails to meet his expectations. And once he has laid off most of the marketing department, the remaining marketeer must still generate the reams of collateral even if it lacks a certain something.
“I’ll go down this thread with [Chat]GPT or Grok and I’ll start to get to the edge of what’s known in quantum physics and then I’m doing the equivalent of vibe coding, except it’s vibe physics,” Kalanick explained. “And we’re approaching what’s known. And I’m trying to poke and see if there’s breakthroughs to be had. And I’ve gotten pretty damn close to some interesting breakthroughs just doing that.”
Then there are the programmers extolling "vibe coding" and how it increases their productivity. CEOs who buy this pitch are laying off staff left and right. For example, Jordan Novote reports that Microsoft laying off about 9,000 employees in latest round of cuts:
Microsoft said Wednesday that it will lay off about 9,000 employees. The move will affect less than 4% of its global workforce across different teams, geographies and levels of experience, a person familiar with the matter told CNBC.
...
Microsoft has held several rounds of layoffs already this calendar year. In January, it cut less than 1% of headcount based on performance. The 50-year-old software company slashed more than 6,000 jobs in May and then at least 300 more in June.
Asked to explain the ABAB rhyming scheme, OpenAI's GPT-4o did so accurately, responding, "An ABAB scheme alternates rhymes: first and third lines rhyme, second and fourth rhyme."
Yet when asked to provide a blank word in a four-line poem using the ABAB rhyming scheme, the model responded with a word that didn't rhyme appropriately. In other words, the model correctly predicted the tokens to explain the ABAB rhyme scheme without the understanding it would have needed to reproduce it.
The problem with potemkins in AI models is that they invalidate benchmarks, the researchers argue. The purpose of benchmark tests for AI models is to suggest broader competence. But if the test only measures test performance and not the capacity to apply model training beyond the test scenario, it doesn't have much value.
16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%--AI tooling slowed developers down.
Even the devs who liked the AI found it was bad at large and complex code bases like these ones, and over half the AI suggestions were not usable. Even the suggestions they accepted needed a lot of fixing up.
Julia Liuson, president of the Microsoft division responsible for developer tools such as AI coding service GitHub Copilot, recently sent an email instructing managers to evaluate employee performance based on their use of internal AI tools like this.
"AI is now a fundamental part of how we work," Liuson wrote. "Just like collaboration, data-driven thinking, and effective communication, using AI is no longer optional — it's core to every role and every level."
Liuson told managers that AI "should be part of your holistic reflections on an individual's performance and impact."
If the tools were that good, people would use them without being threatened. If the tools were that good, people would pay for them. But Menlo Ventures found that only 3% of consumers pay anything. They are happy to use free toys but they have other spending priorities. Other surveys have found numbers up to 8%, but as Ted Gioia notes in The Force-Feeding of AI on an Unwilling Public:
Has there ever been a major innovation that helped society, but only 8% of the public would pay for it?
There’s another reason why huge tech companies do this—but they don’t like to talk about it. If they bundle AI into other products and services, they can hide the losses on their income statement.
That wouldn’t be possible if they charged for AI as a standalone product. That would make its profitability (or, more likely, loss) very easy to measure.
Shareholders would complain. Stock prices would drop. Companies would be forced to address customer concerns.
But if AI is bundled into existing businesses, Silicon Valley CEOs can pretend that AI is a moneymaker, even if the public is lukewarm or hostile.
Yesterday Salesforce announced that prices on a pile of their services are going up around 6% — because AI is just that cool.
Salesforce’s stated reason for the price rise is “the significant ongoing innovation and customer value delivered through our products.” But you know the actual reason is because f- you, that’s why. What are you gonna do, move to SAP? Yeah, didn’t think so.
One problem is that the technology Salesforce is charging its customers for doesn't work well in Salesforce's application space. Salesforce's own researchers developed a new bechmark suite called CRMAArena-Pro:
CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance.
The agent bots had 58% success on tasks that can be done in one single step. That dropped to 35% success if they had to take multiple steps. The chatbot agents are also bad at confidentiality:
Agents demonstrate low confidentiality awareness, which, while improvable through targeted prompting, often negatively impacts task performance. These findings suggest a significant gap between current LLM capabilities and the multifaceted demands of real-world enterprise scenarios.
Despite the fact that most consumers won't pay the current prices, it is inevitable that once the customers are addicted, prices will go up spectacularly. But the wads of VC cash may not last long enough, and things can get awkward with the customers who are paying the current prices, as David Gerard reports:
You could buy 500 Cursor requests a month for $20 on the “Pro” plan. People bought a year in advance.
In mid-June, Cursor offered a new $200/month “Ultra” plan. But it also changed Pro from 500 requests to $20 of “compute” at cost price — the actual cost of whichever chatbot vendor you were using. That was a lot less than 500 requests.
You could stay on the old Pro plan! But users reported they kept hitting rate limits and Cursor was all but unusable.
The new plan Pro users are getting surprise bills, because the system doesn’t just stop when you’ve used up your $20. One guy ran up $71 in one day.
Anysphere has looked at the finances and stopped subsidising the app. Users suddenly have to pay what their requests are actually costing.
Anysphere says they put the prices up because “new models can spend more tokens per request on longer-horizon tasks” — that is, OpenAI and Anthropic are charging more.
The CEO who laid off the staff faces another set of "business risks". First, OpenAI is close to a monopoly; it has around 90% of the chatbot market. This makes it a single point of failure, and it does fail:
On June 9 at 11:36 PM PDT, a routine update to the host Operating System on our cloud-hosted GPU servers caused a significant number of GPU nodes to lose network connectivity. This led to a drop in available capacity for our services. As a result, ChatGPT users experienced elevated error rates reaching ~35% errors at peak, while API users experienced error rates peaking at ~25%. The highest impact occurred between June 10 2:00 AM PDT and June 10 8:00 AM PDT.
Second, the chatbots present an attractive attack surface. David Gerard reports on a talk at Black Hat USA 2024:
Zenity CEO Michael Bargury spoke at Black Hat USA 2024 on Thursday on how to exploit Copilot Studio:
Users are encouraged to link “public” inputs that an attacker may have control over.
A insider — malicious or just foolish — can feed their own files to the LLM.
If you train the bot on confidential communications, it may share them with the whole company.
63% of Copilot bots are discoverable online, out on the hostile internet. Bargury fuzzed these bots with malformed prompts and got them to spill confidential information.
Bargury demonstrated intercepting a bank transfer between a company and their client “just by sending an email to the person.”
So the technology being sold to the CEOs isn't likely to live up to expectations and it will cost many times the current price. But the way it is being sold means that none of this matters. By the time the CEO discovers these issues, the company will be addicted.
Early this spring, Ex Libris released the Summon “Research Assistant.” This search tool is Retrieval Augmented Generation, using an LLM tool (OpenAI’s GPT–4o mini at time of writing) to search and summarize metadata in their Summon/Primo Central Discovery Index.
We did a library-wide test mid-semester and decided that it’s not appropriate to turn it on now. We may do so when some bugs are worked out. Even then, it is not a tool we’d leave linked in the header, promote as-is, or teach without significant caveats (see Reflection).
Brief Overview of the Tool
This overview is for the Summon version, though I believe that the Primo version is pretty similar and it has some of the same limitations.
Query Conversion – The user’s question is sent to the LLM, where it is converted to a Boolean query that contains a number of variations of the query, connected with an OR. If the query is non-English, some of the variations will be in the query language, and the other variations will be in English.
Results Retrieval – The Boolean query is sent to CDI to retrieve the results.
Re-ranking – The top results (up to 30) are re-ranked using embeddings to identify five sources that best address the user’s query.
Overview Creation – The top five results are sent to the LLM with the instructions to create the overview with inline references, based on the abstracts.
Response Delivery – The overview and sources are returned to the user in the response.
There is one major caveat to the above, also in the documentation, which is the content scope. Once you get through all the exceptions,1 only a slice of the CDI could make it into the top 5 results. Most notably, records from any of the following content providers are not included:
APA,
DataCite,
Elsevier,
JSTOR, and
Conde Nast.
These would be in the results you get when clicking through to “View related results,” but they could not make it into the “Top 5.”
Positive Findings
I would summarize the overall findings as: extremely mixed. As I said up front, we had enough concerns that we didn’t want to simply turn on the tool and encourage our wider base to try it out.
Sometimes, people got really interesting or useful results. When it worked well, we found the query generation could come up with search strings that we wouldn’t have thought of but got good results. I found some electronic resources about quilts that I didn’t know we had – which is saying something!
Some of the ways the tool rephrased research questions as suggested “related research questions” were also useful. A few people suggested that this could be used to help students think about the different ways one can design and phrase a search.
The summaries generally seemed accurate to the record abstracts. I appreciated that they were cited in a way that let me identify which item was the source of which assertion.2
We also had many concerns.
Massive Content Gaps (and Additions)
The content gaps are a dealbreaker all on their own. No JSTOR? No Elsevier? No APA? Whole disiplines are missing. While they do show up in the “View related results,” those first 5 results matter a lot in a user’s experience and shape expectations of what a further search would contain. If someone is in a field for which those are important databases, it would irresponsible to send them to this tool.
The need for abstracts significantly limits which kinds of results get included. Many of our MARC records do not have abstracts. For others, one may infer the contents of the book from a table of contents note, but this requires levels of abstraction and inference which a human can perform but this tool doesn’t.
Then there’s the flip side of coverage. This is based on the Ex Libris CDI (minus the massive content gaps), which includes everything that we could potentially activate. At time of writing, it still doesn’t seem possible to scope to just our holdings (and include our own MARC). This means results include not only the good stuff we’d be happy to get for a patron via ILL but also whatever cruft has made its way into the billion+ item index. And that’s not a hypothetical problem. In one search we did during the session, so much potential content was in excluded JSTOR collections that a top 5 result on the RAG page was an apparently LLM-generated Arabic bookseller’s site.3
LLM Parsing / Phrasing
The next issue we encountered was that sometimes the LLM handled queries in unexpected4 ways.
Unexpected Questions
First, the Research Assistant is built to only answer a specific type of question. While all search tools can be described that way, anyone who’s worked for more than 30 seconds with actual humans knows that they don’t always use things in the way we intend. That’s why we build things like “best bet” style canned responses to handle searches for library hours or materials with complicated access (like the Wall Street Journal).
It was not programmed to do anything with single word searches. A search for “genetics,” for example got the “We couldn’t generate an answer for your question” response. There wasn’t any kind of error-handling on the Ex Libris side to turn it into some kind of “I would like to know about [keyword],” even as a suggestion provided in the error message. For all my critiques of LLMs themselves, sometimes it’s just poor edge case handling.
Then there were the meta questions. Colleagues who staff our Ask-a-Librarian brought in a few that they’ve gotten: “Do you have The Atlantic?” or “What is on course reserve for XXXXX?” In both of those cases, the tool was not able to detect that this was not the kind of question it was programmed to answer. In both cases, it returned a few random materials and generated stochastic responses which were, of course, completely inaccurate.
LLM-Induced Biases
Then there were issues introduced by the nature of LLMs – how they tokenize and what kind of data they’re trained on:
A liaison librarian reported asking about notable authors from Mauritius and being given results for notable authors from Mauritania. I would guess this is a combination of stemming and lack of responses for Mauritius. But they are two very distinct countries, in completely different regions of a continent (or off the continent).
Another bias-triggering question related to Islamic law and abortion. The output used language specific to 20th/21st evangelical Christianity. Because LLMs are configured not to output the same result twice, we could not replicate it, but instead got a variety of different phrasings of results of varying quality. This is a (not-unexpected) bias introduced by the data the LLM was trained on. Notably, it was not coming from the language of the abstracts.
Balancing Safety and Inquiry
Note: While I was finishing this blog post, the ACRLog published a post going into more detail about topics blocked by the “safeguards”. I brought this to our library-wide discussion but I’m going to refer readers to the above. Basically, if you ask about some topics, you won’t get a response. Even though some of these are the exact kind of thing we expect our students to be researching.5
When the Summon listserv was discussing this issue in the spring, I went and found the OpenAI Azure documentation for content filtering. They have a set of different areas that people can configure:
Hate and fairness
Sexual
Violence
Self-Harm
Protected material
Copyrighted text (actual text of articles, etc.) which can be output without proper citation
Code harvested from repositories and returned without citation
User prompt attacks
Indirect attacks
Groundedness (how closely it sticks to training data and how much it goes into statistically probable text output)
Configuration levels can be set at low, medium, and high for each. I shared the link and list of areas on the listserv and asked about which the Research Assistant uses but did not get an answer from Ex Libris.
Steps to Delivery
This next part relates to the idea of the Research Assistant itself, along with Ex Libris’s implementation.
Very, very few of our patrons need just a summary of materials (and, again, results of only materials which happen to have an abstract, and of only the abstract not the actual materials). Plenty of our patrons don’t need that at all. Unless they’re going to copy-paste the summary into their paper and call it a day, they actually need to get and use the materials.
So once they’ve found something interesting, what are their next steps?
Well, first you click the item.
Then you click Get It.
Then THAT opens a Quick Look view.
Then you click the title link on the item in the Quick Look View.
And oh look this was in the CDI but not in our holdings, so it’s sent me to an ILL page (this was not planned, just how it happened).
Maybe ExLibris missed the memo, but we’ve actually been working pretty hard to streamline pathways for our patrons. The fewer clicks the better. This is a massive step backward.
Reflection
I doubt this would be of any utility for grad students or faculty except as another way of constructing query strings. I do think it’s possible to teach with this tool, as with many other equally but differently broken tools. I would not recommend it at a survey course level. Is it better than other tools they’re probably already using? Perhaps, but the bar is in hell.
Optimal use requires:
Students to be in a discipline where there’s decent coverage.
Students to know that topical and coverage limitations exist.
Students to understand the summaries are the equivalent of reading 5 abstracts at once and that there may be very important materials in the piece itself.
Students to actually click through to the full list of results.
Ex Libris to let us search only our own index (due to the cruft issue).
Ex Libris to redesign the interface with a shorter path to materials.
Its greatest strength as a tool is probably the LLM to query translation and recommendations for related searches. When it works. But with all those caveats?
I am not optimistic.
FWIW, I totally understand and support not including News content in this thing. First, our researchers are generally looking for scholarly resources of some kind. Second, bias city. ↩︎
These citations are to the abstract vs. the actual contents. This could cause issues if people try to shortcut by just copy-pasting, since we’re reliant on the abstract to reliably represent the contents (though there’s also no page # citation). ↩︎
A colleague who is fluent in Arabic hypothesized that was not a real bookstore because many small things about the language and site (when we clicked through) were wrong. ↩︎
Ben Zhao’s closing keynote for OpenRepositories goes into how these kinds of issues could be expected. So I’ll say “unexpected” from the user’s POV but also I cannot recommend his talk highly enough. Watch it. ↩︎
Whether ChatGPT can appropriately summarize 5 abstracts from materials related to the Tulsa Race Riots or the Nakba is a whole separate question. ↩︎
In 2020, the Emory Center for Digital Scholarship in Atlanta, Georgia reached out to the University of Michigan to contribute to the Sounding Spirit Digital Library (SSDL). They asked the Bentley Historical Library, the U-M Library, and the William L. Clements Library to contribute titles in our collections that would expand their digital collection. This post looks at the range of titles contributed, discusses the equipment used to digitize the titles, and analyses the ways that SSDL and U-M Library align and vary in their digitization efforts.
I always feel like I have so much to write,
so much that I want to write.
So why is it that anytime I sit down to actually write my mind goes entirely blank?
Even if I have a list of intended topics right in front of me,
or a partial draft to work on,
I start feeling like I have nothing interesting to say.
The only thing I have much success in writing is
these long, boring, self indulgent walls of text
about how I feel about not being able to write.
I don’t know, maybe I should just publish this.
At least it will be some of my thoughts out in the world again.
Back in the early days of my blog it seemed to flow quite easily.
I was confident,
not really that what I had to say was correct,
or insightful,
or in any way important,
but that it was not going to get me into trouble
or draw criticism that I couldn’t handle.
I naively thought that I could say whatever I wanted without fear.
Since I am a white, well-educated, straight cis man,
I was largely right.
The only time I felt any discomfort was when I posted a paraphrased recipe from a book
and was threatened with legal action by the writer’s agent.
While I was certain that legally I was in the clear
(you can’t copyright a recipe),
and felt that since I was strongly recommending people buy the book it was OK morally too,
I deleted the post.
Of course I did.
I’ve spent my whole life learning over and over that conflict is bad,
it’s my responsibility to resolve it,
and it probably is me being unreasonable anyway.
Like a lot of ND kids,
I learned early on that all these intense feelings and sense impressions
(that I believed everyone felt)
were not to be acted upon because doing so only brought trouble.
So I think after that experience,
benign though it was,
I started to doubt whether I could speak freely after all.
At the same time I was awakening politically
and learning that some of the beliefs I thought were obvious
(e.g. that all humans had rich inner lives that affected how they thought/acted)
were in fact not that common,
and that stating them in particular ways could be seen as somehow controversial.
I was also succeeding in my career and starting to internalise what I was told
about what I said reflecting on my employer and colleagues.
However it happened,
I had lost my voice.
Yes, in the sense of being literally unable to express myself in certain ways,
but also in that I felt there was something I once had but had misplaced
and desperately wanted to find again without knowing where to look.
First it only affected my personal writing,
but as the years went by it crept into my professional work too.
Working in a large organisation is never not political;
what you say affects not only your own standing and influence
but also that of your team and department in the wider organisation,
which is a heck of a responsibility.
By the time I ended up in the public sector
I was being regularly reminded that it was my job to remain neutral,
impartial,
disinterested.
Eventually I stopped saying much at all,
except to a few trusted friends and colleagues.
This is not my fault, exactly,
but it is partly related to some very core parts of my personality.
I know that breaking rules feels bad,
feels dangerous.
So when I’m told implicitly that
regardless of the depth of my knowledge or experience
my word isn’t good enough
—that only things supported by concrete evidence are OK to say—
I play it safe.
I see that not everyone plays by those rules,
and that some don’t suffer any consequences for breaking them,
but I’m unable to discern to my own satisfaction why that is or how to emulate it.
Is it because they are brighter or more experienced than I?
Are they party to information that explains how they are not,
in fact, breaking rules?
Do they have better understanding of what rules can or cannot be broken?
Are they more mature and self-confident through experience?
Are they simply confident because they have the privilege never to have been challenged
and the power to carry it off through confidence alone?
Maybe some of these are people I don’t want to emulate after all…
Still,
there are glimmers of hope.
I’m starting to become more aware of contexts
where I feel less shackled and more able to express myself.
Unsurprisingly,
it’s usually when I’m under less pressure (external or internal)
to “deliver” some “output” that meets some vague criteria,
and when I’m working with people I know well
and trust not to judge me personally if we disagree.
It’s also easier when I retreat to that shrinking zone
where I still feel like I can speak with some authority.
I’m looking for ways to put myself in that context more often.
Right now I think that means identifying a small group of trusted colleagues at work
that I can bounce ideas around with,
and doing more writing in the various communities I find myself in.
There was a long while when I didn’t feel I had the authority to speak
even about my own lived experience.
Two years of therapy, a lot of introspection, and the love of friends and family
have brought me to a place where I no longer doubt my own experience of the world
(well, not so much as I did — it’s a work in progress),
which gives me a place of solid ground to build out from
as I re-establish my faith in my skills, experience and judgement in other areas.
Well, this wasn’t the thing I was expecting to write when I started,
but here we are.
I guess we’ll see how it goes!
On the first day of the conference, a series of workshops and tutorials were held on cutting-edge topics such as Generative AI, the Human-Centric Web, and Information Security. Tutorials included sessions on using the National Internet Observatory for collecting web data for research and exploring the Meta Content Library as a research tool. We had to choose one workshop or tutorial to attend.
Tutorial: Beyond APIs: Collecting Web Data for Research using the National Internet Observatory
The first workshop session was “Beyond APIs”, where members of the National Internet Observatory (NIO) at Northwestern University discussed many of the current issues in interfacing with the Web, collecting data, and ethical concerns of data usage. We at WS-DL often face many of these same challenges when working with APIs of various sites, such as the deprecation of the original Twitter API discussed in the workshop. In the NIO program, users opt into the study and can both voluntarily donate their data and utilize mobile apps and browser extensions which monitor their Web activity and allow researchers to find interesting patterns in user behavior and the interconnectedness of the Web.
At #WebSci2025? Join our "Beyond APIs: Collecting Web Data for Research using the National Internet Observatory" tutorial that addresses the critical challenges of web data collection in the post-API era. 📍Where: ABE 2400 (15 Seminary Place) ⏰When: Tue, May 20, 9-12.
Lydia Manikonda from Rensselaer Polytechnic Institute presented work on characterizing linguistic differences between human and LLM-generated text using Reddit data from r/explainlikeimfive. They prompted ChatGPT with the same questions as those posed on the subreddit, then compared the top-voted human responses with the AI-generated ones, asking whether readers could distinguish between them and identify the author.
Celia Chen and Alex Leitch from the University of Maryland discussed “Evaluating Machine Expertise,” focusing on how graduate students develop frameworks to assess GenAI content. They noted that LLM-generated content often appears authoritative even without domain expertise. Their research examines whether students build mental models to decide when and how to use LLMs and how these frameworks shift across disciplines. They found that students protect work central to their professional identity, are skeptical of academic LLM content, but trust machine outputs when they can be tested. International students often verify results across languages, such as checking first in English and then confirming in Chinese.
Alexander Bringsjord from Rensselaer Polytechnic Institute explored GenAI’s dual deception based on content and perceived intelligence, highlighting LLM hallucinations and how LLMs blend prior conversation into answers rather than accurately interpreting new documents.
Lydia Manikonda also spoke about the importance of privacy and ethical practices as more companies integrate AI into customer experiences.
Finally, Eni Mustafaraj’s reflections on the Semantic Web and the current state of AI, along with her work on Credbot, left me reflecting on how we might engage with the web and information in the future. The discussion about whether we will continue to visit web pages or shift to new modes of communication felt especially relevant and worth pondering.
How is GenAI reshaping the web and our behavior online?@maidylm, @oshaniws, and Rui Fan are leading a #WebSci25@WebSciConf workshop on Human–GenAI Interactions: exploring ethical, social, and technical impacts on the web and its users.
The conference kicked off on Wednesday with opening remarks from General Chair Matthew Weber of Rutgers University. He welcomed attendees to New Jersey and introduced the other chairs. He shared that this year there were 149 submissions from 519 authors across 29 countries, with 59 papers accepted, resulting in an acceptance rate of 39.6%.
Mor Naaman from Cornell Tech delivered the first keynote of the conference. His talk was titled “AI Everywhere all at Once: Revisiting AI-Mediated Communication”. He reflected on how, when the concept of AI-Mediated Communication (AIMC) was first introduced in 2019, it seemed mostly theoretical and academic. However, in just a few years, AI has become deeply embedded in nearly every aspect of human communication, from personal conversations to professional work and online communities. Mor revisited key studies from the AIMC area, highlighting findings such as how suspicion of AI can undermine trust in interpersonal exchanges, and how AI assistants can subtly influence not only the language and content of our communication but even our attitudes. Given the rapid growth of AI technologies like ChatGPT, he proposed an updated understanding of AIMC’s scope and shared future research directions, while emphasizing the complex challenges we face in this evolving landscape. His talk highlighted the profound and often subtle ways AI is transforming our communication, not just in what we say, but how we think and connect with one another. It made me wonder about the future of communication as AI becomes increasingly integrated into our daily interactions, raising important questions about how we can preserve authenticity and trust amid this rapid technological rise.
After lunch, there were two parallel sessions and we attended Session 2 which seemed more aligned with our interests. Jessica Costa started the session with“Characterizing YouTube’s Role in Online Gambling Promotion: A Case Study of Fortune Tiger in Brazil”, which examines how YouTube facilitates the promotion of online gambling, highlighting its societal impact and providing a robust methodology for analyzing similar platforms. Next, Aria Pessianzadeh presented “Exploring Stance on Affirmative Action Through Reddit Narratives”. This study analyzes narratives on Reddit to explore public opinions on affirmative action, revealing how users express support or opposition through personal stories and thematic framing. Ashwin Rajadesingan presented “How Personal Narratives Empower Politically Disinclined Individuals to Engage in Political Discussions”, which study showing how sharing personal stories can motivate people who typically avoid politics to join conversations on Reddit, as these stories resonate more with people and tend to receive more positive engagement than other types of comments. Wolf-Tilo Balke concluded the session with "Scientific Accountability: Detecting Salient Features of Retracted Articles". This study identifies key characteristics of retracted scientific articles, such as citation patterns, language features, and publication metadata, to better understand their impact and improve detection of problematic research. This work offers a new lens to think critically about the credibility of scientific literature, especially in an era of information overload.
Day 2 at #WebSci25! Session 2: Content Analysis & User Narratives
Jessica Costa from @UFOP is presenting “Characterizing YouTube’s Role in Online Gambling Promotion: A Case Study of Fortune Tiger in Brazil”
Dr. Lee Giles delivered an excellent keynote on the operation and infrastructure of Web crawlers as well as search engines, both general and those created by him. These included numerous *Seer-variant engines, such as ChemXSeer and CiteSeerX. Being a friend of the WS-DL research group, this talk was a nice treat as a current WS-DL student and an incredible resource for other conference participants interested in Web crawlers. Through discussions with other students there, many had attempted to work with or create Web crawlers in the past without realizing the complexity and challenging hurdles they needed to overcome in the process of navigating the modern Web.
The WebSci ‘25 Lightning Talks were brief presentations meant to advertise and attract audience members to the large selection of posters being presented. As with the session and keynote talks, there was no shortage of interesting work on display.
Great posters and drinks reception @WebSciConf 2025 @RutgersU. Spot two web science founders in this photo. It’s great to see the younger generation here picking up the challenge that we laid down 20 years ago and running with it. pic.twitter.com/Rk00D6EQGB
I, David Calano, presented the poster "GitHub Repository Complexity Leads to Diminished Web Archive Availability", which highlighted the limited availability of Web hosted (i.e., GitHub) software repositories archived to the Wayback Machine. We looked at the page damage of archived repository landing pages and the availability of the archived source files themselves to assess the viability of potentially rebuilding archived software projects.
All of the papers in this talk presented interesting information and findings. For example, in the case of Kai-Cheng Yang and Filippo Menczer’s paper, it is interesting to note the left-wing bias inherent in LLMs and what effect such biases might have. As many Web users, particularly those of younger generations, default to consulting an LLM chat bot for information and rarely conduct further searches or analysis of sources, what happens to an already polarizing society? Likewise, Chau Tong’s paper explored the topic of polarization in search engine results. The DocNet paper by Zhu et al. also provided a good technical exploration of bias detection systems leveraging AI and Python.
Session 7: Online Safety & Policy
Deanna Zarrillo presented “Facilitating Gender Diverse Authorship: A Comparative Analysis of Academic Publishers’ Name Change Policies”, which examines the publicly available name change policies of nine academic journal publishers through thematic content analysis, providing insights into how publishers manage rebranding and transparency during transitions. Tessa Masis presented her work, “Multilingualism, Transnationality, and K-pop in the Online #StopAsianHate Movement” which examines how the #StopAsianHate movement used multilingual posts and K-pop fan culture to build global solidarity and amplify anti-Asian hate messages across different countries and communities online. I, Kritika Garg, had the pleasure of presenting our work, “Not Here, Go There: Analyzing Redirection Patterns on the Web”. Our research examined 11 million redirecting URIs to uncover patterns in web redirections and their implications on user experience and web performance. While half of these redirections successfully reached their intended targets, the other half led to various errors or inefficiencies, including some that exceeded recommended hop limits. Notably, the study revealed "sink" URIs, where multiple redirections converge, sometimes used for playful purposes such as Rickrolling. Additionally, it highlighted issues like "soft 404" error pages, causing unnecessary resource consumption. The research provides valuable insights for web developers and archivists aiming to optimize website efficiency and preserve long-term content accessibility.
Mohammad Namvarpour presented the last presentation of the session, “The Evolving Landscape of Youth Online Safety: Insights from News Media Analysis”, which examines how news stories about keeping kids safe online have changed over the past 20 years, showing that recent coverage focuses more on tech companies and government rules. The authors studied news articles to understand how the conversation about youth online safety has evolved.
Session 9: Contemporary Issues in Social Media
The papers from Session 9 explored a wide range of topics across social media from war, news, and even mental health and safety. “A Call to Arms: Automated Methods for Identifying Weapons in Social Media Analysis of Conflict Zones” by Abedin et al. presented an interesting framework for analyzing and tracking weapons of war and ongoing conflicts in active regions of war through social media platforms, such as Telegram. Their work heavily utilized computer vision and open-source datasets and provides a window into the scale and lethality of ongoing conflicts. The paper by Saxena et al., “Understanding Narratives of Trauma in Social Media”, was incredibly valuable in discussing the effects of trauma and social media on mental health.
Web Science Panel
The Web Science panel consisted of Dame Wendy Hall, Dr. Jim Hendler, Dr. Mathew Webber, Dr. Wolf-Tilo Balke, Dr. Marlon Twyman II, and Dr. Oshani Seneviratne. While the panel went a little over on time and not many questions were able to be asked in session, many were had at the reception after. It was a treat to hear from some of the key founders of the field of Web Science and core creators of the World Wide Web at large. The panel topics and moderated questions took on a broad range of topics across the spectrum of Web Science and it was great to hear the thoughts from such key figures on issues related to social media, AI, political governance, the Semantic Web, and the broad applications of Communication and Social Science to the World Wide Web. Also discussed by Dame Hall and Dr. Hendler was the Web Science Trust, which seeks to advance the field of Web Science and bring together researchers from across the globe.
Web Science Panel responding to attendee questions on a wide range of Web Science topics
Friday, May 23, 2025
Session 10: Platform Governance & User Safety
Session 10 also had a decent variety in terms of content. Two of our favorite papers presented were “Decentralized Discourse: Interaction Dynamics on Mastodon” by Brauweiler et al. and “Is it safe? Analysis of Live Streams Targeted at Kids on Twitch.tv”, by Silva et al. Many of the WS-DL members are fans of new, unique, experimental, and decentralized Web tools and social platforms. Some of our members are active in various Mastodon communities and have even run their own instances. It was exciting to hear some researchers are utilizing Mastodon and other social platforms and how they tackled many of the technical challenges present among them. Like the work of Saxena et al. from Session 9, the work by Silva et al. in researching child safety on the popular streaming platform Twitch is also of great importance for the health and wellbeing of the younger Web population. They found that currently Twitch only has minimal options in place for marking and filtering adult content, and in particular only for select forms of media, and such channels are self-reported as for an adult audience, not automatically tagged as such. Furthermore, even if content is not marked for an adult audience, or explicitly marked for kids or for a younger audience, there is no guarantee of the language used by the streamer or topics discussed in chat to be suitable for younger audiences except through voluntary moderation.
Day 4 at #WebSci25! Final session of the @WebSciConf, “Platform Governance & User Safety,” is happening now!
Dame Wendy Hall’s closing keynote was an excellent look through the history of Artificial Intelligence and its relation to the Web. It served as an excellent reminder that progress is not always constant and we tend to alternate between periods of uncertainty and rapid progress that can often blindside us to potential hazards. It was also a reminder of how much Artificial Intelligence relies on the World Wide Web, its users surfing the waves of hyperspace, and the information they share along the way. The collective information of the Web is what comprises AI, without the input of billions of people around the world, there would be no substance to it. Some other great points from the talk were on the dangers and politics surrounding AI research, development, and utilization. Importantly, how much power and control we allow AI to have in our global society and global cooperation (or lack thereof) in regards to AI regulation. The points of this keynote were extremely relevant given the simultaneous release of Anthropic’s Claude 4 LLM model, which in testing was found to engage in blackmail, whistleblowing, and other interesting behaviors.
Despite the week’s rainy weather, the conference was well-organized, stimulating, and rewarding. For some, this was a return to a familiar community, while for us it was a valuable first in person conference experience. The opportunity to exchange ideas with colleagues from industry and academia worldwide was truly worthwhile. The dinner at the Rutgers Club was a fitting conclusion, providing space to connect before departing. With the next conference scheduled for Germany, we look forward to continuing these conversations there. Many thanks to the organizers for putting together an excellent event.
Open Data Editor is evolving into a key companion tool designed to support organisations in the early and critical stages of their AI journey. Today, we are announcing a series of four online meetings to reflect together with our communities the project's present and future.
People want technology for people. Last week, the Open Knowledge Foundation hosted The Tech People Want Summit, bringing together 34 speakers from 19 countries to rethink how technology supports our work in conversational sessions. This wasn’t a summit about coding, but focused on non-technical professionals – including data practitioners, communicators, project managers, and advocates. Once...
We wanted to add transcripts/captions to these videos, for accessibility to those who are hearing impaired, for searchability of video transcript content, and for general usability. We do not have the resources to do any manual transcription or even really Quality Assurance, but we decided that OpenAI whisper automated transcription software was of sufficient quality to be useful.
We have implemented whisper-produced transcriptions. We use them for on-screen text track captions; for an accompanying on-the-side transcript; and for indexing for searching in our collection.
Putting Scientific Information to Work (1967) (of interest to my librarian colleagues, an early marketing video for ISI (then the “Institute for Scientific Information”) cutting edge citation indexing database. (Go to 31:53 for a weird 60s style recap of important events of the 60s?)
Many of our library/museum/archives peers use the open source Whisper implementation, or a fork/variation of it, and at first I assumed I would do the same. However, we deploy largely on heroku, and I quickly determined that the RAM requirements (at least for medium and above models, and disk space requirements (a pip install openai-whisper added tens of gigs) were somewhere in between inconvenient and infeasible on the heroku cedar platform, at least for our budget.
This is not exactly the same product as OpenAI whisper, and exactly how it differs is not public. The hosted whisper does not let (or require?) you to choose a model, it just uses whatever it uses. It has fewer options — and in the open source realm, there are forks or techniques with even more options and features, like diarization or attempting to segment multi-lingual recordings by language. With the hosted commercial implementation, you just get what you get.
But on the plus side, it’s of course convenient not to have to provison your own resources. It is priced at $0.006 per minute of source audio, so that’s only around $25 to transcribe our meager 70 hour corpus, no problem, and no problem if we keep adding 70-200 hours of video a year as currently anticipated. If we start adding substantially more, we can reconsider our implementation.
Details of whisper API usage implementation
Whisper hosted API has a maximum filesize of 25 MB. Some of our material is up to two hours in length, and audio tracks simply extracted from this material routinely exceeded this limit. But by using ffmpeg to transcode to the opus encoding in an ogg container, using the opus voip profile optimized for voice, at a 16k bitrate — even 2 hours of video is comfortably under 25MB. This particular encoding was found often recommended on forums, with reports that downsampling audio like this can even result in better whisper results; we did not experiment, but it did seem to perform adequately.
Whisper can take a single source language argument — we have metadata already in our system recording language of source material, so if there is only one listed, we supply that. Whisper can’t really handle multi-lingual content. Almost all of our current video corpus is only English, but we do have one video that is mixed English and Korean, and fairly poor audio quality — whisper API actually refused to transcribe that, actually returning an error message (after a wait). When I tried that with opensource whisper just out of curiosity, it did transcribe it, very slowly — but all the Korean passages were transcribed as “hallucinated” English. So error-ing out may actually be a favor to us.
You can give whisper a “prompt” — it’s not conversational instructions, but is perhaps treated more like a glossary of words used. We currently give it our existing metadata “description” field, and that resulted in successful transcription of a word that never caught on, “zeugmatography” (inventor of MRI initially called it that), as well as correct spelling of “Eleuthère Irénée”. If it’s really just a glossary, we might do even better by taking all metadata fields, and just listing unique words once per word (or even trying to focus on less common words). But for now description as-is works well.
I had at one point wanted to stream my audio, stored on S3, directly to a HTTP POST to API, without having to download the whole thing to a local temporary copy first. But ruby’s lack of a clear contract/API/shape of a “stream” object strikes again, making interoperability painful. This fairly simple incompat was just the first of many I encountered; patching this one locally just let me onto the next one, etc. One of my biggest annoyances in ruby honestly!
Results?
As others have found, the results of whisper are quite good, better than any other automated tool our staff had experimented with, and we think the benefits to research and accessibility remain despite what errors do eist. There isn’t much to say about all the things it gets right, by listing the things it doesn’t you might get the wrong idea, but it really does work quite well.
As mentioned, it can’t really handle multi-lingual texts
Errors and hallucinations were certainly noticed. In one case it accurately transcribed a musical passage as simply ♪, but oddly labelled it as “Dance of the Sugar Plum Fairies” (it was not). An audience clapping was transcribed as repeated utterances of “ok”. This example might be more troubling: some totally imaginary dialog replacing what is pretty unintelligible dialog in the original.
Wide differences in how long the cues are, although consistent within a piece. But some pieces are transcribed with long paragraph-sized cues, and others just phrase by phrase. I am considering post-processing to join tiny phrase cues into sentences, up to so many words.
It seems to not infrequently, well into a video, start losing the synchronization of timing, getting 5, 10, or even 15 seconds behind? This is weird and I haven’t seen it commented upon before. The text is still as correct as ever, so mostly an inconvenience. See for instance at 9:09 in Baseline: The Chemist, definitely annoying. By 10:23 it’s caught up again, but quickly gets behind again, etc.
We don’t really have the resources to QA even our fairly small collection, so we are choosing to follow in the footsteps of WGBH and their American Archive of Public Broadcasting, and publish it anyway, with a warning influenced by theirs:
I think in the post-pandemic zoom world, most users are used to automatically generated captions and all their errors, and understand the deal.
WGBH digitizes around 37K items a year, far more than we do. They also run an instance of FixIt+ for public-contributed “crowd-sourced” transcription corrections. While I believe FixIt+ is open source (or a really old version of it is?) and some other institutions may run it, we don’t think we’d get enough public attention and only have a small number of videos, we can’t really afford to stand up our own FixitPlus even if it is available. But it does seem like there is an unfilled need for someone to run a crowd-hosted FixitPlus to charge a reasonable rate for hosting for someone that only will need a handful a year?
We did implement an admin feature to allow upload of corrected WebVTT, which will be used in preference to the direct ASR (Automated Speech Recognition) ones. As we don’t anticipate this being done in bulk, right now staff just downloads the ASR WebVTT, uses the software of their choice to edit it, and then uploads a corrected version. This can be done for egregious errors as noticed, or using whatever policy/workflow our archival team thinks appropriate. We also have an admin feature to disable transcoding for material it does not work well for, such as multi-lingual, silent, or other problems.
Text Track Captions on Video
We were already using video.js for our video display. It provides API’s based on HTML5 video API’s, in some cases polyfilling/ponyfilling, in some cases just delegating to underlying APIs. It has good support for text tracks. At present, by default it uses ‘native’ text tracks instead of it’s own implementation (maybe only on?) Safari — you can force emulated text tracks, but it seemed advisable to stick to default native. This does mean it’s important to test on multiple browsers, there were some differences in Safari that required workarounds (more below).
So, for text tracks we simply provide a WebVTT file in a <track> element under the <video> element. Auto-generated captions (ASR, or “Automated Speech Recognition”, compare to OCR), don’t quite fit the existing categories of “captions” vs “subtitles” — we label them as kind captions and give them an English label “Auto-captions”, which we think/hope is a common short name for these.
Safari adding extra “Untitled” track for untagged HLS
For those most part, this just works, but there was one idiosyncracy that took me a while to diagnose and determine appropriate fix. We deliver our video as HLS with a .M3U8 playlist. There is a newer metadata element in .m3u8 playlist that can label the presence or absence of subtitles embedded in the HLS. But in the absence of this metadata — Safari (both MacOS and iOS I believe) insists on adding a text caption track called “Untitled”, which in our case will be blank. This has been noticed by some, but not as much discussion on the internet as I’d expect to be honest!
One solution would be adding the metadata saying no text track is present embedded in HLS (since we want to deliver text tracks as external in <track> element instead). Somewhat astoundingly, simply embedding an EXT-X-MEDIA tag with a fixed static value of CLOSED-CAPTIONS=NONE — on AWS Elemental MediaConvert (which I use) seems to takes you into the “Professional Tier” costing 60% more! I suppose you could manually post-process the .m3u8 manifests yourself… including my existing ones…
Instead, our solution is simply, when on Safari, hook into events on video element to remove a text track with empty string language and title, which is what characterizes these. I adapted from similar solution in ramp, who chose this direction. They wrote theirs to apply to “mobile which is not android”; I found it actually was needed on Safari (iOS or MacOS Safari too), and indeed not Android Chrome (or iOS Chrome!).
I lost at least a few days figuring out what was going on here and how to fix it, hopefully you, dear reader, won’t have to!
Synchronized Transcript on page next to video
In addition to the text track caption in the video player, I wanted to display a synchronized transcript on the page next to/near the video. It should let you scroll through the transcript independent of the video, and click on a timestamp to jump there.
Unsure of how best to fit this on the screen with what UX — I decided to look at YouTube and base my design on what they did. (On YouTube, you need to expand description and look for a “show transcript” button at bottom of it — I did make my ‘show transcript’ button easier to find!)
It shows up next the video, or when on a narrow screen right below it. In a ‘window in window’ internal scrolling box. Used some CSS to try to make the video and the transcript fit wholly on the screen at any screen size — inner scrolling window that’s higher than the parent window I consider a UX nightmare to avoid!
Looking at YouTube, I realized that feature that highlighted current cue as the video played was also one I wanted to copy. That was the trickiest thing to implement.
I ended up using the HTML5 media element api and the events emitted by it and associated child objects, based on the text track with cues I had already loaded in my video.js-enhanced html5 video player. I can let the browser track cue changes and listen for events when they change, to highlight current cue.
If a track is set to mode hidden, then the user agent will still track the text cues and emit events for when they change, even though they aren’t displayed. Video.js (and probably native players) by default have UI that toggles between shown and disabled (which does not track cue changes), so I had to write a bit of custom code to switch non-selected text tracks to hidden instead of disabled
(Some browsers and/or video.js polyfill code may have been emitting cueChange events even on disabled tracks, contrary to or not required by spec — important to test on all browsers!)
At first I didn’t realize I could use the user-agent’s own cue-tracking code, and was trying to catch an event on every timeUpdate event, and calculate which cues included that timestamp myself. In addition to being way more work than required (the HTML5 video API has this feature for you to use) — safari wasn’t emitting timeUpdate events unless the status bar with current time was actually on screen!
In general, the media element api and events seemed to an area with, for 2025, unusual level of differences between browsers — or at least between more native Safari and more emulated video.js in other browsers. It definitely is important to do lots of cross browser testing. While I use it rarely, when I do I couldn’t do without BrowserStack and its free offerings for open source.
Improved Video Controls
The default video.js control bar seems to me undesirably small buttons and text, and just not quite right in several ways. And there don’t seem to be very many alternative open source theme or skins (video.js seems to use both words for this), and what do exist are often kind of pushing on “interesting” aesthetics instead of being neutral/universal?
Adding the caption button was squeezing the default control bar tight, especially on small screens. With that and the increased attention to our videos that transcripts would bring, we decided to generally improve the UX of the controls, but in a neutral way that was still generic and non-branded. Again, I was guided by both youtube and the ramp player (here’s one ramp example), and also helped by ramp’s implementation (although beware some skin/theme elements are dispersed in other CSS too, not all in this file).
Before (default video.js theme)
After (local tweaked)
Scrubber/progress bar extends all the way across the the screen, above the control bar (ala youtube and ramp)
Making sure captions stay above the now higher controls was tricky. I think this approaching using translateY works pretty well, but hadn’t seen it before? Also required a bit of safari-specific css for safari’s “native text tracks”. And some nice slide up/down animation on control bar show/hide matching youtube seems nice.
buttons split between right and left, like again both youtube and ramp. Volume on right only cause it was somewhat easier.
Current time showing as current / total instead of by default elapsed, now matching youtube and what some of our users asked for. (Default video.js has some weird spacing that you have to really trim down once you show current and total).
Use newer CSS @container queries to make buttons smaller and/or remove some buttons when screen is smaller (had some weird problems with this actually crashing the video player in my actual markup though).
While fairly minor changes, I think it results in much better look and usability for a general purpose neutral theme/skin than video.js ships with out of the box. While relatively simple, it still took me a week or so to work through.
If there’s interest, I would find time to polish it up further and release it as more easily re-usable open source product, let me know?
Google indexable transcripts
One of the most exciting things about adding transcripts for our videos, is that text is now searchable and discoverable in our own web app.
It would be awfully nice if Google would index it too, so people could find otherwise hidden mentions of things they might want in videos. In the past, I’ve had trouble getting Google to index other kinds of transcripts and item text like OCRs. While hypothetically Google is visiting with javascript and can click on things like tabs or disclosure “show” buttons — conventional wisdom seems to be that Google is doens’t like to index things that aren’t on the initial page and require a click to see — which matches my experience, although others have had other experiences.
In an attempt to see if I could get google to index, I made a separate page with just transcript text — it links back to the main item page (with video player), and even offers clickable timecodes that will link back to player at that time. This transcript-only page is the href on the “Show Transcript” button, although a normal human user ordinarily would get JS executing to show transcript on same page instead when clicking on that link, you can right-click “open in new tab” to get it if you want. These extra transcript pages are also listed in my SiteMap.
That isn’t to say how much SEO juice they have; but first step is getting them in the index, which I had trouble doing before with things that required a tab or ‘show’ click to be shown. So we’ll keep an eye on it! Of course, another option is making the transcript on-page right from the start without requiring a click to show, but I’m not sure if that really serves the user?
We also marked up our item pages with schema.org content for video, including tags around the transcript text (which is initially in DOM, but requires a ‘show transcript’ click to be visible). I honestly would not expect this to do much for increasing indexing of transcripts… I think according to google this is intended to give you a “rich snippet” for video (but not to change indexing)… but some people think Google doesn’t do too much of that anyway, and to have any chance I’d probably have to provide a persistent link to video as a contentUrl which I don’t really do. Or maybe it could make my content show up in Google “Video” tab results… but no luck there yet either. Honestly I don’t think this is going to do much of anything, but it shouldn’t hurt.
Acknowledgements
Thanks to colleagues in Code4Lib and Samvera community slack chats, for sharing their prior experiences with whisper and with video transcripts — and releasing open source code that can be used as a reference — so I didn’t have to spend my time rediscovering what they already had!
Especially generous were Mason Ballengee and Dananji Withana who work on the ramp project. And much thanks to Ryan “Harpo” Harbert for two sequential years of Code4Lib conference presentations on whisper use at WGBH (2024 video, 2025 video), and also Emily Lynema for a 2025 whisper talk.
I hope I have helped pass on a portion of their generosity by trying to share all this stuff above to keep others from having to re-discover it!
Fig 1 from Rawte et al. Taxonomy for Hallucination in Large Foundation Model
The rise of large language models (LLMs) has brought about accelerated advances in natural language processing (NLP), enabling powerful results in text generation, comprehension, and reasoning. However, alongside these advancements comes a persistent and critical issue: hallucination. Defined as the generation of content that deviates from factual accuracy or the provided input, hallucination presents a multifaceted challenge with implications across various domains, from journalism to healthcare. This blog post presents insights from three recent comprehensive surveys on hallucination in natural language generation (NLG) and foundation models to provide an understanding of the problem, its causes, and ongoing mitigation efforts. “Survey of Hallucination in Natural Language Generation”by Ji et al. (2022) provides a foundational exploration of hallucination in various NLG tasks, including abstractive summarization, dialogue generation, and machine translation. It defines key terminologies, identifies contributors to hallucination, and outlines metrics and mitigation strategies, making it an essential reference for understanding the broad implications of hallucination across NLP applications. “A Survey of Hallucination in "Large" Foundation Models” by Rawte et al. (2023) extends the discussion of hallucination into the realm of large foundation models (LFMs), such as GPT-3 and Stable Diffusion. This paper categorizes hallucination phenomena across modalities (text, image, video, audio), evaluates current detection and mitigation efforts, and highlights the challenges posed by the inherent complexity and scale of LFMs. Finally, in “A Survey on Hallucination in Large Language Models”, Huang et al. (2024) offer an in-depth examination of hallucination specific to LLMs, introducing a nuanced taxonomy that includes factuality and faithfulness hallucinations. This work delves into the causes of hallucination during data preparation, training, and inference while also addressing retrieval-augmented generation (RAG) and future directions for research.
Hallucination in Language Models
WSDL member Hussam Hallak recently wrote a three part blog post about LLM hallucinations in the Quran which does an excellent job at illustrating ways that LLM’s can introduce problematic, erroneous information into its responses. Part 1 shows how Google Gemini hallucinates information. Part 2 compares Google Gemini’s hallucinations with ChatGPT’s. Finally, part 3 then reviews DeepSeek and how it performs on the same prompts.
Hallucination in LLMs can be categorized into two primary types:
Intrinsic Hallucination: This occurs when the generated content contradicts the source input. For instance, in machine translation, a model might produce translations that directly conflict with the original text.
Extrinsic Hallucination: This involves the generation of unverifiable or fabricated content not grounded in the input. While such outputs might sometimes provide helpful additional context, they frequently introduce inaccuracies, posing risks in applications demanding high factual fidelity.
Table 1. from Ji et al., showing various hallucinations in text respective to data
Fig 2. Ji et al., Examples of hallucination in image generation
Figure 2 from Ji et al.'s survey provides both intrinsic and extrinsic examples of object hallucination in image captioning, showcasing the challenges in multimodal tasks. For instance, when describing an image, a model might generate captions referencing objects that are not present in the visual content. This serves as a concrete illustration of how hallucination manifests beyond textual outputs, underscoring the need for robust grounding techniques in multimodal systems.
Recent research extends these definitions to include distinctions between factuality hallucinations (deviation from real-world facts) and faithfulness hallucinations (deviation from user instructions or input context).
Causes of Hallucination
Hallucination arises from multiple interconnected factors throughout the lifecycle of LLMs, encompassing data, training, and inference processes.
Data-Related Causes:
Source-Reference Divergence: According to Ji et al., tasks like abstractive summarization, the reference text might include information absent from the source, leading models to generate outputs based on incomplete grounding.
Misinformation and Bias: Both Rawte et al. and Huang et al. explain that training datasets often contain erroneous or biased information, which models inadvertently learn and reproduce.
Training-Related Causes:
Pretraining Limitations: During pretraining, models optimize for next-token prediction, potentially leading to overgeneralization or memorization of inaccuracies.
Fine-Tuning Misalignment: Misaligned training objectives during supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF) can exacerbate hallucination tendencies.
Inference-Related Causes:
Imperfect Decoding Strategies: Techniques such as beam search may prioritize fluency over factuality, amplifying hallucination.
Overconfidence: Models often assign high probabilities to fabricated outputs, further complicating detection and mitigation.
Metrics for Evaluating Hallucination
Evaluating hallucination requires a nuanced approach:
Statistical Metrics: These include precision and recall for factuality and alignment.
Model-Based Metrics: Leveraging pretrained models to assess the veracity of generated content.
Human Evaluation: Employing expert annotators to identify and classify hallucinatory outputs, though this is resource-intensive.
Recent efforts also emphasize the development of task-specific benchmarks, such as TruthfulQA and HaluEval, which test models against curated datasets designed to probe their factual consistency and faithfulness. These benchmarks are instrumental in driving standardized comparisons across models.
Mitigation Strategies
Efforts to mitigate hallucination span data curation, training modifications, and inference adjustments:
Data-Centric Approaches:
Data Filtering: Removing noisy or biased samples from training datasets.
Augmentation with External Knowledge: Incorporating structured knowledge bases or retrieval-augmented generation (RAG) to ground outputs in verifiable facts.
One notable example of augmentation is the use of retrieval mechanisms that fetch relevant external documents during inference, providing models with updated and accurate information to support generation.
Training Enhancements:
Faithfulness Objectives: Adjusting training loss functions to prioritize adherence to source content.
Contrastive Learning: Encouraging models to distinguish between grounded and ungrounded content.
Knowledge Injection: Embedding domain-specific or real-time data updates into the model during training to reduce reliance on potentially outdated pretraining data.
Inference Techniques:
Iterative Prompting: Refining outputs through step-by-step guidance.
Factuality-Enhanced Decoding: Modifying decoding algorithms to prioritize accuracy over fluency.
Self-Verification Mechanisms: Employing models to cross-check their outputs against trusted sources or re-evaluate their answers iteratively.
Applications and Implications
The implications of hallucination extend far beyond theoretical concerns with real world consequences. One such case happened with Air Canada’s customer support chatbot hallucinated a policy that doesn't exist. When a passenger asked Air Canada’s website chatbot about bereavement fares, the bot invented a rule saying he could buy the ticket at full price and claim the discount retroactively within 90 days. After the trip, the airline refused the refund only to have Canada’s Civil Resolution Tribunal order it to pay C$650 (plus interest and fees) and reject Air Canada’s claim that the chatbot was “a separate legal entity.” One stray hallucination turned into real legal and financial liability, illustrating why faithful generation matters. Another case happened during Mata v. Avianca, Inc., a personal injury suit tried in the U.S. District Court for the Southern District of New York. Two attorneys used ChatGPT to draft part of a motion and it hallucinated six precedent cases that never existed. U.S. District Judge P. Kevin Castel spotted the fabrications, dismissed the citations as “gibberish,” and fined the attorneys and their firm Levidow, Levidow & Oberman C$5,000 for “conscious avoidance and false statements.” The sanction shows how un-vetted AI hallucinations can turn into real ethical and financial consequences.
These real world examples illustrate how in domains like healthcare, hallucination can lead to life-threatening misinterpretations of medical advice. In journalism, it risks eroding public trust by disseminating misinformation. Similarly, legal contexts demand utmost accuracy, where any hallucinatory output could have grave consequences.
Fig 3 Ji et al. Examples of hallucination in visual question answering. The bold text is the output generated by the model and the part before it is the input prompt
Figure 3 further elaborates on this by depicting scenarios where hallucinated captions create misleading narratives about visual data. These examples highlight the significant risks in real-world applications like autonomous systems or assistive technologies, emphasizing the urgency for improved evaluation and mitigation strategies.
Fig 5 from Rawte et al. A video featuring three captions generated by various captioning models, with factual errors highlighted in red italics
Figure 5 from “A Survey of Hallucination in “Large” Foundation Models” adds another layer to this discussion by presenting video captioning outputs with factual errors highlighted in red italics. These examples reveal the complexities of grounding outputs in video data, where temporal and contextual nuances often exacerbate hallucination issues. They underscore the necessity for domain-specific training and evaluation techniques in applications like autonomous systems or assistive technologies.
However, hallucination is not universally detrimental. In creative applications like storytelling or art generation, "controlled hallucination" may enhance innovation by producing imaginative and novel content, like protein discovery. Balancing these use cases with robust mitigation strategies is critical.
Dynamic Evaluation: Developing adaptive evaluation frameworks that account for diverse tasks and domains.
Contextual Understanding: Enhancing models’ ability to reconcile disparate contexts and avoid contradictory outputs.
Knowledge Boundaries: Establishing clear delineations of models’ knowledge limitations to improve trustworthiness.
Explainability and Transparency: Understanding why hallucinations occur and providing users with insights into the reasoning behind model outputs.
Cross-Modal Hallucination: Exploring hallucination across modalities, such as text-image generation systems, to develop unified mitigation strategies.
Conclusions
Addressing hallucination in LLMs is paramount as these models become integral to decision-making systems in high-stakes domains. By tackling the root causes through better data practices, refined training paradigms, and innovative inference strategies, the NLP community can enhance the reliability and trustworthiness of LLMs. Collaborative, interdisciplinary efforts will be essential to navigate the complexities of hallucination and unlock the full potential of generative AI.
Moreover, as LLMs continue to evolve, fostering ethical and responsible AI practices will be crucial. Researchers and developers must work together to ensure that these powerful tools serve as reliable partners in human endeavors, minimizing risks while amplifying benefits for society at large.
Jim
Works Cited
Ji, Ziwei; Lee, Nayeon; Frieske, Rita; Yu, Tiezheng; Su, Dan; Xu, Yan; Ishii, Etsuko; Bang, Yejin; Chen, Delong; Dai, Wenliang; Shu Chan, Ho; Madotto, Andrea; Fung, Pascale. "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, vol. 1, no. 1, Feb. 2022. DOI: https://doi.org/10.48550/arXiv.2202.03629.
Rawte, Vipula; Sheth, Amit; Das, Amitava. "A Survey of Hallucination in 'Large' Foundation Models." arXiv preprint, Sept. 2023. DOI: https://doi.org/10.48550/arXiv.2309.05922.
Huang, Lei; Yu, Weijiang; Ma, Weitao; Zhong, Weihong; Feng, Zhangyin; Wang, Haotian; Chen, Qianglong; Peng, Weihua; Feng, Xiaocheng; Qin, Bing; Liu, Ting. "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions." ACM Transactions on Information Systems, Jan. 2024. DOI: https://doi.org/10.48550/arXiv.2311.05232.
The infrastructure researcher and CSO at Nym joins us for the seventeenth #OKFN100, a series of conversations with over 100 people about the challenges and opportunities facing the open movement
#ODDStories – USTP YouthMappers conducted a workshop seminar to cele Open Data Day, focused on QGIS as a spatial analysis and visualisation tool for disaster preparedness.
#ODDStories – In their Open Data Day event, Spatial Girls Network empowered university students with geoskills and knowledge in open data and open-source data collection tools.
TinyCat District is here! Libraries with multiple locations or branches can now manage their accounts in one place. We created new tools and features for TinyCat District that will simplify your setup, billing, and reporting.
TinyCat District offers:
Synced collections. Copy your catalog across branches with a few clicks.
Shared branding. Customize your brand across branches.
Group billing. Simplify your payment process.
Grouped reports. Review information about your district and individual branches.
Each TinyCat District “superadmin” can manage the source collections, branch memberships, grouped reports, shared branding, and billing across the district.
You can choose your billing setup. With regular “branch billing,” each library branch pays for their own account. If you choose “group billing,” your district’s superadmin can submit a single payment for all branches. Accounts with group billing may be eligible for a discounted rate!
Currently, TinyCat District is offered to libraries with five or more branches. If your library is close to five branches but not quite there, reach out anyway to see if you’re eligible!
Why is it that these two worlds have similar values but barely seem to connect? While both aim to promote free access to information, they seem to do so by different types of peoples associated with the separate movements.
Since the public debut of ChatGPT in November 2022, the calls for librarians to adopt and promote generative AI (GenAI) technologies and to teach “AI literacy” have become part of everyday work life. For instruction librarians with reservations about encouraging widespread GenAI use, these calls have become harder to sidestep as GenAI technologies are rapidly integrated into search tools of all types, including those that libraries pay to access. In this article, I explore the dissonance between, on the one hand, instruction librarians’ pedagogical goals and professional values and, on the other, the capacities, limitations, and costs of GenAI tools. Examining discourse on GenAI and AI literacy, I pay particular attention to messages we hear about the appropriate ways to think and feel about GenAI. These “feeling rules” often stand in the way of honest and constructive dialogue and collective decision making. Ultimately, I consider work from within and outside librarianship that offers another view: that we can slow down, look honestly at GenAI capacities and harms, take seriously the choice some librarians may make to limit their GenAI use, and collectively explore the kinds of futures we want for our libraries, our students, fellow educators, and ourselves.
At the April 2025 Association of College & Research Libraries Conference, academic library workers gathered in person and online to explore the theme “Democratizing Knowledge + Access + Opportunity.” Before sessions about how to integrate generative AI (GenAI) tools into essential public services like teaching and research services, sociologist and professor of African American Studies Ruha Benjamin offered the opening keynote. Articulating the resonance of the conference theme for her, Benjamin reflected, “One way to understand the stakes of this conference, … why it’s so vital that we work in earnest to democratize knowledge, access, and opportunity at a moment when powerful forces are working overtime to monopolize, control, and ration these social goods, is that this is a battle over who gets to own the future, which is also a battle over who gets to think their own thoughts, who gets to speak and express themselves freely, and ultimately who gets to create” (Benjamin, 2025). Noting that technologies are never neutral but rather reflect “the values or lack thereof of their creators,” Benjamin drew a connection between current attacks on libraries and higher education and a category of technology that was prominent throughout the conference program: artificial intelligence. “[I]t should give us pause,” she asserted, “that some of the same people hyping AI as the solution to all of our problems are often the ones causing those problems to begin with.” Applause followed.
Though Benjamin did not name the prominence of AI across conference sessions, I was probably not the only person to notice the contrast between Benjamin’s critique of AI hype and the prevalence of conference sessions about promoting AI technologies and AI literacy in libraries.
As Benjamin continued, she turned to the chilling words of JD Vance at the 2025 Paris AI Summit: “Our schools will teach students how to manage, how to supervise, and how to interact with AI-enabled tools as they become more and more a part of our everyday lives.” As I listened, I thought these words could easily be mistaken as part of a talk on AI literacy by an academic librarian or educator with better intentions. I wondered how many others were thinking the same thing. Benjamin then reminded the audience of Vance’s ideological perspective, as she observed that in January 2021 Vance gave a speech at the National Conservatism Conference entitled, “The Universities are the Enemy,” in which he argued that universities must be aggressively attacked to accomplish his and his audience’s goals for the country (Vance, 2021).
It’s worth taking a brief step away from Benjamin’s keynote to point out that a couple of weeks after her talk, on April 23, President Donald Trump issued an executive order to promote AI literacy through a new the White House Task Force on AI Education that will “establish public-private partnerships to provide resources for K-12 AI education, both to enhance AI-related education but also to better utilize AI tools in education generally.” The executive order’s Fact Sheet states that “AI is rapidly transforming the modern world, driving innovation, enhancing productivity, and reshaping how we live and work.” Thus, “[e]arly training in AI will demystify this technology and prepare America’s students to be confident participants in the AI-assisted workforce, propelling our nation to new heights of scientific and economic achievement” (The White House, 2025). This laudatory language about AI is perhaps unsurprising for an administration that established the Department of Government Efficiency (DOGE). DOGE purportedly aims to reduce the “government waste, fraud, and abuse,” largely through eliminating government jobs and replacing workers with a combination of automation and tech workers who have been directed to violate digital privacy rights and regulations (Klein, 2025; Salvaggio, 2025).
What is perhaps more striking is the similarity between the White House’s rhetoric and that of many educators in universities and academic libraries. Can the difficulty of distinguishing between the dominant AI rhetoric in higher education and that from political leaders who have explicitly named universities as the enemy be a wake-up call for people in higher education and in libraries, a message that we need to give more weight to the ethical concerns surrounding GenAI technologies?[1]
Benjamin did not dwell long in her ACRL keynote on Vance’s vision for GenAI and AI literacy. Instead, she devoted most of her time to exploring imagination as a powerful means through which to envision the kinds of worlds we want to live in and to begin building. As she noted, imagination can envision dystopian futures, but it can also open more hopeful possibilities for the futures we want. “What if we took imagination seriously?” she asked. “Not as flights of fancy, but imagination as a resource, a capacity, a muscle? How might the powers of our collective imagination begin to transform the world around us?” (Benjamin, 2025). Here Benjamin articulated what I believe many in the academic library community have been thinking and feeling in the last few years, as pressure to integrate GenAI tools into library systems and library work has intensified, often accompanied by brief and perfunctory acknowledgements of GenAI’s present and potential harms that are then set aside.
Benjamin was inviting us to imagine alternatives to the narrative that GenAI technologies are the inevitable future of nearly all intellectual work. As I will explore, this process of imagining can include critically examining discourses about GenAI and AI literacy, as well as being curious about and attentive to our own affective experiences in response to GenAI technologies and discourses about them. If we accept this invitation to imagine, we might (re)discover what becomes out of view when so much of our attention is focused on a particular vision of the future of GenAI proliferation. We might widen our ideas of what is possible and nurture some sense of collective agency to work for the kinds of futures we want.
Of course, our individual imaginings and feelings don’t always match what a majority (real or imagined) appear to share. My own conceptions of, approaches to, and feelings about GenAI and AI literacy usually seem out of sync with the dominant discourse in higher education and librarianship (though with time I learned I have some company). Like many others, I am deeply concerned about the real and present costs of GenAI technologies that are rapidly being integrated into search and library tools. I am also unsettled by widespread overconfidence in these technologies’ abilities to generate mostly reliable information and to support research and learning. Both as a librarian and more recently as a professor of practice, I have struggled with how to understand and respond to the enthusiastic calls in higher education and academic librarianship for teaching a version of AI literacy which requires that educators and students use these tools, while giving limited attention to ethical questions surrounding GenAI. So often calls to teach AI literacy contribute to AI hype by misrepresenting GenAI’s capacities, minimizing acknowledgement of its harms, and implying that critique of GenAI stands in the way of human progress. We frequently hear that GenAI technologies are the inevitable future of the world and of libraries and that the only viable option is to embrace them, and quickly, before we fall behind. This message of urgency fits into an older narrative that libraries must embrace technological change or otherwise become obsolete (Birdsall, 2001; Glassman, 2017; Espinel and Tewell, 2023).
Through this article, I hope to further encourage what Benjamin recommends: rather than rushing to adopt and promote new technologies whose ethical implications raise major questions, we might slow down and claim more time and space for considering the present and potential implications of GenAI adoption and use. That time and space is necessary for the more expansive collective imagining that Benjamin proposes, imagining that takes into consideration the power and social structures that currently exist and those that we want to exist.
Of course, what we imagine to be desirable or possible is heavily shaped by our environments, social relationships and interactions, and the ideas and messages we encounter every day. Making space for imagination therefore also means making space for individual and collective inquiry that includes establishing agreed-upon facts about GenAI, rhetorical analysis of GenAI discourses, and critical reflection on our own thoughts and feelings about GenAI technologies and discourses. Being inclusive and expansive in imagining the futures we want also requires investigating the social expectations and pressures that often influence what we do and do not say in various professional circles.
With these things in mind, in this article I consider the dissonances between what we know about the limitations and harms of GenAI technologies and the imperatives we hear to adopt them. Can we reconcile the tensions between, on the one hand, the harms of GenAI technologies and, on the other, professional values like those articulated in the ALA Core Values of Librarianship, which include equity, intellectual freedom and privacy, public good, and sustainability (American Library Association, 2024)? And if we could magically resolve many of those tensions through a radically transformed AI infrastructure that is environmentally sustainable and does not depend on the exploitation of human labor, what might we lose when we offload cognitive tasks like searching for, selecting, reading, or synthesizing sources to GenAI technologies? What do we value about education and information literacy practices that need to be preserved with foresight and intention? Because my work as a librarian and as an educator centers on teaching and learning, I am especially interested in how we conceptualize and approach teaching what is often called AI literacy.
A necessary step in this process of imagining is investigating the messages embedded in much of academic and library discourse about GenAI technologies and the appropriate ways to think and feel about them (what sociologist Arlie Hochschild might call “feeling rules”). For instruction librarians, this process includes examining conceptions and framings of AI literacy and its role in information literacy education. A critical analysis of this discourse can help us open conversations about what we want information literacy instruction and library search tools to look like and do. This inquiry can also help us identify ways we have choice and agency in our own use of and teaching about GenAI tools. After an initial consideration of the feeling rules of GenAI and dominant discourse on AI literacy, I finally consider alternative ways to think about GenAI and to respond to calls for widespread adoption. Looking to work from within and outside librarianship, I consider another view: that we can slow down; take time to look honestly and critically at what we know, think, and feel about GenAI and its impacts; and consider ways to work toward the kinds of futures that align with our professional values. Part of this process is allowing space for more critical and skeptical perspectives on and feelings about GenAI, including nuanced arguments for AI refusal, a term I unpack in more detail later.
Feeling Rules
An impetus for my writing is that I see in much of our professional discourse and interactions a tendency to dismiss or minimize critiques of GenAI technologies, and sometimes even a shaming of those “Luddites” who do not embrace the technological changes of the day.[2] As I have argued elsewhere and further consider in this article, an especially powerful strategy for shutting down critique of GenAI is the construction and imposition of “feeling rules”: social expectations about the appropriate ways to feel and display emotion in a given context (Hochschild, 1979, 1983; Baer, 2025).
Feeling rules, as first described by sociologist Arlie Hochschild, are social norms that prescribe what feelings are and are not appropriate to have and express (Hochschild, 1979). Though feeling rules are not confined to the workplace, they are a powerful part of the emotional labor we do in our places of employment (Hochschild, 1983).[3] Feeling rules are typically discussed in the context of specific moments of social interaction among individuals, while in this article I apply them to our social relationships on both a micro- and a macro-level – that is, as evident not only in discrete individual social interactions but also in discourse about GenAI technologies that is informed by social relationships.
While feeling rules are usually established by those in positions of power, they are often internalized by those for whom the feeling rules are intended (Hochschild, 1979, 1983). In the case of GenAI, messages that librarians should be enthusiastic and optimistic about technological changes, which are frequently described as inevitable, often imply or outright assert that those who question or resist certain technological developments are simply overwhelmed by irrational fear and anxiety that they need to overcome. Giving too much attention to those unpleasant emotions or their underlying thoughts, the discourse often goes, risks making the profession obsolete.
Many of the feeling rules we observe in librarianship or higher education, of course, are influenced by social conditions and norms that extend beyond them. Search for the term “AI anxiety” and you will find articles explaining it as a psychological condition to be overcome by integrating AI technologies into your everyday life (Comer, 2023; Cox, 2023; Okamoto, 2023). The antidote to AI anxiety, according to its experts: accept and embrace the technology. For example, in the BBC article “AI Anxiety: The Workers who Fear Losing their Jobs to AI,” PricewaterhouseCoopers (PwC) Global AI and Innovation Technology Leader, Scott Likens, explains, “In order to feel less anxious about the rapid adoption of AI, employees must lean into the technology. … Instead of shying away from AI, employees should plan to embrace and educate” (Cox, 2023).
But what if emotional responses like “AI anxiety” are in large part deeply intelligent, a recognition of the unsettling facts about how most GenAI tools currently are built and deployed and what harmful impacts they already have? What if the cognitive dissonance that many of us experience when reading articles about AI anxiety or the necessity of AI adoption is worth our attention and curiosity? There is a stark mismatch between, on the one hand, imperatives to rapidly adopt and promote GenAI technologies and, on the other, the extensive documentation of the unethical labor practices upon which GenAI is built, as well at GenAI’s detrimental impacts on the environment, local communities, and society more broadly (Bender et al., 2021; Crawford, 2024; Electronic Privacy Information Center, 2023; Nguyen & Mateescu, 2024; Shelby et al., 2023). Despite this, librarians who are reluctant to adopt GenAI are frequently described as regressive and even harmful to a profession that must adapt to remain relevant. This shaming closes off open dialogue and critical thought.
For many librarians who teach, the calls to adopt GenAI, promote its use, and teach a kind of AI literacy that encourages others to do the same adds to this dissonance. We repeatedly hear that GenAI is the future of work and the university and that we must therefore embrace it in our own work and teaching, regardless of our own views. Projects and initiatives at our places of employment and in our professional associations urge us to use these tools robustly, partly so we can help students, faculty, and community members keep up and succeed in our ever-changing world. Library vendors and the technology companies that libraries and universities pay for services and subscriptions continue to integrate GenAI tools into their platforms, usually offering people little to no choice in whether they use these extractive tools (though perhaps it’s time that libraries demand more choice from our vendors). The apparent lack of criticality toward these vendors and companies is further perpetuated by the refrain that librarians must teach the AI literacy skills that students and researchers will inevitably need. When we do hear about the problems with GenAI technologies, like the persistent inaccuracies in information generated from large language models (LLMs), or the extensive list of GenAI’s environmental and societal harms, reservations are usually a short footnote, followed by a call for ethical GenAI use that sidesteps the fact that using GenAI technologies in their current forms inevitably means adding to their harmful impact.
While some AI technologies may be beneficial in specific domains and justified for narrow use cases – for example, machine learning in some instances of medical diagnosis and drug discovery (Ahmad et al., 2021; Suthar et al., 2022) – they are now being integrated widely and indiscriminately across domains, including in areas where they often hinder human thought more than they support it. As Shah and Bender argue, the LLMs being integrated into library search systems that supposedly save people precious time may actually prevent the exploration, discovery, and information literacy development that these resources have long been meant to enable (Shah & Bender, 2024). Their argument is further supported by accumulating research on the detrimental effects of the cognitive offloading of tasks to GenAI (Gerlich, 2025; Shukla et al., 2025).
I see detrimental impacts of GenAI reliance directly in my own work teaching academic research and information literacy. Increasingly, a large portion of students have turned to GenAI to do nearly all the cognitive work that previously would have taken them so much time and effort. In the process, many if not most of these students are not developing the critical thinking and writing skills that have long been considered foundational to higher education. I also see a smaller group of students who are deeply concerned about the costs of GenAI and who are choosing the more labor-intensive path of developing and articulating their own thinking, rather than immediately turning to chatbots. The latter group is learning far more, and is far better prepared for the workplace and meaningful participation in society more broadly. The contrasting perspectives and behaviors of my students reflects that students’ views and uses of GenAI are, like ours, not monolithic. And also like us, students hear many of the same simplistic messages: that GenAI is an amazing technology that will make work faster and easier and that the only way to be prepared for the workplace and relevant in the world is to embrace GenAI.
In academic libraries, those who want to take a slower and more cautious approach to GenAI are frequently criticized as holding the profession back, resisting the inevitability of technological change, inhibiting progress, neglecting to prepare students for the future, and denying reality. Such criticisms have a silencing effect, discouraging people from expressing their legitimate concerns about a technology that in the widest circulating discussions is surrounded by more hype than critical investigation.
But when we can free ourselves of shaming rhetoric, we are better positioned to both support one another as respected colleagues and to think critically, and imaginatively, about how we want to engage with and teach about GenAI technologies. Given the prevalence of hype and misunderstandings surrounding GenAI, unpacking discourse on GenAI and AI literacy is a powerful and necessary part of this work.
Rhetorics of AI Literacy
Calls for embracing GenAI in higher education and academic librarianship are frequently accompanied by declarations that AI literacy is one of the most essential skills that students must now develop to be prepared for the workforce and for the future in general. Definitions of AI literacy and related competencies regularly add to the AI hype that Benjamin cautions against, as they repeatedly misrepresent GenAI’s abilities, mandate GenAI adoption, and reinforce the message that GenAI is the inevitable future which we must therefore embrace through adoption and active use. Like GenAI discourse more broadly, AI literacy rhetoric often includes brief asides to consider the potential risks of AI technologies to ensure they are used ethically and responsibly. Like a perfunctory checklist, these acknowledgements rarely offer a meaningful examination of the extensive harms of GenAI, nor do they confront the reality that more ethical use will only be possible with radical changes to GenAI technologies and their infrastructures. With the emphasis on adoption and use, this discourse leaves little to no room for considering the possibility of non-use or critical examination of use cases that might not warrant AI use.
Consider, for example, the AI Literacy Framework developed by academic and technology teams at Barnard College. Based on Bloom’s taxonomy, it is composed of four levels: 1) Understand AI, 2) Use and Apply AI, 3) Analyze and Evaluate AI, and 4) Create AI. Here, using AI precedes considering critical perspectives on AI, such as ethical concerns. After students have engaged with level 3, where they “Analyze ethical considerations in the development and deployment of AI,” the next level (4) mandates creating more of these technologies (Hibbert et al., 2024). Stanford University Teaching Commons’ AI literacy framework, which emphasizes “human-centered values,” similarly begins with developing a basic understanding of AI tools, in part through AI use (“functional literacy”). Following functional literacy is “ethical AI literacy,” which involves “understanding ethical issues related to AI and practices for the responsible and ethical use of AI tools.” Again, non-use is not presented as an option. Instead, the framework authors explain, “You and your students can identify and adopt practices that promote individual ethical behavior and establish structures that promote collective ethical behavior” (Teaching Commons, Stanford University, n.d.).[4] As these AI literacy frameworks suggest, much of the literature on AI literacy reflects a strange mixture of the AI inevitability narrative, superficial acknowledgement of ethical concerns, and AI hype that frames GenAI as a transformative force that will better society.
AI literacy frameworks created within librarianship frequently share these characteristics. ACRL President Leo Lo’s 2025 “AI Literacy: A Guide for Academic Libraries” is one such influential document. It is described as “a guide to AI literacy that addresses technical, ethical, critical, and societal dimensions of AI, preparing learners to thrive in an AI-embedded world.” In this new world, librarians can “become key players in advancing AI literacy as technology shapes the future” (Lo, 2025, p. 120). What that future looks like, or what we want it to look like, is not discussed.
Like other AI literacy frameworks, Lo’s guide predicates AI literacy on AI use, as the document defines AI literacy as “the ability to understand, use, and think critically about AI technologies and their impact on society, ethics, and everyday life” [my emphasis] (Lo, 2025, p. 120). As with the previously mentioned AI literacy frameworks, this document presents AI as pervasive and socially beneficial, while omitting a meaningful examination of the material conditions on which creating and using these technologies currently rests. At various points, the guide briefly notes the need to consider the limitations and ethics of GenAI tools, statements that are quickly followed by an emphasis on AI adoption and promotion that supports the common good, social justice, and empowerment. Consider, for example, the section on the societal impact of AI on the environment and sustainability:
While AI remains resource-intensive with a notable environmental footprint, discussions on sustainability should encompass more than just reducing consumption. The real potential lies in using AI to drive systemic changes that promote social and environmental well-being. For example, AI can optimize energy management in cities, creating smarter, more sustainable urban environments. It also has the capacity to revolutionize agricultural supply chains, increasing efficiency, reducing waste, and supporting sustainable practices across production and distribution. By integrating sustainability into the societal dimension of AI literacy, we can better understand AI’s role not just as a technological advancement, but as a force capable of reshaping our economic, social, and environmental landscapes for the better. [my emphasis] (Lo, 2025, p. 122)
Here, a minimization of the costs of AI coexists with an idealization of a future made possible by AI. No references are made to the water-thirsty and energy-hungry data centers rapidly being built to power GenAI, or how these data centers disproportionately harm economically disadvantaged communities and areas that are especially prone to drought (Barringer, 2025). If such harms seem like a distant problem that does not affect most of us, we are likely to be proven wrong. For example, in my current home of Austin, Texas, which is prone to both drought and power grid failures, data centers are big business (Buchele, 2024).
The influential role of Lo’s AI Literacy Guide is further reflected in another key ACRL effort to promote the integration of AI in academic libraries: the ACRL AI Competencies for Academic Library Workers (“AI Competencies”) (ACRL AI Competencies for Library Workers Task Force, 2025). The first draft, published online this past March, builds on Lo’s AI Literacy Guide. Like Lo’s AI Literacy Framework, AI Competencies does not consider whether GenAI tools are the optimal technologies for information literacy education, library research, or critical inquiry.
While Lo’s aforementioned AI Literacy Guide is apparently designed for library instruction, the AI Competencies document concentrates on the abilities that library workers should possess. Despite this different focus, the task force also associates their work with information literacy and notes early in the document that while developing the competencies, they “recognized significant parallels between responsible AI use and the principles of critical information literacy, as outlined in documents like the ACRL Framework for Information Literacy for Higher Education” (p. 1). This suggests the potential relevance of the document to librarians’ instructional work.
Before engaging in a closer examination of the AI Competencies first draft, I should stress that upon releasing the document the authors solicited feedback from the library community to inform future revisions. At the Generative AI in Libraries (GAIL) Conference this past June, the task force co-chairs shared the feedback they received and the kinds of revisions they plan to make (Jeffery and Coleman, 2025). Much of that feedback mirrors my own concerns about common conceptions of AI literacy that I have discussed thus far, conceptions that are reflected in the AI Competencies first draft as well. A considerable number of responses challenged the implications that library workers must use AI, that AI literacy necessitates AI use, and that responsible GenAI use is possible. Some also commented that the document did not adequately acknowledge GenAI technologies’ harms and that the description of AI dispositions (which I discuss in more detail momentarily) was not appropriate for a competencies document. The task force’s receptiveness to this input – which contrasts professional discourse about GenAI that I previously observed – suggests that many in our profession may be eager and now better positioned for more open and honest conversations about GenAI technologies than in the earlier days of learning about them.
Regardless of how the final draft of the AI Competencies document develops, the dispositions outlined in the first draft are worth closer attention because of the feeling rules about GenAI that they imply (for example, the expectation that competent library workers will embrace GenAI technologies and feel positively about them).[5] As the AI Competencies task force explains, the document’s dispositions “highlight the importance of curiosity, adaptability, and a willingness to experiment with AI tools” (pp. 2-3). Library workers who demonstrate the appropriate AI literacy dispositions: “Are open to the potential of responsible human-AI collaboration to unlock a future of greater equity and inclusion,” “Seek uses of AI that center and enhance human agency rather than displace and inhibit it,” and “Pursue continuous professional reflection and growth, especially concerning ethical and environmental responsibilities” (p. 3). Implicit within these dispositions is the belief that use of AI tools in their current form can lead to greater equity and can enhance human agency rather than displacing it. The document does not discuss actions or responses one might take in light of the harmful impacts of GenAI technologies. Instead, questioning whether AI tools should be used appears antithetical to the AI competencies articulated in the document. Like many other AI literacy frameworks and guides, this document implies that reflection is sufficient for demonstrating the correct AI competency dispositions. Such rhetoric, not unique to this document, obfuscates the reality that people have limited control over or insight into what the AI companies that own most AI tools do to build and maintain them.
When AI literacy documents assume GenAI use and come to dominate conversations about GenAI in academic libraries and higher education, or even become codified through formal adoption by institutions or organizations, how does this position library workers and educators who disagree with the assumptions embedded within those documents? Should these individuals be considered “AI illiterate,” in need of developing proper GenAI practices, attitudes, and dispositions? Through the lens of these documents, resisting rapid adoption of GenAI tools or questioning their value might be considered incompetence, regardless of how well informed or thoughtful someone’s perspective on GenAI is.
The AI Competencies first draft provides a window into many of the feeling rules about GenAI currently circulating in academic librarianship. Fortunately they ultimately may not be codified in the final version. The task force’s honesty and critical reflection about the academic library community’s feedback, including questions about the appropriateness of including AI dispositions, is evidence that feeling rules and the narratives that help to drive them are never fully solidified and are rarely universally accepted. Feeling rules are often sites of contestation. Moreover, they can shift and change as we learn more and as we engage in critical reflection and dialogue.
New Imaginings for Responding to GenAI
As the critical feedback on the AI Competencies suggests, alternatives to the dominant AI literacy discourse and its implied feeling rules exist, even when those different viewpoints are harder to find. As some educators demonstrate, when we challenge the feeling rules embedded in much of the higher education and library GenAI discourse, we can open new possibilities for thinking about and responding to calls for GenAI adoption and AI literacy instruction that promote this adoption. We can begin to imagine ways of acting that might be out of view when we are mired in a particular set of feeling rules about GenAI (rules that have largely been constructed by the tech companies that stand to profit from the continued use and growth of their data-extracting products).
Charles Logan is among the educators going against the grain of AI enthusiasm and inviting us to think differently about common conceptions of AI literacy. Building on Nichols et al.’s (2022) work on the limits of digital literacy, Logan interrogates the extent to which AI literacy is even possible, given GenAI’s opaqueness and the hegemonic systems on which these technologies are built (Logan, 2024; Nichols et al., 2022). Noting the assumption of AI use in AI literacy discussions, Logan cautions, “An AI literacy devoid of power analysis and civic action risks becoming a talking point for Big Tech, and … a means for corporations like OpenAI and Google to set the terms of how educators and students think about and use their chatbots” (Logan, 2024, p. 363). Instead, Logan proposes a “more heterogeneous approach to generative AI” that allows room for non-use and critical inquiry into GenAI. One pedagogical response is “mapping ecologies of GenAI” that illuminate “its social, technical, and political-economic relations” (Logan, 2024, p. 362). For example, Logan describes a classroom mapping activity developed by Pasek (2023), in which students locate a nearby data center and investigate questions such as, “What potential land use, energy, or water conflicts might exist because of the data center?” and “Who benefits from the data center being here? Who loses?” (Pasek, 2023, cited in Logan, 2024, p. 366).
Drawing from the work of educators and scholars like Logan, librarian Joel Blechinger pays particular attention to dominant framings of AI literacy, which are connected to a longer tradition of presenting literacy as an antidote to intractable social issues and structural problems. Reiterating the question of whether AI literacy is possible, Blechinger asks librarians, “to what extent are efforts to theorize—and proclaim a new era of—AI Literacy premature? Do these efforts instead reflect our own professional investment in the transcendent power of literacy—what Graff & Duffy (2014) have termed ‘the literacy myth’—more than the applicability of literacy to GenAI?” Similar to Logan, Blechinger proposes that one alternative pedagogical approach could be to draw from a politics of refusal, rather than assuming AI use (Blechinger, 2024).
While some may have a knee-jerk negative response to the term refusal, the concept is more nuanced than one might first think. Writing and rhetoric scholars and teachers Jennifer Sano-Franchini, Megan McIntyre, and Maggie Fernandes, who authored “Refusing GenAI in Writing Studies: A Quickstart Guide,” describe GenAI refusal as encompassing “the range of ways that individuals and/or groups consciously and intentionally choose to refuse GenAI use, when and where we are able to do so.” Such refusal, they write, “is not monolithic,” nor does it “imply a head-in-the-sand approach to these emergent and evolving technologies.” Moreover, “refusal does not necessarily imply the implementation of prohibitive class policies that ban the use of GenAI among students” (Sano-Franchini et al., 2024).
This conception of GenAI refusal is aligned with the work of scholars like Carole McGranahan, who explains in “Theorizing Refusal: An Introduction” (2016) that “[t]o refuse can be generative and strategic, a deliberate move toward one thing, belief, practice, or community and away from another. Refusals illuminate limits and possibilities, especially but not only of the state and other institutions.” Such a politics of refusal, embedded in the fields of critical and feminist data studies, can be a source for imagining new possibilities, while being informed about the material conditions that underlie and shape technologies and technological use (D’Ignazio, 2022; Garcia et al., 2022; Zong & Matias, 2024).
Sano-Franchini, McIntyre, and Fernandes’s act of refusal, supported by an extended analysis of GenAI’s material impacts on society in general and on writing studies and higher education more specifically, can also be understood as a refusal to accept the feeling rules implied in so much of the discourse on AI literacy. The authors present ten premises on which they ground refusal as a reasoned disciplinary response to GenAI technologies. The first of these – “Writing studies teacher-scholars understand the relationship between language, power, and persuasion”– is especially relevant to considering the feeling rules that drive much of generative AI discourse in higher education and in libraries. The authors observe that the metaphors often applied to these technologies obscure the human labor that goes into GenAI training and ascribe human abilities to these technologies in ways “designed to cultivate trust in corporate, exploitative, and extractive technologies.” I would add that the messages we hear from our employers and other educators positioned as experts in GenAI and AI literacy further encourage us to trust these technologies over our reservations about them. Instead, Sano-Franchini, McIntyre, and Fernandes write, “We must be critical of the ways that these metaphors and affective associations are used to exaggerate the abilities of these products in ways that strengthen the marketing efforts of Big Tech corporations like OpenAI.” With this criticality, writing studies scholars can “use language that most accurately—and transparently—reflects the actual technology and/or … [highlight] the discursive limitations of the language … [they] commonly use to describe these products.” The authors draw attention to the economics behind GenAI and the ways it is promoted and marketed. Asking us to examine who truly benefits from the increased use of GenAI in higher education, they note that people in the EdTech industry have largely shaped this discourse (for example, articles in InsideHigherEd and The Chronicle of Higher Education written by individuals with close ties to the EdTech industry).
Such examinations of the language used to discuss GenAI in higher education help to illuminate what usually goes unspoken in that discourse. Sano-Franchini, McIntyre, and Fernandes’s critical examination of GenAI can be seen not just as a refusal to adopt GenAI technologies into their teaching. It is also a refusal to follow the feeling rules behind much of GenAI discourse. It is refusing to be shamed or to doubt oneself for having concerns about the value, ethics, and potential impacts of the GenAI technologies being so heavily promoted at our institutions. The authors choose critical thought over compliance with mandates that stifle critical inquiry and dialogue. Regardless of whether an individual or a group adopts a stance of GenAI refusal (a position that the authors stress looks different in practice for each individual in their context), examining and questioning the feeling rules implicit in much of GenAI discourse better enables us to make more intentional and informed choices about how we do or do not use these technologies and how we teach about them.
Examples of librarians challenging the feeling rules of dominant GenAI discourse exist, even when they are outliers. ACRL’s invitation to Ruha Benjamin to give the 2025 conference keynote is just one example of an interest within our profession to hear more critical perspectives. Library workers’ feedback on the ACRL AI Competencies for Academic Library Workers is another. Some librarians are also vocalizing a need for slower and more critical investigations into GenAI tools, even when doing so risks social ostracism.
In the April 2025 issue of College & Research Library News, Ruth Monnier, Matthew Noe, and Ella Gibson candidly discuss their concerns about the GenAI tools that are increasingly being used and promoted in their organizations. Drawing attention to both the hype and the many ethical questions surrounding GenAI, they note the unpopularity of expressing reservations about adopting GenAI into libraries. Noe reflects, “the hype cycle is real and here it often feels like the choices are to get on board or lay down on the tracks to become part of a philosophy joke.” Monnier concurs: “I agree it is weird how fast universities, corporations, and individuals have pushed for the adoption and usage of generative AI, especially in the context of the rhetoric about ‘how bad’ social media and cellphones are within a K-12 environment. What makes this technology so unique or special that we as a society feel the immediate need to use and adopt it compared to other previous technologies?” (Monnier et al., 2025). The scope of this article does not allow for a close examination of such work, but additional resources in which library workers challenge the dominant feeling rules of GenAI include Joel Blechinger’s “Insist on Sources: Wikipedia, Large Language Models, and the Limits of Information Literacy Instruction” (2024), Violet Fox’s zine “A Librarian Against AI” (2024); and Matthew Pierce’s “Academic Librarians, Information Literacy, and ChatGPT: Sounding the Alarm on a New Type of Misinformation” (2025). Such work does not deny the fact that GenAI tools exist, nor does it suggest we can or should ignore these tools’ existence. It does open space for thinking more critically about the actual capacities and impacts of GenAI and making more intentional and informed choices about how we (dis)engage with GenAI technologies.
Many in our profession likely will not agree with much of what I have said here, but regardless of our individual views of GenAI technologies, I hope we can all agree that we value critical inquiry, and that an essential part of that process is making space for a consideration of varied perspectives and experiences. Critical inquiry and dialogue become possible and richer when we investigate the feeling rules that may be shaping, and sometimes limiting, professional discourse and practice. As we expand critical conversations about GenAI, we have more power to imagine the futures we want to build and cultivate, as Ruha Benjamin invites us to do.
In the spirit of collective imagining, I close with some questions I would like to collectively explore in and beyond libraries. I have organized these questions into two main areas: professional conversations and interactions and our teaching practices.
Professional conversations:
How can we be more inclusive of varied perspectives in our conversations about GenAI and related work, as we acknowledge the challenge of speaking honestly when one disagrees with dominant framings of GenAI and AI literacy?
How can we more critically examine our discourses and dialogues about GenAI, as we identify areas that may be unclear, inaccurate, or based on assumptions that need further investigation?
How do we practice a culture of care in these dialogic spaces and engage in constructive critique of ideas, not a critique of individuals?
How do we align our discourse about GenAI and related work with our professional and personal values, including those articulated in the ALA Core Values of Librarianship and the ALA Ethics of Librarianship?
How do we preserve time and energy for valuable work that may not be centered on GenAI, and that has been deprioritized because of the presently dominant focus on GenAI?
Teaching practices:
Historically, what have we valued about librarianship and information literacy education that still remains vital to us? How do we continue our engagement with those dimensions of our work?
What agency do students, faculty, and library workers have in whether/how they choose to use GenAI tools? What might it look like for teaching about GenAI technologies to allow for choice in whether and when to use GenAI tools? How can opting for non-use be respected as a choice that may be well-informed and even strategic?
What skills, understandings, and practices are prioritized or deprioritized in our teaching? What might be gained and what might be lost through our different prioritizations of pedagogical content and learning experiences? What guides our decisions about what to teach and how?
Many of the resources referenced in this article’s section on alternative imaginings can be springboards for further dialogue and for imagining the futures we want to have and to help build.
In closing, I return now to the end of Ruha Benjamin’s 2025 ACRL keynote. Ultimately, Benjamin revised her opening question “Who owns the future?” to “Who shares the future?” This reframing invites us to imagine collectively. That imagining will inevitably include differing views and beliefs, and it will not always be comfortable. But it can be more generative (in the human sense) and more inclusive when we consider questions like those above, and when we remember that most of us want a future in which people and communities can pose and explore their own questions, find sources of information worth their trust, and work together to actively make informed choices that support the common good. Most of us will hopefully also agree that this collective work is worth the discomfort of looking honestly at the feeling rules embedded in much of GenAI discourse and librarianship. We may be better able to discover and work toward the futures we want when we break those rules in ways that are kind and affirmative of everyone’s humanity, and that prioritize human thought and action over automation.
Acknowledgements
Though this work lists one author, the reality is that many people helped shape it.
My sincere thanks to external reviewer Joel Blechinger and Lead Pipe internal reviewers Ryan Randall and Pamella Lach for the time, thought, and care they gave to providing constructive feedback on the various stages of this article. Thank you also to Pamella, as Publishing Editor, for facilitating all steps of the publishing process, and to all members of the Lead Pipe Editorial Board for their attention to this article, the opportunity to publish it here, and all the work that goes into sustaining this volunteer-driven, open access publishing venue. I also want to express my appreciation to Melissa Wong, who provided writing feedback on a separate article on dominant narratives about generative AI in librarianship and encouraged me to further develop that article’s discussion of GenAI and feeling rules.
Ahmad, Z., Rahim, S., Zubair, M., & Abdul-Ghafar, J. (2021). Artificial intelligence (AI) in medicine, current applications and future role with special emphasis on its potential and promise in pathology: Present and future impact, obstacles including costs and acceptance among pathologists, practical and philosophical considerations. A comprehensive review. Diagnostic Pathology, 16(1), 24. https://doi.org/10.1186/s13000-021-01085-4
Baer, A. (2025). Unpacking predominant narratives about generative AI and education: A starting point for teaching critical AI literacy and imagining better futures. Library Trends, 73(3), 141-159. https://doi.org/10.1353/lib.2025.a961189
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922
Benjamin, R. (2025, April 2). Opening keynote. ACRL 2025, Minneapolis, MN.
Blechinger, J. (2024, June 7). Insist on sources: Wikipedia, Large Language Models, and the limits of information literacy instruction. CAPAL 2024 (Canadian Association of Professional Academic Libraries), Online. https://doi.org/10.60770/36Y6-3562
Espinel, R., & Tewell, E. (2023). Working conditions are learning conditions: Understanding information literacy instruction through neoliberal capitalism. Communications in Information Literacy, 17(2), 573–590. https://doi.org/10.15760/comminfolit.2023.17.2.13
Garcia, P., Sutherland, T., Salehi, N., Cifor, M., & Singh, A. (2022). No! Re-imagining data practices through the lens of critical refusal. Proceedings of the ACM on Human-Computer Interaction, 6 (CSCW2, Article no. 315), 1–20. https://doi.org/10.1145/3557997
Gerlich, M. (2025). AI tools in society: Impacts on cognitive offloading and the future of critical thinking. Societies, 15(1), 6. https://doi.org/10.3390/soc15010006
Graff, H. J. & Duffy. J. (2014). Literacy myths. In B.V. Street, S. May (Eds.), Literacies and Language Education. Encyclopedia of Language and Education. Springer. https://doi.org/10.1007/978-3-319-02321-2_4-1
Hochschild, A. R. (1979). Emotion work, feeling rules, and social structure. American Journal of Sociology, 85(3), 551–575. https://doi.org/10.1086/227049
Jeffery, K. & Coleman, J. (2025, June 16). ACRL AI competencies for library workers. Generative AI in Libraries (GAIL) Conference. Online. https://www.youtube.com/watch?v=PLvf_OhaWZg
Lo, L. (2025). AI literacy: A guide for academic libraries. College & Research Libraries News, 86(3), 120-122. https://doi.org/10.5860/crln.86.3.120
Logan, C. (2024). Learning about and against generative AI through mapping Generative AI’s ecologies and developing a Luddite praxis. ICLS 2024 Proceedings (International Society of the Learning Sciences). https://repository.isls.org//handle/1/11112
Merchant, B. (2023). Blood in the machine: The origins of the rebellion against Big Tech. Little, Brown and Company.
Monnier, R., Noe, M., & Gibson, E. (2025). AI in academic libraries, part one: Concerns and commodification. College & Research Libraries News, 86(4), Article 4. https://doi.org/10.5860/crln.86.4.173
Nichols, T. P., Smith, A., Bulfin, S., & Stornaiuolo, A. (2022). Critical literacy, digital platforms, and datafication. In R. A. Pandya, J. H. Mora, N. A. Alford, & R. S. de R. Golden (Eds.), The Handbook of Critical Literacies (pp. 345–353). Routledge. https://doi.org/10.4324/9781003023425-40
Pierce, M. (2025). Academic librarians, information literacy, and ChatGPT: Sounding the alarm on a new type of misinformation. College & Research Libraries News, 86(2), Article 2. https://doi.org/10.5860/crln.86.2.68
Sano-Franchini, J., McIntyre, M., & Fernandes, M. (2024). Refusing GenAI in writing studies: A quickstart guide. Refusing GenAI in Writing Studies. https://refusinggenai.wordpress.com
Selber, S. A. (2004). Multiliteracies for a digital age. Southern Illinois University Press.
Shah, C., & Bender, E. M. (2024). Envisioning information access systems: What makes for good tools and a healthy web? ACM Transactions on the Web, 18(3), 33:1-33:24. https://doi.org/10.1145/3649468
Shelby, R., Rismani, S., Henne, K., Moon, Aj., Rostamzadeh, N., Nicholas, P., Yilla-Akbari, N., Gallegos, J., Smart, A., Garcia, E., & Virk, G. (2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, 723–741. https://doi.org/10.1145/3600211.3604673
Shukla, P., Bui, Ph. Levy, S. S., Kowalski, M., Baigelenov, A., & Parsons, P. (2025, April 25). De-skilling, cognitive offloading, and misplaced responsibilities: Potential ironies of AI-assisted design. CHI EA ’25: Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems (Article No.: 171), 1-7. https://doi.org/10.1145/3706599.3719931
Shuler, S., & Morgan, N. (2013). Emotional labor in the academic library: When being friendly feels like work. The Reference Librarian, 54(2), 118–133. https://doi.org/10.1080/02763877.2013.756684
Sloniowski, L. (2016). Affective labor, resistance, and the academic librarian. Library Trends,64(4), 645–666. https://doi.org/10.1353/lib.2016.0013
Suthar, A. C., Joshi, V., Prajapati, R., Suthar, A. C., Joshi, V., & Prajapati, R. (2022). A review of generative adversarial-based networks of machine learning/artificial intelligence in healthcare. In S. Suryanarayan Iyer, A. Jain, & J. Wang (Eds.), Handbook of Research on Lifestyle Sustainability and Management Solutions Using AI, Big Data Analytics, and Visualization. IGI Global Scientific Publishing. https://doi.org/10.4018/978-1-7998-8786-7.ch003
Zong, J., & Matias, J. N. (2024). Data refusal from below: A framework for understanding, evaluating, and envisioning refusal as design. ACM Journal on Responsible Computing, 1(1), 1–23. https://doi.org/10.1145/3630107
[1] For those who would argue we should not conflate the extreme views of a few politicians with those of the AI industry, it is worth noting statements by tech leaders who have argued AI can replace education (Ivanova, 2025; Sam Altman [@sama], 2022).
[2] For an exploration of who the Luddites actually were and why the term’s pejorative use is misplaced, see Brian Merchant’s book Blood in the Machine (2023).
[3] Hochshild’s initial research on emotional labor focused on the experiences of flight attendants and debt collectors (Hochshild, 1983). Subsequent research by others building on Hochschild’s work examined the emotional labor of numerous caring professions, including librarianship, where workers are often expected to consistently display a friendly and cheerful demeanor (Evans & Sobel, 2021; Shuler & Morgan, 2013; Sloniowski, 2016; Sobel & Evans, 2020).
[4] The Stanford University Teaching Commons AI Literacy Framework is based partly on Selber’s 2004 multiliteracy framework, which includes three main dimensions of literacy: functional literacy, critical literacy (related to social and ethical issues), and rhetorical literacy.
[5] The choice to include dispositions in the AI Competencies for Academic Library Workers was likely inspired by the ACRL Framework for Information Literacy, which lists dispositions for each of its six conceptual frames.
I recently spent some time looking up metadata for DOIs in the Crossref
API at work. Here a few takeaways in case you ever need to do it
too:
it looks like the most you can look up at a time is 49 using the
https://api.crossref.org/works/?filter=doi1,doi2,…
endpoint.
when looking up 40 at a time the responses take on average 8.08 seconds
to return (std dev 5.97)
if any of your DOIs fail to match this regex the entire request will
fail with a HTTP 400 error r”^doi:10.)/.+$“
the server can return spurious HTTP 500 errors, which when retried
succeed just fine
apparently you can do up to 50 requests
per second to the API?! I only did one at a time, and put a one
second sleep in between since I wasn’t in a particular rush, and wanted
to look up 500,000 DOIs without pissing anyone off.
Your mileage will vary, as these are just observations at a point in
time, and the state of the service is guaranteed to change over time.
Of course you could choose to bypass the REST API altogether and work
with the bulk
download. But in the situation we have at work we want to be able to
update our data on a weekly basis, so the API makes more sense.
The goal was to show participants how to upload a dataset, check for mistakes in the data, and add helpful information (metadata) that explains the dataset clearly.
The final CNI membership meeting of Cliff’s tenure, held April 7–8, 2025, in Milwaukee, was to include a surprise presentation of the Festschrift’s table of contents. Though Cliff’s health prevented him from attending in person, he participated virtually and heard readings of excerpts from each contribution. Clifford Lynch passed away shortly after, on April 10, 2025. Authors completed their essays before his passing, and the original text remains unchanged.
Below the fold is a bried snippet of each of the invited contributions and some comments.
The festschrift's contents are:
Don Water's contribution provides a comprehensive overview of Cliff's achievements. He stresses the continuity of Cliff's major themes:
In his opening address as executive director at the October membership meeting, Clifford promised to continue many of CNI's ongoing initiatives. However, he urged the membership to recast these activities and add others under three broad themes: (1) developing networked information content; (2) transforming organizations, professions, and individuals; and (3) building technology, standards, and infrastructure. In CNI's program plan, he suggested that these themes captured "the essential foundations of the vision of advancing scholarship and intellectual productivity." In other words, the scholarly work of research and teaching in the emerging networked environment depended on—and would surely collapse if sufficient attention was not given to—key cultural (content) and social (organization and skill) factors as well as the underlying technology.
Dan Cohen discusses Cliff's influence on the Google Books project and, more broadly, the impact of access to multiple forms of data online:
Cliff was saying something radical: networked digital collections transcend the benefits of mere remote access, the ability to read at a distance. They will inevitably lead to utterly new approaches and uses. Books would, of course, continue to serve as a rich narrative form of knowledge. But by becoming digital, books could mingle with other forms of information and data to enable research we had not previously considered or been able to do.
For instance, Cliff noted in his forum talk, the value of quotidian humanities reference works, such as gazetteers and ontologies, would grow enormously in the future as aids to search the large corpora that were coming online. These highly structured works could be fruitfully joined with less structured text, unlocking previously hidden knowledge and enabling new forms of analysis. At the same time, free-form text corpora could generate or supplement new kinds of reference works.
Kevin Guthrie and Roger Schonfeld describe how Cliff's focus on infrastructure influenced the development of such important platforms as JSTOR:
He saw long before others that infrastructure would be needed to enable the development of new forms of scholarship, research, and learning at the intersection of technology and education. For example, at his first meeting as CNI executive director in the spring of 1997, he highlighted the need for authorization and authentication infrastructure in his meeting road-map. He wrote that "authentication and authorization is emerging as a central issue, and we need your participation in developing these issues." This technique was characteristic of Clifford's approach; he would highlight an important need and appeal to those in the community to address it.
...
Clifford and CNI provided the venue for the development of many of the services that emerged to facilitate access, such as Shibboleth and OpenAthens. Over the course of his long tenure, Clifford and CNI have had similar impact on the development of many components of scholarly communications infrastructure, including model licenses, digital object identifiers, standards for reporting usage data, and institutional repositories, among others.
Christine Borgman describes four areas to which Cliff and CNI made major contributions, but where the fundamental problems remain unsolved:
A defining characteristic of infrastructure is that it is invisible until it breaks down. Invisibility is a sign of success, in that libraries were serving their communities so effectively that their existence was receding into the background. Invisibility is also a sign of failure, in that the substantial expertise, labor, and resources necessary to provide those services goes unrecognized. In times of shrinking budgets, invisibility is especially dangerous.
The loss of libraries position as an intermediary in the access pipeline:
Libraries catalog materials they acquire to support the information needs of their communities. They have far less control over how remotely accessed content is organized. Library catalogs increasingly are subsumed under content management systems, and users find their way into many of these resources via search engines external to library control.
The lack of coherent economic and policy models for preserving and sustaining access to digital collections remains a massive problem for libraries—past, present, and future.
Lack of clarity as to the roles of different institutions in the access pipeline over time:
Digitization has blurred the boundaries between types of artifacts, such as books, records, and objects, and between institutions. Determining who collects what kinds of content and who sustains access for future generations becomes even more complex when publishers lease access to content, and as research data repositories are hosted by funding agencies or as independent entities.
Borgman also discusses the related problem of open data:
Making research data FAIR (findable, accessible, interoperable, and reusable) remains complex, laborintensive, and expensive in most disciplines. Making research data sustainable for the long term involves an array of economic, social, institutional, and infrastructural challenges. The lack of sufficient data management workforce comprised of data librarians and archivists is central to these challenges.
Michael Buckland describes Cliff's pioneering work at Berkeley on library automation, which was truly ahead of its time:
From 1979 until he moved to CNI in 1997, Clifford was responsible for developing and implementing library infrastructure for the multicampus University of California system, including MELVYL, a highly innovative, user-oriented online replacement for card catalogs and its extension to provide access to medical and other bibliographical resources. To support it and other applications, he and others built an intercampus network that evolved into the university's Internet node.
Marjory Blumenthal describes Cliff's contributions to the Computer Science and Telecommunications Board (CSTB) at the National Academies:
Cliff contributed to multiple CSTB activities. He spoke at events, wrote material for CSTB reports, and participated in committee deliberations on such topics as information infrastructure, literacy (both information and information-technology literacy), and intellectual property rights. More often than not, those topics intersected, and Cliff was perceptive in acknowledging and assessing their interactions. This integrative outlook was (and is) a differentiator; so many experts focus on their primary expertise, whether as technologists, social scientists, lawyers, or something else.
Judy Ruttenberg covers Cliff's contributions to the Association of Research Libraries:
Many people have remarked over the years that Cliff is wonderful at making sense of the moment and synthesizing information in real time—and he is regularly asked to play that summation role at the end of conferences. What has been more remarkable for me is Cliff's ability to identify early signals and drivers of change across the research, policymaking, and cultural heritage sectors. Cliff specializes in directing our collective attention to significant trends that will affect the impact, stewardship, and trustworthiness of networked scholarly information, and the capacity of institutions to fulfill that mission.
Herbert Van de Sompel and Michael L. Nelson's contribution is a condensed version of their blog post. You should definitely read that in preference to the version in the festschrift. Herbert and Michael tell the fascinating story of a 1999 meeting in Santa Fe that featured a demonstration of their prototype Universal Preprint Service, which became influential in the development of OAI-PMH and OpenURL. The meeting started out badly but ended well thanks to Cliff. Herbert describes a typical Cliff intervention:
Prior to the start of the second day, he vented his frustration about the lack of progress to Cliff, who was about to start moderating the first session. Cliff was nice enough to let him ramble on a bit, and, in a manner that exemplified one of Cliff’s many unparalleled capabilities, he went on to open the meeting by providing two discussion topics regarding interoperability that he somehow had been able to synthesize from the first day’s discussions, which most had experienced as enjoyable yet lacking in any sense of concrete direction. One was whether archive functions, such as data collection and maintenance, should be decoupled from user functions, such as search. The other was about the choice between distributed searching across repositories and harvesting from them to build cross-repository search engines.
This is a great example of Cliff's ability to isolate the signal from the noise, amplify it, and persuade the community to focus on it. Most people who have been in discussions with Cliff have a similar story.
In 1986, I was invited to visit the University of California, Berkeley, to review their campus network. One of the numerous people there who needed to be interviewed was a young librarian who was said to have a good sense of the future. My prepared questions were oriented toward the future networking requirements for libraries. The interviewee, Cliff Lynch, took the conversations to several other places, however, including topics of privacy, access control, and security. Our conversation wandered this landscape far longer than the time set aside, leaving the rest of my site visit schedule in shambles. At the end, Cliff and I agreed that we would find some way to continue the "interview." For almost 40 years, we have.
Howard Besser's list of Cliff's contribution to the stewardship of digital images starts by "noting that Cliff is responsible for popularizing the term stewardship to reflect management across the life cycle of a digital work". The first item shows how far ahead of his time Cliff was:
In 1986, the Office of Information Systems and Technology at the University of California, Berkeley began work on a project to deliver high-quality digital images from its art museum, architecture slide library, and geography department. The developers believe that this software (eventually called ImageQuery) was the first deployed multiuser networked digital image database system. The software was first shown publicly in June 1987 at the conferences of the American Association of Museums (now the American Alliance of Museums) and the American Library Association. For most attendees, this was their first time viewing a high-resolution image on a computer screen.
ImageQuery was an X-Windows-based system with several features that were relatively new for the time: a graphical user interface, point-and-click searching, thumbnail images to permit browsing and sorting, tools for annotation of images, and the linking of images to locations on maps. In addition, ImageQuery was designed for networked accessibility, had client-server features, and permitted Boolean searches.
Vicky's and my contibution is the penultimate one, and again I suugest you read Lost Of Cliff Keeps Stuff Safe as the draft we posted back in April.
This text reports on the impact of a training course offered by a multiplier trainer from the School of Data network. A one-day workshop on Ensuring FAIR and High Quality Data using Open Data Editor (ODE) was hosted at Department of Library and Information Science, Swami Vivekanand Subharti University (SVSU), Meerut, India, on 28 June...
This text reports on the impact of a training course offered by a multiplier trainer from the School of Data network. With support from the Open Knowledge Foundation, the Technology for Inspiration Initiative led by Nuela Ada Ononiwu organised a two-day training workshop on the Open Data Editor (ODE) developed under the School of Data...
Someone in a Slack I’m in asked for advice on how to understand classes in programming, and I wrote a mini-novel about how I understand them, so I figured it might as well be a blog post! Here we go —
A class is a noun. That is to say:
it is a Specific Thing
which I can name.
And I make a class when I have a noun that needs to know things about itself.
What do I mean by “needs to know things about itself”? I mean that this Specific Thing has associated behavior and/or data. For instance, if I’m writing a circulation system for a library, an important noun might be a Checkout.
Checkouts have data they need to know about themselves: for instance, what time was the object checked out, what object is it, when is it due back, what user ID has it. Every Checkout needs to know this kind of data about itself, but the specific values of the data will differ for each Checkout.
Similarly, Checkouts have behavior — that is, verbs that happen to this noun! For instance, a Checkout might know how to resolve itself: add a checkin timestamp, remove the user ID for privacy purposes, et cetera.
And now the sweet part: Checkouts need to know the mechanics of how to resolve themselves, but code outside of Checkouts only needs to know that Checkouts can be resolved. This makes code a lot more readable — if you give the methods on Checkout good names, then you see something like checkout.resolve() and you understand what’s happening; you don’t need to get bogged down in the details of how it happens. You can even update the mechanics of how Checkout resolution works later, and the rest of the code doesn’t need to change. Only the Checkout class itself needs to keep track of the specifics. This principle of encapsulation is fundamental to software design and makes it easier to maintain code over time.
OK, so how do I know when to write a class? I mean, at first, I usually don’t! I write some functions to handle behavior, and then over time I realize one or more of the following:
There’s a bunch of data that tends to travel together. Maybe it wants to be different properties of one thing.
I keep seeing functions that need more than 4 arguments. Maybe several of these arguments want to be properties of a common thing. (This is the most common way that I notice that I have a bunch of data traveling together.)
I realize that there’s an important object or idea in my problem domain, and a big part of the cognitive work I’m doing is modeling its data and behavior, and my life will be easier if I just slap a name on that and treat it as a Specific Thing.
All of this, of course, was a learning journey I had to go on as a baby programmer. It’s easiest to get started by just writing functions that Do The Thing! That’s definitely where I got started, and it took me a while to build an actionable mental model of classes. For me, I think the first thing that got me over the cognitive hurdle was when Django introduced class-based views. I’d been working in Django but it had only had function-based views, and then I had to adapt to this new way of thinking. It was a challenge at first, but Django did such a phenomenal job of writing class-based views as repeating patterns — so the same sorts of methods tended to be exposed in different ones, and you could just grab and overwrite what you needed — that I started to understand their utility. Also, of course, the documentation was spectacular.
I also found it really helpful to watch a whole bunch of Sandi Metz’s Smalltalk-inflected Ruby conference talks. She has incredibly lucid things to say about object-oriented programming; in particular, the way she talks about message-passing really influenced how I think about writing tests, hiding vs showing information, and different objects’ scopes of responsibility. I don’t think you can go wrong with any of them, but the ones that have stuck with me are Nothing is Something, All the Little Things, and Polly Want a Message.
LibraryThing is pleased to sit down this month with bestselling author Susan Wiggs, whose prolific body of work—including more than fifty novels—has been translated into more than twenty languages, and is available in more than thirty countries. Known for such popular series as The Lakeshore Chronicles and the Bella Vista Chronicles, and for stand-alone bestsellers like The Oysterville Sewing Circle and Family Tree, she has been described by the Salem Statesman Journal as a writer who is “one of our best observers of stories of the heart [who] knows how to capture emotion on virtually every page of every book.” A former teacher, a Harvard graduate, and an avid hiker, she lives on an island in Washington’s Puget Sound. Wiggs sat down with Abigail this month to discuss her new novel, Wayward Girls—due out from William Morrow later this month, it is currently on offer through the Early Reviewer program—a tale of six teenage girls confined to a Catholic institution in the 1960s for being “gay, pregnant or unruly.”
Set in Buffalo, NY in 1968, Wayward Girls is described by the publisher as being based on a true story. Tell us about that story—how did you discover it, and what made you want to write about it?
I grew up in a small town in western New York, not far from Buffalo, but we moved overseas when I was a child. I never went back until 2021, when my big brother and I embarked on a journey to revisit our childhood haunts. Jon was facing a terminal diagnosis, and this bittersweet, nostalgic trip was an item on his bucket list.
When we visited the church of our youth, vivid memories of Jon as an altar boy flooded back—especially the time his sleeve caught fire from the incense thurible. You might notice a dramatized version of this incident early in the novel! (There’s a photo of Jon and me at St Mary’s Catholic Church in Olean, NY: photo here)
This moment sparked a deeper exploration into the impact of the Catholic Church in the 60s and 70s. My research led me to a forbidding stone complex at 485 Best Street in Buffalo that had once been a Magdalene Laundry—a place where “wayward girls” were sent to be “reformed” by strict nuns. Teenage girls were forced into slave labor and some delivered babies without proper medical care–babies that were sometimes stolen from them and placed for adoption. Though vaguely aware of the “laundries” in Ireland, I was shocked to learn they existed throughout the U.S. as well.
As a child, I remember more than one babysitter who “went away,” a euphemism for girls sent into hiding when they became pregnant. The more I learned, the more deeply I felt the helpless pain and rage of these young women. Their stories ignited my imagination, and Wayward Girls became one of my most personal and involving novels to date. I hope my passion for this topic touches readers’ hearts and inspires important conversations about our past treatment of young women, and–as Jodi Picoult points out–is a cautionary tale for today.
What kind of research did you need to do, while writing? Were there things you learned that surprised you, or that you found particularly disturbing or noteworthy?
Well, you won’t be surprised to know that I started at the library. The public library, of course, and I also had help from the librarian of the Buffalo History Museum. Every book I write begins with a visit to the library, and that has never varied in my 35+ years of writing fiction.
The former Good Shepherd facility in Buffalo still exists, although it’s no longer a Magdalene laundry. The atrocities committed there have been verified by accounts in scholarly and court documents, and by anecdotal evidence from former inmates. Currently there are multiple lawsuits involving the Good Shepherd, brought by individuals who suffered harm at the hands of the Catholic organizations responsible for operating them.
But as my editor pointed out in her letter to early readers, the novel is about the irrepressible spirit of women, and it’s not all doom and gloom. And in order to bring the story to life for readers, it’s a vivid snapshot of the world in 1968: the war in Vietnam, protests around diversity and women’s rights…eerily not so different from the world today. The race riot in Buffalo that was quelled in part by Jackie Robinson actually did occur. And Niagara Falls was actually “shut off” as depicted in the novel. The nuns characterized it as a “miracle,” although the real explanation is more prosaic and scientific.
Although Magdalen asylums or laundries operated throughout the Anglophone world, revelations regarding the abuses perpetrated in these institutions were particularly explosive in Ireland from the 1990s through the 2010s. Did this history inform your story, set in the states?
Like many readers, I was aware of (and horrified by) the Magdalen asylums in Ireland, thanks to news reports, books like Small Things Like These and films like The Magdalene Sisters and Philomena. There’s even a song called “The Magdalene Laundries” by Joni Mitchell. Probably the most moving and disturbing account I read during my research was Girl in the Tunnel by Maureen Sullivan.
I learned that in the United States, there were at least 38 such institutions. Women and girls, most from poor homes, were regularly sentenced to religious-run, but state-sanctioned prison systems of slave labor and abuse.
How do you approach disturbing topics in general, when writing a book? Is there anything in particular you hope readers will take away from Wayward Girls?
I’ve never shied away from dealing with controversial subjects in my books. I believe fiction can be a safe space to explore difficult realities that many people face. My approach is always to ask whether including disturbing content serves the story and characters in a meaningful way. I hope readers come away with insights about the enduring resilience of the human spirit rather than just feeling shocked.
For me as an author, the most gratifying feedback from a reader is to hear that not only were they transported and entertained, but that they gained something of lasting value from reading my book. Just last week, I received this moving note from a reader who is looking forward to Wayward Girls:
I have a personal history and I am still uncomfortable at times “coming out”, so to speak. While I am not in the book it was my experience in 197* ….At 16 I was sent to the Zoar Home for Unwed Mothers through the Catholic Diocese of Steubenville Ohio.
It is a VERY emotional continued lifelong journey – but healing to read, expose and work through all the trauma that comes back to the surface when faced with others’ stories or
historical revelations.
I look forward to your beautiful writing portraying this story and adding continued society enlightenment of the traumatic experiences and shame those of us suffered, as we are still bearing the pain while continuing to navigate this life and memories….
Thank you Susan
Well. When an author gets a note like this from a reader, she has no higher calling. I only hope this reader will feel seen by Wayward Girls.
Tell us a little bit about your writing process. Do you have a particular place or time you like to write, or a specific routine you follow? Do you have any advice for aspiring writers?
Thank you for this question. I love talking shop! I’ve written 50+ novels, and no matter how the world (and technology) changes, my process is always the same. I write my first draft in longhand, using a fountain pen with peacock blue ink, in a Clairefontaine or Leuchtturm 1917 grid-ruled notebook. People who are left-handed know why—it helps me avoid dragging my sleeve through the wet ink, because the fountain pen ink dries instantly.
This process keeps me away from the computer, which I find to be a delightful distraction. Writing a novel of 130,000 words requires enormous focus. Eventually, I type it up (these days by dictation) in proper manuscript format and revise it about eleven hundred seventy-seven times, and send it to my literary agent and editor for notes. Then I revise it about eight hundred more times until I’m convinced that I’ve done my very best for my readers.
Instead of “aspiring writers,” I like the term “emerging writers.” They are already writing. They are just learning and polishing their craft as their stories emerge. I wish I could say there’s a magic formula to vault you into success, but there isn’t. The first job is to define what success looks like for you, personally. A traditional commercial publisher like HarperCollins? An indie-published book? Or just the sense of accomplishment that you’ve put your heart on paper?
Once you know what you want as a writer, find the journey that will take you there. It probably won’t be easy, but what goal worth having is? For traditional publishing, you need to find a literary agent (never ever ever pay a fee to an agent!) who will place your book with a publisher.
I don’t even know what to say about AI. A novel is meant to be a personal artistic expression–the author’s unique perspective, experiences, and voice. Having AI generate text defeats the purpose of creative writing. If you want to write a novel, the struggle and growth that comes from doing it yourself is actually the point.
In terms of WHAT to write, please write the story you’re dying to tell, not the thing that’s hot in bookstores and on Booktok right now. Those books were conceived and published long ago, so by the time you jump on the bandwagon, it has already left. Make yourself happy with your writing. Reach out to a writers’ group in your community. The local library is a good place toconnect. Take a class. Write together, trade chapters, talk shop.
Writing fiction is like being the ultimate master of your own personal universe. There’s something deeply satisfying about finally having complete control over something, even if it’s just whether your protagonist gets coffee or gets hit by a bus.
It’s also the socially acceptable way to have elaborate conversations with imaginary people. You can kill off that annoying character who’s clearly based on your ex, give yourself superpowers through a thinly veiled alter ego, and resolve conflicts in ways that would never work, or win every argument with a perfectly timed witty comeback.
Plus, fiction lets you experience the rare joy of creating problems on purpose just so you can solve them. It’s like being a chaos agent and a benevolent fixer all at once. Where else can you ruin someone’s entire life in chapter three and then feel genuinely proud of yourself for it?
But I’ve strayed from the question! The answer is, READ. Read new books hot off the press. Read beloved older titles. Read the classics, the ones you thought were so boring when you were a kid in school. Because chances are, these books mean something to you now.
And at the end of the day, the very long writing day, that’s all an author can hope for—that readers were willing to spend their time reading a book filled with the deepest secrets of her heart.
Don’t shy away from your writing dreams. Tell your family/partner/friends that you have two sacred hours every day you’re going to devote to writing. And then write. WRITE.
What’s next for you? Do you have any books in the offing that you can share with us?
This is my second-favorite place to be in the writing journey. I have a blank page in front of me and I get to start something fresh!
In the meantime, there will be lots of editions of my books coming out—new paperback versions, new audiobooks, interesting new formats to explore.
Tell us about your library. What’s on your own shelves?
My library grows and changes over time. While books come and go, there are a few permanent fixtures: the first book I read (The Carrot Seed by Ruth Krauss), the first book I bought (Yertle the Turtle by Dr Seuss), the first book I took from the library (You Were Princess Last Time by Laura Fisher), and the first long book I read in one sitting and immediately reread (Diary of Anne Frank).
I do love my keepers, but I tend to give books away after I’ve read them. I’m always sending books I’ve loved to my reader friend and family. But I always keep the books signed by the author, because that signature makes me feel like I’m a member of an exclusive club. Although it’s bittersweet, I am especially fond of my books signed by authors who aren’t with us anymore—Madeleine L’Engle, Anne Rice, Sir Roger Bannister, Ray Bradbury, Crosby Bonsall.
What have you been reading lately, and what would you recommend to other readers?
I am on a mission to read all the books I can get my hands on by the people who read the early draft of Wayward Girls, because I know what a time commitment it is, how busy we all are, and what fantastic writers they are. So I’ve been reading books by Jodi Picoult, Adriana Trigiani, Patti Callahan Henry, Robert Dugoni, Kristina McMorris, and Shana Abé. All of these authors remind me of why I decided to write in the first place—to transport, entertain, surprise, and delight the reader.
If I could choose one poem of mine to explain my stance, or my artistic
position, it would be “The One Thing in Life,” which appears in Lucky
Life. In this poem I stake out a place for myself, so to speak, that was
overlooked or ignored or disdained, a place no one else wanted. I mean
this in a psychological and metaphorical and philosophical sense.
When I think about the place “no one else wanted,” I think of an
abandoned or despised area. I think of weeds, a ruin, a desert, but I
think of these things not as remote in time or place from that which is
familiar and cherished and valuable—our civilization—but as things that
lie just under the surface and just out of eyesight. (Bradish,
pp. 1486-1487).
I wonder, "What does Plato have to say about knowledge, and to what degree is knowledge an absolute?"
Size and Scope of the Corpus
To address this question, I amassed twenty-six English-language items authored or attributed to Plato. All of these items were garnered from Project Gutenberg. (See the cache, the rudimentary bibliography, and the computed summary.) Because the word "knowledge" occurs as one of the more frequent computed keywords in the corpus, I can assume, at least to some degree, the concept of knowledge is discussed:
Word cloud illustrating the frequency of computed keywords in the Plato corpus
Traditional Close Reading Process
I can then search the collection for items where "knowledge" is a computed keyword. There are thirteen, and the traditional way to address my question would be to apply close reading to the following items:
But call me lazy. I want to address my question. I want to get answers not pointers.
Knowledge Is Using Grammar
One of the most rudimentary ways to extract definitions is to concordance for something like "knowledge is". Here are some of the more meaningful results of such a process when applied to the corpus:
qualities? socrates: why, you know that knowledge is the first qualification of any teacher?
wledge of what we do not know? besides, knowledge is an abstraction only, and will not infor
each. we seem to find that the ideal of knowledge is irreconcilable with experience. in huma
ignored, and the certainty of objective knowledge is transferred to the subject; while absol
lace, they are abiding. and this is why knowledge is more honourable and excellent than true
wledge is power? protagoras agrees that knowledge is certainly a governing power. this, howe
s the food of the soul? surely, i said, knowledge is the food of the soul; and we must take
?' 'i think we must admit that absolute knowledge is the most exact knowledge, which we must
edge, one of two things follows--either knowledge is not to be attained at all, or, if at al
your favorite doctrine, socrates, that knowledge is simply recollection, if true, also nece
her incline to think that the method of knowledge is inseparable from actual knowledge, and
e measure of all things,' and that 'all knowledge is perception.' this was the subjective wh
ion; but i will venture to assert, that knowledge is true opinion: let this then be my reply
the doves, and say that the chase after knowledge is of two kinds? one kind is prior to poss
e fancied to be a perfect definition of knowledge is a dream only. but perhaps we had better
nce or of anything! and so, theaetetus, knowledge is neither sensation nor true opinion, nor
n the other hand, we are conscious that knowledge is independent of time, that truth is not
combine, and he is of opinion that such knowledge is granted to the gods only. to have seen
A similar way to extract such sentences is to find all sentences with the form subject-predicate-object, and then retain only the ones where the word "knowledge" is the subject and the predicate is a form of the word verb "to be". Below is a sample of the results of such a process:
Why, you know that knowledge is the first qualification of any teacher
Not power but knowledge is the characteristic of a king or royal person.
Besides, knowledge is an abstraction only, and will not inform us of any particular subject, such as medicine, building, and the like.
Still at the bottom of the arguments there remains a truth, that knowledge is something more than sensible perception;--this alone would not distinguish man from a tadpole.
But the thoughts of men deepened, and soon they began to be aware that knowledge was neither sense, nor yet opinion--with or without explanation; nor the expression of thought, nor the enumeration of parts, nor the addition of characteristic marks.
And what we fancied to be a perfect definition of knowledge is a dream only.
And so, Theaetetus, knowledge is neither sensation nor true opinion, nor yet definition and explanation accompanying and added to true opinion?
Surely, I said, knowledge is the food of the soul; and we must take care, my friend, that the Sophist does not deceive us when he praises what he sells, like the dealers wholesale or retail who sell the food of the body; for they praise indiscriminately all their goods, without knowing what are really beneficial or hurtful: neither do their customers know, with the exception of any trainer or physician who may happen to buy of them.
Now the rest of the world are of opinion that knowledge is a principle not of strength, or of rule, or of command: their notion is that a man may have knowledge, and yet that the knowledge which is in him may be overmastered by anger, or pleasure, or pain, or love, or perhaps by fear,--just as if knowledge were a slave, and might be dragged about anyhow.
And what did you think, he said, of that part of the argument in which we said that knowledge was recollection, and hence inferred that the soul must have previously existed somewhere else before she was enclosed in the body?
'My answer is, that knowledge is perception.'
But if knowledge is perception, how can we distinguish between the true and the false in such cases?
Theaetetus offers a definition which he has heard: Knowledge is true opinion accompanied by definition or explanation.
And so, Theaetetus, knowledge is neither perception nor true opinion, nor yet definition accompanying true opinion.
But if, said Socrates, you are still incredulous, Simmias, I would ask you whether you may not agree with me when you look at the matter in another way;--I mean, if you are still incredulous as to whether knowledge is recollection.
And so, Theodorus, we have got rid of your friend without assenting to his doctrine, that every man is the measure of all things--a wise man only is a measure; neither can we allow that knowledge is perception, certainly not on the hypothesis of a perpetual flux, unless perchance our friend Theaetetus is able to convince us that it is.
I cannot say, Socrates, that all opinion is knowledge, because there may be a false opinion; but I will venture to assert, that knowledge is true opinion: let this then be my reply; and if this is hereafter disproved, I must try to find another.
Apparently, according to Plato, knowledge has something to do with perception and true opinion.
Knowledge Is Using Semantic Indexing
Yet another way to identify sentences alluding to knowledge is to apply semantic indexing against them and then query for the word "knowledge". I did just this, and the results are returned in the form of one long paragraph:
But even if knowledge can know itself, how does the knowledge of what we know imply the knowledge of what we do not know? But of what is this knowledge? And that knowledge which is nearest of all, I said, is the knowledge of what? But what is this knowledge then, and of what? LACHES: And the reason of this is that they have knowledge? Would not knowledge?--a knowledge of measuring, when the question is one of excess and defect, and a knowledge of number, when the question is of odd and even? 'Then you learn what you know.' What knowledge is there which has such a nature? And knowing is having knowledge at the time? And not knowing is not having knowledge at the time? But then what is this knowledge, and what are we to do with it? Certainly, of the knowledge which I have. Knowledge is prior to any particular knowledge, and exists not in the previous state of the individual, but of the race. And what is the nature of this knowledge or recollection? And we have not got the idea of knowledge? We cannot define knowledge until the nature of definition has been ascertained. And so we must ask again, What is knowledge? What then is knowledge? question--"What is knowledge?" 'My answer is, that knowledge is perception.' But if so, knowledge is not perception. What then is knowledge? All this time we have been repeating the words 'know,' 'understand,' yet we do not know what knowledge is. Is there some other form of knowledge which distinguishes them? What then is knowledge? But when the word 'knowledge' was found how was it to be explained or defined? It may be regarded as a higher degree of knowledge when we not only know but know that we know. Herein lies the difficulty which I can never solve to my satisfaction--What is knowledge? And so, when the question is asked, What is knowledge? That which is known is affected by knowledge, and therefore is in motion. Knowledge, like the other, is one; and yet the various parts of knowledge have each of them their own particular name, and hence there are many arts and kinds of knowledge. Not power but knowledge is the characteristic of a king or royal person.
The paragraph can be transformed into a set of smaller paragraphs by comparing the each sentence to its subsequent sentence, and if the two are significantly different, then a new paragraph is begun. Here is the result of such a process:
But even if knowledge can know itself, how does the knowledge of what we know imply the knowledge of what we do not know? But of what is this knowledge? And that knowledge which is nearest of all, I said, is the knowledge of what?
But what is this knowledge then, and of what? LACHES: And the reason of this is that they have knowledge? Would not knowledge?--a knowledge of measuring, when the question is one of excess and defect, and a knowledge of number, when the question is of odd and even?
'Then you learn what you know.' What knowledge is there which has such a nature? And knowing is having knowledge at the time? And not knowing is not having knowledge at the time? But then what is this knowledge, and what are we to do with it?
Certainly, of the knowledge which I have. Knowledge is prior to any particular knowledge, and exists not in the previous state of the individual, but of the race. And what is the nature of this knowledge or recollection? And we have not got the idea of knowledge? We cannot define knowledge until the nature of definition has been ascertained. And so we must ask again, What is knowledge? What then is knowledge? question--"What is knowledge?" 'My answer is, that knowledge is perception.' But if so, knowledge is not perception. What then is knowledge? All this time we have been repeating the words 'know,' 'understand,' yet we do not know what knowledge is. Is there some other form of knowledge which distinguishes them? What then is knowledge? But when the word 'knowledge' was found how was it to be explained or defined? It may be regarded as a higher degree of knowledge when we not only know but know that we know. Herein lies the difficulty which I can never solve to my satisfaction--What is knowledge? And so, when the question is asked, What is knowledge? That which is known is affected by knowledge, and therefore is in motion. Knowledge, like the other, is one; and yet the various parts of knowledge have each of them their own particular name, and hence there are many arts and kinds of knowledge. Not power but knowledge is the characteristic of a king or royal person.
Many people will now say, "Can you cite from whence these sentences came?" And my answer is, "Yes, see the following list of items and the number of identified sentences coming from each":
Notice how this list echoes the list identified via the keyword search, above. Moreover, if I were to apply close reading to my question, then I'd spend my time against Theaetetus.
Knowledge Is Supplemented with Large-Language Models
The long paragraph of extracted sentences can then be used as input to a large-language model for summarization. Such is what I did:
In this context, the question "What is knowledge?" is posed multiple times, with various responses and discussions that revolve around the nature of knowledge. Here's a brief summary of the main points:
1. Knowledge is prior to any particular knowledge, and exists in the race rather than the individual. 2. The nature of knowledge or recollection is not fully understood, and so the question "What is knowledge?" remains difficult to define. 3. Knowledge is perception, but this definition raises further questions about what constitutes knowledge. 4. Knowledge is a higher degree of understanding when one not only knows but also recognizes that they know. 5. The concept of knowledge is complex and multifaceted, encompassing various arts and kinds of knowledge. 6. Power is not the defining characteristic of a king or royal person, but rather knowledge.
Overall, the conversation highlights the challenges of defining knowledge and the various ways in which it can be understood and approached.
Somewhat helpful.
The same long paragraph can be used as input into a large-language model and queried. This process is called "retrieval-augmented generation" (RAG). I "ragged" my paragraph and asked, "What is knowledge?". Here is the complete response:
As a college professor with a limited capacity for words, I must concisely address the question "what is knowledge?" Knowledge, in essence, refers to the understanding or awareness of something, whether it be a fact, concept, or experience. However, the nature of this understanding is not straightforward, as it implies an acknowledgment of what one does not know just as much as what one does know.
The closest knowledge we have is the knowledge of what? The answer lies in the realization that knowledge is prior to any particular knowledge and exists within the collective race rather than the individual. This collective knowledge is a recollection of experiences and perceptions that transcend individual understanding.
The question remains, "What is knowledge?" Is it perception? No, for knowledge encompasses more than just perceiving something. It is a higher degree of awareness, acknowledging both what we know and what we do not know. The difficulty lies in defining this concept, as it eludes our understanding despite repeated attempts.
In essence, knowledge is a complex and multifaceted construct that cannot be reduced to a single definition. It encompasses various forms of understanding, including power, which distinguishes a king or royal person from others. Ultimately, the nature of knowledge remains a mystery, and we are left with more questions than answers.
Again, somewhat helpful.
Conclusion
So, what is knowledge according to Plato? Quite frankly, I am unable to answer the question definitely, but I am going away with a better and more nuanced understanding of the topic. And I am now equipped with a vocabulary for discussion. Yet, a sophomoric answer to the question might be, "Knowledge, according to Plato, has something to do with perception, 'true opinion', and recollection, but then again, the question may be moot or unanswerable."
For extra credit, I might try to understand the definition of "true opinion". Concordance snippets begin to address the question:
true opinion is as good a guide to correct action as knowledge
true opinion is no matter of conjecture with me
true opinion is proved by the rhetoric of the law courts, which cannot give knowledge
true opinion is surely unerring, and the results which follow from it are all noble and good
For extra extra credit, I might apply this same process against the works of some other philosopher and then do compare and contrast. Hmmm...
Interesting fact: It took me longer to write up what I learned and format it for the Web than it did to do the actual work. I believe that says something about the process. After all, as a librarian, one of my goals is to "Save the time of the reader."
Epilogue
This missive and all of the supporting data ought to be available as a Distant Reader study carrel ("data set") at the following URL:
The UPS Prototype was a proof-of-concept web portal built in preparation of the Universal Preprint Service Meeting held in October 1999 in Santa Fe, New Mexico. The portal provided search functionality for a set of metadata records that had been aggregated from a range of repositories that hosted preprints, working papers, and technical reports. Every search result was overlaid with a dynamically generated SFX-menu that provided a selection of value-adding links for the described scholarly work. The meeting outcome eventually led to the Open Archives Initiative and its Protocol for Metadata Harvesting (OAI-PMH), which remains widely used in scholarly communication, cultural heritage, and beyond. The SFX-menu approach became standardized as the NISO OpenURL Framework for Context-Sensitive Services (NISO OpenURL) and compliant linking servers remain operational in academic and research libraries worldwide. Both OAI-PMH and NISO OpenURL as well as associated systems and services, have been so widely deployed that they can safely be considered an integral part of the scholarly information infrastructure. The authors, who were deeply involved in devising the UPS Prototype and played core roles in the OAI-PMH and NISO OpenURL specification efforts, take the reader behind the scenes of the development of these technologies and reveal Clifford Lynch as the Invisible Influencer in the establishment of scholarly Information Infrastructure.
Introduction
The book “Music with Roots in the Aether” bundles interviews and essays about seven contemporary American composers, including Philip Glass, Alvin Lucier, and Pauline Oliveros [1]. It was published in 2000 and is based on a 1975-1976 project by Robert Ashley – himself a contemporary American composer – for which he created 14 one-hour videotapes focusing on the creative genius of the selected composers [2]. Ashley was asked to write a foreword for the book and spends his first paragraphs emphasizing the challenge involved in reflecting on a project he did 25 years earlier:
But the Foreword turned out to be hard, even for me. I couldn’t remember who I was when the project was conceived. I couldn't remember any of the energies of the ideas that went into the project. Purposely I have not been good at remembering old ideas. I burn bridges. It keeps the path clear. [1]
Ashley’s sentiment resonates with us, because, for this 2025 essay, we impulsively chose to illustrate Clifford Lynch’s impact on the development of infrastructure for research and education by means of the Universal Preprint Service (UPS) project that we jointly initiated and executed in 1999. Some memories have remained strong, others have faded and become uncertain, and, undoubtedly a lot has just evaporated into the fabric of time. Fortunately, there are external memories that can serve as fall-backs when ours fail. Many aspects of the project and its context were documented in research papers. These papers reference documents with details about underlying discussions that are long gone from the organizational websites on which they were published, but fortunately were saved for posterity by the indispensable Internet Archive. The petites-histoires featuring people involved in our effort have not been publicly documented because the turn of the century was still social-media free. But personal touches often aptly illustrate the spirit of a project and the zeitgeist of the era in which it took place. Therefore, we feel compelled to also include a few anecdotes that remain glittering sharp in our memory. All in all, despite the fog of time, we are quite confident that the story we tell is an accurate reflection of events that were crucial to the eventual broad adoption of metadata harvesting using the Open Archives protocol (OAI-PMH) and open linking using OpenURL, and, especially, of the crucial role Cliff played in making that happen.
Two PhD Candidates, in for a Surprise
The middle to late 1990s were exciting times for those into computers, networks, and information. Times that seemed to hold an unlimited potential, rather abruptly brought about by the combination of the HTTP/HTML Web, the mainstreaming of the Internet, affordable personal computing, and increased digitization capabilities. Like many others, we were excited about how these technologies could bring about a better world and consequently devoured Wired, a magazine that abounds with “techno-utopianism and hippie-idealism” [3]. We had jobs that presented challenges in which this powerful combination of technologies could be leveraged to imagine and implement innovative solutions.
Herbert became systems librarian at the Ghent University Library in 1981, after completing an administrative automation project there to obtain a degree in Informatics. He didn’t exactly hit the ground running as he was trying to figure out what automation in academic libraries was all about. Most libraries were focusing their efforts on the catalogue but, given his science education, that didn’t seem to tick all the boxes. Eventually, it was the science librarian who turned on the light by putting the automation challenge in terms of the “consultation chain”: first searching secondary sources to find journal articles, then searching catalogues to determine where the journals were, and then obtaining the articles. And so it was that, as soon as CD-ROMs became available, Herbert started providing public access to Abstract & Indexing (A&I) databases, initially on stand-alone PCs, later on PCs in Local Area Networks (LAN), and eventually on PCs across the university’s Wide Area Network. He initiated an effort to create a Belgian Union Catalogue on CD-ROM and hooked it up to the network too (Figure 1). Access dramatically improved but constraints remained: consultation was restricted to Windows PCs, the LANs had to run the Banyan Vines operating system, and networking a large collection of CD-ROMs published by a variety of vendors was a dark art. It all amounted to access being restricted to dedicated library PCs operated in departmental libraries, which was better than what most other European academic libraries had to offer but not good enough for Herbert. That is why he experienced the interoperability fabric introduced by the Web as the chains coming off regarding ways to deliver scholarly information to researchers and students. That enthusiasm resulted in the 1997 release of the Library’s Executive Lounge, a menu-driven environment that provided web-based access to all information that had previously only been available on library PCs, with the addition of some electronic journal collections for good measure. But something was still missing: the Web had links and the Executive Lounge didn’t. Herbert put it as follows:
When using a library solution, the expectations of a net-traveler are inspired by his hyperlinked Web-experiences. To such a user, it is not comprehensible that secondary sources, catalogues and primary sources, that are logically related, are not functionally linked. [4]
Figure 1 - September 1989, Ghent University Library: Herbert showing off a CD-ROM
The frustration expressed in this quote led to a collaboration with SilverPlatter and Ex Libris to implement dynamic links from journal article descriptions in A&I databases to journal holding information in the library catalogue. And it also provided fertile ground for PhD research on how to empower libraries to create links across their electronic collections by means of an open linking framework.
Michael began his professional career at the NASA Langley Research Center (LaRC) in 1991, originally working in the Analysis and Computation Division of the supercomputer center. Early experiences with Usenet and anonymous FTP began to divert his attention from supercomputing and cluster computing (now known as cloud computing) to information networks and libraries. In 1993, he set up an anonymous FTP server, the Langley Technical Report Server (LTRS), for technical memorandums and technical papers published by LaRC. It effectively brought the culture to share and access technical reports via FTP, which already existed in computer science, to NASA. Later, in 1993, he added a web interface to LTRS, providing a much-needed boost in usability. Browsing functionality improved, abstracts were indexed, and became searchable using the Z39.50-based Wide Area Information Server (WAIS), which was pretty much the only free search software at the time (for example, MySQL was not released until 1995). Around the same time, the Center for AeroSpace Information (CASI) brought their own WAIS server online; it provided abstracts for all publicly available, NASA authored reports and articles. Other centers and projects were inspired by this activity and wanted to set up their own "report server". It became clear that a website - the term "digital library" was not yet widely adopted - allowing simultaneous WAIS search of all the NASA and NASA-affiliated report servers was needed. A bit of Perl hacking later, by Michael and his colleagues, and the NASA Technical Report Server (NTRS) was released in 1994 (Figure 2).
Figure 2: Summer 1999, NASA Langley Research Center: Michael among the desktop machines running LTRS and NTRS
The development of LTRS and NTRS assumed a 1:1 relationship from a metadata record to the URL of the associated full text document. But with the progression from ".ps.Z" to ".pdf" files, the usefulness of that assumption started to break down. It became unworkable by 1998, when Michael created a separate digital library for the scanned documents of the National Advisory Committee for Aeronautics (NACA), the 1915–1958 predecessor of NASA. Obviously, none of these documents were born digital, and a single NACA report presented on the web was composed of TIFF images, large and thumbnail JPEGs, and a PDF of the entire report. Based on the experience of managing and presenting these collections of files as a single web object, Michael's dissertation [5] evolved in the direction of creating buckets, the smart web objects in the Smart Objects, Dumb Archives (SODA) model [6]. The basic premises of SODA were that individual reports are more important than the repositories that hold them, and that it should be possible for multiple digital libraries to simultaneously make them discoverable. This 1997 insight is now commonplace, but went against the conventional wisdom of the time. It precedes, yet aligns with, the perspective of the W3C Architecture of the World Wide Web that individual resources are more important than the web servers that host them [7]. As a matter of fact, the Architecture of the World Wide Web only mentions resources, not web servers.
As Herbert and Michael embarked on their respective PhD explorations on different sides of the Atlantic, they didn’t realize they were about to meet, to collaborate on the UPS project, and to present their results at a meeting that would be moderated brilliantly by Cliff Lynch, a man they both admired but had never met in person.
The UPS Prototype
By early 1999, Herbert’s ideas to give libraries a say regarding links across their electronic collections had taken shape [8]. He had also conducted an experiment illustrating the components of the open linking framework he envisioned. A linking server operated by the library would feature a knowledge base detailing its collection as well as a rule engine that would dynamically decide which links to provide for which type of collection item. A user interested in links for a specific item would click the associated Special Effects link (SFX) that was targeted at the linking server and contained keys that allowed the server to collect sufficient metadata about the item to evaluate the rules and return item-specific links [9]. But inserting SFX links required control of the systems that provided access to the collection and, as such, the experiment only used sources operated locally by the Ghent University Library. Demonstrating the general feasibility of the approach required an experiment without such constraints.
When Rick Luce, director of the Los Alamos National Laboratory Research Library, visited the Ghent Library to check out the linking approach, it became clear that his groundbreaking Library Without Walls project [10] would provide the ideal setting: its collection combined locally and remotely controlled sources, including locally operated full-text, and it maintained close relationships with various parties in the scholarly information industry. So, Herbert packed up in February 1999 for a six-month stint in Los Alamos and successfully conducted an elaborate experiment that demonstrated the feasibility of the approach with sources under both local and remote control, including full-text collections from Wiley and the American Physical Society, and involved linking servers at Los Alamos and Ghent [11].
Figure 3 - Summer 1999, Donna Bergmark’s home: Rick Luce, Herbert, and Paul Ginsparg celebrating the call to action
But Los Alamos was also where the famous physics preprint server - then known as xxx.lanl.gov (now known as arXiv) - ran under Paul Ginsparg’s desk [12]. Having witnessed many years of fierce discussions at Ghent University about subscriptions to journals and their ever-increasing price tag, Herbert very much understood the appeal of the new communication paradigm it entailed and had brought his video camera along to Los Alamos, hoping he might get a chance to interview the much-revered Ginsparg. He shouldn’t have bothered. It turned out that Rick and Paul were already exploring whether the Library Without Walls, which ran a mirror of the preprint server, could become its institutional host, and Herbert started taking part in those conversations.
One brainstorm led to another and by the time Herbert got ready to return to Ghent, the trio published a call to action (Figure 3) for “the further promotion of author self-archived solutions” in which they announced a meeting with 25 invited experts to be held in Santa Fe, NM, in October 1999, to kick things off [13]. The stated goals were “to reach an agreement regarding an approach to build a promotional prototype multidisciplinary digital library service for the main existing e-print archives” and “to create a forum that will continue to address the interoperability of self-archiving solutions, as a means to promote their global adoption” [14]. To this day, Herbert vividly remembers the thrilling moment when he pushed the Send button on his Toshiba laptop to distribute the final version of the call to action to various listservs, while sitting in his tiny apartment in Santa Fe that was sparsely outfitted with rented furniture (Figure 4).
Figure 4: July 27 1999, Calle Mejia: Herbert sends out the invitation for the Santa Fe Meeting
Over time, Herbert had come to understand and embrace the “seeing is believing” power of prototypes. He had decided that a concrete strawman to illustrate services across e-print repositories would be needed to fuel discussions; but he would need collaborators to pull that off. When reaching out to various e-print repositories to obtain metadata dumps, Thomas Krichel, a major force behind the Research Papers in Economics (RePEc) [15] effort, enthusiastically came on board. Rick Luce identified just the other person who was needed. Via the New Mexico Library Alliance, he knew Michael’s supervisor Mike Little; together, they engineered a meeting in Washington, DC, anticipating that their Young Turks would resonate. During a four-day meeting in April 1999 it became clear they very much did, although their initial meeting didn’t get off to a good start due to Michael getting hopefully lost driving around Dupont Circle, in those pre-GPS days. They drew up technical plans for a prototype and even managed to get a meeting with Deanna Marcum at the Council on Library Information and Resources (CLIR) and Donald (Don) Waters at the associated Digital Library Federation (DLF), securing support and funding for the meeting and the prototype.
Figure 5 – October 1999: UPS Prototype - A bucket for a preprint shows ReDIF metadata and two SFX links
Together, Herbert, Michael, and Thomas, started working on the UPS Prototype to be presented at the very outset of the planned Santa Fe Meeting. And, although the prototype was intended “not to make statements about the architectural directions that UPS should take, but rather to facilitate discussions,” [11] its design did entail some significant technical choices. Metadata would be collected from various-print repositories using static dumps; metadata would be normalized to the ReDIF format [16] used in the RePEc initiative; the SODA model would be used to manage/present individual e-prints as buckets; search across the aggregated metadata would be realized using the NCSTRL+ extension of Dienst that supported buckets; each e-print-specific bucket would provide SFX linking capabilities (Figure 5). In order to realize this all in a six-month period, the prototype trio brought more help on board. And they met twice in person: once in Ghent, where Thomas showed up totally drenched on Herbert’s doorstep after having biked through heavy rain from Ostend; once in Los Alamos, prior to the Santa Fe Meeting, when Thomas arrived hours late having biked (Figure 6) through heavy snow from Albuquerque and spent the night in a drainage pipe (colleagues arriving late was a recurring theme for Herbert). Michael mostly remembers it being bitterly cold, since his earlier visit to Albuquerque in the Summer of that same year had not taught him to pack a sweater for Santa Fe in Fall. And, despite hiccups that plague every project, the well-documented UPS Prototype [17] was finished on time, ready to be presented to the meeting participants.
The Santa Fe Meeting for ...?
In order to further optimize the chances of success for the meeting, the collaboration of Cliff Lynch and Don Waters as moderators had been secured and turned out to be fundamentally important. In the Acknowledgments section of his PhD thesis, Herbert put Cliff’s impact on the direction of the meeting and on his own thinking as follows:
When starting to work on this thesis, I went back reading several of his early papers and could not feel other than intimidated by the far forward-looking vision expressed therein. At several occasions, I heard Cliff address large audiences, discussing complicated digital library matters with an amazing clarity. Cliff's work has always been a great inspiration to me. I met Cliff for the first time in person at the Open Archives meeting in Santa Fe, for which he had enthusiastically accepted my invitation to serve as a moderator. His involvement was crucial to the successful conclusion of the meeting. [18]
Figure 6 - October 21 1999, Fort Marcy: Thomas Kirchel’s bike made it to the Santa Fe Meeting
The meeting started off in a very concrete manner, with the presentation of the UPS Prototype, some exposes on repository interoperability, and reflections about institutional versus discipline-oriented archive initiatives. But, as the first day progressed, the discussions got increasingly distracted by back-and-forth arguments about the necessity (or not) of peer-review. The Stevan Harnad “self-archiving” camp (archiving the peer-reviewed version of a contribution on a personal or institutional server) insisted it is essential to keep scholarly communication trustworthy, whereas the Paul Ginsparg “preprint” camp (publishing unreviewed contributions on a discipline-oriented or institutional server) stated that knowledgeable readers can assess quality without external review and that novice readers should wait until a peer-reviewed version becomes available. Michael also remembers Paul saying something to the effect that the meeting would be a lot more productive if everyone just learned how to program in Perl and then do something instead of just talking about it. The peer-review tension had already been present prior to the meeting and is even reflected in the evolution of the title of its announcement: an unpublished version dated April 1999 was entitled “Call for participation aimed at the further promotion of the preprint concept”, the version published in July 1999 was entitled “Call for your participation in the UPS initiative aimed at the further promotion of author self-archived solutions”, whereas post-meeting the title was modified to become “The Open Archives initiative aimed at the further promotion of author self-archived solutions.” The choice of the term “archives” didn’t go down well with professional archivists [19], but it did neutralize the disagreement regarding peer-review. By the end of the first day, when participants mingled at the Santa Fe Institute (Figure 7), Herbert was frustrated despite a successful demonstration of the prototype. His bad mood must have been tangible because Ed Fox, whom Herbert had met for the first time at the meeting, volunteered one of his patented neck massages.
Figure 7 – October 21 1999, Santa Fe Institute: Herbert and Michael at the end of the first meeting day
That night, sleep would not come and Herbert, jetlagged and sleep-deprived, had incessant incoherent thoughts on how to get the meeting back on track. Prior to the start of the second day, he vented his frustration about the lack of progress to Cliff, who was about to start moderating the first session. Cliff was nice enough to let him ramble on a bit, and, in a manner that exemplified one of Cliff’s many unparalleled capabilities, he went on to open the meeting by providing two discussion topics regarding interoperability that he somehow had been able to synthesize from the first day’s discussions, which most had experienced as enjoyable yet lacking in any sense of concrete direction. One was whether archive functions, such as data collection and maintenance, should be decoupled from user functions, such as search. The other was about the choice between distributed searching across repositories and harvesting from them to build cross-repository search engines. This is what the meeting report has to say about the outcome of discussion regarding the first topic:
Although archive initiatives can implement their own end-user services, it is essential that the archives remain "open" in order to allow others to equally create such services. This concept was formalized in the distinction between providers of data (the archive initiatives) and implementers of data services (the initiatives that want to create end-user services for archive initiatives). [20]
The outcome of discussions of the second topic in favor of a harvesting solution is somewhat remarkable because distributed search using WAIS/Z39.50 was quite in vogue in libraries and digital libraries in those days. Cliff himself had a significant track record in Z39.50 and its standardization [21, 22],, but he had also identified harvesting approaches as a topic for further research [23]. Motivated by complexity and scalability concerns, he gently nudged discussions in favor of harvesting. In a paper in which he clarifies the complementary nature of Z39.50 and OAI-PMH, Cliff credits the meeting participants for the decision that was considered controversial by some in the community:
The Santa Fe group wanted a very simple, low-barrier-to-entry interface, and to shift implementation complexity and operational processing load away from the repositories and to the developers of federated search services, repository redistribution services, and the like. They also wanted to minimize the interdependency between the quality of applications services as viewed by the user and the behavior of repositories that supplied data to the applications services. [24]
By the end of the meeting, there was a general sense that the UPS Prototype had been helpful to illustrate the potential of cross-repository services and, hence, to emphasize the need for cross-repository interoperability. A paper that provides a rich summary of the Santa Fe Meeting describes it as follows:
There was general agreement among the participants at the meeting that the Prototype was an extremely useful demonstration of potential. There was also agreement, however, that trying to reach consensus on the full functionality of the prototype was "aiming too high" and that a more modest first step was in order. [25]
Towards OAI-PMH and OpenURL
By turning the focus of the meeting on these two topics, Cliff fundamentally changed its course. By thoughtfully guiding the discussions towards these concrete outcomes, he set the stage for work on what would become the Open Archives Initiative Protocol for Metadata Harvesting [26], of which both Herbert and Michael became editors. Undoubtedly, he had plenty of technical skills that would have allowed him to make significant contributions to the actual specification effort. But in a manner that characterizes Cliff, he silently took a step back and let the community decide its direction while expressing continued support for the work, on many occasions, and at venues around the world. There can be no doubt that his endorsement played a crucial role in the global adoption of OAI-PMH, which has been an integral part of the scholarly and cultural heritage infrastructure for over two decades.
The focus on interoperability to realize just a single aspect demonstrated by the prototype - cross-repository discovery - also meant that discussions about its other technical ingredients, including SFX linking, would have to be postponed. But both Cliff and Don were very much aware of the problem it addressed and about the nature of the proposed solution. They were both part of the NISO Reference Linking Working Group [27] that investigated how to tackle the so-called “appropriate copy problem”, which, simplifying the charge to the Group, can be summarized as follows: “how to resolve a reference link to a paper in such a manner that it ends up at one of potentially many distributed copies of that paper to which a user, covered by an institutional subscription, has access?”
The Working Group resulted from a meeting in February 1999 [28], in which various models for a link localization solution had been explored [29, 30]., Don Waters invited Herbert to present his linking work at a second meeting in June 1999 [31]. And, although the meeting report has nice things to say about SFX linking [32], including its ability to address link localization challenges beyond the appropriate copy problem, he remembers profusely apologizing to Don about a presentation not done well. Still, after the demonstration at the Santa Fe Meeting, Cliff extended an invitation for a presentation at the Spring 2000 meeting of the Coalition for Networked Information [33]. The room was packed with representatives from libraries, the scholarly publishing industry, and library system vendors, and the talk became a veritable breakthrough moment for SFX linking. But significant tasks remained, including standardizing the SFX link syntax and demonstrating the ability of the approach to integrate with the emerging DOI-based reference linking approach pursued by journal publishers and instantiated by CrossRef [34].
The standardization’s history is well documented [35]; it started in December 2000 when the original SFX URL specification [36] - by then renamed OpenURL - was submitted to NISO and concluded five years later with the release of The OpenURL Framework for Context-Sensitive Services [37]. The DOI integration was explored by means of a limited prototype [38] that was demonstrated and discussed at the July 2000 NISO/DLF/CrossRef meeting [39]. As the meeting seemed to reach a consensus in favor of the proposed model with an institutional localization component powered by OpenURL - essentially the SFX open linking approach - a question was brought forward as to whether the model with a centralized localization component that had been identified in the first meeting of the Working Group should also be further discussed. At that point, Cliff decidedly stepped in stating “No. We have a solution!” In doing so, he paved the way for the endorsement of the OpenURL linking framework by the Working Group, the rigorous testing of its feasibility in an extended prototype [40], and its eventual acceptance in the US scholarly communication community and beyond. Afterwards, Cliff continued to express support for the approach at numerous venues and gave it his strongest possible endorsement by becoming a member of Herbert’s PhD jury.
Thank you, Cliff
By means of the UPS Prototype effort, this essay has illustrated Cliff’s fundamental impact on the direction infrastructure for research, education, and cultural heritage has taken in the past decades. Two technologies, OpenURL that was used in the Prototype and OAI-PMH that resulted from the Prototype, became an integral part of that infrastructure. Hopefully, the essay has adequately shown that Cliff had a significant part in making that happen, not as an author of specifications, a writer of code, or a builder of tools. But rather as an identifier of problems to come and as a perceptive influencer, gently nudging forward the solutions he believed in and strongly supporting the community efforts that realized them. We have witnessed the same impact in other efforts we have been involved in since the UPS Prototype and can safely assume that others have experienced it in their projects aimed at improving the status-quo of scholarly information infrastructure.
We do want to emphasize that, as we dreamt up the outlines of the UPS Prototype, we were early career researchers with a visible, yet modest track record. Cliff (CNI), along with Paul Ginsparg (LANL), Rick Luce (LANL), Deanna Marcum (CLIR), and Don Waters (DLF) strongly and publicly endorsed our effort, shone the spotlight on us, and in doing so had a major impact on our career trajectories. We vividly remember receiving that support and the experience has led us to similarly support the young researchers we have mentored since.
Figure 8 – December 12 2017, Washington, DC: Cliff and Herbert at the Fall 2017 CNI Membership Meeting
As we were selected to write a contribution for this Festschrift, on behalf of all infrastructure plumbers, we want to profoundly thank Cliff. Scholarly infrastructure would not have progressed the way it did without him. We don’t envy the person who will step into his shoes once he has retired. The work ahead is enormous, with needs for new infrastructure and existing infrastructure crumbling. Indeed, OAI-PMH is being supplanted due to its reliance on XML, a technology that has become arcane in a JSON world. And the OpenURL Framework is under attack by the centralized Get Full Text Research [41] effort, launched by the major commercial publishers, that mutes the capabilities of libraries to influence the nature of links across their electronic collections. While 25 years of OAI-PMH and OpenURL do not put those technologies in the same IT infrastructure league of - say - UNIX, it is a substantial period considering that the lifetime of many digital library phenomena can typically be measured in terms of months or years, not decades. Cliff’s influence is directly traceable in the global penetration and longevity of these two technologies that go all the way back to the 1999 UPS Prototype.
[2] Robert Ashley, David Behrman, Philip Glass, Alvin Lucier, Gordon Mumma, Pauline Oliveros, Terry Riley. “Music with roots in the aether”. June 1, 1977-June 18, 1977. The Kitchen, New York, New York. https://thekitchen.org/on-file/music-with-roots-in-the-aether/
[4] Herbert Van de Sompel, Patrick Hochstenbach, and Tobias De Pessemier, “The hybrid information environment and our Intranet solution to access it“, Ghent University Academic Bibliography, 1997, accessed on January 27, 2025, https://hdl.handle.net/1854/LU-1056689
[5] Michael L. Nelson, "Buckets: smart objects for digital libraries,", PhD dissertation, Old Dominion University Digital Commons, 2000, accessed on January 28, 2025, https://doi.org/10.25777/gbh6-7d07
[6] Michael L. Nelson et al. “SODA: Smart Objects, Dumb Archives,” Lecture Notes in Computer Science 1696, (1999): 453-464, accessed on January 27, 2025, https://doi.org/10.1007/3-540-48155-9_28
[7] Ian Jacobs and Norman Walsh, “Architecture of the World Wide Web, Volume One,” W3C Recommendation, 15 December, 2004, accessed on January 27, 2025, https://www.w3.org/TR/webarch/
[8] Herbert Van de Sompel and Patrick Hochstenbach. “Reference Linking in a Hybrid Library Environment. Part 1: Frameworks for Linking,” D-Lib Magazine 5, no 4 (1999), accessed on January 27, 2025, https://doi.org/10.1045/april99-van_de_sompel-pt1
[9] Herbert Van de Sompel and Patrick Hochstenbach. “Reference Linking in a Hybrid Library Environment. Part 2: SFX, a Generic Linking Solution,” D-Lib Magazine 5, no 4 (1999), accessed on January 27, 2025, https://doi.org/10.1045/april99-van_de_sompel-pt2
[10] “Library Without Walls Welcome Page,” 1999, archived at the Wayback Machine, April 28, 1999,
[11] Herbert Van de Sompel and Patrick Hochstenbach. “Reference Linking in a Hybrid Library Environment. Part 3: Generalizing the SFX Solution in the SFX@Ghent & SFX@LANL experiment,” D-Lib Magazine 5 no 10, accessed on January 27, 2025, https://doi.org/10.1045/october99-van_de_sompel
[18] Herbert Van de Sompel. “Dynamic and context-sensitive linking of scholarly information,“ Ghent University Academic Bibliography, 2000, accessed on January 27, 2025, https://hdl.handle.net/1854/LU-522209
[19] Michael L. Nelson, "To the Editor: Response to Peter Hirtle's April 2001 editorial, OAI and OAIS: What's in a Name?," D-Lib Magazine 7 no 5, accessed on February 12, 2025, https://doi.org/10.1045/may2001-letters
[21] Clifford A. Lynch. “RFC1729: Using the Z39.50 Information Retrieval Protocol in the Internet Environment,” December, 1994, accessed on January 27, 2025, https://datatracker.ietf.org/doc/rfc1729/
[22] Clifford A. Lynch. “The Z39.50 Information Retrieval Standard - Part I: A Strategic View of Its Past, Present and Future,” D-Lib Magazine 3, no 4 (1997), accessed on January 27, 2025, https://hdl.handle.net/cnri.dlib/april97-lynch
[25] Herbert Van de Sompel and Carl Lagoze. “The Santa Fe Convention of the Open Archives Initiative,” D-Lib Magazine 6, no 2 (2000), accessed on January 27, 2025, https://doi.org/10.1045/february2000-vandesompel-oai
[26] Carl Lagoze, Herbert Van de Sompel, Michael L. Nelson, and Simeon Warner. "The Open Archives Initiative Protocol for Metadata Harvesting," June 14, 2002, accessed on January 28, 2025, https://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
[30] Priscilla Caplan and William Y. Arms. “Reference Linking for Journal Articles,” D-Lib Magazine 5, no 7/8 (1999), accessed on January 27, 2025, https://doi.org/10.1045/july99-caplan
[31] National Information Standards Organization. “Second Workshop on linkage from citations to electronic journal literature,” June 1999, archived at the Wayback Machine, July 7, 2000,
[34] Helen Atkins et al. “Reference Linking with DOIs: A Case Study,” D-Lib Magazine 6, no 2 (2000), accessed on January 27, 2025, https://doi.org/10.1045/february2000-risher
[38] Herbert Van de Sompel and Oren Beit-Arie. “Open Linking in the Scholarly Information Environment Using the OpenURL Framework,” D-Lib Magazine 7, no 3 (2001), accessed on January 27, 2025, https://doi.org/10.1045/march2001-vandesompel
[39] "Meeting Report of the NISO/DLF/CrossRef Workshop on Localization in Reference Linking,” July, 2000, archived at the Wayback Machine, December 6, 2000,
Witt's account of the business side of the early history is much less detailed and some of the details don't match what I remember.
But as regards the technical aspects of this early history it appears that neither author really understood the reasons for the two kinds of innovation we made; the imaging model and the I/O architecture. Witt writes (Page 31):
The first time I asked Priem about the architecture of the NV1, he spoke uninterrupted for twenty-seven minutes.
Below the fold, I try to explain what Curtis was talking about for those 27 minutes. It will take me quite a long post.
The opportunity we saw when we started Nvidia was that the PC was transitioning from the PC/AT bus to version 1 of the PCI bus. The PC/AT bus' bandwidth was completely inadequate for 3D games, but the PCI bus had considerably more. Whether it was enough was an open question. We clearly needed to make the best possible use of the limited bandwidth we could get.
We had two basic ways of making "the best possible use of the limited bandwidth":
Reduce the amount of data we needed to ship across the bus for a given image.
Increase the amount of data shipped in each cycle of the bus.
Imaging Model
A triangle is the simplest possible description of a surface. Thus almost the entire history of 3D computer graphics has modeled the surfaces of 3D objects using triangles. But there is a technique, dating back at least to Robert Mahl's 1972 paper Visible Surface Algorithms for Quadric Patches, for modeling curved surfaces directly. It takes a lot more data to describe a quadric patch than a triangle. But to achieve equivalent realism you need so many fewer patches that the amount of data for each frame is reduced by a significant factor.
Virtua Fighter on NV1
As far as I know at the time only Sega in the video game industry used quadric patches. When we launched NV1 at Comdex we were able to show Sega arcade games such as Virtua Fighter running on a PC at full frame rate, a first for the industry. The reason was that NV1 used quadric patches and thus made better use of the limited PCI bus bandwidth.
At Sun, James Gosling and I built the extremely sophisticated and forward-looking but proprietary NeWS window system. At the same time, I also worked with engineers at competitors such as Digital Equipment to build the X Window System. One of my many learning experiences at Sun came early in the long history of the X Window System. It rapidly became obvious to me that there was no way NeWS could compete with the much simpler, open-source X. I argued for Sun to open-source NeWS and failed. I argued for Sun to drop NeWS and adopt X, since that was what application developers wanted. Sun wasted precious time being unable to decide what to do, finally deciding not to decide and wasting a lot of resource merging NeWS and X into a kludge that was a worse NeWS and a worse X than its predecessors. This was just one of a number of fights at Sun I lost (this discusses another).
Once Microsoft announced Direct X it was obvious to me that Nvidia was doomed if the next chip did quadric patches, because the developers would have to work with Direct X's triangles. But, like Sun, Nvidia seemed unable to decide to abandon its cherished technology. Time for a decision to be effective was slipping away. I quit, hoping to shake things up so as to enable a decision to do triangles. It must have worked. The books recount how close Nvidia was to bankruptcy when RIVA 128 shipped. The rest is history for which I was just an observer.
I/O Architecture
In contrast the I/O architecture was, over time, the huge success we planned. Kim writes (Page 95):
Early on, Curtis Priem had invented a "virtualized objects" architecture that would be incorporated in all of Nvidia's chips. It became an even bigger advantage for the company once Nvidia adopted the faster cadence of chip releases. Priem's design had a software based "resource manager", essentially a miniature operating system that sat on top of the hardware itself. The resource manager allowed Nvidia's engineers to emulate certain hardware features that normally needed to be physically printed onto chip circuits. This involved a performance cost but accelerated the pace of innovation, because Nvidia's engineers could take more risks. If the new feature wasn't ready to work in the hardware, Nvidia could emulate it in software. At the same time, engineers could take hardware features out when there was enough leftover computing power, saving chip area.
For most of Nvidia's rivals, if a hardware feature on a chip wasn't ready, it would mean a schedule slip. Not, though, at Nvidia, thanks to Priem's innovation. "This was the most brilliant thing on the planet," said Michael Hara. "It was our secret sauce. If we missed a feature or a feature was broken, we could put it in the resource manager and it would work." Jeff Fisher, Nvidia's head of sales, agreed: "Priem's architecture was critical in enabling Nvidia to design and make new products faster."
Context
Nvidia is just one of the many, many startups that Sun Microsystems spawned. But at the time what made Nvidia unique among the competing graphics startups was the early engineers from the team at Sun that built the GX series of graphics chips. We went through an intensive education in the techniques needed to implement graphics effectively in Unix, a multi-process, virtual memory operating system. The competitors all came from a Windows background, at the time a single-process, non-virtual memory system. We understood that, in the foreseeable future, Windows would have to evolve multi-processing and virtual memory. Thus the pitch to the VCs was that we would design a "future-proof" architecture, and deliver a Unix graphics chip for the PC's future operating system.
The GX team also learned from the difficulty of shipping peripherals at Sun, where the software and hardware schedules were inextricable because the OS driver and apps needed detailed knowledge of the physical hardware. This led to "launch pad chicken", as each side tried to blame schedule slippage on the other.
Write-mostly
Here is how we explained the problem in US5918050A: Apparatus accessed at a physical I/O address for address and data translation and for context switching of I/O devices in response to commands from application programs (inventors David S. H. Rosenthal and Curtis Priem), using the shorthand "PDP11 architecture" for systems whose I/O registers were mapped into the same address space as system memory:
Not only do input/output operations have to be carried out by operating system software, the design of computers utilizing the PDP11 architecture usually requires that registers at each of the input/output devices be read by the central processing unit in order to accomplish any input/output operation. As central processing units have become faster in order to speed up PDP11 type systems, it has been necessary to buffer write operations on the input/output bus because the bus cannot keep up with the speed of the central processing unit. Thus, each write operation is transferred by the central processing unit to a buffer where it is queued until it can be handled; other buffers in the line between the central processing unit and an input/output device function similarly. Before a read operation may occur, all of these write buffers must be flushed by performing their queued operations in serial order so that the correct sequence of operations is maintained. Thus, a central processing unit wishing to read data in a register at an input/output device must wait until all of the write buffers have been flushed before it can gain access to the bus to complete the read operation. Typical systems average eight write operations in their queues when a read operation occurs, and all of these write operations must be processed before the read operation may be processed. This has made read operations much slower than write operations. Since many of the operations required of the central processing unit with respect to graphics require reading very large numbers of pixels in the frame buffer, then translating those pixels, and finally rewriting them to new positions, graphics operations have become inordinately slow. In fact, modern graphics operations were the first operations to disclose this Achilles heel of the PDP11 architecture.
We took two approaches to avoiding blocking the CPU. First, we implemented a queue in the device, a FIFO (First In First Out), that was quite long, and we allowed the CPU to read from the FIFO the number of free slots, the number of writes it could do and be guaranteed not to block. When the CPU wanted to write to NV1 it would ask the FIFO how many writes it could do. If the answer were N, it would do N writes before asking again. NV1 would acknowledge each of those writes immediately, allowing the CPU to proceed to compute the data for the next write. This was the subject of US5805930A: System for FIFO informing the availability of stages to store commands which include data and virtual address sent directly from application programs (inventors David S. H. Rosenthal and Curtis Priem), the continuation of an application we filed 15th May 1995. Note that this meant the application didn't need to know the size of the device's FIFO. If a future chip had a bigger or smaller FIFO, the unchanged application would use it correctly.
Second, we tried as far as possible not to use the CPU to transfer data to and from NV1. Instead, whenever we could we used Direct Memory Access, in which the I/O device reads and writes system memory independently of the CPU. In most cases, the CPU instructed NV1 to do something with one, or a few writes, and then got on with its program. The instruction typically said "here in memory is a block of quadric patches for you to render". If the CPU needed an answer, it would tell NV1 where in system memory to put it and, at intervals, check to see if it had arrived.
Remember that we were creating this architecture for a virtual memory system in which applications had direct access to the I/O device. The applications addressed system memory in virtual addresses. The system's Memory Management Unit (MMU) translated these into the physical addresses that the bus used. When an application told the device the address of the block of patches, it could only send the device one of its virtual addresses. To fetch the patches from system memory, the DMA engine on the device needed to translate the virtual address into a physical address on the bus in the same way that the CPU's MMU did.
So NV1 didn't just have a DMA engine, it had an IOMMU as well. We patented this IOMMU as US5758182A: DMA controller translates virtual I/O device address received directly from application program command to physical i/o device address of I/O device on device bus (inventors David S. H. Rosenthal and Curtis Priem). In 2014's Hardware I/O Virtualization I explained how Amazon ended up building network interfaces with IOMMUs for the servers in AWS data centers so that mutiple virtual machines could have direct access to the network hardware and thus eliminate operating system overhead.
Context switching
The fundamental problem for graphics support in a multi-process operating system such as Unix (and later Linux, Windows, MacOS, ...) is that of providing multiple processes the illusion that each has exclusive access to the single graphics device. I started fighting this problem in 1983 at Carnegie-Mellon. James Gosling and I built the Andrew Window System, which allowed multiple processes to share access to the screen, each in its own window. But they didn't have access to the real hardware. There was a single server process that accessed the real hardware. Applications made remote procedure calls (RPCs) to this server, which actually drew the requested graphics. Four decades later the X Window System still works this way.
RPCs imposed a performance penalty that made 3D games unusable. To allow, for example, a game to run in one window while a mail program ran in another we needed the currently active process to have direct access to the hardware, and if the operating system context-switched to a different graphics process, give that process direct access to the hardware. The operating system would need to save the first process' state from the graphics hardware, and restore the second process' state
Our work on this problem at Sun led to a patent filed in 1989, US5127098A: Method and apparatus for the context switching of devices (inventors David S. H. Rosenthal, Robert Rocchetti, Curtis Priem, and Chris Malachowsky). The idea was to have the device mapped into each process' memory but to use the system's memory management unit (MMU) to ensure that at any one time all but one of the mappings was invalid. A process' access to an invalid mapping would interrupt into the system's page fault handler, which would invoke the device's driver to save the old process' context and restore the new process' context. The general problem with this idea is that, because the interrupt ends up in the page fault handler, it requires device-dependent code in the page fault handler. This is precisely the kind of connection between software and hardware that caused schedule problems at Sun.
There were two specific Nvidia problems with this idea. First that Windows wasn't a virtual memory operating system so you couldn't do any of this. And second that even once Windows had evolved into a virtual memory operating system, Microsoft was unlikely to let us mess with the page fault handler.
As you can see in Figure 6 of the '930 patent, the I/O architecture consisted of an interface between the PCI bus and an internal bus that could implement a number of different I/O devices. The interface provided a number of capabilities:
It implemented the FIFO, sharing it among all the devices on the internal bus.
It implemented the DMA engine and its IOMMU, sharing it among all the devices on the internal bus.
Using a translation table, it allowed applications to connect to a specific device on the internal bus via the interface using a virtual name.
It ensured that only one application at a time could access the interface.
The difference between the PCI and PC/AT buses wasn't just that the data path grew from 16 to 32 bits, but also that the address bus grew from 24 to 32 bits. The address space was 256 times bigger, thus Nvidia's devices could occupy much more of it. We could implement many virtual FIFOs, so that each application could have a valid mapping to one of them. The device, not the operating system, would ensure that only one of the virtual FIFOs was mapped to the single physical FIFO. A process accessing a virtual FIFO that wasn't mapped to the physical FIFO would cause an interrupt, but this time the interrupt would go to the device's driver, not the page fault handler. The driver could perform the context switch, and re-assign the physical FIFO to the new virtual FIFO. It would also have to copy page table entries from the CPU's MMU into the IOMMU to reflect the placement of the new process' pages in physical memory. There would be no page fault so no knowledge of the device in the operating system's page fault handler. As we wrote in the '050 patent:
the use of many identically-sized input/output device address spaces each assigned for use only by one application program allows the input/output addresses to be utilized to determine which application program has initiated any particular input/output write operation.
Because applications each saw their own virtual FIFO, future chips could implement multiple physical FIFOs, allowing the virtual FIFO of more than one process to be assigned a physical FIFO, which would reduce the need for context switching.
One of the great things about NeWS was that it was programmed in PostScript. We had figured out how to make PostScript object-oriented, homomorphic to SmallTalk. We organized objects in the window system in a class hierarchy with inheritance. This, for example, allowed Don Hopkins to implement pie menus for NeWS in such a way that any user could replace the traditional rectangular menus with pie menus. This was such fun that Owen Densmore and I used the same technique to implement object-oriented programming for the Unix shell.
At a time when PC memory maxed out at 640 megabytes, the fact that the PCI bus could address 4 gigabytes meant that quite a few of its address bits were surplus. So we decided to increase the amount of data shipped in each bus cycle by using some of them as data. IIRC NV1 used 23 address bits, occupying 1/512th of the total space. 7 of the 23 selected one of the 128 virtual FIFOs, allowing 128 different processes to share access to the hardware. We figured 128 processes was plenty.
The remaining 16 address bits could be used as data. In theory the FIFO could be 48 bits wide, 32 from the data lines on the bus and 16 from the address lines, a 50% increase in bits per bus cycle. NV1 ignored the byte part of the address so the FIFO was only 46 bits wide.
So we organized the objects in our I/O architecture in a class hierarchy, rooted at class CLASS. The first thing an application did was to invoke the enumerate() method on the object representing class CLASS. This returned a list of the names of all the instances of class CLASS, i.e. all the object types this instance of the architecture implemented. In this way capabilities of the device weren't wired in to the application. The application asked the device what its capabilities were. In turn, the application could invoke enumerate() on each of the instances of class CLASS in the list, which would get the application a list of the names of each of the instances of each class, perhaps LINE-DRAWER.Thus the application would find out rather than know a priori the names of all the resources (virtual objects) of all the different types that the device supported.
The application could then create objects, instances of these classes, by invoking the instantiate() method on the class object with a 32-bit name for the newly created object. The interface was thus limited to 4B objects for each application. The application could then select() the named object, causing an interrupt if there was no entry for it in the translation table so the resource manager could create one. The 64Kbyte address space of each FIFO was divided into 8 8K "sub-areas". The application could select() an object in each, so it could operate on 8 objects at a time. Subsequent writes to each sub-area were interpreted as method invocations on the selected object, the word offset from the base of each sub-area within the 8Kbyte space specifying the method and the data being the argument to the method. The interface thus supported 2048 different methods per object.
In this way we ensured that all knowledge of the physical resources of the device was contained in the resource manager. It was the resource manager that implemented class CLASS and its instances. Thus it was that the resource manager controlled which instances of class CLASS (types of virtual object) were implemented in hardware, and which were implemented by software in the resource manager. It was possible to store the resource manager's code in read-only memory on the device's PCI card, inextricably linking the device and its resource manager. The only thing the driver for the board needed to be able to do was to route the device's interrupts to the resource manager.
The importance of the fact that all an application could do was to invoke methods on virtual objects was that the application could not know whether the object was implemented in hardware or in the resource manager's software. The flexibility to make this decision at any time was a huge advantage. As Kim quotes Michael Hara as saying:
This was the most brilliant thing on the planet. It was our secret sauce. If we missed a feature or a feature was broken, we could put it in the resource manager and it would work."
Conclusion
As you can see, NV1 was very far from the "minimal viable product" beloved of today's VCs. Their idea is to get something into users' hands as soon as possible, then iterate rapidly based on their feedback. But what Nvidia's VCs did by giving us the time to develop a real chip architecture was to enable Nvidia, after the failure of the first product, to iterate rapidly based on the second. Iterating rapidly on graphics chips requires that applications not know the details of successive chip's hardware.
I have been privileged in my career to work with extraordinarily skilled engineers. Curtis Priem was one, others included James Gosling, the late Bill Shannon, Steve Kleiman, and Jim Gettys. This search returns 2 Sun patents and 19 Nvidia patents for which both Curtis Priem and I are named inventors. Of the Nvidia patents, Curtis is the lead inventor on 9 and I am the lead inventor on the rest. Most describe parts of the Nvidia architecture, combining Curtis' exceptional understanding of hardware with my understanding of operating systems to redefine how I/O should work. I rate this architecture as my career best engineering. It was certainly the most impactful. Thank you, Curtis!
How to tell if someone's bullshitting: watch for them to give a deadline that they repeatedly push back.
This was apropos of Donald Trump's approach to tariffs and Ukraine, but below the fold I apply the criterion to Elon Musk basing Tesla's future on its robotaxi service.
Jonathan V. Last's A Song of “Full Self-Driving”: Elon Isn’t Tony Stark. He’s Michael Scott. shows that Musk's bullshitting started almost a decade ago:
For years, Elon Musk has been promising that Teslas will operate completely autonomously in “Full Self Driving” (FSD) mode. And when I say years, I mean years:
December 2015: “We’re going to end up with complete autonomy, and I think we will have complete autonomy in approximately two years.”
January 2016: “In ~2 years, summon should work anywhere connected by land & not blocked by borders, eg you’re in LA and the car is in NY.”
June 2016: “I really would consider autonomous driving to be basically a solved problem. . . . I think we’re basically less than two years away from complete autonomy, complete—safer than a human. However regulators will take at least another year.”
October 2016: By the end of 2017 Tesla will demonstrate a fully autonomous drive from “a home in L.A., to Times Square . . . without the need for a single touch, including the charging.”
March 2018: “I think probably by end of next year [end of 2019] self-driving will encompass essentially all modes of driving”
February 2019: “I think we will be feature complete—full self-driving—this year. Meaning the car will be able to find you in a parking lot, pick you up, take you all the way to your destination without an intervention, this year."
@motherfrunker" tracks this BS, and the most recent entry is:
January 2022: I will be shocked if we don't achieve FSD safer than a human this year
But finally, on June 22nd, Tesla's robotaxi revolution arrived. Never one to miss an opportunity to pump the stock with bullshit, Musk:
envisions a future fleet, including a new “Cybercab” and “Robovan” with no steering wheels or pedals, that could boost Tesla’s market value by an astonishing $5 trillion to $10 trillion. On June 20, Tesla was worth $1.04 trillion
“My view is the golden age of autonomous vehicles starting on Sunday in Austin for Tesla,” said Wedbush analyst Dan Ives. “I believe it’s a trillion dollar valuation opportunity for Tesla.”
Dan Ives obviously only sipped 10-20% of Musk's CoolAid. Others drank deeper:
Investor Cathie Wood’s ARK Invest predicts robotaxis could account for 90% of Tesla’s profits by 2029. If they are right, this weekend’s launch was existential.
Tesla's net income from the trailing 12 months is around $6.1B and falling. Assuming, optimistically, that they can continue to sell cars at the current rate, Cathie Woods is assuming that robotaxi profits would be around $60B. Tesla's net margin is around 6%, so this implies revenue of almost $1T in 2029. Tesla charges $4.20/ride (ha! ha!), so this implies that they are delivering 231B rides/year, or around 23,000 times the rate of the entire robotaxi industry currently. Woods is projecting that in four year's time Tesla's robotaxi business will have almost as much revenue as Amazon ($638B), Microsoft ($245B) and Nvidia ($130B) combined.
"On generous assumptions, Tesla’s core EV business, generating 75% of gross profit but with falling sales, might be worth roughly $50 per share, only 15% of the current price. Much of the remainder relates to expectations around self driving. RBC Capital, for example, ascribes 59% of its price target, or $181 per share, to robotaxis and a further $53 to monetizing Full Self Driving technology. Combined, that is a cool $815 billion based on double-digit multiples ascribed to modeled revenue — not earnings — 10 to 15 years from now because, after all, it relates to businesses that barely make money today."
Tesla’s much-anticipated June 22 “no one in the vehicle” “unsupervised” Robotaxi launch in Austin is not ready. Instead, Tesla is operating a limited service with Tesla employees on board the vehicle to maintain safety.
...
Having an employee who can intervene on board, commonly called a safety driver, is the approach that every robocar company has used for testing, including testing of passenger operations. Most companies spend many years (Waymo spent a decade) testing with safety drivers, and once they are ready to take passengers, there are typically some number of years testing in that mode, though the path to removing the safety driver depends primarily on evaluation of the safety case for the vehicle, and less on the presence of passengers.
In addition to Musk’s statements about the vehicle being unsupervised, with nobody inside, in general the removal of the safety driver is the biggest milestone in development of a true robotaxi, not an incremental step that can be ignored. As such, Tesla has yet to meet its goals.
Seven-and-a-half years after Musk's deadline for "complete autonomy" the best Tesla can do is a small robotaxi service for invited guests in a geofenced area of Austin with a safety driver in daylight. Waymo has 100 robotaxis in service in Austin. Three months ago Brad Templeton reported that:
Waymo, the self-driving unit of Alphabet, announced recently that they are now providing 200,000 self-driving taxi rides every week with no safety driver in the car, only passengers.
...
In China, though, several companies are giving rides with no safety driver. The dominant player is Baidu Apollo, which reports they did 1.1 million rides last quarter, which is 84,000 per week, and they now are all no-safety-driver. Pony.AI claims 26,000 per week, but it is not clear if all are with no safety driver. AutoX does not report numbers, but says it has 1,000 cars in operation. WeRide also does not report numbers.
US auto safety regulators are looking into incidents where Tesla Inc.’s self-driving robotaxis appeared to violate traffic laws during the company’s first day offering paid rides in Austin.
...
In one video taken by investor Rob Maurer, who used to host a Tesla podcast, a Model Y he’s riding in enters an Austin intersection in a left-turn-only lane. The Tesla hesitates to make the turn, swerves right and proceeds into an unoccupied lane meant for traffic moving in the opposite direction.
A honking horn can be heard as the Tesla re-enters the correct lane over a double-yellow line, which drivers aren’t supposed to cross.
In two other posts on X, initial riders in driverless Model Ys shared footage of Teslas speeding. A vehicle carrying Sawyer Merritt, a Tesla investor, reached 35 miles per hour shortly after passing a 30 miles per hour speed limit sign, a video he posted shows.
But immediately after that rollout, Tesla drivers started racking up fines for violating the law. Many roads in China are watched by CCTV cameras, and fines are automatically handed out to drivers to break the law.
It’s clear that the system still needs more knowledge about Chinese roads in general, because it kept mistaking bike lanes for right turn lanes, etc. One driver racked up 7 tickets within the span of a single drive after driving through bike lanes and crossing over solid lines. If a driver gets enough points on their license, they could even have their license suspended.
Why did Tesla roll out their $8K "Intelligent Assisted Driving" in China? It might have something to do with this:
There are already many competing robotaxi services in China. For example:
Baidu is already operating robotaxi services in multiple cities in China. It provided close to 900,000 rides in the second quarter of the year, up 26 per cent year-on-year, according to its latest earnings call. More than 7 million robotaxi rides in total had been operated as of late July.
That was a year ago. It isn't just Waymo that is in a whole different robotaxi league than Tesla. And lets not talk about the fact that BYD, Xiaomi and others outsell Tesla in China because their products are better and cheaper. Tesla's response? Getting the White House to put a 25% tariff on imported cars.
I should have done more research for this post, then I would have found out that, unlike Tesla's CyberCab, Waymo's custom robotaxi is real. This morning I drove alongside a test vehicle in Palo Alto. Wikipedia has the details:
In July 2024, Waymo began testing its sixth-generation robotaxis which are based on electric vehicles by Chinese automobile company Zeekr, developed in a partnership first announced in 2021.
The "Waymo Driver Integration Plant," a 239,000 square foot facility outside of Phoenix, will assemble more than 2,000 Jaguar I-PACE robotaxis, the Alphabet company said in a statement. Waymo will add those self-driving vehicles to its existing fleet that already includes around 1,500 robotaxis.
The plant will be "capable of building tens of thousands of fully autonomous Waymo vehicles per year," when it is fully built out, Waymo said. The company also said it plans to build its more advanced Geely Zeekr RT robotaxis that feature its "6th-generation Waymo Driver" technology later this year at the plant.
It seems likely that Waymo will be offering unsupervised rides in a custom robotaxi before Tesla can do it in a Model Y.
The rise of the technology industry over the last few decades has been powered by its very strong economies of scale. Once you have invested in developing and deploying a technology, the benefit of adding each additional customer greatly exceeds the additional cost of doing so. This led to the concept of "blitzscaling", that it makes sense to delay actually making a profit and devote these benefits to adding more customers. That way you follow the example of Amazon and Uber on the path to a monopoly laid out by Brian Arthur's Increasing Returns and Path Dependence in the Economy. Eventually you can extract monopoly rents and make excess profits, but in the meantime blitzscale believers will pump your stock price.
This is what the VCs behind OpenAI and Anthropic are doing, and what Google, Microsoft and Oracle are trying to emulate. Is it going to work? Below the fold I report on some back-of-the-envelope calculations, which I did without using A1.
David Gerard notes that:
Microsoft is forecast to spend $80 billion on AI in 2025.
Lets try to figure out the return on this investment. We will assume that the $80B is split two ways, $40B to Nvidia for hardware and $40B on building data centers to put it in. Depreciating the $40B of hardware over five years is very optimistic, it is likely to be uneconomic to run after 2-3 years. But that's what we'll do. So that is minus $8B/year on the bottom line over the next five years. Similarly, depreciating the data centers over 20 years is likely optimistic, given the rate at which AI power demand is increasing. But that's what we'll do, giving another minus $2B/year on the bottom line.
Microsoft could argue that some of the $80B is the cost of training the models. But since the models will depreciate even faster than the hardware used to train them, this doesn't make things look better.
Microsoft's gross margin for cloud services is about 70%, so they will be expecting this $10B/year cost to generate $33B/year in revenue, or about 13% of Microsoft's total. Of course, there will be some ramp up in the revenue, but Microsoft is planning to keep investing, so next year's investment will need to generate a return too. We will thus ignore the ramp.
Microsoft is today promoting the pay-as-you-go pricing model of Copilot Studio as the preferred sales motion. The list price of one message is $0.01. While enterprise clients may get discounts, there’s also the chance of prepaid message capacity being unused, so things may even out. With this price point, Copilot Studio usage generates $2.5M revenue per month, and $30M per year.
So Microsoft is processing about 3B messages/year. It needs adoption to be so fast that next year's revenue will be around 1,100 times its current rate. They will need next year's customers to generate about 33T messages/year.
160k organizations using Copilot, this translates to around 1.5K messages per org per month. Or 52 messages per day. Now, we have to remember that one action in a Copilot Studio agent often consumes more than one message. ...
If those 52 messages were only about regular GenAI usage without any business process logic, that would mean 26 responses from Copilot Studio agents per day. If they were to include things like agent actions (meaning, AI does something more than just chatting back at you) or AI tools, we’re quickly at a point where the average Copilot Studio customer organization does a couple of agent runs per day.
This is shockingly low. It is plain and obvious that most customers are merely experimenting with trying to build agents. Hardly anyone is running it in production yet. Which wouldn’t be that bad if this was a new 2025 product. But Copilot Studio has been out since November 2023.
The back of my envelope says that Microsoft's AI business needs to grow customers like no business (even OpenAI) has ever grown customers if it is not to be a huge drag on the bottom line.
If this were a traditional technology business with very strong economies of scale growing customers incredibly fast would be good, because the incremental revenue from each new customer vastly outweighs the incremental cost of supporting them. This is where Microsoft's 70% gross margin comes from.
OpenAI lost $5B on $4B in revenue, so each $1 of revenue cost them $2.25. Ed Zitron had a more detailed estimate:
To be abundantly clear, as it stands, OpenAI currently spends $2.35 to make $1.
As conversations with experts and AI companies made clear, inference, not training, represents an increasing majority of AI’s energy demands and will continue to do so in the near future. It’s now estimated that 80–90% of computing power for AI is used for inference.
If we assume unrealistically that training is a one-time cost and they don't need to retrain for next year, training cost them say 15% of $45M, or about $6.75M and answering the 30B messages cost them $38.25M. Scaling up by a factor of 1,100 means answering the messages would cost them $42B plus the $10B depreciation, so $52B. But it would only generate $33B in revenue, so each $1 of revenue would cost about $1.58. Scaling up would make the losses worse.
There are only two possibilities. Either inference gets at least an order of magnitude cheaper than training instead of around 6 times more expensive, or the price of using AI goes up by at least an order of magnitude. Now you see why Sam Altman et al are so desperate to run the "drug-dealer's algorithm" (the first one's free) and get the world hooked on this drug so they can supply a world of addicts.
In this series, we wanted to take a few moments to celebrate some great tools that have been developed for and by librarians (and other library pros) to solve a particular problem they’ve encountered in their work. You can read all about cataloguing tools in the first part of the series. In this part two, [...]
The Civil Association for Equality and Justice (ACIJ) managed to guarantee the quality of decades of housing survey data and consolidate the results even with changes in the collection methodology.
ODE in action: A new cohort of pilot organisations will start testing ODE with new responsible AI-powered features on issues as varied as data journalism, academic research, cultural heritage, and public sector data.
While working on the Saving Ads project, we identified problems with replaying ads (technical report: “Archiving and Replaying Current Web Advertisements: Challenges and Opportunities”) that used JavaScript code to dynamically generate URLs. These URLs included random values that differed during crawl time and replay time, resulting in failed requests upon replay. Figure 2 shows an example ad iframe URL that failed to replay, because a dynamically generated random value was used in the subdomain. URL matching approaches like fuzzy matching could resolve these problems by matching the dynamically generated URL with the URL that was crawled.
Goel, Zhu, Netravali, and Madhyastha’s "Jawa: Web Archival in the Era of JavaScript" involved identifying sources of non-determinism that cause replay problems when dynamically loading resources, removing some of the non-deterministic JavaScript code, and appling URL matching algorithms to reduce the number of failed requests that occur during replay.
Video 1: Presentation video
Sources of Non-Determinism
Non-determinism can cause variance in dynamically generated URLs (e.g., the same resource referenced by multiple URLs with different query string values, such as https://www.example.com/?rnd=4734 and https://www.example.com/?rnd=7765). This variance can result in failed requests (like the example shown in Figure 2) if the replay system does not have an approach for matching the requested URL with one that was successfully crawled. The sources of non-determinism that cause problems with replaying archived web pages are server-side state, client-side state, client characteristics, and JavaScript's Date, Random, and Performance (DRP) APIs. When replaying web pages client browsers do not maintain server-side and client-side state. The other sources of non-determinism (client characteristics and DRP APIs) are present during replay and impact JavaScript execution.
When a web page’s functionality requires dynamically constructed server responses (e.g., posting comments, push notifications, and login), the functionality can be impacted if the web page requires communication with a website’s origin servers. When an archived web page is loaded, the functionality would also be impacted if more resources were requested that were not archived during the crawling session. For client characteristics and DRP APIs, the authors ensured that all APIs would return the same value during replay time as they did during crawl time. For DRP APIs, they also used server-side matching of requested URLs to crawled URLs.
Reducing Storage When Archiving Web Pages
Goel et al. created a web crawler named Jawa (JavaScript-aware web archive) that removes non-deterministic JavaScript code so that the replay of an archived web page does not change if different users replay it. Since Jawa removes some third party scripts, the preservation style is in-between Archival Caricaturization and the Wayback style. Archival Caricaturization is a term created by Berlin et al. (“To Re-experience the Web: A Framework for the Transformation and Replay of Archived Web Pages”) to describe a type of preservation that does not preserve the web page as it originally was during crawl time. Archive.today is an example of Archival Caricaturization where all the original JavaScript is removed during replay. In contrast, archives that use the Wayback style archive and replay all of the resources of a web page and only make minimal rewrites during replay.
Jawa reduced the storage necessary for their corpus of 1 million archived web pages by 41% when compared to techniques that were used by Internet Archive during 2020 (Figure 3). This storage savings occurred because they discarded 84% of JavaScript bytes (Figure 4). During their presentation (https://youtu.be/WdxWpGJ-gUs?t=877), the authors mentioned that the 41% reduction in storage also includes other resources (e.g., HTML, CSS, and images) that would have been loaded by the excluded JavaScript code. Jawa saves storage by not archiving non-functional JavaScript code and removing unreachable code. When removing JavaScript code, they ensured that the removed code does not affect the execution of the rest of the code.
Brunelle et al.’s “Archival Crawlers and JavaScript: Discover More Stuff but Crawl More Slowly” also involved measuring JavaScript’s impact on storage when archiving web pages. They found that using a browser-based crawler that executes JavaScript during the crawling session resulted in 11.3 times more storage for all 303 web pages in their collection and 5.12 times more storage per URI (approximately 413.2 KB/URI). If we take this KB per URI measurement and multiply it with the number of URIs in Jawa’s corpus of 1 million web pages, it is expected for browser-based crawlers to require approximately 413.2 GB of storage to archive all web pages. When Goel et al. used techniques similar to the Internet Archive (which the authors referred to as IA*), it required 535 GB to archive the web pages in the corpus, while 314 GB of storage was required for Jawa. Since the amount of JavaScript (and the resources dynamically loaded by this code) has increased (Figure 5), the IA* approach required more storage than previously expected by Brunelle et al. Even though the amount of storage required to archive web pages has increased, Jawa achieved enough storage savings to go below the previously expected storage for browser-based crawlers.
Brunelle et al. also compared the crawl throughput and showed that using browser-based crawlers significantly increased the amount of time (38.9 times longer than using Heritrix) it takes to crawl web pages when compared to traditional web archive crawlers that do not execute the JavaScript during crawl time. Since Jawa can reduce the amount of JavaScript archived, it was able to improve the crawling throughput by 39% when it archived web pages from Goel et al.’s corpus (Figure 6).
Removing Non-Functional Code From JavaScript Files
Their approach for removing non-functional code is based on two observations about JavaScript code that will not work on archived web pages and relies on interacting with origin servers:
Most non-functional JavaScript code is compartmentalized in a few files and is not included in all JavaScript files.
The execution of third-party scripts will not work when replaying archived web pages.
To identify non-functional JavaScript, they created filter lists, instead of using complex code analysis. Their filter lists contain rules that were created based on manual analysis of the scripts from their corpus.
Every rule was based on domain, file name, or URL token:
For domain rules, they removed some URLs associated with a third party service.
For file name rules, they would identify files like “jquery.cookie.js” (which is used for cookie management) from any domain and not archive it.
For URL token rules, if a key word such as “recaptcha” was found in the URL they would not archive the resource.
The filter lists can be used to exclude URLs during crawl time to remove JavaScript files that are not needed. They removed third-party scripts that would not prevent post-load interactions from working during replay time. They also removed scripts that were on EasyList, which is an ad blocking filter list.
They checked if the code removed by Jawa visually or functionally impacted the replay of archived web pages. For visual impact, they checked if the archived web page looks the same with or without filtering by comparing screenshots of a web page archived by Jawa with filtering and without filtering. They then viewed the web pages that had different pixel values and only insignificant differences occurred like different time stamp information on the web page and different animations due to JavaScript’s Date, Random, and Performance (DRP) APIs.
For functional impact, they checked if the post-load interactions will work on the archived web page. They found that removing the files that matched their filter lists did not negatively impact the navigational and informational interactions.
Garg et al. (“Caching HTTP 404 Responses Eliminates Unnecessary Archival Replay Requests”) identified cases where archived web pages that require regular updates (e.g., news updates, new tweets, and sport scores updates) would repeatedly make failed requests to an origin server during replay time, which resulted in unnecessary traffic. Jawa could resolve this problem, because it removes code that communicates with an origin server, and this should reduce the amount of failed requests that can occur during replay for these types of archived web pages.
Removing Unreachable Code
For this approach, the code they removed is unreachable code that will never be executed. This unreachable code is associated with sources of non-determinism that are absent during replay time and sources of non-determinism caused by asynchronous execution and APIs for client characteristics.
The code that is executed when an event handler is invoked can differ depending on the order the user interacts with elements on the page (Figure 7), the inputs the user provides to the events, and the values returned by the browser’s APIs:
Order of user interaction: They focused on read-write dependency and found that the event handlers would not impact the replay since these events were used for user analytics that would track user interaction.
User input: None of the event handlers that work at replay time would read inputs that impact which code gets executed.
Browser APIs: Jawa removes APIs for client characteristics so only DRP APIs would be executed during replay time. When they checked the web pages in their corpus, the DRP APIs did not impact the reachable code for any event handler.
For the resources not filtered out by their filter lists, Jawa injects code (Figure 8) to identify which code was executed and in what order and then triggers every registered event handler using default input values and identifies which code was executed. The code that was executed gets stored. They then ensure that the browser will follow the same execution schedule and use the same client characteristics.
Utilizing URL Matching Algorithms to Handle Non-Determinism
When non-determinism causes variance in web resources’ URLs it results in failed requests, which prevents the web resources from loading. To mitigate this replay problem, Jawa uses approximate URL matching at the backend and intercepts calls to DRP APIs, supplying the return value of the API that was previously saved during crawl time. Goel et al. used two URL matching algorithms, querystrip and fuzzy matching, to match the requested URL with a crawled URL.
Querystrip removes the query string from a URL before initiating a match. This approach can help with cases where the query string is updated for a resource based on the server-side state. Figures 9 and 10 show an example where querystrip would be useful. We identified a replay problem (that resulted in a failed request for most replay systems except for ReplayWeb.page) with Amazon ad iframes that used a random value in the query string. If the query string is removed from this URL and a search is performed for the base URL in the WACZ file, then we could match the URL that was dynamically generated during replay with the URL that was crawled.
Figure 9: Example URI for an Amazon ad iframe. The rnd parameter in the query string contains a random value that is dynamically generated when loading an ad.
Figure 10: When replaying an Amazon ad iframe, the rnd parameter is not the same as the original value that is in the URI-R. Even though an incorrect URI-M is generated, ReplayWeb.page is able to load the ad. WACZ | URI-R: https://aax-us-east.amazon-adsystem.com/e/dtb/admi?b=...
Goel et al.’s fuzzy matching approach used Levenshtein distance to find the best match for a URL. An example of fuzzy matching for a replay system is pywb’s rules.yaml and pywb’s fuzzymatcher.py script that uses these rules. According to their presentation (https://youtu.be/WdxWpGJ-gUs?t=931), Jawa eliminated failed network fetches on around 95% of the pages from their corpus of 3,000 web pages (Figure 11). Their paper reported 99% of eliminated failed network fetches, but it is listed as 95% in the more recent slides.
This group’s continued work (“Detecting and Diagnosing Errors in Replaying Archived Web Pages”) involves identifying URL rewriting problems caused by JavaScript that impacts the quality of the archived web page during replay. Their goal is to create a new approach for verifying the quality of an archived web page that is better than comparing screenshots and viewing failed requests. Their approach involves capturing (during crawl and replay time) each visible element in the DOM tree, the location and dimensions of the elements, and the JavaScript that produces visible effects. Their approach was able to reduce false positives while detecting low fidelity during replay when compared to using only screenshots and runtime and fetch errors.
Summary
Goel et al. created a web crawler named Jawa (JavaScript-aware web archive) that removes some non-deterministic JavaScript code so that the replay of an archived web page does not change if different users replay it. Their crawler also reduces the amount of storage needed when archiving web pages by removing non-functional and unreachable JavaScript code.
The sources of non-determinism that cause problems with replaying archived web pages are server-side state, client side state, client characteristics, and JavaScript's Date, Random, and Performance APIs. When non-determinism caused variance in a dynamically generated URL during replay, they used two URL matching algorithms which are querystrip and fuzzy matching to match the requested URL with a crawled URL. These URL matching algorithms can reduce the number of failed requests and could resolve replay problems associated with random values in dynamically generated URLs, which is a problem we encountered during the Saving Ads project while replaying ads.
Win free books from the July 2025 batch of Early Reviewer titles! We’ve got 191 books this month, and a grand total of 3,477 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.
The deadline to request a copy is Friday, July 25th at 6PM EDT.
Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, the UK, Canada, Germany, Australia, Greece, Cyprus, Czechia, Spain, Denmark and more. Make sure to check the message on each book to see if it can be sent to your country.
Thanks to all the publishers participating this month!
Happy July, DLF Community! We hope you’re having a pleasant summer so far and that you’re staying cool however you can. Between vacations and time to relax, we hope to catch you at a DLF Working Group meeting sometime this month. And, because the fall will be here before we know it, we hope you’re making plans to come to the Forum and Learn@DLF this November – registration is open and all programs have been released for what’s sure to be a wonderful week in colorful Colorado.
See you soon!
— Aliya from Team DLF
This month’s news:
Forum program announced, registration open: The program for the 2025 DLF Forum is now available. Register at the earlybird rate to join us in Denver in November.
Opportunity: H-NET Spaces invites applications for its Spaces Cohort Program, which supports early-stage projects and/or scholars in need of support and hands-on training in DH methods. Applications due July 1.
Call for climate-conscious bookworms: DLF’s Climate Justice Working Group summer book group is meeting Tuesday, July 29 at 1pm ET. They’ll be discussing chapters 5&6 of After Disruption: A Future for Cultural Memory by Trevor Owens, which is available open access. All are welcome to join, even if you’re not a regular participant in the working group and/or missed the first discussions. Register here to join.
Office closure: CLIR offices will be closed July 3-4 in observance of Independence Day.
This month’s open DLF group meetings:
For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.
DLF Born-Digital Access Working Group (BDAWG): Tuesday, 7/1, 2pm ET / 11am PT
DLF Digital Accessibility Working Group (DAWG): Tuesday, 7/1, 2pm ET / 11am PT
DLF AIG Cultural Assessment Working Group: Monday, 7/14, 1:00pm ET /10am PT
DLF AIG Metadata Assessment Working Group: Thursday, 7/17, 1:15pm ET / 10:15am PT
DLF AIG User Experience Working Group: Friday, 7/18, 11am ET / 8am PT
DLF Digital Accessibility Policy & Workflows Subgroup: Friday, 7/25, 1pm ET / 10am PT
DLF Digital Accessibility Working Group IT & Development (DAWG-IT) Subgroup: Monday, 7/28, 1:15pm ET / 10:15am PT
Libraries play a crucial role in ensuring equitable access to information, yet many collected materials remain inaccessible to patrons with disabilities. How can resource-sharing practitioners leverage their expertise and systems to bridge this gap? This question was at the heart of a recent OCLC Research Library Partnership (RLP) Works in Progress Webinar, Increasing the accessibility of library materials—Roles for ILL, where experts from three large academic libraries shared their practices and insights for improving collection accessibility for users who need accommodations.
Accessibility has long been a topic of interest and action for the SHARES resource sharing consortium, and all three speakers were from SHARES institutions. As the consortium’s coordinator, I introduced the session by highlighting some of the group’s previous work on accessibility, which included surveying members about current practices, challenges, and aspirations around accessibility, creating a resource document on accessibility and ILL, and drafting accessibility provisions that were incorporated into the latest revision of the US Interlibrary Loan (ILL) code in 2023.
Next our three distinguished presenters took to the virtual stage:
Clara Fehrenbach, Document Delivery Services Librarian at the University of Chicago Library
Ronald Figueroa, Resource Sharing and Facility Manager at Syracuse University Libraries
Brynne Norton, Head of Resource Sharing & Reserves at the University of Maryland Libraries
Key takeaways
The session highlighted various models and strategies for enhancing accessibility in library collections. Here are key insights shared by the presenters:
Making procurement seamless for qualified patrons: Clara Fehrenbach discussed the University of Chicago’s partnership with Student Disability Services (SDS). This collaboration allows students to request alternate formats directly through the library catalog, ensuring privacy and streamlined access to necessary materials. The library scans materials that SDS can’t source elsewhere, while SDS evaluates and authorizes patron eligibility and does the actual document remediation.
Providing PDFs of hard-to-get material still in copyright: Brynne Norton outlined the Accessible Library Text Retrieval Program (ALTR) at the University of Maryland. This program provides text-searchable PDFs of in-copyright library materials for students with visual impairments and other disabilities, serving as a last resort when other accessible formats are unavailable. Accessibility and Disability Service (ADS) staff determine who qualifies for this service as an accommodation.
Offering multiple levels of remediation: Ronald Figueroa outlined Alternate Format Services (AFS) at Syracuse University, which provides alternate formats for items owned, licensed, or obtained via ILL for qualified patrons. Service eligibility is determined by the Center for Disability Resources for students and by the ADA Coordinator for faculty and staff. AFS remediates for magnification, text-to-speech, or screen readers, according to need, and outsources jobs over 200 pages.
Practical tips for ILL practitioners
The presenters shared practical advice for libraries looking to start offering accessibility services or enhance an existing service:
Start small: Begin with basic services and gradually expand capabilities based on available resources.
Understand patron needs: Tailor services to meet the specific needs of patrons, whether it’s OCR documents, accessible PDFs, or other formats.
Leverage partnerships: Collaborate with Disability Services Offices (DSOs) on campus to determine eligibility and streamline the provision of accessible materials.
Maintain communication: Keep in close touch with partners to ensure ongoing support and address any changes in staff or procedures.
They also offered guiding principles for those who might be feeling overwhelmed by the prospect of starting up a new accessibility service:
Let those who are already good at it do it: ILL = scanning; Student Disability Services = eligibility.
Don’t overpromise: Understand what you actually have the bandwidth to offer before partnering.
Don’t be shy: Know that Student Disability Services folks are eager to partner.
Tap into your ILL community: Ask your peers for help.
Looking ahead
The webinar also looked to the horizon for upcoming developments in accessibility, including the integration of Optical Character Recognition (OCR) into OCLC’s Article Exchange document delivery application and burgeoning efforts by the ALA RUSA STARS Codes and Guidelines Committee to establish scanning standards, with a focus on improving scanning for accessibility. These initiatives are crucial for ensuring that all patrons have equitable access to library resources.
This webinar provided valuable insights and practical strategies for improving accessibility in library collections. By leveraging collaborative efforts, specialized programs, and efficient workflows, libraries can make significant strides in ensuring that all patrons, regardless of their abilities, have access to the information they need. We invite you to learn more by watching the recorded webinar, and exploring the wealth of resources shared on the recording webpage.
The training, offered by Public Lab Mongolia, attracted the interests of Mongolian open data enthusiasts and practitioners from various fields, including sociology students and practitioners, data analysts, health researchers, university lecturers, civil society, and private sector professionals.
In onze blogpost uit 2023, “Machine Learning en WorldCat“, vertelden we voor het eerst hoe we machine learning inzetten om dubbele records in WorldCat op te sporen en samen te voegen.
Het verwijderen van dubbele records is altijd belangrijk voor de kwaliteit van WorldCat. Het maakt catalogiseren efficiënter en verbetert de algehele kwaliteit. Nu bibliografische gegevens sneller dan ooit binnenstromen, moeten we ervoor zorgen dat records nauwkeurig, samenhangend en toegankelijk blijven. En dat in hoog tempo.
Met AI kunnen we het ontdubbelen van gegevens snel en efficiënt opschalen. Toch blijft menselijke kennis en ervaring belangrijk voor het succes. Bij OCLC hebben we geïnvesteerd in een hybride aanpak: we gebruiken AI om enorme hoeveelheden data te verwerken, terwijl catalogiseerders en OCLC-experts de belangrijkste beslissingen blijven nemen.
Van papieren strookjes naar machine learning
Voordat ik bij OCLC kwam, werkte ik al aan het verbeteren van bibliografische gegevens. Het samenvoegen van dubbele records deden we toen nog helemaal handmatig. Bibliotheken stuurden ons papieren strookjes met mogelijke duplicaten, vaak voorzien van uitleg van een catalogiseerder.
We sorteerden duizenden strookjes in archiefkasten: groene strookjes voor boeken, blauwe voor niet-boeken en roze voor series. De hoeveelheid strookjes was zo groot dat we zelfs kantoormeubilair moesten gebruiken om ze op te slaan. Je kon uiteindelijk nergens meer een pen of notitieblok vinden.
Deze afbeelding is gemaakt met AI en geeft een indruk van hoe de rommelige gangen eruitzagen waar we de duplicaatstrookjes bewaarden. Het ziet er hier veel netter uit dan het in werkelijkheid was.
Als ik erop terugkijk, zie ik hoe vooruitziend die gezamenlijke inspanning was. Het was langzaam en methodisch werk, maar het liet zien hoe zorgvuldig we toen te werk gingen. Elk strookje stond voor een beslissing, een stukje menselijk oordeel dat bepaalde of records in ons systeem werden samengevoegd of behouden. Ondanks ons harde werk konden we het nooit helemaal bijhouden. De stapel duplicaten bleef groeien en we liepen altijd achter de feiten aan.
Het verschil met nu is enorm. Sinds ik bij OCLC werk met AI-gestuurde ontdubbeling, besef ik pas hoe veel efficiënter we dit tegenwoordig kunnen aanpakken. Wat vroeger jaren duurde, doen we nu in weken, met meer nauwkeurigheid en in meer talen, schriften en materiaaltypes dan ooit tevoren. Toch blijft de kern van het werk hetzelfde: menselijke expertise is onmisbaar. AI is geen wondermiddel. Het leert van onze catalogiseringsnormen, ons professionele oordeel en onze correcties.
Door een hybride aanpak te gebruiken, waarbij machine learning het zware werk doet en menselijke controle het proces stuurt en verfijnt, kunnen we een balans vinden tussen snelheid en precisie. Zo bouwen we voort op het beste van beide werelden.
Innovatie en zorgvuldigheid in balans houden in WorldCat
Al tientallen jaren werken catalogiseerders, metadatabeheerders en OCLC-teams samen om de betrouwbaarheid van WorldCat te waarborgen. Zo blijft het een hoogwaardige, betrouwbare bron voor bibliotheken en onderzoekers. Het verwijderen van dubbele records is een belangrijk onderdeel van deze inspanning. Het zorgt ervoor dat alles overzichtelijker wordt, makkelijker doorzoekbaar is en beter uitwisselbaar is tussen verschillende systemen.
Met AI kunnen we dubbele records nu op een nieuwe manier aanpakken. Hierdoor kunnen we veel meer duplicaten opsporen en samenvoegen dan ooit tevoren. De belangrijkste uitdaging is om AI verantwoord en transparant toe te passen, zodat het aansluit bij professionele catalogiseringsnormen.
Deze schaalbare aanpak sluit naadloos aan bij onze langdurige rol als beheerders van gedeelde bibliografische gegevens. AI biedt ons de kans om menselijke expertise te versterken, zonder deze te vervangen.
Een nieuwe kijk op ontdubbeling
Tot nu toe gebruikten we vaste algoritmes en veel handwerk om dubbele records samen te voegen. Dit werkte wel, maar had duidelijke grenzen.
Met de AI-gestuurde ontdubbelingsmethoden van OCLC kunnen we nu veel meer bereiken:
Meer talen en schriften: Ons machine learning-algoritme verwerkt efficiënt niet-Latijnse schriften en records in allerlei talen. Hierdoor kunnen we sneller duplicaten opsporen in wereldwijde collecties.
Meer soorten records: AI herkent duplicaten in een breder scala aan bibliografische gegevens. Ook helpt het ons bij materiaaltypes die eerder lastig waren om te ontdubbelen.
Bescherming van zeldzame en speciale collecties: Bij zeldzame en unieke materialen zetten we geen AI in. Zo blijven bijzondere items in archieven en speciale collecties goed beschermd.
Dankzij deze verbeteringen kunnen we de metadata van WorldCat verder verbeteren, met aandacht voor meer materialen en talen. Dit stelt ons in staat om de kwaliteit van data verantwoord te verhogen.
Wat “verantwoorde AI” in de praktijk betekent
De term ‘AI’ is breed en roept bij sommigen scepsis op. Dat is begrijpelijk. Verschillende AI-toepassingen brengen vragen met zich mee over vooroordelen, nauwkeurigheid en betrouwbaarheid.
Onze aanpak is gebaseerd op een paar belangrijke principes:
AI als aanvulling op menselijke expertise: AI is bedoeld om mensen te ondersteunen, niet te vervangen. We hebben menselijke controle en het labelen van data ingebouwd, zodat onze AI-modellen leren volgens de beste catalogiseringspraktijken.
Efficiëntie zonder concessies aan kwaliteit: Onze AI is ontwikkeld om slim met computercapaciteit om te gaan, zonder dat dit ten koste gaat van de nauwkeurigheid en kwaliteit van de records.
Duurzaamheid: We zorgen ervoor dat onze systemen niet meer computerkracht gebruiken dan nodig. Zo blijven de resultaten goed, zonder verspilling. Door AI slim in te zetten, blijft ontdubbeling betaalbaar en toekomstbestendig, terwijl we blijven groeien.
Het doel is niet om mensen te vervangen, maar om hun kennis en tijd beter te benutten. Catalogiseerders kunnen zich daardoor richten op werk dat echt waarde toevoegt voor hun gebruikers, in plaats van eindeloos dubbele records op te ruimen.
Daarnaast spelen catalogiseerders en onze ervaren OCLC-medewerkers een actieve rol in dit proces. Door gegevens te labelen en feedback te geven, helpen zij AI steeds beter te worden in het herkennen en verwerken van duplicaten.
AI als gezamenlijke inspanning en de weg vooruit
Ik mis de stapels papieren strookjes en het elk kwartaal opruimen van archiefkasten niet, maar ik waardeer wel wat ze symboliseerden: zorgvuldigheid en toewijding. AI vervangt die zorgvuldigheid niet, maar bouwt erop voort en tilt het naar een hoger niveau.
Terwijl de tools zich blijven ontwikkelen, blijven onze principes hetzelfde. OCLC gebruikt al lange tijd technologie om bibliotheken te helpen bij het beheren van hun catalogi en collecties. Nu passen we diezelfde aanpak toe op AI: doelgericht, effectief en geworteld in onze gedeelde toewijding aan de kwaliteit van metadata.
Deze manier van innoveren stelt bibliotheken in staat om te voldoen aan veranderende behoeften en waarde te blijven leveren aan hun gebruikers.
Doe mee met OCLC’s datalabeling-initiatief en help de rol van AI bij het ontdubbelenverbeteren.
AI-gestuurde ontdubbeling is een gezamenlijke inspanning die voortdurend wordt verfijnd dankzij de input van de gemeenschap en professionele controle. Jouw bijdrage heeft direct invloed op de kwaliteit en efficiëntie van WorldCat. Daarmee komt het de hele bibliotheekgemeenschap ten goede.
En la entrada de blog de agosto de 2023 titulada “Machine Learning y WorldCat”, compartimos, por primera vez, nuestros esfuerzos para aprovechar el Aprendizaje Automático con el fin de mejorar la deduplicación en WorldCat.
La deduplicación siempre ha sido un elemento central para garantizar la calidad de WorldCat, ya que mejora la eficiencia y la calidad de la catalogación. Sin embargo, con el aumento acelerado de datos bibliográficos, nos enfrentamos al desafío de mantener los registros actualizados, conectados y accesibles de forma rápida. La deduplicación impulsada por la IA ofrece una forma innovadora de agilizar este trabajo de manera eficiente, pero su éxito sigue dependiendo del conocimiento y criterio humano. En OCLC, hemos apostado por un enfoque híbrido que combina la capacidad de la IA para manejar grandes volúmenes de datos con el papel esencial de los catalogadores y de los expertos de OCLC a la hora de tomar las decisiones clave.
De las fichas en papel al Aprendizaje Automático
Antes de unirme a OCLC, trabajé en el área de calidad de datos bibliográficos en una época en la que la deduplicación se realizaba de manera completamente manual. Como parte de un programa de mejora de calidad, las bibliotecas nos enviaban fichas de papel detallando posibles duplicados, cada una acompañada de la explicación del catalogador en cuestión. Recuerdo que clasificábamos miles de estas fichas por colores: verde para libros, azul para materiales no bibliográficos, rosa para publicaciones seriadas. Incluso reutilizamos archivadores de la oficina para almacenar las fichas de duplicados que se nos desbordaban: encontrar bolígrafos y blocs de notas era prácticamente imposible.
Esta imagen ha sido generada con IA para recrear mi recuerdo de los pasillos de archivadores abarrotados de fichas de duplicados. La IA lo ha recreado de manera más ordenada de lo que realmente era.
En retrospectiva, creo que fue un esfuerzo colaborativo con visión de futuro. Sin embargo, era un trabajo lento y meticuloso, que reflejaba la naturaleza minuciosa de nuestras tareas en aquel momento. Cada ficha representaba una decisión, un juicio humano que definía si los registros en nuestro sistema se fusionaban o permanecían separados. A pesar de todo el esfuerzo, este proceso estaba limitado por su volumen: siempre íbamos detrás de los duplicados en lugar de adelantarnos a ellos.
Hoy, trabajando en la deduplicación impulsada por la IA en OCLC, me sorprende lo mucho que hemos avanzado. Lo que antes requería años ahora se completa en semanas, con mayor precisión y abarcando más idiomas, escrituras y tipos de materiales que nunca. Sin embargo, el núcleo del trabajo sigue siendo el mismo: el conocimiento humano es fundamental. La IA no es una solución mágica; aprende de nuestros estándares de catalogación, de nuestro criterio profesional y de nuestras correcciones.
Al adoptar un enfoque híbrido para la deduplicación, podemos utilizar el Aprendizaje Automático para realizar el trabajo pesado mientras garantizamos que la supervisión humana guíe y refine el proceso.
Equilibrandola innovación y la responsabilidad en WorldCat
Durante décadas, catalogadores, especialistas en metadatos y equipos de OCLC han trabajado para mantener la integridad de WorldCat, asegurando que sea un recurso fiable y de alta calidad para bibliotecas e investigadores. La deduplicación siempre ha sido una pieza clave en este trabajo, eliminando registros duplicados para optimizar la eficiencia, facilitar el descubrimiento y mejorar la interoperabilidad.
Ahora, la IA nos permite abordar la eliminación de duplicados de nuevas maneras, ampliando drásticamente nuestra capacidad para identificar y fusionar registros duplicados a gran escala. Sin embargo, la verdadera cuestión no es solo cómo aplicar la IA, sino cómo hacerlo de manera responsable, transparente y en consonancia con los estándares profesionales de catalogación.
Este modelo para ampliar la eliminación de duplicados es una continuación de nuestra misión histórica de preservar y gestionar datos bibliográficos compartidos. La IA ofrece una oportunidad para potenciar el conocimiento humano, no para sustituirlo.
El giro fundamental en la deduplicación
Históricamente, la deduplicación se ha basado en algoritmos deterministas y en el esfuerzo manual de catalogadores y de los equipos de OCLC. Aunque estos métodos han sido efectivos, también presentan ciertas limitaciones.
Las técnicas de deduplicación impulsadas por IA que hemos desarrollado en OCLC nos permiten:
Ampliar el alcance más allá del inglés y las lenguas romances: Nuestro algoritmo de Aprendizaje Automático procesa con mayor precisión y eficiencia escrituras no latinas y registros en todos los idiomas, lo que mejora la deduplicación rápida en colecciones globales.
Abordar una mayor variedad de tipos de registros: La IA facilita la identificación de duplicados en un espectro más amplio de registros bibliográficos y aporta nuevos conocimientos sobre materiales más complejos de tratar.
Preservar colecciones raras y especiales: Actualmente no aplicamos procesos de deduplicación impulsados por IA a materiales raros, garantizando así la preservación de los registros únicos de archivos y colecciones especiales.
Estos avances permiten mejorar la precisión de los metadatos en una variedad más amplia de materiales e idiomas, lo que nos ayuda a ampliar el trabajo que realizamos para garantizar la calidad de los metadatos de WorldCat de manera responsable.
Qué significa “IA responsable” en la práctica
El concepto de “IA” es amplio y, con frecuencia, genera escepticismo. Y no es para menos: muchas aplicaciones de Inteligencia Artificial plantean preocupaciones relacionadas con sesgos, precisión y fiabilidad.
Nuestro enfoque se basa en algunas ideas clave:
La IA debe complementar el conocimiento humano, no reemplazarlo. Hemos integrado la revisión manual y el etiquetado de datos para garantizar que los modelos de IA se entrenen siguiendo las mejores prácticas de catalogación.
La eficiencia no debe comprometer la fiabilidad. La deduplicación impulsada por IA está diseñada para optimizar los recursos computacionales, asegurando que la automatización no afecte la calidad de los registros.
La sostenibilidad es fundamental. Hemos desarrollado un sistema eficiente desde el punto de vista computacional, que reduce el uso innecesario de recursos mientras mantiene resultados de alta calidad. Al optimizar la huella tecnológica de la IA, garantizamos que la deduplicación sea rentable y viable a largo plazo.
Este modelo de deduplicación no busca reducir el papel de las personas, sino redirigir su conocimiento donde es más prioritario. Los catalogadores pueden enfocarse en tareas de alto valor que los conecten con sus comunidades, en lugar de dedicar tiempo a resolver registros duplicados.
Además, los catalogadores y el equipo especializado de OCLC participan activamente en este proceso. A través del etiquetado de datos y valoraciones posteriores, los profesionales contribuyen a perfeccionar y mejorar la capacidad de la IA para identificar duplicados.
La IA como esfuerzo colaborativo y el camino por delante
No echo de menos las pilas de fichas ni las purgas trimestrales de archivadores, pero valoro profundamente lo que representaban. La IA no está sustituyendo ese cuidado, sino ampliándolo. Aunque las herramientas evolucionan, nuestros principios permanecen intactos. OCLC lleva años utilizando la tecnología para ayudar a las bibliotecas a gestionar sus catálogos y colecciones, y ahora aplicamos esa misma filosofía a la IA: de manera deliberada, efectiva y basada en nuestro compromiso compartido con la calidad de los metadatos. Este enfoque hacia la innovación permite a las bibliotecas adaptarse a necesidades cambiantes y ofrecer un mayor valor a sus usuarios.
Participe en la iniciativa de etiquetado de datos de OCLC y contribuya al perfeccionamiento del rol de la IA en la deduplicación.
La deduplicación impulsada por IA es un esfuerzo continuo y colaborativo que seguirá evolucionando gracias a las aportaciones de la comunidad y la supervisión profesional. Las contribuciones de los profesionales tendrán un impacto directo en la calidad y eficiencia de WorldCat, beneficiando a toda la comunidad bibliotecaria.