Planet Code4Lib

Reflection: The second half of my sixth year at GitLab and learning Strategy & Operations / Cynthia Ng

Hard to believe it’s been 6 years, and I certainly didn’t imagine working outside of Product and Engineering at any point until it happened. Check out the last reflection post if you’re wondering what happened in the first half of the sixth year, namely how I wrapped up my work in Support, and moved to … Continue reading "Reflection: The second half of my sixth year at GitLab and learning Strategy & Operations"

Sort order of results in OCLC AssignFAST/fastsuggest API / Jonathan Rochkind

We have long used the free OCLC AssignFAST API to power an auto-suggest/auto-complete in some of our staff metadata entry forms. (Note: While OCLC calls this service “AssignFAST” in docs, the base URL instead includes the term fastsuggest as in http://fast.oclc.org/searchfast/fastsuggest?, which may be confusing for some!)

Recently our staff reported they thought the sort order had changed in the results returned — before they think they could enter, say, “Philadelphia” and get “Philadelphia–Pennsylvania” as a suggested response — which was the one they wanted. But now, that result wasn’t even in the first 15 results, and thus wasn’t included in our drop-down auto-suggest menu. It looks like the current result order is strictly alphabetical, so includes a lot of obscure or unlikely to be useful hits from a query matching any part of a term.

We contacted OCLC support at this form we found for “Contact the FAST team”. They got back to us relatively quickly (hooray!), but unfortunately could not neither confirm nor deny that any changes had happened to the order of results from that API any time recently, or any time at all.

But they did, eventually, tell us that we could in fact control the sort order of the API response, and make it return results in order of usage, by adding a query param to the API request — &sort=usage+desc. I would guess “usage” means frequency of use in the OCLC catalog record database.

Eg, https://fast.oclc.org/searchfast/fastsuggest?&query=philadelphia&query&queryReturn=suggestall%2Cidroot%2Cauth%2Ctype&suggest=autoSubject&rows=20&sort=usage+desc

This works, and seems to return results more like what we’re expecting/used to?

It also seems to return results more like what their own web page at https://fast.oclc.org/searchfast/ returns?

It’s also entirely undocumented on the AssignFAST API documentation, and they don’t seem inclined to update the docs? I do not know if there might be other useful values selectable as sort fields, in addition to usage.

I make this post in part for Google-ability for anyone else looking to solve this mystery.

The AssignFAST doc page also includes lots of broken links to examples. In general, I think we probably should be grateful this free API still exists at all; I think OCLC is putting only minimal if any resources to even keeping it alive, let alone enhancing it or responding to current user need; I’d expect it to disappear at some point.

I remember a time when I first entered the profession where I imagined we would develop all sorts of innovative digital services and APIs from which a new generation of library technological ecosystem would be built, and OCLC, our community-owned non-profit cooperative, would be at the center of that doing innovative things that they were well-placed to do with their access to metadata ecosystems. Of that fantasized future, what remains in 2024 is the leftover unsupported frozen-on-the-vine initial experiments from that era of excitement and innovation, and a shrunken ecosystem of vendors trying to figure out how to wring every last drop of revenue from their customers.

Matt Levine Explains Cryptocurrency Markets / David Rosenthal

Kanav Kariya
Matt Levine's superpower is his ability to describe financial issues in wonderfully simple terms, and in a section of Monday's column entitled Crypto is for fun he is on top form:
In many cases, the essential attribute of a crypto token is liquidity: What you want, often, is a token that trades a lot, because your goal for the token is to trade it a lot. Real-world utility, a sensible business model, acceptance in real transactions, etc., are all less important than just trading if you think of crypto as a toy market for traders to play with. If a token trades a lot at a high price, that in itself justifies the price, because that is all that is asked of a token: It doesn’t need to have a good underlying business or cash flows; it just has to trade a lot at a high price.
Below the fold I discuss the astonishing story behind this explanation of why wash trading is so rife in cryptocurrencies.

Levine is commenting on Leo Schwartz' The rise and fall of a crypto whiz kid: How 25-year-old Kanav Kariya went from president of Jump Crypto to pleading the Fifth. This tells how Jump Crypto got started as a way to evaluate interns, such as Kanav Kariya, as potential hires:
The firm needed to test the mettle of its would-be staffers—whether they could parse the nuances in financial markets and translate them into algorithmic trading models. But it couldn’t give the temporary hires the keys to the kingdom, with its proprietary strategies and billions of dollars in capital.

Crypto offered a solution. The sector had its own tradable assets, exchanges, and quirks, but it was separated enough from Jump’s world of stocks and bonds that it wouldn’t pose a threat.
Levine comments:
It feels like a useful mental model for crypto: Crypto is what you get when you take the smart ambitious interns at traditional financial firms and put them in charge of their own play market. Only for real money.
Jump Crypto made markets, and for many shitcoins they were the dominant market maker, and thus the reason why the shitcoins traded a lot, and thus the source of their value. This was because they made deals with the shitcoin projects:
Token projects will lend market makers a large supply of tokens so they can kickstart trading. Some firms also negotiate a call option, which gives the market makers the right to buy a chunk of the tokens for a steep discount if the project goes well. Selig says that the inverted structure in crypto—where market makers work with token projects, rather than exchanges—makes some sense, given projects’ need to spur trading activity. It also creates dynamics that would never be allowed in TradFi. While crypto market makers still make money off the trading spreads, the massive windfalls often come from those lucrative call options.
If I'm a shitcoin promoter, I can sell Jump Crypto cheap call options for a bunch of my new shitcoin. They will pump the price by wash trading it. We'll, both get rich; me from selling some of my inflated tokens, and Jump Crypto by exercising the options and selling. Who are we selling to? The retail speculators sucked in by the spectacle of a new token with a rapidly rising "price". This is another version of the VC's List And Dump Schemes described by Fais Khan in "You Don't Own Web3": A Coinbase Curse and How VCs Sell Crypto to Retail.

Schwartz writes:
For a firm like Jump, becoming the market maker for a token project meant unlimited upside with no real financial risk. “If you’re at Jump, you decide which one is going to win,” one crypto exchange founder tells Fortune, speaking on the condition of anonymity to discuss industry dynamics.
And Levine comments:
Man, what a crazy time the crypto boom was. It really did teach a generation of young financial traders that they could build perpetual motion machines: You make a token, you trade it, that makes it go up, the value comes from you trading it, you can do no wrong, you get rich, “unlimited upside with no real financial risk.”
Like most financial schemes, this all worked great at first. Two of the shitcoins for which Jump Crypto made markets were Do Kwon's Terra/Luna pair. Schwartz' article starts by recounting a May 2021 crisis:
[Jump Crypto] had become a kind of silent partner for one of the most high-profile projects in crypto, an algorithmic stablecoin called TerraUSD that was meant to maintain a $1 peg through a complex mechanism tied to a related cryptocurrency called Luna—a careful dance that Jump helped coordinate on the backend by fulfilling trades. But despite the bluster of Terra's swaggering founder, Do Kwon, the stablecoin was failing. It had lost its peg.
Jump Crypto and Do Kwon averted the crisis:
Over the next week, Jump secretly bought up huge tranches of TerraUSD to create the appearance of demand and restore the coin's value to $1, according to court documents. Meanwhile, Kwon "vested" Jump, meaning he agreed to deliver 65 million tokens of Luna to Jump at just $0.40, even though the coin would trade, at times, at more than $90 on exchanges.

Jump ultimately made $1 billion from that agreement alone
Terra loses peg
Like most financial schemes it could not last, and the longer it lasted the bigger the crash:
These dubious heroics succeeded only in staving off the inevitable: When TerraUSD lost its peg again a year later, there was nothing Jump could do. By May 2022, the cryptocurrency had grown in popularity, and its failure was catastrophic. Some $40 billion in investors' money evaporated into thin air in a matter of days.
The details of Jump Crypto's involvement with Terra/Luna were revealed because the SEC's case against Terraform Labs and Kwon was based partly on a whistleblower from Jump Crypto. The testimony will probably feature in the criminal case against Kwon, currently awaiting extradition from Montenegro.

Black Boxes / Ed Summers

One of the primary reasons I’ve resisted using machine learning models in my professional work is that they always appear to me in the guise of a black box.

A black box is a metaphor used to talk about some computational process which has known inputs and outputs, but whose internal workings are not known. It’s an idea that goes back to the beginnings of cybernetics and modern computing.

Computer professionals are trained to distinguish between the interpretability and explainability of machine learning models. But at the end of the day our ability to understand and adequately communicate how and why these models work are active areas of research.

Of course, our ability to understand is constrained because the details of some models are guarded as business secrets. But more and more we are seeing documentation efforts increase transparency about how “open source” models are built. We’ve seen significant efforts to regulate how they are used and by who. But it’s important to recognize that even these “Glass Boxes” don’t explain why some of the models work.

And that’s the glass half full version.

The glass half empty version is that we do understand how these models work, if you define “we” as someone, or some set of people, somewhere. The past isn’t evenly distributed too.

For a given model someone knows what software was used to build it–because someone installed it right? Someone made choices about what algorithms and data structures were used. Someone knows what data was used to train it because they collected it and fed it in. Someone knows what data was not fed in, and (maybe) how that biases the model. Someone knows how much money was spent training and using the model because the electricity bill got paid.

But these systems are very complex that these someones are almost certainly not the same people. They may not work in the same organizations. They may not be willing, or even able, to communicate these things to you, even if you could find them.

This is why Nick Seaver says that algorithmic systems are in fact culture, which require ethnographic methods to understand (Seaver, 2017).

I got my start as a professional programmer when the World Wide Web was being built in the mid-1990s. I didn’t have a computer science degree, but I was fortunate to be able to lean on the modest programming experience I got in high school (BASIC, Pascal and Fortran) to learn a new language Perl and apply it. I didn’t become a .com millionaire, but it helped pay the rent. I felt lucky, and in truth I was. I was already a participant in a network of privilege.

Perl had lots of features for working with text, which leant itself to be useful in web applications. It had an open source culture that encouraged the sharing of “modules” on CPAN that extended the language to do things like talk to databases, or work with specific data formats, or thousands of other things. I loved the attention to detail in the Camel Book about how the language was built and evolved. I knew that the real programmers had written the Perl interpreter in C. I didn’t really know C that well, but I knew other people did, and understood that (in theory) I could too. It was all a simple matter of programming, as they say.

But the reality was, I didn’t know the details of some of the open source Perl modules I installed and used. I trusted that the authors did. I didn’t know how some of the Perl language primitives were coded in C, but I learned from experience using them, while in conversation with other programmers, and knew that someone else somewhere did understand. I participated in the culture of programming in Perl.

The scary thing for me about generative AI is that I don’t have any meaningful trust relationships with the people who know parts of the answers to why/how it works. At a fundamental level I don’t know if it is even theoretically possible to know. Not only that, I have an active mistrust of some of the organizations that are trying to convince professionals that this is The New Way, because they stand to gain so much, while workers and consumers stand to gain so little, and the environment stands to lose so much more.

But, if I’m being generous, perhaps it appears that way because it’s not my culture?

It’s important to recognize that all of us experience some aspect of computing as a black box. Whether it is using a model from HuggingFace, an ATM, an app on our phone, or ChatGPT. I think the important question to ask is how do we participate in its the creation and use of this technology?

Can we open the black box? Can we open the other black boxes we find inside? Do we know other people who can? What do they say? What is the culture of this black box? Do I want to be a part of it? Why? Why not? What choices do I have? What about others?

I no longer program in Perl, but I still marvel at how Larry Wall practiced computing as a culture.

References

Seaver, N. (2017). Algorithms as culture: Some tactics for the ethnography of algorithmic systems. Big Data & Society, 4(2). https://doi.org/10.1177/2053951717738104

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 23 July 2024 / HangingTogether

The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by a team of OCLC contributors.

The hands of a young white person against a black background form the first half of the sign for “friendship” in American Sign Language.Photo by Nic Rosenau on Unsplash

Discussing mental illness in the library profession

Author Morgan Rondinelli shares her own experience with mental illness and how that impacts her interactions with patrons in “What’s Missing in Conversations about Libraries and Mental Illness” (posted 19 June 2024 in the open access journal In the Library with the Lead Pipe). Rondinelli, who works at a public library, writes, “I interact with patrons, who like me, are very good at masking or otherwise hiding symptoms. I also interact with patrons who display more socially obvious signs of mental illness.” The variety of mental health conditions and ability of those affected to function in society has often been misunderstood, even with the increased mental health awareness caused by the pandemic. Rondinelli notes the limited library literature focused on librarians with mental illnesses.

Discussion of librarians with disabilities, including those with mental health issues that may not be apparent to co-workers, is an important part of IDEA. Based on my personal experience, I do think literature and programs on this topic have increased in the last few years. ALA’s webpage for Mental Health Resources in Libraries provides resources for supporting patrons, staff, and both. Many of the staff-focused resources are about stress and anxiety, which could be part of a mental illness or a temporary situational response. When I attended the RBMS Conference in June 2024, I was pleased to see self-care topics on the agenda, including short stretching and meditation sessions throughout the conference and the discussion group, “Nurturing Resilience: Strategies for Librarian Well-being,” lead by Rebecca Davis, a recent graduate of Syracuse University’s Library and Information Science program. Disclosing an emotional response like “I’m stressed because of work” is different than disclosing a mental illness, and the stigmatization of persons will mental illness is a very real concern. I admire Rondinelli for the courage and confidence to be open about her mental illness. Bold articles like hers, combined with librarians’ mental health as regular conference program topic, are helping to open doors for more discussion and acceptance as we become more comfortable with saying, “The librarians are not okay.” Contributed by Kate James.

Disability Pride Month resources with TIE and the Herrick District Library

In July 1990, United States President George H.W. Bush signed into law the Americans with Disabilities Act. To commemorate the landmark legislation, Disability Pride Month is marked each July in the U.S.  Toward Inclusive Excellence (TIE), the blog from ALA’s Association of College and Research Libraries (ACRL), brings together diverse resources to keep libraries, especially but not only those that serve higher education, informed about important ideas and initiatives. “Commemorating Disability Pride Month with TIE and Choice Content” collects book reviews, webinars, and other resources about disability rights, history, and inclusivity. Holland, Michigan’s Herrick District Library (OCLC Symbol:  EGH) has also compiled useful lists of more than fifty titles for children and young adults as Disability Pride Month recommendations that can equally serve as collection development suggestions.

Between the Herrick District Library lists and the TIE Blog, a full swath of resources from kids’ picture books all the way through scholarly communications are brought to our attention for Disability Pride Month. Although the celebration originated in the U.S., it has spread to many places throughout the world. The resources cited likewise represent disability inclusiveness internationally, consider accessibility both inside and outside libraries, and include a wide range of disabilities. Contributed by Jay Weitz.

National Indian Boarding School Digital Archive

Reporting from WBUR’s Here and Now focused on the new National Indian Boarding School Digital Archive, which launched in May. Deepa Fernandes interviewed Fallon Carey, interim digital archives manager for the National Native American Boarding School Healing Coalition (NABS). NABS was founded in 2012, as a means of responding to the trauma inflicted on American Indian and Alaska Native Nations due to the US federal Boarding School Policy of 1869. There were over 500 federally funded “boarding schools” that Native children who were forcibly removed from their families were forced to attend. In the interview, Carey characterizes this removal as akin to kidnapping and discusses the effort as an attempt to disenfranchise Native people from their land. The removal resulted in loss of connection with family, language, heritage, and culture. The establishment of the digital archive which helps to document the stories of children who were in the boarding schools will help to serve as a site of reckoning and truth telling to support healing for descendants.

In Canada, a Truth and Reconciliation Commission was established in 2007, as a means of addressing the legacy of Canada’s Indian Residential School system, which was parallel to the US boarding schools. In that time, Canada has made strides towards reckoning with its own legacy of racism and harm. It is striking that all Canadian librarians I have had contact with are not only aware of the TRC but are engaged in local efforts towards a national process that helps to move forward specific recommendations. The National Centre for Truth and Reconciliation Archives exists but is principally comprised of documented related to the TRC. It is impactful to see how archival materials contributed from a range of cultural heritage institutions can be instrumental in both truth telling and healing for impacted individuals and communities.  Contributed by Merrilee Proffitt.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 23 July 2024 appeared first on Hanging Together.

Restoring Obsidian Knowledgebase from MacOS Time Machine at the Command Line / Peter Murray

While on vacation, I was catching up on some personal knowledge management maintenance I had been putting off. At one task—adding a page for a new employee at the company I work for—I noticed that the page for my company was gone. Odd, that page has been in my knowledgebase for years…let’s go look in the People, Places, and Organizations folder…6 pages?!? There should be at least 60! And look at that…all of my templates are gone, too. What else was missing?!? Here are my project notes for rebuilding my knowledgebase from backups.

I’m using Obsidian as my personal knowledge management tool. Obsidian creates a wiki-like experience for a directory of Markdown files on the local computer. (See an earlier post on DLTJ about my use of Obsidian.) I was using iCloud Drive to synchronize that directory of Markdown files across my laptop, phone, and tablet. Based on some web searching, I’m guessing that a recent change by Apple on how iCloud Drive synchronizes between machines caused files that hadn’t been accessed in a while to disappear—some sort of selective sync function.

My initial reaction—to move my knowledgebase folder out of iCloud Drive and onto the local hard drive—may have hampered the recovery…the files might have been marked as in the cloud and could have been downloaded. But moving my knowledgebase folder and purchasing an Obsidian Sync subscription was what I did. (I have no regrets…I’ll keep the Obsidian Sync subscription. Obsidian has a good, independent development team, and I’m happy to support them.) But how to get the missing files back?

My laptop has two independent backups: a USB-connected hard disk on my office desk using the built-in MacOS Time Machine, and ArqBackup uploading to an encrypted AWS S3 bucket. I was away from my desk for the week, so I first tried restoring from ArqBackup. That didn’t work; more on that below. When I got home, I attempted a recovery from the Time Machine drive.

Recovering an Obsidian Vault from a series of Time Machine backups

Because I didn’t know what files when missing when, I thought the best approach was to successively overlay restores of the knowledgebase directory onto each other. Doing so would mean that there would be files that I had intentionally deleted that would be in the new knowledgebase directory. Still, that is only a dozen or so out of the potentially hundreds of missing files—a much better option than having an unknown number of missing files.

First step: restore the knowledgebase directory for each backup set into a separate directory.

for i in $(tmutil listbackups)                                                                                                                                                            do
echo $i && tmutil restore $i/Macintosh\ HD\ -\ Data/Users/peter/Library/Mobile\ Documents/iCloud~md~obsidian/Documents/Knowledgebase $(basename $i)-Knowledgebase
done

Each iteration through the backup sets showed the number of restored files:

Total copied: 28.50 MB (29888454 bytes)
Items copied: 2916

As I scanned through the items copied for each day, there were no points where there was a sudden drop, so I think the file loss was gradual over some time. The results of this first step were a bunch of directories that looked like “2023-12-26-032655.backup-Knowledgebase”

Next: copy the contents of each backup set, in the order in which the backup sets were created, on top of an empty Knowledgebase directory.

mkdir restored-Knowledgebase
for i in $(ls -d 2*)
do
echo $i && cp -rp $i* restored-Knowledgebase
done

Now lets run some commands to see what we’re dealing with. First, the number of files in the restored Knowledgebase directory: 3,418.

find restored-Knowledgebase -type f -print | wc -l
3418

Next, use the diff command to see the differences between the active knowledgebase in my home directory and the restored knowledgebase, and look for the string Only in restored-Knowledgebase: 480 files restored that aren’t in the active knowledgebase.

diff  --brief --recursive ~/Knowledgebase restored-Knowledgebase | grep "Only in restored-Knowledgebase/" | wc -l                                                                                                480

Now let’s review a list of files that are only in the active knowledgebase. This is a reality check to make sure we’re on the right path, and indeed this lists only the files that were created since the last backup to the Time Machine drive.

diff  --brief --recursive ~/Knowledgebase Knowledgebase | grep "Only in /Users/" | less

One last command to see files that exist in both the active knowledgebase and the restored knowledgebase but don’t have the same contents:

diff  --brief --recursive ~/Knowledgebase Knowledgebase | grep "^Files"

There was only one file with minor differences…a file that I changed in between backups. Happy that everything seems in order, I just copy the contents of the restored knowledgebase on top of the active knowledgebase.

cp -rp restored-Knowledgebase/* ~/Knowledgebase

ArqBackup problems

As I mentioned above, my first attempt at restoring was using ArqBackup. ArqBackup behaved in a way that I didn’t expect…I could use the user interface to restore a file at any point in time or a directory at the point of the latest backup set. What I couldn’t do was what I did with Time Machine: restore a directory at a specific point in time. This seems to be a function of how ArqBackup stores its data. What this means, though, is that ArqBackup is less a solution for restoring directories (or a whole system) at a point in time and better as a disaster recovery when the laptop and the Time Machine drive are unavailable.

One Year of Learning 2023 / Peter Murray

Inspired by Tom Whitwell’s 52 things I learned in 2022, I started my own list of things I learned in 2023. I got well into 2024 before I realized I hadn’t published it! So, in no particular order:

  1. In the summer of 2011, a lab technician at Los Alamos National Laboratory carefully laid 8 rods of plutonium on a bench to take a picture. The rods were almost close enough to cause an uncontrolled fission event. AAAS Science
  2. Runways are named by their magnetic compass heading value divided by 10 (e.g. a runway heading due east—90°—is named “runway 9”). Variations in the Earth’s magnetic fields means that sometimes runways have to be renamed. National Centers for Environmental Information
  3. The impetus behind Google Image Search came from people searching for pictures of Jennifer Lopez in her blue/green dress at the 2000 Grammy Awards. CNN
  4. The eucalypts tree drops oily leaves that eventually burn down the forest, but that kills off surface vegetation and is needed to open the eucalypt seed pods. Cory Doctorow’s Pluralistic
  5. The origins of the nautical terms “port” side (because it was the side of the ship used for loading/unloading) and “starboard” side (from Old English for “steer” and “side of the boat”). NOAA National Ocean Service

Not much in this first year, but I’ve already started a running list for 2024.

The ILS without patron data: a thought experiment realized with FOLIO / Peter Murray

In the previous blog post, I outlined the concept of a library system with no personally identifiable information as a way to safeguard a patron’s right to privacy. Library systems commonly retain traces of a patron’s library activity, and the librarian ethos protects a patron’s privacy as they conduct their research and borrow items from the library. Suppose our library systems decoupled patrons’ personal information from their library activity. In that case, the risk of leaked information from the library systems is significantly reduced.

In this blog post, I examine how a modern library service platform could be modified to handle this minimal personal knowledge system. As you may recall, this proposed system uses pairwise-subject-identifiers (“pairwise-id”) from an organization’s identity provider (“IdP”) to identify people. Our service provider (“SP”) uses that identifier internally and calls external services that can find out who the pairwise-id is when necessary. I’m using the library services platform with which I’m most familiar: FOLIO. As an open-source library services platform, FOLIO offers a relatively straightforward path for such customizations. In the following sections, I’ll examine what our library system SP needs to do when encountering a new pairwise-id for the first time, how to send patron notices and bill patrons, and changes to the hold-request subsystem. I’ll also discuss some changes that are needed to FOLIO itself. For the sake of brevity, I’m calling this FOLIO version “FILP” — the FOLIO Identity Limited Platform.

A New Pairwise-ID is seen at FOLIO login

Screen capture of the SSO settings pane. It contains four fields: Identity Provider URL, SAML binding, SAML attribute, and User property. There is also a button labeled 'Download metadata' The FOLIO Settings → Tenant → SSO settings pane

FOLIO includes a SAML SP endpoint that assumes user records have already been loaded into the system. Configuring this endpoint requires naming the SAML attribute that will contain the person’s unique identifier and which field of the FOLIO user record has that identifier. In this example, the FOLIO SP is looking for the user identifier in the uid SAML attribute from the IdP and will search the contents of the External System ID field in the user record.

In our FILP version, we could use the SAML module unmodified; we would need to pre-load user records with the pairwise-id in the External System ID field. FOLIO user records have four required fields: patron group, active/inactive status, email address, and last name. The pre-loaded data would include the patron group appropriate for the pairwise-id patron and an “active” status. The pairwise-id is also copied into the email address field; I’ll describe later in this post how the pairwise-id is used in the FILP version of the email module. In the last name field, we can put the three-random-word phrase that will be used for hold-pickup notices. (More on this in the holds section below.)

Our FILP SAML login module can also create user records on-the-fly when a new pairwise-id is seen. The IdP sends attributes (such as “student” or “faculty”) to the FILP SP that are needed to determine the appropriate patron group; the settings for the SAML module would contain a table that maps those attributes to patron groups. The pairwise-id is copied to the email address field, and a random last name will also be recorded in the new user record.

New Email Delivery Module

FOLIO has a built-in email module with a simple API for outbound email. Other FOLIO modules send a POST to the /email endpoint with a JSON body that contains the email details, including the to address and the body of the message. The built-in email module has configuration settings for the SMTP server, and it takes responsibility for sending the message.

Our FILP version of the email module has the same API signature as the built-in module: it listens for POST requests to the /email endpoint and accepts an identical JSON body. It is a drop-end replacement; the other modules in the FOLIO system don’t know that they are communicating with a FILP-enabled email module.

Remember from the previous post that the IT group running the IdP will need new services that act on behalf of our library system in cases where a patron’s identity must be known. One such service sends an email to a pairwise-id (say, the “IdP Pairwise Email Service”). This service takes the pairwise-id and looks up the actual email address. Also remember that we copied the pairwise-id to the email address in the user record. Our FILP email module reads the JSON body to get the pairwise-id in the ‘to’ field, then sends it and the message contents to the IdP Pairwise Email Service. The IdP Pairwise Email Service returns a success or failure message, which our FILP email module records in its database.

New Fee-Fine Module

Like the FOLIO email notification module, there is a single point that FILP will need to override to send fee/fine information to an external agent. Also, similar to the email module, the IT group running the IdP will need an IdP Pairwise Billing Service. When that service is given a pairwise-id, a charge/credit amount, and a message, it will post a transaction against the patron’s organization account. FOLIO’s existing fee-fines module has a POST method to create a new fee and a PUT method to modify an existing fee. The FILP version of the fee-fine module is a drop-in replacement for those /feefines and /feefines/{feefineId} API endpoints, and it accepts the same JSON bodies as those endpoints. The ownerId field in the JSON body is the FOLIO user record identifier, and our FILP feefine module uses that identifier to look up the pairwise-id in the user record to forward the data to the IdP Pairwise Billing Service.

No changes to the Requests module

The third example from the previous blog post of the impact of our FILP minimal-personal-knowledge library system was item request pickup slips. For context, the typical hold-paging-request workflow is for the library to print a paging slip that contains the title, author, and shelving location of the requested item along with the patron’s name and contact information. The pickup slip is attached to the book and placed on a hold shelf for the patron to pick up. In this typical workflow, the patron’s name is intimately tied to the requested material.

Instead of printing the patron’s name, we use a random three-word phrase stored in the FOLIO user record’s last name field when the record was created. That random phrase is printed on the pull slip. When FOLIO sends a hold pickup notice to the patron, the {{user.lastName}} replacement token is available to insert in the body of the message:

The item you requested, {{item.title}}, is now ready for pickup at the main library hold shelves. Items on the pickup shelves are sorted alphabetically using a three-word phrase. Your three-word phrase is {{user.lastName}}.

Changes Required within FOLIO

An important point in this description of how the pairwise-id is used in FOLIO is that the patron is the one logging into FOLIO to perform these actions. Currently, FOLIO performs circulation operations like a typical integrated library system: to check out an item to a patron, a staff member logs into FOLIO with privileges to perform the checkout function. That checkout function allows staff members (with the required permissions) to check out any item to any user record.

~~In our FILP FOLIO, though, the staff member won’t be able to scan a patron’s barcode to identify the patron…the patron will need to log in through the IdP single sign-on system so the pairwise-id is transmitted to FOLIO.~~ There is a correction in the next blog post about how it is possible to use the existing Check-out app. Since it is the patron that is logged into FOLIO at this point, we will need a new API endpoint for a function that checks out an item only to the logged-in user record (rather than any user record).

FOLIO differs from previous library systems in that patrons are “first class” users. The only thing that differentiates a library staff member’s account is the permissions on their user record. As described above, an access service staff member will have permission to use the Checkout app to register a loaned item on any user record. A patron user will need a permissions set that allows access only to their user record. Several other endpoints will need similar modifications: an endpoint that records a hold request for the logged-in user, an endpoint that allows someone to set notification and pickup preferences for themself, an endpoint that requests a renewal for a checked-out item, and so forth.

Conclusion and the Way Forward

FILP, as described above, still has some potential ways to correlate library activity to a specific patron and possibly de-anonymize that person. This blog post is already nearly 2,000 words, so I put that discussion plus a few other open questions in the next post.

FOLIO’s architecture is excellent because it is almost possible to build the FOLIO Identity Limited Platform—FILP—today. Replace a few back-end modules and add API endpoints where capabilities are scoped to an individual user record, and we’re pretty much there. This article’s subtitle is “a thought experiment realized in FOLIO”. It is almost enough for a statement of work.

I’ll add a plug for the company that I work for here in the last paragraph. If your library would like to do this with FOLIO, Index Data specializes in this type of software development. Few things would please me more than having the chance to build this into FOLIO. Contact me if you want to discuss this further or enter into a development agreement to add this capability to the FOLIO open source codebase.

The ILS without patron data: open questions / Peter Murray

In my prior two posts, I outlined a strategy to minimize personally identifiable information in library automation systems (idea overview, impact on FOLIO). This approach uses a unique single-service identifier (the “pairwise-id”) recognized exclusively by the identity provider (IdP) and the library’s service provider (SP), effectively preventing any cross-system correlation of an individual’s activities. The only personal information the library system stores is the pairwise-id, meaning that there are no exposed names, addresses, phone numbers, or other demographic details in the event of a system breach. When the library system needs to notify the user or post a charge to the user’s account, it invokes the “IdP Pairwise Email Service” and the “IdP Pairwise Billing Service.”

You might ask why we’re going to these lengths. Why put in the work to create these extra email and billing services? The goal is simple: to make potential attacks less fruitful. By limiting the storage of personal information and narrowing the APIs that access it, we create fewer avenues for potential exposure. This approach frees resources to focus on the remaining sections, like the IdP, Pairwise Email Service, and Pairwise Billing Service, which access personal information.

This approach also strengthens the privacy of the remaining library workflows. For example, access services staff members only see the pairwise-ids, not the patrons’ actual names or other personal details, as they check-in items or process hold requests. Of course, there may be circumstances when library staff need access to a patron’s personal information. To accommodate such cases, we could add a new FOLIO app that retrieves these details for authorized personnel. Any such access would be recorded and subject to auditing to prevent misuse.

In this post, I’m finishing up this series (for the time being?) with a collection of additional details and open questions - starting with a correction.

Correction: Library Staff Check-out app

I added a correction to the previous post about implementing a patron-data-minimizing library service in FOLIO. In the section about checking out a book, I mentioned that the only way for a user to check out an item was for the user to log into the library system via the IdP. Also, I said that existing functions could not be utilized, such as when a library staff member with appropriate permissions uses the Check-out app to register an item loaned to a patron by scanning the patron’s ID card barcode.

My colleague Mike Taylor pointed out that this was incorrect. In my own mind, I had taken the minimal patron record one step too far. We can indeed use the barcode field in the user record; this barcode could either be from a pre-loaded patron record or supplied by the IdP as an attribute when the patron logs in for the first time. Once the barcode is in place, the existing Check-out app can function as it currently does. Nevertheless, libraries must be mindful of potential risks as barcodes are visually accessible and not as easily changed as passwords.

Securing Circulation Station

Related the Check-out app, we need a strategy to control where check-outs can occur. If a patron is logging into FOLIO to use the Check-out app, we’ll ideally want this process confined to the library building. A potential solution might involve using client HTTPS certificates; with this method, FOLIO would only provide access to the Check-out app if the user’s browser presents a client certificate installed exclusively on the circulation stations. Keycloak could be beneficial in this regard. In EBSCO’s presentations about Keycloak replacing the original authentication mechanism, location-based login was highlighted as an advantage.

Deanonymization

While these modifications have minimized personal data in the library system, we haven’t completely eliminated it. A patron’s activity itself — the stream of topics browsed, articles downloaded, and items borrowed — can act as a fingerprint of their interests. The elements of this fingerprint can be quite distinctive when considering their content, time-of-day, and location. With sufficient data, an intruder could potentially link the activity back to the individual behind the pairwise-id.

There are strategies to mitigate the risk of accumulating patron activity. For example, the IdP could generate a fresh pairwise-id for each login by the patron. In this scenario, the IdP would need to maintain a record of all pairwise-ids, and would likely want to implement automatic user provisioning (where the library system SP automatically generates a new user record for every new pairwise-id).

This approach presents new challenges, such patron blocks that rely on the maximum number of checked-out items or the maximum amount of fees levied on a patron. Since the patron’s activities are now scattered across multiple user records in FOLIO, we need to introduce a “Pairwise Block Check Service.” This service would take a pairwise-id, track down all other pairwise-ids tied to the same patron, and tally their total loans and library fees. It would return a yes or no answer on whether the circulation transaction can proceed.

Deanonymization is a topic where a lot of research is ongoing. We would want to engage these researchers to make sure our approach of limiting the correlation of patron activity is sound.

Discovery integration

FOLIO doesn’t come with a built-in discovery layer. This was an intentional design decision, aimed at defining clear boundaries that allow for the integration of a library’s preferred discovery layer using well-defined,and versioned APIs. As it stands, all known discovery layer integrations connect to FOLIO using a central account with permissions to access all users’ circulation records. These records are fetched using a patron identifier, such as the pairwise-id. However, this method makes the discovery layer’s FOLIO user account as a potential security vulnerability.

Ideally, we would want each patron to log into FOLIO using their own account. Doing so would naturally restrict each user’s visibility to their personal record. At the moment, I’m uncertain whether such an indirect (transitive) login setup is feasible. In other words, can a patron log into their chosen discovery layer via the IdP, and could the discovery layer then use this authentication to log into FOLIO?

All Done?

So, I think that is it…I’ve gotten all the parts of this idea rolling around in my head out into the world. Thanks for the discussions on Mastodon and elsewhere about the specifics, and I’m looking forward to hearing more thoughts and, if necessary, integrating them into a fourth blog post.

I feel compelled to express gratitude for having a system like FOLIO to explore this idea in a tangible way. FOLIO’s primary emphasis on an API-first approach makes this concept feel more feasible. When I say API-first, I mean there are no hidden APIs within FOLIO: for every task that can be performed in the user interface, a well-defined, versioned API exists to facilitate the same function. Beyond the user interface, the modules within FOLIO are compartmentalized by function and communicate with each other using the same well-defined, versioned APIs. As a result, replacing a module to adapt FOLIO for unique uses is entirely viable.

The ILS without patron data: a thought experiment / Peter Murray

Library systems hold significant information about patrons, including their search and reading histories. For librarians, ensuring the privacy and confidentiality of this data is an essential component of professional ethics. In the United States, for example, the third point in the American Library Association Code of Ethics is “We protect each library user’s right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.”

To understand this better, consider how the Video Privacy Protection Act of 1988 arose in the U.S. after the controversy surrounding the publication of Robert Bork’s video rental history. A year earlier, Robert Bork was nominated to the U.S. Supreme Court. In the course of his confirmation hearing, a reporter published Bork’s video rental history. Although this list of videos were not a factor in his rejected nomination, that the list was published was found to be outrageous enough spur Congress to pass the law. Similarly, if your library records were made public, it could well be embarrassing and intrusive. (Side note: While there is no federal protection for personal library records like those for video rental records, state laws offer a patchwork of protections.)

Library systems, like the video rental systems of old, tie personally identifiable details with patron activity. So, what if we could separate these details? Before we delve into this, let’s define some terms related to Federated Identity systems. Skip these sections if you know about Federated Identity systems.

Federated Identity Systems: Identity Providers and Service Providers

In our complex world, library services often come from multiple providers. Rather than have the hassle of separate logins and passwords, it is common for these providers to call back to a central service where a people can prove they are who they say they are. The place where people log in called an Identity Provider (IdP). The place where people want to go is called a Service Provider (SP). A Federated Identity System is a trust relationship and a set of agreements/technologies that enable the sharing of identity information and authorizations across systems. It allows people to access resources and services across different systems using a single set of credentials, typically managed by their Identity Provider. (IdPs are sometimes called Assertion Parties because these are the software systems in the trust relationship that assertions about who a user is; SPs are sometimes called Relying Parties because they are trusting the IdP’s assertions.) Federated Identity systems exchange attributes about someone. Those attributes can be specific to a person, like “name” and “email address”, or general categories, like “student” or “community-member”. Attributes can also have special meanings to the IdP and SP, like Pairwise-Subject-ID.

Pairwise Subject Identifier

An identifier that is specific to a user is called a “subject identifier”. These typically look somewhat like an email address with parts specific to both the user and the organization. For example: murraype@dltj.orgmurraype is specific to me and dltj.org gives the identifier context to my organization. In a Federated Identity system, the same subject identifier is given to every SP that asks for it.

However, if we don’t want multiple SPs correlating a user’s activities, we can use a “pairwise-subject-identifier”. Within this workflow, the IdP sends different identifiers to different SPs for the same person, making the identifiers unique to each IdP-SP pair. More formally, pairwise-subject-identifier (“pairwise-id”) is defined this way:

This is a long-lived, non-reassignable, uni-directional identifier suitable for use as a unique external key specific to a particular relying party. Its value for a given subject depends upon the relying party to whom it is given, thus preventing unrelated systems from using it as a basis for correlation.

Typically opaque, these identifiers don’t offer additional information to the SPs trying to correlate activities between users. For instance, the pairwise-id between IdP-SP#1 is uGDJVRxK48E@dltj.org and the pairwise-id between IdP-SP#2 is T6vNM9v5tUna@dltj.org. Not only are the two Service Providers (SPs) unable to determine if the identifiers belong to the same person, but the identifiers themselves also lack any inherent information that would allow the SPs to ascertain the individual’s identity.

Pairwise-ID as THE library system ID

In our ideal library system aiming to minimize personal data collection, the pairwise-id becomes the unique identifier in the library system. (There are some drawbacks to using the pairwise-id as the unique identifier…see the discussion in the third post of this series.) The first time the library system’s SP gets a new pairwise-id, it creates a new user record in the system. The system uses other attributes from the IdP to determine privileges for this new record - for instance, a “student” status gets a normal loan period, a “faculty” status gets an extended loan period, and a “conference visitor” status gets blocked from borrowing.

The library SP is trusting the attributes received by the IdP—see the discussion above about the trust relationship for the assertions—so it does not need prior knowledge about the patron. So other than knowing that the person is a specific individual with a recognized status in the organization, the library system knows nothing about the patron. If the patron’s borrowing and search history are leaked from the library system, the system’s leaked records has nothing else to offer to tie those to a person. (Again, there are de-anonymizing nuances, but for later discussion.)

…but I need to send overdue notices to the patron

Let’s consider some operational aspects that usually require personal data: sending overdue notices, applying fees to a patron, and handling patron requests. The library system knows enough about its patron community to check out books to authorized users—people with attributes coming from the IdP that we trust and use to set how long the loan needs to be. What if a user keeps a book too long…we need a way to send a notice to a person to return the book and to bill them when they don’t return it. But the only thing the library system has is an opaque identifier that only has meaning at the IdP.

Library systems are typically self-contained: they send their own email messages and have their own billing systems for keeping track of patron charges. In a library system without patron data, though, we need to rely on others with more information about the person to handle those tasks.

Let’s take the example of sending notices to the patron. Rather than the library system doing sending the notice itself, our system tells another system to do it. The group that runs the IdP has a service that, when given a pairwise-id and the content of a message, will send that message to the patron for us.

Another example: billing the patron when they say they’ve lost the item or the library declares it missing. The IdP group has another service that takes in the pairwise-id, a currency amount, and a description then adds that information to the person’s central account. The library keeps track of the fact that a pairwise-id has been billed, but it never knows the person behind that identifier. If the item turns up again, our library system reverses the charge: it sends the pairwise-id, a credit amount, and a credit description.

Library patrons also request items be held for them; what do we do in this case? When someone requests an item, the library system prints a “paging slip” that is used to get the item from the shelf. The paging slip has information about the item—its title, author, and shelving location—as well as information about the person who requested it. The paging slip usually turns into the hold pick-up slip; it is taped to the outside of the book and shelved alphabetically by the patron’s last name. There is a serious privacy downside to this workflow, though: everyone from the staff member pulling the item to the other users browsing the hold-pickup shelf can see the name of the person who asked for it. Instead, our library-system-with-no-names prints a random three-word phrase to stand in for the name of the person who asked for the item. This same three-word phrase is sent in the hold-pickup message to the library patron so they can find the item on the hold-pickup shelf.

But could we build it?

While this thought experiment is theoretical, could a real-world library system actually function this way? In the next post, we’ll explore possible adaptations for the FOLIO Library Services Platform to turn theory into practice.

On Open Library Services: Reflections from the GIL User Group Meeting / Peter Murray

In May 2023 I was asked to join the opening session at Georgia’s GIL User Group Meeting. Along with Chris Sharp and Emily Gore, we reflected on the conference theme: The Future is Open. GALILEO has an exciting time ahed of it…their libraries are adopting FOLIO and a new resource sharing system. Below is a lightly edited version of my remarks during the panel, and a recording of the keynote panel is available on YouTube.

Tell us a little bit about your experience working with “open” library services.

In my experience, “open” is built into the ethos of libraries. I mean…even if we look at just the last century, we have the Library of Congress starting the National Union Catalog project in 1901—that was about sharing the contents of cards in the catalog—and ALA establishing a code of practice for interlibrary loan in 1917.

My career in libraries has always been about the open; I started in 1991 at the same time OhioLINK was forming, and I remember many trips to Columbus, Ohio, to work out processes and share tips-and-tricks with each other. I was even giving away code and adapting code from others before the phrase “open source” came into common use. Over the course of my career, I’ve worked on or with several open source projects: FEDORA, Islandora, ArchivesSpace, CollectionSpace, FOLIO and ReShare.

Standards are also an important part of “open” — in order to ease the process of us working together and our systems working together, it helps to have a common starting point to build on. I’ve been working on NISO projects and committees for most of my career, and it warms my professional heart to see better services come about for patrons because there is agreement on how the pieces should be put together.

What are the biggest advantages and risks in working with open services?

The biggest advantage is having a seat at the table as decisions are made. Working in the open means bringing the best of your experience and the needs of your patrons to bear as products, services, and software are designed. It is so much easier to have that input at the front end rather than trying to retrofit a system to your needs after the fact. With many voices and perspectives in the creation process, it also reduces the chances that something important will be missed.

The biggest risk is time and patience. Having many people involved in the design process means that it takes time to listen to those perspectives and effort to synthesize the way forward for the group. There will be misunderstanding and there will be compromise. There may even be paths that you want to pursue, but the group isn’t willing to follow. And of course the is the risk that the path the group follows may not be fruitful.

How have you seen library attitudes to “open” change in past years?

There seem to be more variations of open now. Last month’s article from OCLC Research had a catalog of openness: open access, open data, open educational resources, open science, and open source. So in those you cover publishing, research activity and outputs, educational materials, and software systems.

What should USG libraries be thinking about as we begin migrating to FOLIO and ReShare?

You are at a crossroads. There is a lot of new stuff coming at you, and the temptation will be to make the new work like the old. I was involved in the early design process for FOLIO and I’ve watched how those apps evolved, and then how the ReShare apps came about. What I said earlier about the librarians and library technologists pouring their experience into their design is true, and it continues today. So I think you should take a risk to open yourselves up to new and hopefully more efficient and effective workflows. I’m pretty sure those already in the community will be welcoming and help you with the process. And then, once you got your feet under you, see where you can bring your experience and perspective to the ongoing development work.

What other “open” initiatives are you excited about?

The progress of the open access movement is fascinating. There is the phrase that progress happens one retirement at a time — I used to chuckle at that, but after 30 years in the profession that phrase is less funny and more stinging. The slow but steady progress seems to be real, though. It has reached the stage where government mandates are making it happen. See, for instance, last year’s memo from the White House Office of Science and Technology Policy giving guidance to federal departments to make the results of taxpayer-supported research immediately available to the American public at no cost. Also the announcement earlier this month from the EU on a policy to require pubicly-funded research to be made available at no cost to readers and at no cost to authors.

How can libraries and GALILEO better support “open” initiatives?

In one important way, GALILEO is in a privileged position right now with FOLIO and ReShare. Many of us have been involved in the projects for a long time, and we’ve lived through the process that got those platforms where they are today. We can’t see them clearly from the outside anymore. If there is any room left in your implementation plans, I encourage you to note where you struggle to find and understand what you need to know. Those are the places where feedback can help us improve the process for the next libraries that come into the project. Even if you don’t have time to make the improvements now—and I expect you won’t—just jotting those ideas down in a notebook and coming back to them after your implementation will help.

Ghost Newsletter Software Findings: Got Past the Mailgun Problem, but Got Stuck On Ugly HTML / Peter Murray

This was going to be only a post about how I got the Ghost newsletter software to use Amazon Simple Email Service (AWS SES) instead of the built-in Mailgun support, but it turned into that plus why I can’t use Ghost for the DLTJ Newsletter.

Ghost’s bulk email delivery problem

One of Ghost’s downsides is that it only supports the Mailgun service for delivering newsletter issues. Ghost can use any email delivery agent for what it calls “transactional” email: email verification on new accounts, password resets, using email to log in, etc. Of course, the point of email newsletter software is to send issues as email, so limiting a core feature to one mechanism is rather unfortunate.

There are many threads on the Ghost support forum about using bulk email services other than Mailgun, but this post from September 2022 has a reply from a Ghost staff member about why this is so hard: they tie email analytics (reports of who opens email and who follows which links) to the each user. That functionality needs deeper integration with the bulk email service than just sending email. I don’t care about that — in fact, I make sure that DLTJ and its newsletter don’t gather reader details — but I get that this is important to some people.

The problem with Mailgun is that it has become quite expensive for self-hosted hobbyists to use. They used to have a “hidden” pay-as-you-go service that was pretty inexpensive, but earlier this year they eliminated that. Now there are reports of Ghost users having to pay $35/month to deliver just a couple hundred emails. (AWS is expensive, but not that expensive!)

In that same Ghost support forum thread mentioned earlier, this is one line about someone who solved this problem pointing to a Spanish-language post. That post, in turn, points to the ghosler software from ItzNotABug on GitHub. That software uses a webhook that Ghost fires whenever an issue is published. Ghosler reads the subscriber database from Ghost and then sends email to any SMTP endpoint…precisely what I need!

Using Ghosler with AWS SES in a Docker Compose stack

One requirement that I have is to run this software in Docker containers for ease of management and coexistence with other software. There are several examples of running Ghost in Docker; my way is certainly not the only way to do it. Another of my requirements is to put anything that isn’t public-facing on my Tailscale network. So you’ll see that mentioned in this Docker Compose file as well. There are two Compose files: one for Mariadb, phpMyAdmin, and Ghost and another Compose file that builds Ghosler.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
## File: ghost/docker-compose.yml
services:
  mariadb:
    image: mariadb:11.4.2
    container_name: mariadb
    command: --default-authentication-plugin=mysql_native_password
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: ...root-password...
    volumes:
      - mariadb_data:/var/lib/mysql
    ports:
      - 3306:3306

  phpmyadmin:
    image: phpmyadmin:5.2.1
    container_name: phpmyadmin
    network_mode: service:ts-phpmyadmin
    depends_on:
      - mariadb
      - ts-phpmyadmin
    environment:
      PMA_HOST: mariadb
      PMA_PORT: 3306
      PMA_ARBITRARY: 1
      APACHE_PORT: 8080
    restart: unless-stopped
  ts-phpmyadmin:
    image: tailscale/tailscale:latest
    hostname: aws-phpmyadmin
    environment:
      - TS_AUTHKEY=...tailscale-auth-secret...?ephemeral=false
      - TS_STATE_DIR=/var/lib/tailscale
      - "TS_EXTRA_ARGS=--advertise-tags=tag:container"
      - TS_SERVE_CONFIG=/config/ts-serve-config-phpmyadmin.json
    ports:
      - "8080:8080"
    volumes:
      - ts-data-phpmyadmin:/var/lib/tailscale
      - ts-config:/config:ro
      - /dev/net/tun:/dev/net/tun
    cap_add:
      - net_admin
      - sys_module
    restart: unless-stopped

  ghost:
    image: ghost:5-alpine
    container_name: ghost
    restart: unless-stopped
    ports:
      - 8081:2368
    networks:
      - caddy_net
      - default
    environment:
      # see https://ghost.org/docs/config/#configuration-options
      enableDeveloperExperiments: true
      database__client: mysql
      database__connection__host: mariadb
      database__connection__user: ghost
      database__connection__password: ...db-user-password...
      database__connection__database: ghost
      # Configure SMTP server for Ghost
      mail__from: newsletter@newsletter.dltj.org
      mail__transport: SMTP
      mail__options__host: email-smtp.us-east-1.amazonaws.com
      mail__options__port: 465
      mail__options__auth__user: ...AWS-access-key...
      mail__options__auth__pass: ...AWS-secret-key...
      mail__options__secure_connection: true
      mail__options__service: SES
      mail__from: "'DLTJ Newsletter' <newsletter@dltj.org>"
      url: https://newsletter.dltj.org
      # contrary to the default mentioned in the linked documentation, this image defaults to NODE_ENV=production (so development mode needs to be explicitly specified if desired)
      #NODE_ENV: development
    volumes:
      - ghost_data:/var/lib/ghost/content

  caddy_reverse_proxy:
    # Use the caddy:latest image from Docker Hub
    image: caddy:latest
    restart: unless-stopped
    container_name: caddy_proxy
    ports:
      - 80:80
      - 443:443
    volumes:
      # Mount the host Caddyfile
      - ./Caddyfile:/etc/caddy/Caddyfile
      - caddy_data:/data
      - caddy_config:/config
    networks:
      - caddy_net

networks:
  caddy_net:
  default:

volumes:
  mariadb_data:
  ghost_data:
  caddy_data:
  caddy_config:
  ts-config:
    external: true
  ts-data-phpmyadmin:
    driver: local

## Contents of `ts-serve-config-phpmyadmin.json`
## {
##   "TCP": { "443": { "HTTPS": true } },
##   "Web": { "${TS_CERT_DOMAIN}:443": { "Handlers": { "/": { "Proxy": "http://127.0.0.1:8080" } } } },
##  "AllowFunnel": { "${TS_CERT_DOMAIN}:443": false }
## }
##
##
## Contents of `Caddyfile`
## {
##     email jester@dltj.org 
## }
## newsletter.dltj.org {
##     reverse_proxy ghost:2368
## }

The parts prefaced with ts- are for the Tailscale Docker container (documentation). ts-config is a Docker volume where I store Tailscale configuration files, of which ts-serve-config-phpmyadmin.json is one.

Inside the Ghost directory, I cloned the Ghosler software: git clone https://github.com/ItzNotABug/ghosler.git. Then, I added this Docker Compose file to that directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## File: ghost/ghosler/docker-compose.yml
services:
  ghosler:
    container_name: ghosler
    build: .
    network_mode: service:ts-ghosler
    restart: unless-stopped
    depends_on:
      - ts-ghosler
    tty: true
    volumes:
      - ghosler_data:/usr/src/app
      - ./configuration/config.production.json:/usr/src/app/configuration/config.production.json:rw

  ts-ghosler:
    image: tailscale/tailscale:latest
    hostname: ghosler.internal
    environment:
      - TS_AUTHKEY=...tailscale-auth-key...?ephemeral=false
      - TS_STATE_DIR=/var/lib/tailscale
      - "TS_EXTRA_ARGS=--advertise-tags=tag:container"
      - TS_SERVE_CONFIG=/config/ts-serve-config-ghosler.json
    ports:
      - "2369:2369"
    volumes:
      - ts-data-ghosler:/var/lib/tailscale
      - ts-config:/config:ro
      - /dev/net/tun:/dev/net/tun
    cap_add:
      - net_admin
      - sys_module
    restart: unless-stopped
    networks:
      - ghost_default

volumes:
  ghosler_data:
    driver: local
  ts-data-ghosler:
    driver: local
  ts-config:
    external: true

networks:
  ghost_default:
    external: true

An interesting bit here is network_mode: service:ts-ghosler. Documentation about this is hard to come by (as noted in the Docker forum), but what this does is put the ghosler and ts-ghosler containers in the same network namespace. To the outside, it looks like one machine.

When you follow the directions for setting up Ghosler’s webhook in Ghost, you’ll need to go into the Ghost configuration and change the URL of the webhook so that it is http://ghosler.internal:2369/published — the Docker hostname and port. I found that Ghosler didn’t know enough about itself to set this automatically.

The problem with Ghost

So, having set up Ghost and its side-buddy Ghosly in Docker and confirmed that I could deliver email newsletters, I set about importing my past newsletter issues into Ghost. And here is where I got stuck.

My blog has gone through two phases: the Wordpress phase from its origin in 2008 to mid-2015, and the Jekyll static site generator phase from 2015 to the present. My posting history is a mixture of HTML exports from Wordpress and Markdown files with some moderately elaborate include macros. Even back in the Wordpress days, the posts were constructed in HTML first rather than using the WYSIWYG editor. The way content is laid out on the page is moderately important to me using good—or at least increasingly better—semantic HTML. (As with all things, learning improved semantic HTML is an ongoing process.)

I decided to use Ghost’s “Universal Import” format to migrate content. This is a JSON file that Ghost uses to transport a newsletter from one installation to another, so it seemed to promise the highest fidelity. In fact, I found that if I took a Ghost export JSON file and replaced the posts array with entries that looked like this, the import would go fine:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
                post_entry = {
                    "id": str(post_id),
                    "uuid": str(post_uuid),
                    "title": title,
                    "slug": slug,
                    "mobiledoc": None,
                    "lexical": None,
                    "html": post_html,
                    "comment_id": str(post_id),
                    "plaintext": None,
                    "feature_image": None,
                    "featured": 0,
                    "type": "post",
                    "status": "published",
                    "locale": None,
                    "visibility": "public",
                    "email_recipient_filter": "all",
                    "created_at": pub_date,
                    "updated_at": pub_date,
                    "published_at": pub_date,
                    "custom_excerpt": None,
                    "codeinjection_head": None,
                    "codeinjection_foot": None,
                    "custom_template": None,
                    "canonical_url": None,
                    "newsletter_id": None,
                    "show_title_and_feature_image": 0,
                }

Note that there is a field for lexical that I’m leaving empty and a field for html that I’m setting to the post’s HTML. Ghost—being a JavaScript application—uses Lexical as its native internal rich text format. And it helpfully converts HTML to Lexical on import. The problem is that Lexical is a (very) lossy format, so HTML that used to look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<h2 id="p25973-card-based-qa-sessions">Index Card-based Question and Answer Sessions</h2>
<blockquote>
  <p>Here is the formula:</p>
  <ol>
    <li>Throw away the audience microphones.</li>
    <li>Buy a pack of index cards.</li>
    <li>Hand out the cards to the audience before or during your talk.</li>
    <li>Ask people to write their questions on the cards and pass them to the end of the row.</li>
    <li>Collect the cards at the end of the talk.</li>
    <li>Flip through the cards and answer only good (or funny) questions.</li>
    <li>Optional: have an accomplice collect and screen the questions for you during the talk.</li>
  </ol>
  <p>Better yet, if you are a conference organizer, buy enough index cards for every one of your talks and tell your
    speakers and volunteers to use them.</p>
</blockquote>
<div style="text-align: right; width: 100%;"><cite>- <a
      href="http://blog.valerieaurora.org/2015/06/23/ban-boring-mike-based-qa-sessions-and-use-index-cards-instead/"
      title="Ban boring mike-based Q&amp;A sessions and use index cards instead | Valerie Aurora">Ban boring mike-based
      Q&amp;A sessions and use index cards instead</a>, by Valerie Aurora, 23-Jun-2015</cite></div>

…comes out of Ghost looking like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<h2 id="index-card-based-question-and-answer-sessions">Index Card-based Question and Answer Sessions</h2>
<blockquote>Here is the formula:</blockquote>
<ol>
  <li>Throw away the audience microphones.</li>
  <li>Buy a pack of index cards.</li>
  <li>Hand out the cards to the audience before or during your talk.</li>
  <li>Ask people to write their questions on the cards and pass them to the end of the row.</li>
  <li>Collect the cards at the end of the talk.</li>
  <li>Flip through the cards and answer only good (or funny) questions.</li>
  <li>Optional: have an accomplice collect and screen the questions for you during the talk.</li>
</ol>
<blockquote>Better yet, if you are a conference organizer, buy enough index cards for every one of your talks and tell
  your speakers and volunteers to use them.</blockquote>
<p>- <a
    href="http://blog.valerieaurora.org/2015/06/23/ban-boring-mike-based-qa-sessions-and-use-index-cards-instead/">Ban
    boring mike-based Q&amp;A sessions and use index cards instead</a>, by Valerie Aurora, 23-Jun-2015</p>

Problems that I immediately spotted:

  1. The <h2> fragment id was replaced, which makes the old links that used it worthless.
  2. The ordered list got put outside the blockquote. In fact, I noticed this in other imported issues…multi-paragraph blockquotes got put into individual paragraph blockquotes, and that rendered weirdly.
  3. The styling and <cite> tag were discarded.
Side-by-side comparison of Ghost (left) with original.

I use all three of these things extensively in almost all of the DLTJ Thursday Thread newsletter issues. So, yeah, stuck. I’m unsure if this is a problem with Ghost’s implementation of Lexical or with Lexical itself, but I don’t know enough JavaScript to find out. So, I’m abandoning my Ghost effort. For completeness in these notes, here is a web archive of this same newsletter issue (link to original post).

<noscript><img alt="Archived version of a sample issue using the Ghost software." src="https://dltj.org/assets/images/2023/2023-02-04-tweet-1585816108908662788.png" /> </noscript>

By the way, check out my ReplayWeb for Embedding Social Media Posts (Twitter, Mastodon) in Web Pages article to see how that web page archive is embedded in this blog post.

Processing WOLFcon Conference Recordings with FFMPEG / Peter Murray

WOLFcon—the World Open Library Foundation Conference—was held last month, and all of the meetings were recorded using Zoom. Almost all of the sessions were presentations and knowledge-sharing, so giving the recordings a wider audience on YouTube make sense. With nearly 50 sessions, though, manually processing the recordings would make the process quite challenging. I created a pipeline of ffmpeg commands that does most of the grunt work and learned a lot about ffmpeg command graphs along the way. Here are the steps:

  1. Clip the videos from the Zoom recordings.
  2. Rescale the recordings to 1920x1080, if necessary.
  3. Create a title card to add to the front of the recording.
  4. Merge the title card, session recording, and outro video into a single video.
  5. Upload the videos to YouTube using a script.

Some hard-learned lessons along the way:

  • Ffmpeg’s subtitles filter does not play nicely with filtergraphs that have more than one video input. I needed to create a separate step for burning the title card “subtitles” into the intro video and then concat that video with the session recording and outro videos.
  • Ffmpeg’s xfade filter does not like it when its source video is trimmed. No matter what variations of filters I used, it was always a hard cut between the intro video and the session recording. To solve this, I made a separate step for clipping the longer meeting recording to just the session content. I used a lossless Constant Rate Factor (-crf) to not lose too much detail with the multiple encoding steps.

I’m documenting the steps here in case they are helpful to someone else…perhaps I’ll need this pipeline again someday.

Clip sessions from recordings

Each room of the conference was assigned a Zoom meeting. These Zoom meetings allowed remote participants to join the session, and the meetings were set to record. This meant, though, that several minutes in the recording at the start and end of the session were not useful content. (Sometimes the Zoom meeting/recording for the same room would just continue from one session to the next, so multiple sessions would end up on one recording.) The valuable part of each recording would need to be clipped from the larger whole.

1
2
3
4
5
ffmpeg -y \
  -i FOLIO\ Roadmap.mkv \
  -ss 0:01:22 -to 0:50:26 \
  -c:v libx264 -crf 0 \
  FOLIO\ Roadmap.trimmed.mp4

Each option means:

  1. Ffmpeg command with -y to overwrite any existing file without prompting
  2. Recording file from Zoom
  3. Scan-start (-ss) and end (-to) points. Could have also used -d for duration. Note that the placement of -ss here relative to the -i input filename means ffmpeg will perform a frame-accurate. This avoids the problem of blank or mostly blank frames until the next keyframe is found in the file. See How to Cut Video Using FFmpeg in 3 Easy Ways (Extract/Trim) for a discussion.
  4. Re-encode with the x264 codec with a lossless Constant Rate Factor.
  5. Output file name

Rescale recordings

Most of the recordings from Zoom output to Full HD (1920x1080) resolution, but some were recorded to quite squirrely dimensions. (1920 by 1008, 1920 by 1030 … 1760 by 900, really?) To find the resolution of each recording file, I used the ffprobe command:

1
2
3
4
5
6
7
find . -type f -name '*.mkv' -exec sh -c '
  for file do
    printf "%s:" "$file"
    args=( -v error -select_streams v:0 -show_entries stream=width,height -of csv=s=x:p=0 "$file")
    ffprobe "${args[@]}"
  done
' exec-sh {} + 
  1. Run a find command to get a list of all .mkv files and pipe the list into a sub-shell.
  2. For each line of the list of files…
  3. …print out the filename (without a newline at the end)…
  4. …then build an argument array for the ffprobe command to output width and heigth…
  5. …and execute the command
  6. (end of sub-shell)
  7. (end of find)

Line 4 is necessary because some filenames have spaces, and spaces in filename for ffmpeg in bash can be a little challenging.

I moved the recordings that weren’t 1980x1080 to a separate directory and ran an ffmpeg command to add letterboxing/rescaling as needed to get the output to Full HD resolution. The -ss and -to options can also be used to clip the video to the correct length at the same time.

1
2
3
4
5
6
7
ffmpeg -y \
  -i 'zoom-recording-tuesday-2pm-room-701.mp4' \
  -ss 17868 -to 20104 \
  -vf 'scale=(iw*sar)*min(1920/(iw*sar)\,1080/ih):ih*min(1920/(iw*sar)\,1080/ih),
       pad=1920:1080:(1920-iw*min(1920/iw\,1080/ih))/2:(1080-ih*min(1920/iw\,1080/ih))/2' \
  -crf 0 
  'FOLIO Migrations.trimmed.mp4
  1. Ffmpeg command
  2. Input video file
  3. Start and stop points for the session recording
  4. Use a string of video filters, and in this first filter: scale the recording so the longest dimension is either 1920 pixels wide or 1080 pixels tall
  5. For the second filter of the chain, pad the frame to 1920x1080 and put the source video in the center/middle of the output frame.
  6. Constant Frame Rate of lossless
  7. Output file name

Create Title Card snippet

Each session recording has a 15-second title card with the session’s name. The 15 second video itself is just a PowerPoint animation of the conference logo sliding to the right half of the frame and a red box fading in on the left side of the frame. Each animation element was assigned a timing, and the resulting “presentation” was exported from PowerPoint to a video file. The music comes from Ecrett, so I have high hopes that it will pass the music copyright bar. The audio track was added to the video using—you guessed it—ffmpeg:

1
2
3
4
5
6
ffmpeg \
  -i WOLFcon\ 2023\ Intro\ Title\ Card.mov \
  -i WOLFcon\ 2023\ Intro\ audio.mp3 \
  -c:v copy 
  -map 0:v:0 -map 1:a:0 
  WOLFcon\ 2023\ Intro\ Title\ Card\ with\ audio.mov
  1. Ffmpeg command
  2. First stream input is the recording of the PowerPoint animation
  3. Second stream input is the sound file
  4. Using the ‘copy’ codec
  5. Mapping the first input’s zero-th stream (video) to the output
  6. Mapping the second input’s one-th stream (audio) to the output
  7. Writing the combined file

So with the blank title card video done, the next step is to burn/overlay the text of the session title into the video. I messed with ffmpeg’s drawtext filter for a while because the alternative—the subtitles filter—seemed too complicated. One thing that subtitles does nicely, though, is wrap the text to a given area on the video frame…sometimes complexity is a good thing. The open source Aegisub Advanced Subtitle Editor was immensely useful in creating the subtitle definition file. I can simply replace the text of the session title in the last line of the subtitle definition file, then feed it into ffmpeg.

The subtitle definition (so-called “.ass”) file generated by Aegisub is a text file, and it looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[Script Info]
; Script generated by Aegisub 3.2.2
; http://www.aegisub.org/
Title: Default Aegisub file
ScriptType: v4.00+
WrapStyle: 0
ScaledBorderAndShadow: yes
YCbCr Matrix: None
PlayResX: 1920
PlayResY: 1080

[V4+ Styles]
Format: Name, Fontname, Fontsize, PrimaryColour, SecondaryColour, OutlineColour, BackColour, Bold, Italic, Underline, StrikeOut, ScaleX, ScaleY, Spacing, Angle, BorderStyle, Outline, Shadow, Alignment, MarginL, MarginR, MarginV, Encoding
Style: Left side middle,Helvetica Neue,72,&H00FFFFFF,&H000000FF,&H00000000,&H00000000,0,0,0,0,100,100,0,0,1,0.5,0,4,50,920,10,1

[Events]
Format: Layer, Start, End, Style, Name, MarginL, MarginR, MarginV, Effect, Text
Dialogue: 0,0:00:04.00,0:00:13.00,Left side middle,,0,0,0,,FOLIO Roadmap Update

Just the last line needs to change for each session title. Another ffmpeg command overlays the subtitles onto the title card video:

1
2
3
4
ffmpeg -y \
  -i ../WOLFcon\ 2023\ Intro\ Title\ Card\ with\ audio.mov \
  -vf "subtitles=FOLIO Roadmap.ass" \
  FOLIO\ Roadmap.intro.mp4
  1. Ffmpeg command
  2. Input file is the title card with audio
  3. Add the subtitles video filter with the session-specific ASS file
  4. Output file

Merge sources to the final file

Now we have all of the pieces to make the final recording

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ffmpeg -y \
  -i FOLIO\ Roadmap.intro.mp4 \
  -i FOLIO\ Roadmap.trimmed.mp4 \
  -i ../WOLFcon\ 2023\ Outro\ Title\ Card\ with\ audio.mov \
  -filter_complex "[0:v]fps=30/1,setpts=PTS-STARTPTS[v0];
    [1:v]fps=30/1,settb=AVTB,format=yuva420p,fade=in:st=0:d=1:alpha=1,setpts=PTS-STARTPTS+((14)/TB)[v1];
    [2:v]fps=30/1,settb=AVTB,format=yuva420p,fade=in:st=0:d=1:alpha=1,setpts=PTS-STARTPTS+((2959)/TB)[v2];
    [v0][v1]overlay,format=yuv420p[vfade1];
    [vfade1][v2]overlay,format=yuv420p[fv];
    [0:a]asetpts=PTS-STARTPTS[a0];
    [1:a]asettb=AVTB,asetpts=PTS-STARTPTS+((14)/TB),compand=.3:1:-90/-60|-60/-40|-40/-30|-20/-20:6:0:-90:0.2[a1];
    [2:a]asetpts=PTS-STARTPTS+((2959)/TB)[a2];
    [a0][a1]acrossfade=d=1[afade1];
    [afade1][a2]acrossfade=d=1[fa];" \
  -map "[fv]" -map "[fa]" \
  -crf 0 -ac 2 \
  FOLIO\ Roadmap.complete.mp4

There are some filter commands here to cross-fade the video and audio between video segments that are butting up next to each other. There is an excellent description of the ffmpeg cross-fade options, and I’m using the “traditional” method.

  1. The ffmpeg command…one more time!
  2. The ‘0’ input file is the title card video with the subtitles
  3. The ‘1’ input file is the trimmed recording from Zoom
  4. The ‘2’ input file is the outro file (a 6-second file with some end-bumper music)
  5. Start of the filtergraph, tagging the video stream of the first input file as [v0], set the video as 30 frames-per-second, and anchor the “presentation timestamp” at the 0-th frame.
  6. The video stream of the second input file is set to [v1]. The format filter sets an alpha channel to make the fade work, the fade filter makes cross-fade with a d-duration of 1 second, and the setpts filter offsets the start of the video to 14 seconds after the 0-th frame. (The title card video is 15 seconds, so making the recording fade in at the 14 second mark gives us that 1 second of overlap.)
  7. The third video stream is [v2]. The parameters are nearly identical to the previous line with just the starting time difference. (That number varies by the length of the session recording.)
  8. This filter overlays the [v0] and [v1] videos. This works because of the alpha channel and offset start of the second input. The output is tagged as [vfade1].
  9. Same as the previous line with the intro-plus-recording and the outro clip. The output is tagged as [fv].
  10. Tags the audio stream of the ‘0’ input file.
  11. Tags the audio stream of the ‘1’ input file and offsets its start (similar to the video stream). I’m also applying the audio compression-and-expansion filter to help raise the volume of quiet parts of the Zoom recording.
  12. Tags the audio stream of the ‘2’ input file and offsets its start.
  13. Crossfades the first and second audio streams across 1 second.
  14. Crossfades the second and third audio streams across 1 second and tags the output as [fa].
  15. Maps the ends of the video pipeline ([fv]) and audio pipeline ([fa]) to the output.
  16. Sets the codec to lossless and audio to stereo.
  17. The final output file.

Upload to YouTube

With the files ready, it is time to upload them to YouTube. The youtube-upload script is useful as a tool for batch uploading the videos. There are a couple of caveats to be aware of:

  1. The script uses an old authentication method against the YouTube API. There is a comment on issue 352 that has good advice on how to get around that.
  2. Without additional work, YouTube will automatically lock-as-private any videos uploaded using the API. Details and instructions are in a comment on issue 306, but it involves getting your YouTube API Project validated.

The command looks like this:

1
2
3
4
5
6
7
8
9
10
youtube-upload \
  --title="FOLIO Project Roadmap" \
  --description="Multi-line description goes here." \
  --category="Education" \
  --default-language="en" --default-audio-language="en" \
  --client-secrets=./client_secret.json \
  --credentials-file=./credentials_file.json \
  --playlist="WOLFcon 2023" \
  --embeddable=True --privacy=public \
  'FOLIO\ Roadmap.complete.mp4'

That should all be self-explanatory. One thing to be aware of is the authentication step. The client_secret.json file is downloaded from the Google API Console when the YouTube API project is created; that API project will need to be set up and this credentials file saved before running this script. Also, the credentials_file.json won’t exist when this command is first run, and you’ll be prompted to go to a specific URL to authorize the YouTube API project. After that, the credentials file will exist and you won’t be prompted again.

And since I already had the session metadata in a spreadsheet, it was easy to write a formula that put all of the pieces together:

=”youtube-upload –title=’“&B2&”’ –description=’“&C2&”’ –category=’Education’ –default-language=’en’ –default-audio-language=’en’ –client-secrets=./client_secret.json –credentials-file=./credentials_file.json –playlist=’WOLFcon 2023’ –embeddable=True –privacy=public ‘“&A2&”’”

Then it is just a matter of copying and pasting the calculated command lines into the terminal.

Digital versus Digitized: On the Hachette v. Internet Archive Appeal Oral Argument / Peter Murray

One thing that would dramatically clarify the controlled digital lending concept in general and the Hachette v. Internet Archive lawsuit in particular is having distinct terms for types of ebooks. I propose that we refer to them as digital and digitized. A digital book is one that is born digital, where the publisher has the original “source code”. Alternatively, a digitized book originates as a physical copy, which is then converted into a sequence of printed page images. Given the differences in the way they are created by the publisher and the capabilities offered to the reader, distinguishing the two types of books is appropriate. I’m not a marketer, but I suspect digital and digitized might be too similar for an average person to notice the subtle differences. As technical descriptors, these terms help clarify some of the misunderstandings (or even willful obfuscations?) I heard during the circuit court oral arguments.

On June 28, 2024, the oral arguments in the Hachette Book Group, Inc. v. Internet Archive case took place. Shortly after, a recording was made readily available on the Internet Archive. I created an unofficial transcript of this recording, which I posted on my media site with Hypothesis enabled. I made 30 notable annotations in the transcript, comprising key points, personal comments, and several references to external materials. (Feel free to annotate specific points alongside me on Hypothesis, if you wish.) This post consolidates those ideas and remarks into a coherent form.

Disclosures: Despite not being a lawyer, I find the intersection of copyright, fair use, library services, and societal welfare intriguing, and often reflect and write about them professionally. This is not legal advice. I’m currently employed by a software company that’s developing a controlled digital lending system. In addition to my professional ties, I believe controlled digital lending is a tremendous benefit for library patrons, libraries, and society at large.

Background

Hachette v. Internet Archive is a lawsuit filed on June 1, 2020, during the peak of the Covid-19 pandemic, in response to the National Emergency Library | Internet Archives blog (NEL) program. The NEL program, initiated on March 24, 2020, removed the restrictions on the number of patrons allowed simultaneous access to digitized books on IA’s Open Library collection. Before this pandemic-induced change, libraries could partner with Open Library to provide access to digitized books for their patrons. IA employed a system called Controlled Digital Lending (CDL), assuring that digitized copies weren’t distributed to the public unconditionally. CDL is a blend of digital rights management (DRM), library operations software, and library protocols, ensuring that a single physical copy is not loaned more than once in any form, whether physical or digital. NEL removed this “never lent more than once” CDL restriction based on the premise that all the nation’s public libraries were closed and no one could access the physical materials. NEL concluded on June 16, 2020, and IA’s regular CDL program resumed. For a more detailed explanation of CDL, refer to an article I authored called Controlled Digital Lending…What’s the Fuss? derived from my talk at Code4Lib in 2023, and Issue 94 of my intermittent newsletter.

The federal court in the Southern District of New York ruled in favor of Hachette on August 11, 2024, but the judgement was stayed pending an appeal to the intermediate court. The oral arguments last week were part of that process, and now we wait for that ruling. From accounts I’ve read, it seems like both parties are poised to take this to the U.S. Supreme Court no matter who wins.

Because I started trying to draw a distinction between various terms, I’m going to carefully chose my words in this post:

  • Digital book: A digital book, born from electronic files. ePub, Daisy, and the Kindle format are common file formats for these books, which use digital typesetting to arrange words on a screen.
  • Digitized book: A book with pages that are images of a physical item. Originally digitally typeset on a printed page, these pages are then scanned and sequenced into a file.
  • E-book: An umbrella term encompassing both Digital or Digitized books as defined above.
  • Physical book: As simple as it sounds: a book that exists in a physical, human-readable format.
  • Book: If I refer to a “book” without any of the modifiers above, it applies to any type of book: digital, digitized, or physical.

Digital versus Digitized Marketplaces

I led this blog post with a discussion of digital books versus digitized books, and it wasn’t long into the Internet Archive’s presentation that we get to the matter.

[15:25] under [fair use] factor 4 you say that actually there’s one reason there’s still be a market for e-books is because e-books are more attractive than digitized versions of physical books. Right? Because they have features and they’re more user friendly or whatever. So what that kind of means is what you’re saying is that your digital copies are more convenient or more attractive, I guess more convenient than physical books, but less convenient than e-books.

The judge is asking about the three kinds of books: “physical books”, “digitized versions of physical books”, and “ebooks”. So, we are already recognizing that digitized books and digital “ebooks” are different and that digitized books are “more convenient than physical books but less convenient than ebooks.”

There seems to be a legal concept here about “markets”, and specifically whether digitized books (from libraries through CDL) and digital books (from publishers through Overdrive and other programs) are in the same “market”. It seems undisputed that the market for physical books and the market for something digital/digitized are different…even though they hold the same basic content. It does seem disputed whether digitized books (which are facsimiles of the content in physical books) and digital books (with capabilities only possible in the digital realm) are in the same market or are different markets. To my eye, a digitized book is mostly akin to a physical book with the exception that the digitized book can be more easily distributed via electronic devices. The real difference lies in the capabilities of the born-digital book on an electronic device. But I don’t know how the law defines “markets” in this case.

In the “ebook” marketplace, publishers will license “digital” books to libraries using a service like Overdrive. A publisher is unlikely to sell or license a “digitized” book to libraries. (I last heard of a publisher doing this early digital days of the 1990s.) Also in the “ebook” marketplace, a library will use CDL to lend a “digitized” book to patrons. It is conceivable that a library could lend a “digital” book if it has purchased the rights to do so, but that is unusual at this stage.

Sidenote: when digital and digitized have identical packaging, a la PDFs

Let’s get into the weeds for a moment and talk about how the PDF file format muddies the distinction between digital and digitized. There are two main types of PDF files: those created from digital typesetting and those created from scanned images of printed pages. Digitally typeset PDFs—digital books—are created using digital typesetting software such as Adobe InDesign, Microsoft Word, or LaTeX. The content is created and formatted within the software and then exported or saved as a PDF file. Digitally typeset PDF files can be made accessible to assistive technologies by properly tagging the structure and content of the document; this allows screen readers to interpret and navigate the PDF file. Digitally typeset PDF files are usually smaller in size compared to scanned PDFs because they contain vector-based text and graphics.

On the other hand, PDF files based on scanned images of printed pages—digitized books—are created by scanning or photographing physical printed pages using; each page of the book is captured as an image. Scanned PDF files are not inherently accessible to assistive technologies like screen readers because the text is not selectable or readable by these tools. Scanned PDF files also tend to have larger file sizes compared to digitally typeset PDFs because they contain high-resolution images of each page. The text might be selectable to be copied out of the document, but only if a layer of Optical Character Recognition (OCR) has been applied to the file. OCR software analyzes the images and attempts to recognize and convert the characters into selectable and searchable text.

The same file format—PDF—can be used for both a digital book and a digitized book. You might initially only be able to tell the different by the file size, but capabilities will be apparent to the reader soon after it is opened. Some publishers use PDF as a delivery format, and these are most likely the digitally typeset files. The reader doesn’t get the digital book advantages of reflowing text or changing the font family or size with this kind of PDF. CDL can also use PDF as a delivery mechanism, and this is most likely the sequence of images PDF.

Fortunately, the court doesn’t get into this level of technical detail. Unfortunately, I think a lot of sides talking past each other come from muddy technical aspects of licensing versus lending a “PDF”.

The rights with first-sale

Some examples from the oral arguments where not having clear definitions causes problems:

[18:02] When they buy those books, they buy the physical copies to lend to their patrons one at a time or through an interlibrary change. They also buy e-books to make those available to their patrons. We’re focused here on e-books and impacting e-licensing. I have a hard time reconciling those two, specifically as to e-licensing. Why would libraries ever pay for an e-license if they could have internet archives, scan all the books, hard copies they buy and make them available on an unlimited basis?

Internet Archives’ counsel points out here that in licensing e-books, the libraries are not adding to their permanent collection. Libraries haven’t bought the book, and don’t have the first-sale rights to do what they want with the book. And most publishers—notably the major commercial/trade publishers that are a part of this lawsuit—do not want to sell e-books with first sale rights for a library to add to its permanent collection. (How a library is supposed to fulfill the part of its mission to preserve cultural artifacts is beyond the scope of this post, but you can see the obvious problems of saving e-books.)

The origins of CDL, or the point at late in the arguments where I exclaim WHAT?!?

Late in the publisher’s arguments comes this bit:

[1:08:29] Control digital lending is a contrived construct that was put together at the behest of Internet Archive back in 2018 when they confronted the fact that libraries didn’t want to deal with them. Libraries didn’t want to give copies of their works to be digitized because they were concerned about copyright arguments. So they got in a room together with various people and contrived this principle of control digital lending to rationalize what they were doing.

CDL is much older than 2018. IA’s version of CDL predates the first discussion of it by over half a decade (see the origins of Controlled Digital Lending in issue 94), and there are earlier implementations. I don’t remember seeing this claim from the publishers in their district court complaint, so I hope there is evidence for this statement on the record in the evidence presented to the lower court.

Publishers will stop publishing

Starting with the publisher’s lawyer:

[1:15:20] That IA’s brief and amici try to create the impression that the public interest is on their side. And it is not. The protection of copyright is in the US Constitution and federal law because it creates an incentive for writers and artists to create new works to benefit our broader society. Internet Archives’ control digital lending is in direct conflict with that basic principle. And as I previously…

You don’t really think people are going to stop writing books because of the control digital lending to you?

Well, I think publishers are going to go down the tubes if they do not have the revenues.

You think that that’s really…

I do, Your Honor. There’s no question. I mean, and the standard here is not, will this eliminate…

No, I understand. It’s just a part. But this question about balancing the incentive to create a work with the larger distribution of it, that is the question to be decided in this case.

If this gets to the U.S. Supreme Court, do we get to go with the originalist’s thinking from the Copyright Act of 1790 | Wikipedia that copyright extends for 14 years with the right to renew for a additional 14-year term should the copyright holder still be alive?

The market becomes only as large as the number of people who simultaneously want to read a work

A judge make a good point about the potential market effect of library’s CDL:

[1:18:32] But you’re reducing the market from the number of people who might want to read… Let’s look at even the paper books. They’ll pretend like take out the digital market for a second. The number of people who might want to read it ever, down to the number of people who might want to read it simultaneously. And if you put digital books into the mix, it’s the same idea, right?

At the extreme, the workflow efficiencies that come with CDL (or, a “reduction in friction” as I think it is referred to in the oral arguments) could mean that there is only a market as big as those who want physical books for their personal collection and libraries collectively purchasing a number of physical items to fulfill digitized book needs. (There is still a market for digital books that publishers won’t sell to libraries.) There is some nuance here, but the point is interesting.

And it is here that I think we see the first substantive discussion of digital versus digitized:

[1:19:48] that efficiency may or may not have an effect on either the number of copies that get sold or on the market for the Overdrive service, which has a variety of different sort of different aspects and benefits over and above CDL. I mean, CDL is largely sort of scanned images of pages of paper books because it’s the paper book. The Overdrive service has a lot of benefits. You can flow the text. You can do different features and that is one reason why…that is one explanation for the data that you see—that there is no reduction in demand for Overdrive.

My informed but not expert opinion

I agree with those that are saying that the line of questioning from the circuit court judges shows a more thoughtful approach to the nuances of copyright than was seen in the district court decision. The judges and lawyers seem to recognize that digitized books and digital books are different, with digitized books being more convenient than physical books but less functional than digital books. However, there appears to be a dispute over whether digitized books and digital books are in the same “market” or different markets, which is a key factor in determining fair use.

The concept of first-sale rights is central to the discussion of library lending and ebook licensing. Libraries purchase physical books with the right to lend them to patrons, but when licensing ebooks, they do not have the same ownership rights. This distinction is crucial in understanding the limitations libraries face in providing access to books (physical or ebooks — to say nothing about preserving books, too). The argument that CDL threatens the incentive for writers and artists to create new works, as publishers may “go down the tubes” without sufficient revenues, is a significant point of contention. The balancing of copyright protection and the broader distribution of works is a central question in this case.

This case highlights the complexity of the issues surrounding digital and digitized books, copyright, fair use, and library lending. The distinction between digital and digitized books is crucial in understanding the nuances of the case and the potential implications for the future of ebooks and library services.

Other articles and opinions

Learnings from the British Library Cybersecurity Report / Peter Murray

The British Library suffered a major cyber attack in October 2023 that encrypted and destroyed servers, exfiltrated 600GB of data, and has had an ongoing disruption of library services after four months. Yesterday, the Library published an 18-page report on the lessons they are learning. (There are also some community annotations on the report on Hypothes.is.)

Their investigation found the attackers likely gained access through compromised credentials on a remote access server and had been monitoring the network for days prior to the destructive activity. The attack was a typical ransomware job: get in, search for personal data and other sensitive records to copy out, and encrypt the remainder while destroying your tracks. The Library did not pay the ransom and has started the long process of recovering its systems.

The report describes in some detail how the Library recognized that its conglomeration of disparate systems over the years left them vulnerable to service outages and even cybersecurity attacks. They had started a modernization effort to address these problems, but the attack dramatically exposed these vulnerabilities and accelerated their plans to replace infrastructure and strengthen processes and procedures.

The report concludes with lessons learned for the library and other institutions to enhance cyber defenses, response capabilities, and digital modernization efforts. The library profession should be grateful to the British Library for their openness in the report, and we should take their lessons to heart.

Note! Simon Bowie has some great insights on the LSE Impact blog, including about how the hack can be seen as a call for libraries to invest more in controlling their own destinies.

The Attack

The report admits that some information needed to determine the attackers’ exact path is likely lost. Their best-effort estimate is that a set of compromised credentials was used on a Microsoft Terminal Services server (now called Remote Desktop Services). Multi-factor authentication (MFA, sometimes called 2FA) was used in some areas of the network, but connections to this server were not covered. The attackers tripped at least one security alarm, but the sysadmin released the hold on the account after running malware scans.

Starting in the overnight hours from Friday to Saturday, the attackers copied 600GB of data off the network. This seems to be mostly personnel files and personal files that Library staff stored on the servers. The network provider could see this traffic looking back at network flows, but it is unclear whether this tripped any alarms itself. Although their Integrated Library System (an Aleph 500 system according to Marshall Breeding’s Library Technology Guides site) was affected, the report does not make clear whether patron demographic or circulation activity was taken.

Recovery—Rebuild and Renew

Reading between the lines a little bit, it sounds like the Library had a relatively flat network with few boundaries between systems: “our historically complex network topology … allowed the attackers wider access to our network than would have been possible in a more modern network design, allowing them to compromise more systems and services.” Elevated privileges on one system lead to elevated privileges on many systems, which allowed the attacker to move freely across the network. Systems are not structured like that today—now tending to follow the model of “least privileges”—and it seems like the Library is moving away from the flat structure towards a segmented structure.

As the report notes, recovery isn’t just a matter of restoring backups to new hardware. The system can’t go back to the vulnerable state it was in. It also seems like some software systems themselves are not recoverable due to age. The British Library’s program is one of “Rebuild and Renew” — rebuilding with fresh infrastructure and replacing older systems with modern equivalents. In the never-let-a-good-crisis-go-to-waste category, “the substantial disruption of the attack creates an opportunity to implement a significant number of changes to policy, processes, and technology that will address structural issues in ways that would previously have been too disruptive to countenance.”

The report notes “a risk that the desire to return to ‘business as usual’ as fast as possible will compromise the changes”, and this point is well taken. Somewhere I read that the definition of “personal character” is the ability to see an action through after the emotion of the commitment to action has passed. The British Library was a successful institution, and it will want to return to that position of being seen as a thriving institution as quickly as possible. This will need to be a continuous process. What is cutting edge today will become legacy tomorrow. As our layers of technology get stacked higher, the bottom layers get squeezed and compressed into thin slivers that we tend to assume will always exist. We must maintain visibility in those layers and invest in their maintenance and robustness.

Backups

They also found “viable sources of backups … that were unaffected by the cyber-attack and from which the Library’s digital and digitised collections, collection metadata and other corporate data could be recovered.” That is fortunate—even if the older systems have to be replaced, they have the data to refill them.

They describe their new model as “a robust and resilient backup service, providing immutable and air-gapped copies, offsite copies, and hot copies of data with multiple restoration points on a 4/3/2/1 model.” I’m familiar with the 3/2/1 strategy for backups (three copies of your data on two distinct media with one stored off-site), but I hadn’t heard of the 4/3/2/1 strategy. Judging from this article from Backblaze, the additional layer accounts for a fully air-gapped or unavailable-online copy. An example is the AWS S3 “Object Lock” service, a cloud version of Write-Once-Read-Many (WORM) storage. Although the backed-up object is online and can be read (“Read-Many”), there are technical controls that prevent its modification until a set period of time elapses (“Write-Once”). Presumably, the time period is long enough to find and extricate anyone who has compromised the systems before the object lock expires.

Improved Processes

The lessons include the need for better network monitoring, external security expertise retention, multi-factor authentication, and intrusion response processes. The need for comprehensive multi-factor authentication is clear. (Dear reader: if you don’t have a comprehensive plan to manage credentials—including enforcement of MFA—then this is an essential takeaway from this report.)

Another outcome of the recovery is better processes for refreshing hardware and software systems as they age. Digital technology is not static. (And certainly not as static as putting a printed book on a climate-controlled shelf.) It is difficult (at least for me) to envision the kind of comprehensive change management that will be required to build a culture of adaptability and resilience to reduce the risk of this happening again.

Some open questions…

I admire the British Library’s willingness to publish this report that describes in a frank manner their vulnerabilities, the impacts of the attack, and what they are doing to address the problems. I hope they continue to share their findings and plans with the library community. Here are some things I hope to learn:

  • To what extent was the patron data (demographic and circulation activity) in the integrated library system sought and copied out?
  • How will they prioritize, plan, and create replacement software systems that cannot be recovered or are deemed too insecure to put back on the network?
  • Describe in greater detail their changes to data backup plans and recovery tests. What can be taught to other cultural heritage institutions with similar data?
  • This is about as close to “green-field” development as you can get in an organization with many existing commitments and requirements. What change management exercises and policies helped the staff (and public) through these changes?

Cyber security is a group effort. It would be easy to pin this chaos on the tech who removed a block on the account that may have been the beachhead for this attack. As this report shows, the organization allowed this environment to flourish, culminating in that one bit-flip that brought the organization down.

I’ve never been in that position, but I am mindful that I could someday be in a similar position looking back at what my actions or inactions allowed to happen. I’ll probably be at risk of being in that position until the day I retire and destroy my production work credentials. I hope the British Library staff and all involved in the recovery are treating themselves well. Those of us on the outside are watching and cheering them on.

Digital Library Federation Working Group Receives Strategic Growth Grant from the Society of American Archivists Foundation / Digital Library Federation

See the original post on the Council on Library and Information Science News.

[Alexandria, VA, July 10, 2024] — The Council on Library and Information Resources’ Digital Library Federation is pleased to announce that its Born-Digital Access Working Group (BDAWG) has been awarded a Strategic Growth Grant from the Society of American Archivists Foundation. This prestigious grant will support the efforts of BDAWG, specifically its Visioning Access Systems subgroup, in the project titled “AI is for Access: An Investigation of AI Adoption.”

The grant is administered by CLIR, which acts as the fiscal sponsor for this important initiative.

The goal of “AI for Access” is to investigate how members of the U.S. archival community are utilizing artificial intelligence (AI) and machine learning (ML) tools to enhance access to digital archival materials. The study will delve into the adoption and implementation of these advanced technologies within archival practices, ultimately seeking to improve the accessibility and usability of digital archives.

In recent years, the exponential growth of digital content has inundated archival institutions with an unprecedented volume of materials, posing significant challenges in organization and accessibility. Traditional manual methods, once sufficient, now struggle to cope with the sheer scale and complexity of these digital archives.

Recognizing this pressing issue, the project aims to explore the transformative potential of AI and ML tools within the archival community. By delving into how these advanced technologies are being utilized, the study seeks to uncover innovative approaches to enhance access to digital archival materials. Through the integration of AI and ML tools, archival institutions can improve their archival management strategies, making their materials more discoverable, searchable, and comprehensible to a diverse audience.  At its core, the project seeks to bridge the gap between the vast repositories of digital content and the individuals and communities who seek to engage with them. By leveraging the power of AI and ML, archival institutions can ensure the long-term preservation and accessibility of cultural heritage, safeguarding invaluable materials for generations to come.

As a profession, archival workers are working in an exciting time, at the intersection of two technological developments that are largely changing the ways in which we work. The increase in born digital collections and the emergence of AI technologies create a space for archival workers to be innovative and forward thinking.

The project group recognizes this unique moment and the fact that accessing born-digital collections remains for many a difficult task (see DLF VAS report). The group is interested in how archival workers are approaching AI, or if not, to identify the barriers to its use. Through this research project, “AI is for Access,” we hope that archives practitioners will be able to advocate for the institutional adoption of AI tools in archival settings and to discuss where there are areas of concern. The group also hopes to demystify AI usage within archives, addressing concerns related to its adoption, and increasing awareness of available resources. Additionally, the group hopes to contribute valuable insights to SAA’s advocacy efforts, particularly in crafting AI protections for archival work and labor, says Steven Gentry, on behalf of the project team.

A critical component of the project is an archives community survey which is scheduled to be announced and circulated in September of 2024. The survey investigates how US archival workers are currently using artificial intelligence to facilitate access to born-digital archival material. The goal is to better understand the extent to which archival workers are adopting artificial intelligence as well as to identify the reasons why they may not be engaging with AI. Insight from survey results will contribute to our collective understanding of how archival workers are utilizing artificial intelligence (AI) in providing access to born-digital collections and facilitate future research efforts by identifying potential gaps and opportunities in the field as it relates to artificial intelligence.

The Society of American Archivists represents over 6,200 professional archivists across various sectors, including government, academia, business, libraries, and historical organizations nationwide. The SAA Foundation is dedicated to supporting archivists and archives by providing essential resources to the archival community. Each year, the SAA Foundation Board awards grants that align with the mission and strategic priorities of the Society of American Archivists.

As the leading nonprofit funding source dedicated to the interests of archives and archivists in the United States, the SAA Foundation’s Strategic Growth Fund supports numerous initiatives, including professional and public education, publications, research, the SAA Awards Program, and the general operations of the SAA Foundation.

The group encourages all US archival workers to look out for the survey announcement in September and invites you to participate, whether you are actively using AI or not. For more information, please visit the News space on the DLF website for project updates.

 

Contact: 

Jennifer Ferretti

Director, Digital Library Federation

jferretti@clir.org

The post Digital Library Federation Working Group Receives Strategic Growth Grant from the Society of American Archivists Foundation appeared first on DLF.

The European Union must keep funding free software / Open Knowledge Foundation

Open Letter to the European Commission

Since 2020, Next Generation Internet (NGI) programmes, part of European Commission’s Horizon programme, fund free software in Europe using a cascade funding mechanism (see for example NLnet’s calls). This year, according to the Horizon Europe working draft detailing funding programmes for 2025, we notice that Next Generation Internet is not mentioned any more as part of Cluster 4.

NGI programmes have shown their strength and importance to supporting the European software infrastructure, as a generic funding instrument to fund digital commons and ensure their long-term sustainability. We find this transformation incomprehensible, moreover when NGI has proven efficient and economical to support free software as a whole, from the smallest to the most established initiatives. This ecosystem diversity backs the strength of European technological innovation, and maintaining the NGI initiative to provide structural support to software projects at the heart of worldwide innovation is key to enforce the sovereignty of a European infrastructure.
Contrary to common perception, technical innovations often originate from European rather than North American programming communities, and are mostly initiated by small-scaled organizations.

Previous Cluster 4 allocated 27 million euros to:

  • “Human centric Internet aligned with values and principles commonly shared in Europe”;
  • “A flourishing internet, based on common building blocks created within NGI, that enables better control of our digital life”;
  • “A structured ecosystem of talented contributors driving the creation of new internet commons and the evolution of existing internet commons”.

In the name of these challenges, more than 500 projects received NGI funding in the first 5 years, backed by 18 organisations managing these European funding consortia.

NGI contributes to a vast ecosystem, as most of its budget is allocated to fund third parties by the means of open calls, to structure commons that cover the whole Internet scope – from hardware to application, operating systems, digital identities or data traffic supervision. This third-party funding is not renewed in the current program, leaving many projects short on resources for research and innovation in Europe.

Moreover, NGI allows exchanges and collaborations across all the Euro zone countries as well as “widening countries”1:, currently both a success and an ongoing progress, likewise the Erasmus programme before us. NGI also contributes to opening and supporting longer relationships than strict project funding does. It encourages implementing projects funded as pilots, backing collaboration, identification and reuse of common elements across projects, interoperability in identification systems and beyond, and setting up development models that mix diverse scales and types of European funding schemes.

While the USA, China or Russia deploy huge public and private resources to develop software and infrastructure that massively capture private consumer data, the EU can’t afford this renunciation.
Free and open source software, as supported by NGI since 2020, is by design the opposite of potential vectors for foreign interference. It lets us keep our data local and favors a community-wide economy and know-how, while allowing an international collaboration.

This is all the more essential in the current geopolitical context: the challenge of technological sovereignty is central, and free software allows addressing it while acting for peace and sovereignty in the digital world as a whole.

Sign the letter: https://pad.public.cat/lettre-NCP-NGI#fn1


1
As defined by Horizon Europe, widening Member States are Bulgaria, Croatia, Cyprus, the Czech Republic, Estonia, Greece, Hungary, Latvia, Lituania, Malta, Poland, Portugal, Romania, Slovakia and Slovenia. Widening associated countries (under condition of an association agreement) include Albania, Armenia, Bosnia, Feroe Islands, Georgia, Kosovo, Moldavia, Montenegro, Morocco, North Macedonia, Serbia, Tunisia, Turkey and Ukraine. Widening overseas regions are : Guadeloupe, French Guyana, Martinique, Reunion Island, Mayotte, Saint-Martin, The Azores, Madeira, the Canary Islands. ⤴

Accelerated Computing / David Rosenthal

Source
One big problem of the economies of the US and the UK is the cult of the CEO, and the resulting flood of CEO hagiographies that appear after a surge in their company's stock price. These aren't harmless fluff pieces, they contribute to a CEO mindset that is profoundly destructive — see Elon Musk for one example. Will Hutton writes:
But decades of being congratulated and indulged for the relentless pursuit of their own self-interest has turned the heads of too many of our successful rich. They really believe that they are different: that they owe little to the society from which they have sprung and in which they trade, that taxes are for little people. We are lucky to have them, and, if anything, owe them a favour.
Below the fold I continue the "Old man yells at cloud" theme of recent posts by trying to clarify an aspect of the current Jensen Huang hagiography.

Ed Zitron's must-read The Shareholder Supremacy traces the idea of the super-hero CEO revealed by a soaring stock price to:
the famous Dodge vs. Ford Motor Co. case that would define — and ultimately doom — modern capitalism, and in many ways birth the growth-at-all-costs Rot Economy.

The Michigan Supreme Court found that "a business corporation is organized and carried on primarily for the profit of the stockholders [and that] the powers of the directors are to be employed for that end," and intimated that cash surpluses should not be saved to invest in upcoming projects, but distributed to shareholders, because Ford had shown that it was good at making money. Ford was directly forbidden from lowering prices and raising employee salaries, and forced to issue a dividend.

To be clear, the statement around corporations’ duty toward shareholders was made “obiter dicta.” This means it was not actually legally binding, despite over a hundred years of people acting as if it was.
Zitron goes on to detail how "Neutron" Jack Welch destroyed General Electric while being lionized as a super-star CEO. I can testify to his malign influence because, during my time at Sun Microsystems, Scott McNealy was seduced by it.

Now on to the latest CEO whose soaring stock price has caught the attention of the CEO hagiographers, Jensen Huang. Don't get me wrong. I was there when he was a really good startup CEO, and unlike many others he has grown with the company in a very impressive way.

In The Envy of Everyone, M. G. Siegler remarks on:
this incredibly prescient profile of co-founder and CEO Jensen Huang back in 2017 by Andrew Nusca (in naming him Fortune's 2017 'Businessperson of the Year').
At the time Siegler tweeted:
“Video games was our killer app — a flywheel to reach large markets funding huge R&D to solve massive computational problems.”

Genius foresight. Sounds obvious now. Was not then.
Siegler was commenting on Andrew Nusca's 2017 profile entitled This Man Is Leading an AI Revolution in Silicon Valley—And He’s Just Getting Started. As Huang hagiography goes, it was pretty good. The quote in the tweet is from Jensen Huang talking about the early days:
“We believed this model of computing could solve problems that general-purpose computing fundamentally couldn’t,” Huang says. “We also observed that video games were simultaneously one of the most computationally challenging problems and would have incredibly high sales volume. Those two conditions don’t happen very often. Video games was our killer app—a flywheel to reach large markets funding huge R&D to solve massive computational problems.”
I don't disagree with what Huang said. Despite the need to focus on gaming, we did have a vague idea that in the future there would be other application areas in which custom accelerators could make an impact. And it is true that Nvidia's VCs, Sutter Hill and Sequoia, gave us the time to develop a multi-generation architecture rather than rushing out a "minimum viable product". I do quibble with the idea that this was "genius foresight".

90s System Diagram
Even back when I was working with Curtis Priem and Chris Malachowsky on Sun's GX graphics chips the problem we were solving was that the CPU could not write pixels into the framebuffer fast enough for any kind of 3D application. So a chip, then called a graphics chip but now called a GPU, was needed between the CPU and the framebuffer capable of converting a small number of writes from the CPU into a large number of writes to the framebuffer. Designs faced three major performance constraints:
  • The limited bandwidth of the bus carrying instructions from the CPU to the graphics chip. I wrote about how we addressed this constraint in Engineering For The Long Term.
  • The limited write bandwidth of the framebuffer.
  • The limited transistor budget imposed by the chip's cost target.
NV1-based Diamond Edge
Swaaye, CC-By-SA 3.0
Six months after we started Nvidia, we knew of over 30 other companies all trying to build 3D graphics chips for the PC. They fell into two groups. Nvidia and a few others were making fixed-function accelerators, but most were trying to get faster time-to-market by making programmable accelerators in which the functions were implemented in software running on a CPU in the graphics chip.

One problem for the programmable companies was that the transistors needed to implement the graphic chip's CPU's fetch, decode, execute system were not available to implement the actual graphics functions, such as matrix multiply. A second problem was that the CPU in the graphics chip needed to fetch its instructions forming the program that defined the functions from some memory:
  • It could fetch them from the system memory that the CPU got its data and instructions from. That wasn't a good idea since (a) it required implementing a DMA engine in the graphics chip, and (b) the DMA engine's program fetches would compete for the limited bandwidth of the bus.
  • It could fetch them from a separate program memory private to the graphic chip. This wasn't a good idea since it added significant cost, both for the separate memory itself and also for the extra pins and complexity of the graphics chip's extra memory port.
  • It could fetch them from the framebuffer memory. This wasn't a good idea since the program fetches would compete with the limited bandwidth of the framebuffer RAM's random access port.
The result was that for many product generations from NV1 the winning graphics chips were all fixed-function. At the time Nvidia was started, a person having ordinary skill in the art should have understood that fixed-function accelerator hardware was the way to go. We were not geniuses.

Over the years, Moore's Law gradually relaxed the constraints, forcing the design choices to be re-evaluated. As we had expected, the one that relaxed fastest was the transistor budget. The obvious way for the accelerator to exploit the extra transistors was to perform functions in parallel. For parallel applications such as graphics this increased the accelerator advantage over the CPU. But equally important in the competitive market for graphics chips, it sped up the design by making the chip an assembly of many identical units.

The constraints continued to relax. In 2001 Nvidia had enough transistors and bandwidth to release the GeForce 3 series with programmable pixel and vertex shaders. Eventually, various non-graphics communities figured out how to use the increasing programmability of Nvidia's GPUs to accelerate their parallel applications, and Nvidia seized the opportunity to create a software platform with the 2007 release of CUDA. By greatly simplifying the development of massively parallel applications for Nvidia GPUs, CUDA drove their adoption in many fields, AI being just the latest.

Source
The Huang hagiography's focus on Nvidia's current stock price is misplaced and ahistorical. It is notoriously volatile. Look at the log plot of the stock price since the IPO. I count eight drops of 45% or more in 25 years, that's average of about one every three years. One of the questions I asked when interviewing Chris Malachowsky for the 50th Asilomar Microcomputer Workshop was approximately "how do you manage a company with this much volatility?" His answer was, in effect, "you learn to ignore it".

Source
David Cahn of Sequoia, one of Nvidia's two VCs, writes in AI’s $600B Question:
In September 2023, I published AI’s $200B Question. The goal of the piece was to ask the question: “Where is all the revenue?”

At that time, I noticed a big gap between the revenue expectations implied by the AI infrastructure build-out, and actual revenue growth in the AI ecosystem, which is also a proxy for end-user value. I described this as a “$125B hole that needs to be filled for each year of CapEx at today’s levels.”
...
The $125B hole is now a $500B hole: In the last analysis, I generously assumed that each of Google, Microsoft, Apple and Meta will be able to generate $10B annually from new AI-related revenue. I also assumed $5B in new AI revenue for each of Oracle, ByteDance, Alibaba, Tencent, X, and Tesla. Even if this remains true and we add a few more companies to the list, the $125B hole is now going to become a $500B hole.
Goldman Sachs' provides furtehr skepticism in Gen AI: Too Much Spend, Too Little Benefit?:
Tech giants and beyond are set to spend over $1tn on AI capex in coming years, with so far little to show for it. So, will this large spend ever pay off? MIT’s Daron Acemoglu and GS’ Jim Covello are skeptical, with Acemoglu seeing only limited US economic upside from AI over the next decade and Covello arguing that the technology isn’t designed to solve the complex problems that would justify the costs, which may not decline as many expect.
Another huge drop in the stock price is sure to be in the offing. How great will the hagiographers think Huang is then?

CompanyMarketQuarterlyEmployeesMktCap perIncome per
 CapIncome EmployeeEmployee
NVDA$3T$26B30K$100M$867K
GOOG$2.3T$25B185K$17M$140K
AAPL$3.2T$28B160K$20M$170K
MSFT$3.3T$22B221K$15M$100K

Mr. Market will do what Mr. Market does, the stock price isn't under Huang's control. The things that are under Huang's control are the operating profit margin (53%), revenues ($26B/quarter), and the company's incredible efficiency. Nvidia's peers in the $2-3T market cap range have between 5 and 7 times as many employees. As Huang says, Nvidia is the smallest big company.

‘There are many data gaps in Latin America and we lack collection methods’ / Open Knowledge Foundation

This is the eleventh conversation of the 100+ Conversations to Inspire Our New Direction (#OKFN100) project.

Since 2023, we are meeting with more than 100 people to discuss the future of open knowledge, shaped by a diverse set of visions from artists, activists, scholars, archivists, thinkers, policymakers, data scientists, educators, and community leaders from everywhere.

The Open Knowledge Foundation team wants to identify and discuss issues sensitive to our movement and use this effort to constantly shape our actions and business strategies to deliver best what the community expects of us and our network, a pioneering organisation that has been defining the standards of the open movement for two decades.

Another goal is to include the perspectives of people of diverse backgrounds, especially those from marginalised communities, dissident identities, and whose geographic location is outside of the world’s major financial powers.

How openness can accelerate and strengthen the struggles against the complex challenges of our time? This is the key question behind conversations like the one you can read below.

*

This time we did something different, a collective conversation. A few weeks ago we had the chance to bring together several members of the Open Knowledge Network to talk about the current context, opportunities and challenges for open knowledge in Latin America.

The conversation took place online in Spanish on 18 June 2024, with the participation of Haydée Svab (Brazil), Fernanda Carles (Paraguay), Omar Luna (El Salvador), Andrés Vázquez Flexes and Julieta Millan (Argentina), moderated by Lucas Pretti, OKFN’s Communications & Advocacy Director. Sara Petti, International Network Lead and Project Manager at OKFN, also joined the conversation.

One of the important contexts of this conversation is precisely Julieta’s incorporation as regional coordinator of the Network’s Latin America Hub. With this piece of content, we also aim to facilitate regional integration and find common points of collaboration for shared work within the Network. That’s why we started by asking him to introduce herself in her own words. 

We hope you enjoy reading it.

*

Julieta Millan: Hi, I’m Julieta from Argentina. I live in La Plata, near Buenos Aires. I’m a biologist and I studied at the National University of La Plata. I was doing my PhD in neuroscience at the UBA in Buenos Aires, but I left recently due to my dissatisfaction with the scientific system in general.

I moved to the private sector, working as a data scientist, applying knowledge from my PhD. During this time, I learned to program in Python. Later, I was part of the ARPHAI project, an initiative that came out of COVID-19 and focused on epidemiological control, including diseases such as dengue. We worked on mathematical models of epidemics using data from electronic medical records in Argentina. This project was my first approach to data science and open science, a field I am passionate about.

Currently, I work as a data scientist in the private sector while continuing to participate in projects related to open science. I am also involved in the organisation of csv,conf, where I had the opportunity to meet inspiring people over the last years in Argentina and Mexico.

I am very happy to be here as a regional coordinator. I am new to the open science environment (my experience in ARPHAI was my first approach to open knowledge) and it is always very difficult to enter into environments that are already set up and even more so in one that is as niche as open knowledge. So I’m super grateful for the opportunity and I’m looking forward to doing a lot of things. I would love to hear your ideas and opinions about what we need here in Latin America.

Andrés Vázquez Flexes: I have a question. First, let me quickly introduce myself. I am a member of the core team of the Open Knowledge Foundation. I am a software developer with a technical profile, although in the past I have also been active in the open data and transparency community, especially in Córdoba, Argentina.

Julieta, first I want to say that I am very happy that we have a regional coordinator for Latin America. It seems to me something that we lacked in the past and I am very happy to have it.

If you have had time to evaluate, I would like to know what you see so far: what is our baseline, where are we starting from in Latin America and where do you think we are going? It’s a general question, but I think it’s the most pertinent one to start with.

Julieta Millan: I think it is clear that in Latin America we have a lot of desire to work and a lot of knowledge. We have a spectacular education, and there are many organisations that are developing tools and educational materials of great quality.

However, I think it would be very beneficial if the different organisations could get to know each other better. There are many individual initiatives that do not connect. My current goal is to get to know as many initiatives as possible in Latin America that work with open knowledge. This is quite difficult, as there are many and it is difficult to find them unless someone passes you the information or they are already in your network of contacts.

It is fundamental to build links and bridges between organisations to create something consistent that works for us, beyond following the guidelines generally established by the United States, Europe and other regions.

I don’t know if this answers any specific questions, but it’s a general idea I wanted to share.

Sara Petti: I think Andrés’ question is very interesting and it might be interesting to extend the question to all those present. Does anyone else have a particular idea about that?

Fernanda Carles: I am going to give an answer maybe a little bit different than expected. I worked for a long time in the area of open data, but right now I am a data scientist and I am working on environmental issues. I am collaborating with the Mozilla Foundation on a project to create a system to predict air quality in the city.

My current work consists of finding relevant data to develop these prediction systems. I have had a lot of trouble finding data from field measurements. We hear a lot about satellite measurements from northern technologies, but these are not sufficient. In Paraguay, and in other smaller countries in Latin America, there are many gaps in terms of ground measurement data, which are key for prediction and monitoring of environmental parameters. For example, climate and pollution data that can be easily found in other Latin American countries are not available here.

To get the data I need, I rely on various communities, such as open source communities that do air quality measurements, and some state actors that have certain data and make it available in real-time. I am trying to get this data for free. I think that in the environmental issue and in open source data collection, there are many gaps in Latin America and Africa, compared to other continents where there is better data quality and more state initiatives. This is especially worrying as our regions have the highest projected impacts due to climate change.

I would love to keep talking to people in the region to see how we can fill these gaps. Both from open source and advocacy initiatives, it is important to push for these initiatives to come from the state side as well.

Julieta, I would like to know your opinion from Argentina, considering your experience in biology and science. How do you see the availability of data in the region?

Julieta Millan: Concerning the scientific system, particularly, or in general? 

Fernanda Carles: I think in general. Well, for me in particular, I’m also very interested in sensors, measurements of environmental issues, etc.

Julieta Millan: What I can answer from my particular point of view, based on my experience, is that in general in the scientific system here in Argentina, and much of Latin America, open science is not really present. Many people who do science don’t understand how it works and what it is for. The idea of publishing your data seems absurd to them. They think: “How are you going to make your data available to others? What if they steal my ideas? What if they find a mistake in my work?”

This is a big problem within the scientific system, the typical problem of lack of reproducibility. It is terribly difficult to find data. It is often said that data is available on request, but then it never is.

Although there are open science initiatives in Latin America, what I see is that, in general, it is not common or accessible. Accessibility is very low. I would like to know if anyone else has seen something different.

Fernanda Carles: I think it is very interesting to talk about these issues because they have a philosophical aspect related to what our educational institutions value. I work with data and my main source of information is the academic sector. I have a lot of problems because there is a lot of resistance to sharing data with the general public.

This really hinders a lot of collaboration and growth processes that could be very interesting. It’s a big difference from how things are done elsewhere.

Haydée Svab: I had one thing to say, but now I have five (laughs). Well, I’m going to introduce myself first. I am the director of Open Knowledge Brazil. I am a civil engineer and my master’s degree was in transport engineering and planning. During my graduation, I got close to the open source movement and during my master’s degree, I got close to the open data and data science movement.

My master’s work focused on the intersection of transport and gender, analysing mobility patterns from a gender perspective. I found many data gaps. In Brazil, more than 50% of the population is non-white, but no origin-destination survey collects this information. Therefore, analyses from a racial perspective, as well as from a gender perspective, are very limited.

I agree with Fernanda that we have many challenges. When I teach or talk about data, I always mention that we have Big Data on top of the same data. There are a lot of data gaps and lack of collection methods that ensure accountability and quality.

In Brazil, I identify three major challenges that we face on a daily basis.

The first is the preservation of privacy. There is a fallacy that says that respecting data privacy means closing all open data. Currently, the privacy of personal data is used as an excuse for not publishing open data of public interest. We are at a time when it is crucial to open data to promote transparency, integrity, and social control. We participate in working groups with the government to achieve this.

The second thing is artificial intelligence. This is a very present and worrying issue. The Brazilian government has introduced a bill on the use of artificial intelligence, including facial recognition for public security. This is of great concern to us, especially in a country with many prisoners without a final conviction. We are working to remove these dangerous provisions from the bill.

And the third thing is climate change. Climate problems are serious in Brazil. We have recently had rain disasters in the south of the country. This weekend, we launched the locally adapted Open Data Index for Cities. We could not do it in person in Porto Alegre due to extreme weather conditions. Many people were unable to participate in the online edition because they had no electricity or internet. It is crucial to recognise the problem of climate crises and to act.

These are three key issues we are working on. I think we can exchange ideas and set up joint projects here, especially on climate issues that are clearly transnational.

Sorry for being long, but these issues are very important.

Julieta Millan: I understand everything perfectly. I completely agree with the issues, especially in our context with so many vulnerable populations. Focusing on any climate change issues and also making people and researchers understand why it is so important to share data is key.

Does anyone else want to continue or hasn’t spoken yet?

Omar Luna: It’s a pleasure to meet you all. I am currently at SocialTIC, I am in charge of communications for the Escuela de Datos project in Latam and we are in a process of reflection to collectively build the school that we all want.

From the basic question generated by Andrés, four key points come to mind. Last year, at AbreLatam in Uruguay, Natalia Carfi from Open Data Charter mentioned that in these 10 years of the open data agenda, each agenda has its own rhythm and movement. For example, the feminist movement has strongly driven the visibility of open data, as has the climate data community, which has highlighted the need in that field.

In addition, emerging issues such as systems of care and the care economy are gaining prominence and require more inter-institutional attention and support. On the other hand, we face challenges such as open science and open government, where some areas still do not fully understand their importance or face access restrictions for national security or privacy reasons. This is something Patricio Del Boca commented on at csv,conf,v8, when he talked about “messy transitions” of governance. We moved from ecosystem building to data restriction and opacity.

Thirdly, there is the growing issue of artificial intelligence and its application in open data, which raises questions about transparency and the macho, binary and completely lacking intersectionality bias of these models in Latin American contexts.

Finally, we have seen an evolution in community needs towards more specific and complex data, such as those related to climate change, which require data collection methods adapted to changing conditions.

In short, the challenge for the Escuela de Datos is not only to address these current issues but also to accompany and attract new generations that consume information differently and demand more dynamic agendas that are adaptable to regional and global challenges.

Julieta Millan: Thank you for the conversation, and actually, I was interested in commenting on the Escuela de Datos. In our recent conversations, especially in the exercise we are doing with the form for those working on open data and research in Latin America, we have identified three recurring challenges: the general lack of knowledge about open data, which leads to a lack of interest and, consequently, low participation. This, in turn, affects the ability to obtain funding, forming a vicious cycle that hinders the reach and dissemination of the information generated. If nobody pays attention, if nobody uses what we generate, we won’t get the funding.

You mentioned the importance of new formats for sharing information. Although I am not a communicator, I think it is essential to consider how to adapt to these changes to improve the reach and relevance of our messages.

Lucas Pretti: I am trying to find a common overview, but it is difficult. I would like to comment on two points that are resonating with me as I listen to you. First, there is the intergenerational issue. I think our generation experienced the democratic values of an internet that was not corporatised or massively policed. Two decades ago, there were other possibilities. Today, that discussion seems less relevant for young people.

So the question is whether there is a generational challenge here – are we the oldest in the room in this sense?

Fernanda Carles: I have a comment on that. For me, I think it is not necessarily a generational problem. In my area, I see that the lack of knowledge covers different aspects, such as business models. For example, in my search for data to validate certain things, I found three types of initiatives in Paraguay that monitor the type of data I need.

First there are the state initiatives, which have their own way of working. Then there are the private ones, where many people buy sensors, install them in the city and sell that data. And finally, there are the open source initiatives. I work especially with one of these initiatives that emerged as a direct response to private services that sell environmental data and consider their business model immoral.

I think the main problem is a lack of knowledge. There are many possibilities with data, and privatisation is the prevailing model that most people know about. That is, selling the data to make some kind of profit. But there are other models. I think it’s a question of ignorance more than anything else, I don’t know if it’s so much generational.

It’s an interesting issue to convince certain actors that opening up data can be beneficial, even in economic terms, which, well, let’s be honest, is what really drives things in our world.

Lucas Pretti: It’s good that you mentioned that because my second point was precisely the dependence on the state. Maybe this is a Brazilian view and it doesn’t happen in the same way in other places. That’s why I bring up the dependence on who is in government, whether at the national or provincial level; everything changes in terms of funding and special policies, especially with political polarisation.

We have a kind of oscillation in this sense. It’s not just having or not having money, but rather having more or less depending on who is in power. I wanted to put this on the table: to what extent are we still dependent on state funding? I don’t see many foundations or philanthropists in Latin America as you see, for example, in Europe or the United States, where there is more funding for large-scale projects.

I’m not just talking about small grants and donations, but significant funding. In Latin America, the situation is more state-oriented, and I think it is important to discuss this question of funding as well.

Haydée Svab: I want to comment on the previous point, about generations. I feel that there is a generational layer in how issues are addressed. What moved us before was open data, public transparency and so on. I don’t think that narrative moves the newer generations in the same way.

But artificial intelligence does appeal to children and teenagers, and people around 18, 20 years old. They want to address, study and discuss for example algorithmic transparency. This demand to know the data that algorithms train is the way I have used to talk about the importance of open data as an input for society.

It is necessary to change the narrative we used to use before. The new generations don’t seem to care as much about privacy as we do. In discussions I have participated in, it seems that privacy is an “old issue”, for them it is almost a given that they will not have privacy. However, if we talk about algorithmic influence in everyday life, such as recommendation systems or facial recognition affecting for example young black men, then we are touching on issues that really matter to them.

Julieta Millan: I think you’re right. I’ve just never had a conversation with someone younger who tells me if they care about privacy. It’s clear to them that their data is already on the internet and everybody has it. There is no such thing as privacy for them, it’s very real and very shocking. In that sense, we grew up very differently as a generation, challenged in a different way by technology. I had never thought of it that way. Thank you, the data is very good.

Fernanda Carles: It’s interesting how there is a lack of knowledge of the digital body as part of oneself, and how we surrender to the reality that all our data is out there. And at the same time, we don’t advocate for opening up data that could be helpful in generating solutions and making problems in our society visible.

Recently, I was having a conversation with friends about how we think differently about data issues from the individual or from the collective. These conversations and activism in this field are very interesting and could really help our area to work in a more agile way.

I wonder, Julieta, if you have seen any interesting work or ideas in the field of education on these issues. What are your plans and what can we collaborate on?

Julieta Millan: Let’s see. Plans? I don’t know… 

Lucas Pretti: Let’s take advantage of the fact that Escuela de Datos is in the room (laughs).

Julieta Millan: Yes, exactly (laughs). They can tell us. The thing is that any data we collect will have the bias of coming from people who already understand and know about open data. People who don’t know about it yet will not be in our environment or answer our questions, because they are simply not here.

It is crucial to reach out to this group and sow a seed of knowledge about open data, showing its benefits. It is important to reach out to those who are not yet part of this niche.

Andrés Vázquez Flexes: I would like to jump in and invite you to look for ways to reach out to the places where young people are. I have given talks in secondary schools about open data and I think that educating young people on this topic is crucial. Not as an abstract idea, but something concrete.

A few years ago, my son, in a fight with the school over air conditioning, asked to see the school budget. The principal called me and I had to explain about open data. This is a good example of how young people can understand and use this information.

In another school, senior students collected money, searched for prices and bid for suppliers to make an end-of-year campaign. This is an exercise in participation and transparency.

The civil servants of 10 years from now are young people who today have not heard of open data and participation. Even if we don’t solve it today, I think there are opportunities to talk to young people and educate them.

I love a phrase a friend of mine says when he gets data that is not open through scraping techniques or other things: “The data was already open, I just gave it a nudge”. I have a friend with a climate data start-up who, five years ago, started collecting data through scraping techniques. He got a lot of data that was not published in a portal or a nice csv. There are techniques that our activism can help us get data without doing anything outside the rules and actively publish it ourselves in an accessible way.

Fernanda Carles: I love it. Thank you, Andrés, for your comment. I’ll write to you later to learn a bit about your friend’s techniques.

Omar Luna: From Escuela de Datos we want to offer you spaces to disseminate and socialise knowledge about open data in Latin America. We have open spaces for contributions, such as tutorials, advocacy on tools or initiatives that you want to share with the community. We are based on open knowledge.

If you have experiences, reflections on open science, challenges or possibilities of climate data, such as Andrés’ experience with his entrepreneurial friend, we are also interested. We have an experience systematisation section on escueladedatos.online, where we document successful cases in the use of open data in areas such as electoral, climate and diversity data.

We are open to co-creating and formulating collaborative learning spaces. We want to expand this knowledge not only in the current community that is strong and mature but also for future generations.

Thank you all for the space. We keep in touch.

Julieta Millan: Yes, thank you all very much for participating and giving your time. It’s great that we have these conversations. I hope they keep happening.

Haydée Svab: Thank you very much. I am very happy to have leadership in Latin America. Omar, we are also here with the Brazilian Escola de Dados. We can exchange cards, do bilingual courses, exchange tools and everything else.

Julieta Millan: I will surely be in contact with you soon. You are good examples in Latin America of things that work really well, thank you very much!

Policy Brief: Governing Digital Public Infrastructure as a Commons / Open Knowledge Foundation

Policy Brief submitted to T20, Engagement Group of G20 under Task Force T05 – Inclusive digital transformation

Authors

  • Renata Avila, Open Knowledge Foundation (UK-Guatemala)
  • Ramya Chandrasekhar, Center for Internet and Society, CNRS (France-India)
  • Melanie Dulong de Rosnay, Center for Internet and Society, CNRS (France)
  • Andrew Rens, Research ICT Africa (South Africa)

Abstract and Keywords

Building upon the G20 New Delhi Leaders’ Declaration commitment to improving access to digital services through digital public infrastructure (DPI), this policy brief discusses the potential of governing DPI as a commons, drawing from the governance of data/digital ecosystems and practices of commoning and co-creation from the open movement.

A commons approach to governing DPI can help scale and localise DPI exchanges, increase transparency and accountability, accelerate their impact, reduce governance complexities, data and localisation frictions, and secure community engagement beyond the governments and companies involved. Important public goods as diverse as Wikipedia and Linux constituted and governed as digital commons, offer valuable lessons in digital governance. Rigorous research on Knowledge Commons, building on the Nobel prize-winning work of Elinor Ostrom, provides conceptual and practical resources to ensure that DPI increases equality as it addresses marginalisation.

The G20 Summit in Brazil promises to address transversal issues related to citizen participation, democratic governance, and urgent action to reduce poverty and tackle the climate crisis. The Commons-based governance structure proposed for DPI, enables cooperation by multiple actors, including the public sector, commercial providers and civil society, even across borders. The brief recommends developing and adopting a digital commons governance model for DPI to accelerate its adoption and increase the public benefits of technology deployed at scale while safeguarding the rights and digital sovereignty of countries and communities within them.

Keywords: digital governance, digital public goods, digital commons, cooperation, digital public infrastructure, data governance, democratic participation.

Beyond HTTP APIs: the case for database dumps in Cultural Heritage / Raffaele Messuti

In the realm of cultural heritage, we're not just developing websites; we're creating data platforms. One of the primary missions of cultural institutions is to make data (both metadata and digital content) freely available on the web. This data should come with appropriate usage licenses and in suitable formats to facilitate interoperability and content sharing.

However, there's no universal approach to this task, and it's far from simple. We face several challenges:

  1. Licensing issues: Licenses are often missing, incomplete, or inadequate (I won't discuss licenses in this post).
  2. Data format standardization: We continue to struggle with finding universally accepted standards, formats, and protocols.

The current landscape of data distribution methods is varied:

  • Data dumps in various formats (CSV, XML, JSON, XLS 💀, TXT)
  • HTTP APIs (XML, JSON, RESTful, GraphQL, etc.)
  • Data packaging solutions (Bagit, OCFL, Data Package)
  • Linked Open Data (LOD)

While we often advocate for the adoption of Linked Open Data, which theoretically should be the ultimate solution, my experience suggests that the effort required to maintain, understand, and use LOD often renders it impractical and unusable (Wikidata is an exception).

Tweet by Rob Sanderson about LOD

Seeking Simpler Solutions

Can we approach data distribution with simpler solutions? As always, it depends: if there is a need for real-time access to authoritative data, HTTP APIs are necessary (or streaming data solutions like Kafka and similar technologies). But if we can work with a copy of the data, not necessarily updated in real time, then using self-contained and self-describing formats is a viable solution.

I've always thought that the Internet Speculative Fiction Database (ISFDB) publishing of MySQL dumps was a smart move. Here's why:

  • An SQL dump includes implicit schema definitions along with the data content.
  • While not directly usable, loading it into a working MySQL server is straightforward. You can easily set one up on your laptop.
  • Once loaded, you can work with the data offline, eliminating the need for thousands of HTTP calls (and the associated challenges with authentication, caching, etc.)
  • SQL is an extremely powerful language and is often one of the first things every programmer should learn.

The Rise of SQLite, DuckDB and Parquet

In recent years, there has been a surge of interest in SQLite, an embedded database that has long been used in many applications. SQLite offers several advantages:

  • It's contained in a single file
  • It doesn't require a server
  • It's fast and efficient

While SQLite isn't suitable for scenarios requiring network access or massive multiple writes, it's perfect for distributing self-contained, ready-to-use data.

It's worth noting the significant contributions of Simon Willison in this area, particularly his suite of tools in the Datasette project. Other notable figures in this field include:

An extremely powerful new tool, similar to SQLite, that recently reached its 1.0 milestone is DuckDB. This column-oriented database focuses on OLAP applications, making it less suitable for generic use in web applications. However, DuckDB has a standout feature: it can read and write Parquet files, a highly efficient data storage format.

I was particularly inspired by Ben Schmidt's blog post, Sharing texts better. It introduced me to Parquet, a smart technology that can provide a robust foundation for data distribution, analysis, and manipulation in the cultural heritage sector. DuckDB makes working with Parquet easy, but there are plenty of libraries (and ETL platforms) to wire Parquet data into any application.

Practical Examples

Let's explore some real-world applications of these technologies in the cultural heritage domain:

1. Digipres Practice Index

The Digital Preservation Practice Index is an experiment by Andy Jackson that collects sources of information about digital preservation practices. It's already distributed as a SQLite file and can be browsed using Datasette Lite, requiring only a web browser and no additional software installation.

In this repository, I demonstrate the simple steps needed to convert the SQLite file to Parquet format using DuckDB. The resulting file can be queried directly in the browser using DuckDB Shell (note: it reads Parquet with HTTP Range Requests, so the file is partially downloaded over the network).

2. ABI Anagrafe Biblioteche Italiane

Italian institutions publish a lot of open data, but everything is deeply fragmented: a lot of ontologies, many APIs (official and unofficial), and many dumps with undocumented schemas.

For example, the Register of Italian Libraries provides a zip file containing some CSVs and XML files, and you need to struggle a bit to link pieces together.
In this repository https://github.com/atomotic/abi, I created some scripts to transform all those sources into a Parquet file. XMLs are converted to JSON, which is still not the perfect approach, but a single file of 5MB contains all that information. And you can play with it online using the DuckDB shell.

3. BNI Bibliografia Nazionale Italiana

The Italian National Bibliography (BNI) is the official repertory of publications published in Italy and received by the National Central Library of Florence in accordance with legal deposit regulations. The dumps consist of several XML files, each containing a list of records in Unimarc XML format.

In this repository https://github.com/atomotic/bni, I have created some scripts to scrape the content and convert it to Parquet. The XML sources were ~275MB, while the resulting Parquet file is just 70MB.

The Case for Self-Contained Data Formats

When APIs are not strictly necessary, distributing data in self-contained formats can significantly enhance usability and efficiency. The advantages:

  • Users can work offline and faster
  • Reduced server maintenance: with less reliance on real-time data serving, institutions can allocate fewer resources to maintaining complex API infrastructures.
  • Scalability: Self-contained data formats are inherently more scalable, as they don't suffer from traffic spikes that can overwhelm API servers.

In these crazy times of rapid advancement in Large Language Models (LLMs) and AI technologies, it's increasingly likely that your content will be scraped by external entities (setting aside ethical or political considerations for now). By providing data dumps, you can mitigate potential issues for your infrastructure.

You are invited to OPEN GOES COP’s 2nd Webinar / Open Knowledge Foundation

If you’re working at the intersection of openness and climate change, we encourage you to join us. In case you missed our inaugural webinar, we’ve started a movement! The “we” currently includes: Wiki Green Initiatives, Open Knowledge Network, Open Climate Campaign, and Open Data Charter.

We are a coalition of organisations and individuals aiming to advocate for openness in the context of the UN Climate Change Conferences (COP). 

We hope to reinvigorate discussions on the role of ‘openness’ as a necessary condition for addressing the climate crisis and to build the capacity of open movement activists and stakeholders from civil society and academia to influence high-level decisions on related issues.

Webinar title: What does openness mean for COP? with the Open Data Charter and Open Knowledge Brazil

In OPEN GOES COP’s next webinar, we’re going to highlight the work of coalition members, the Open Data Charter (ODC) and Diários do Clima (Climate Diaries) by Open Knowledge Brazil

The ODC works with governments and civil society organisations to open up data purposefully, while protecting the data rights of communities and people. ODC’s Executive Director, Natalia Carfi, will be discussing how openness in research and data availability can add value to our cause. They will discuss a recent project Uruguay 2100, which seeks to tell a story of how sea level rise will affect the Uruguayan coast by the year 2100. To do this, the team created maps showing the areas at risk of flooding in Montevideo by collecting and analysing data that was made publicly available.

Diários do Clima is a platform capable of aggregating environmental policy data to help search for acts published by Brazilian municipalities. Using technological automation, they monitor official gazettes to identify the most relevant documents for those following the issue. Each act is categorised and organised so that users can filter and receive alerts on topics and locations of interest.

Speakers:

  • Maxwell Beganim – OPEN GOES COP spokesperson
  • Nati Carfi – Open Data Charter
  • Haydée Svab – Open Knowledge Brazil
  • Moderator: Monica Granados – Open Climate Campaign

Event details:

  • 🗓 17 July 2024 
  • 🕒 3 pm UTC/GMT
  • 📍 Online (Zoom)

Lexicon Enhancers / Distant Reader Blog

This posting describes a number of Python scripts used to enhance a lexicon, where a lexicon is defined as a list of desirable, meaningful words. Compare a lexicon to a stop word list where stop word lists contain words of little use or interest, and a lexicon is a set of words of great significance.

The first script -- keywords2lexicon.py - takes the name of a Distant Reader study carrel and an integer (N) as input. It then computes the N most frequent keywords and outputs them to the carrel's etc/lexicon.txt file. This is a decent way to jumpstart a lexicon. Alternatively, create a list of words by hand.

Once you have a lexicon, you may want to enhance it, and there are three supported methods:

  1. lexicon2variants.py - given a study carrel, this script will find the lemmas of each word in the lexicon, identify the associated tokens ("words") with those lemmas, and send the result to standard output. This is a good way to identify variations in spellings but only the variations that exist in the carrel's corpus.
  2. lexicon2related.py - given a study carrel and a number (N), this script will loop through the lexicon to identify semantically related words. The number of related words can be equal to the top N similarities or the similarities whose value is greater than N, where N is a floating point number. The script will output the related words. All of this can only be done by first semantically indexing the carrel, and it only really becomes useful if the carrel's size can be measured in millions of words. This technique is a root of the current generative-AI trend.
  3. lexicon2synonyms.py - given a study carrel, loop through the carrel's lexicon and use WordNet to identify synonyms. This technique ought to be seen as complementary to lexicon2related.py as it will introduce words outside the carrel's corpus.

Used in different orders, with different parameters, and compounded between themselves, this tiny system of scripts will generate lists of words that may be of interest to the student, researcher, or scholar.

Given a refined lexicon, it is possible to create sophisticated full-text database queries, map the lexicon's words in a networked space, feed the words to a concordance for quick reading, etc. One might even calculate weights of individual documents based on the occurances of lexicon words. Hmmm...

As an example, here is a tiny lexicon generated from the list of computed keywords in Homer's Iliad and Odyssey:

father; great; jove; man

Here is a list of these words and their variants found in the corpus:

father; fathers; great; greater; greatest; jove; man; manned; men

Here is a list of these words and their semantically related words:

aside; father; great; jove; loud; man; mighty; murderous; redoubtable; sarpedon; shook; thereon

Here is a list of these words and their synonyms:

Church Father; Father; Father-God; Father of the Church; Fatherhood; Isle of Man; Jove; Jupiter; Man; Padre; adult male; bang-up; beget; begetter; beginner; big; bring forth; bully; capital; corking; cracking; dandy; don; enceinte; engender; expectant; father; forefather; founder; founding father; generate; gentleman; gentleman's gentleman; get; gravid; great; groovy; heavy; homo; human; human being; human beings; human race; humanity; humankind; humans; jove; keen; large; majuscule; male parent; man; mankind; military man; military personnel; mother; neat; nifty; not bad; outstanding; peachy; piece; serviceman; sire; slap-up; smashing; swell; valet; valet de chambre; with child; world

Finally, and very importantly, the output of these scripts are not intended to be taken whole cloth. Instead one is expected to get the output, peruse it, and then season it to your own taste. Computers are stupid. You are not.

Aus GLAMR / Hugh Rundle

This week was VALA 2024 and I was lucky enough to attend and present a poster. The point of the post was essentially to act as a sneaky advertising billboard for Aus GLAMR, an aggregator and news site for blogs, newsletters, events and groups by and for GLAMR workers in Australia and Aotearoa.

I haven't blogged about this yet because I was waiting for VALA, but I actually deployed it a few months ago. Aus GLAMR is the latest iteration of an application I've been running for around 8 years: previously known as Aus GLAM Blogs. It started life as a simple JavaScript file triggered by a cron job, to tweet out new blog posts from Australian librarians. Over time I've re-written it three times, with Aus GLAMR being the latest, bringing a considerable expansion of both scope and functionality.

There were two triggers for this rewrite - firstly the previous version had some dependencies that were proving troublesome and would have been a big effort to replace. Secondly, Twitter had reached its final death throes, and Elon Musk decided to start charging for use of the Twitter API.

Using Aus GLAMR to reach a bigger audience for your thing

For authors and organisers, Aus GLAMR helps you to reach GLAMR workers so they can find out about your latest blog posts or newsletter editions, your event, or your online group. You can register your thing:

  • Blog: If it's a website that often or sometimes has Strong GLAMR Themes, with an RSS or Atom feed, you can register it here.
  • Newsletter: Whether it uses Ghost, Write.as, Mailchimp or something else, if it's an email newsletter with some kind of GLAMR focus you can register it here.
  • Group: A "group" could be an email list, a subreddit, Mastodon server, etc. If it's a many-to-many electronic medium, it's probably a "group".
  • Event: Conference, convention, seminar, workshop, talk, meet-up...

Whilst the previous iterations all let you register and get updates about blog posts, Aus GLAMR introduces newsletters, groups and events. Basically I considered all the things GLAM Twitter was good for finding out about, and tried to include them here.

Subscribing to find out what's happening in GLAMR

On the other side of the equation, you can subscribe to get updates on the latest things being registered or captured. You have a few options depending on your preference:

  • get a weekly email update
  • follow the Mastodon bot via an ActivityPub application.
  • subscribe via RSS:
    • blog posts
    • events and calls for proposals
    • newsletters
    • everything

Using Aus GLAMR to find what you want to know about

One of the things that was important to me when migrating from the Aus GLAM Blogs to Aus GLAMR was to retain the references to old blog posts. This turned out to be a bit harder than I anticipated, as I was migrating an encrustation of three different versions of MongoDB data into Postresql. Depending on when the blog had been added, there was more or less information, and in some cases the blog was listed with an HTTP protocol whilst later posts were HTTPS. It was a bit of a mess, and there was some duplication. Ultimately I used the existing data and ran a custom Python script that pulled more information from each RSS feed or website using Beautiful Soup, to create a Django fixture that could then initiate the new database.

You can now reap the benefits of this because with a slightly enlarged and normalised set of data, you can search the collection by keyword, or browse by type, category, or blog/newsletter. If you're interested in past Australian library blogs on a particular topic, for example, you can use Aus GLAMR to find them.

What's next?

I'd love for the GLAMR community in Australia and Aotearoa to embrace the Aus GLAMR app to share and find out about what's happening in the professions. I'm hoping the weekly email update will enable wider dissemination than the previous requirement to use Twitter, Mastodon or RSS. If you work in GLAMR, please tell people about it!

As a last little aside – the only reason the core functionality of Aus GLAMR is possible is because of the RSS and Atom standards. These are free and open standards anyone can use. And they do use them, even though most people don't realise it. Every blogging software and most newsletter platforms include RSS running quietly in the background. This is how Aus GLAMR knows there is a new blog post or newsletter edition to add to the database. Podcasting runs on RSS too. Open standards are always bigger and better than corporate APIs. Don't let anyone tell you otherwise.


July 4 and the power of words / John Mark Ockerbloom

The history of online books is intertwined with the history of Independence Day. It was on July 4, 1971 that Michael Hart, the founder of Project Gutenberg, was inspired to enter into a computer a copy of the Declaration of Independence. Here are some of the words he typed:

We hold these truths to be self-evident, that all men are created equal, that they are endowed by their Creator with certain unalienable Rights, that among these are Life, Liberty, and the pursuit of Happiness. That to secure these rights, Governments are instituted among Men, deriving their just powers from the consent of the governed, That whenever any Form of Government becomes destructive of these ends, it is the Right of the People to alter or to abolish it, and to institute new Government, laying its foundation on such principles and organizing its powers in such form, as to them shall seem most likely to effect their Safety and Happiness.

195 years after the first publication of these words heralded the birth of a new nation, Michael Hart quoted them in what’s often considered the birth of the ebook. These words have staying power not just because the United States still exists as one of the world’s most powerful nations, and not just because they inspired people to fight for independence in the 18th century. They have staying power because they inspired people to establish and protect human dignity, equality, and empowerment in many other times and places. Consider, for instance, the Seneca Falls Declaration, echoing and reworking these words 72 years later to proclaim “all men and women are created equal”, and to launch efforts to secure American women’s rights to vote that would take another 72 years to win. Consider also the many declarations around the world that, as David Armitage notes, have drawn on this Declaration’s words for inspiration.

It’s often easy for July 4 to be an occasion for Americans to glibly congratulate ourselves. But even if the Declaration proclaims the truths above “self-evident”, the fulfillment of their promise has been anything but self-evident. Those who wrote and signed the Declaration knew they would be at risk for losing “our Lives, our Fortunes and our sacred Honor” for years afterwards, until those who fought and supported the American Revolution made Great Britain agree to a treaty recognizing the new nation. The writers of the Declaration themselves also failed to consistently live up to the ideals stated in it. Many of them, including the Declaration’s main drafter, continued to enslave other people for all of their lives. Indeed, the conflict between the principle that “all men are created equal” and the desire to keep some human beings subjugated to others eventually led to a war much bloodier than the American Revolution. During that war, Lincoln’s Gettysburg Address again invoked the Declaration’s founding principle of equality, and called the Civil War a test of whether “any nation so conceived and so dedicated can long endure.”

Our nation has endured many such tests in our 248 years, some of them resulting in victories, some of them resulting in setbacks. Sometimes the power of words alone has not been enough to guarantee people’s “unalienable Rights”. In the wake of the Civil War, three constitutional amendments were passed that seemed to guarantee the right to vote regardless of race, the equal protection of the laws, and the end of slavery. But less than a decade later, a bitterly fought close election led to a compromise that undid those guarantees. African Americans soon effectively lost their right to vote in much of the United States, until the 1965 Voting Rights Act. Exploitation of convict labor in tandem with a dramatic rise in imprisonment, often under flimsy pretexts, led to what Douglas A. Blackmon and others have called “slavery by another name“. In Plessy v. Ferguson, John Marshall Harlan insisted in his dissent that the Fourteenth Amendment “placed our free institutions upon the broad and sure foundation of the equality of all men before the law”. But he could not override the ruling of seven other Supreme Court justices in that case, who accepted a claim that enforced racial segregation could in practice be “separate but equal”.

These setbacks and others remind us that while “the People” can institute governments that respect our rights, “the people” can also let us down. We are people, after all: human beings who have unalienable rights and dignity, but who also have flaws, vices, weaknesses, and vulnerabilities. Any given institution, from a neighborhood association to the Supreme Court, can let us down too. They are also made of people, after all, ultimately doing what the people involved in them decide to do.

I know a number of people who are discouraged this July 4 in particular, as we’ve seen our Supreme Court once again let us down. They have issued rulings whose implications extend to “altering fundamentally the Forms of our Governments”, as the Declaration writers complained of King George. Particularly alarming among them is the ruling of six of the justices giving presidents immunity from prosecution so broad as to enable the kind of “despotism” the Declaration writers denounced in George III. Justices Sotomayor and Jackson noted this in their dissents, and other writers like Mark Joseph Stern have amply argued the case as well. The Supreme Court by itself cannot kill American democracy, but figuratively speaking, it’s placed a gun on the mantelpiece that a leading presidential candidate has demonstrated a willingness to fire upon regaining power. If we thought we could rely on the Supreme Court or any other institution unquestioningly, or assume that saying the right words to it would make it honor those words as we thought they should, we’ve been gravely disillusioned.

What our words can do, though, is help us keep working for what we know is right, until we overcome the setbacks with victories. In our libraries, we can bring these words together, and ensure that people continue to have access to them, for as long as that takes. We have all kinds of words that are needed for those struggles. We have words that inspire, words that persuade, and words that tell the stories of what has been, and what could be. We have words that tell us how democracies have formed, how they’ve failed, and how they’ve been restored. We have the words of all kinds of people that help us more easily see and treat those people as our equals, and deserving of the same unalienable rights that we have ourselves. We have words that illuminate the grievances behind long-standing conflicts, expose the atrocities of those conflicts, and show us how some of those conflicts have been resolved justly, even when few could imagine that outcome. We have words that help us pursue our happiness, and enable others to pursue their happiness as well.

Brought together, and freely shared, words have a lot of power. Authoritarians know this– that’s why so many of them try to ban books or coercively manipulate the way people communicate with each other. It’s why librarians raise the alarm when they see book bans on the rise, on whatever pretext. And it’s also a reason for us to hope that as we communicate with each other, organize with each other, and take action with each other, we can make real the promises of the words we treasure.

That’s why on this July 4, I celebrate the words of the Declaration of Independence that Michael Hart typed out and that I repeated at the top of this post. I’ll continue to organize and seek out more words that can be used to protect life, promote liberty, and pursue happiness. I resolve to do what I can, particularly between now and November, to work towards a society and government that honors the fundamental equality of all human beings, and the free consent of the governed. And I remember those who have engaged in similar struggles before me, including those Lincoln honored for fighting in another July long past, so that “government of the people, by the people, for the people, shall not perish from the earth.”

More On The Halvening / David Rosenthal

At the end of May I wrote One Heck Of A Halvening about the aftermath of the halving of Bitcoin's block reward on April 19th. Six weeks later it is time for a quick update, so follow me below the fold.

Source
All the graphs in this post are 7-day moving averages, to filter out some of the noise. The first shows what the halvening did to the hash rate. Before the halvening it was around 620EH/s. Except for a spike up to 657EH/s on 26th May, after the BTC "price" hit nearly $71.5K, it has hovered around 580EH/s. Since power is the biggest cost, the miners' costs have dropped up to 6.5% since the halvening.

Source
The second graph is much more dramatic. It shows miners' revenue dropping from around $70M/day before the the halvening to around $30M/day after, a drop of 57%. So with costs down 6.5% and revenue down 57% their margins have taken a huge hit.

Back in May I wrote:
Lets start back in October when the Bitcoin "price" was in the high $20Ks. This was a problem, because only the most efficient miners could make a profit at that "price".
Mysteriously, that was when Tether's money printer started running at full speed, and the "price" got pumped up to around $70K. If I'm right that only the most efficient miners could make money with the "price" in the high $20Ks, and if their (income - costs) has dropped 50%, the most efficient miners need the "price" to be in the mid-$50Ks. Mysteriously, the mid-$50Ks is where the "price" has dropped to, down 21% since its last peak on June 5th.

Source
That the hash rate was broadly flat after the halvening is interesting. Before the pre-halvening pump started, with the BTC "price" roughly flat, there was a slow but steady increase in the hash rate reflecting Bitmain's shipments of new, efficient chips displacing a smaller contribution from older chips being turned off. Bitmain's shipments continued, but once the block reward was halved many more of the less efficient chips should have been turned off, and the hash rate reduced. This didn't happen, and I can see two possible explanations:
  • It may be that mining power is highly skewed, with the vast majority coming from the latest chips running on low-cost power which are still profitable after the halvening. Then the shutdown of a huge number of older chips sucked in by the pre-halvening pump would have had only a small reduction in the total hash rate.
  • It may be that miners are burning their excess profit from the pre-halvening pump by staying in the market rather than pivoting to AI, each hoping to be among the survivors from the eventual cull.
Source
Another feature of the halvening was a massive spike in transaction fees to over $40/transaction. The third graph shows this spike, and a smaller one around the peak in the BTIC "price". But it also shows that right now no-one wants to transact, because the average fee is now $1.72. So the miners are once again almost entirely dependent upon inflating the currency with block rewards.

Source
The last graph shows the total cost of a transaction. It shows that the fee spike around the "price" peak made little difference, The current average cost per transaction is $48.57, of which $1.72 is the fee. So the system is 96.5% supported by inflation; there is currently no risk of the dangers associated with Fee-Only Bitcoin.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 9 July 2024 / HangingTogether

The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by a team of OCLC contributors.

Several books stacked on top of each other.Photo by Kimberly Farmer on Unsplash

Discussion on confronting book bans

On Wednesday, 31 July 2024, the University of Chicago Library (OCLC Symbol:  CGU); the University of Chicago Divinity School; the Center for the Study of Race, Politics, and Culture at the University of Chicago; and the School of Information Sciences at University of Illinois Urbana-Champaign (OCLC Symbol:  UIU) will present a free virtual webinar, “Confronting Book Bans:  A Panel Discussion.”  In addition to the live presentation 10:00-11:30 a.m. Central Time on July 31, an online recording will be made available in August.  University Librarian and Dean of the University of Chicago Library Torsten Reimer will moderate the discussion between American Library Association 2023-2024 President Emily Drabinski, who is Associate Professor at the Graduate School of Library and Information Studies at Queens College (OCLC Symbol:  Q7L) and Emily J.M. Knox, who is Associate Professor in the iSchool at Illinois.  Currently the Critical Pedagogy Librarian at the City University of New York Graduate Center (OCLC Symbol: ZGM), Drabinski has been a regular contributor to Truthout, the independent nonprofit news organization dedicated to social justice issues.  In 2023, Knox’s book Foundations of Intellectual Freedom was awarded the ALA Intellectual Freedom Round Table Eli M. Oboler Prize for best published work in the area of intellectual freedom. 

Drabinski and Knox are two of the library’s world’s most prominent and knowledgeable scholars fighting book bans in the United States. This discussion promises to offer historical background, analysis of the current struggles, and advice about dealing with censorship.  Contributed by Jay Weitz.

Task group addresses metadata related to Indigenous Peoples of the Americas 

In Descriptive Notes, the blog of the Description Section of the Society of American Archivists, Katherine Witzig (Choctaw Nation of Oklahoma) gives an update on the Program for Cooperative Cataloging’s Task Group for Metadata Related to Indigenous Peoples of the Americas (Taking on the Challenge: PCC’s Metadata Justice Work for Indigenous Communities). The group has written a preliminary report, but Witzig acknowledges there is more work to do including creating statements to contextualize the harm of inappropriate metadata and to encourage work in the field. Other groups are working on recommendations for subject headings and the Library of Congress Classification system. A survey is available to gather input from all who work with Indigenous metadata, and will be open until the Association for Tribal Libraries, Archives, and Museums (ATALM) 2024 conference in November.  

I am so impressed by Witzig and their co-chair Brandon Castle who are not only new professionals but started doing their work as task group leaders as graduate students (Castle has recently started a position as Native American & Indigenous Studies Librarian at the University of Massachusetts Amherst). The scope of the work is daunting, but the group’s energy and enthusiasm are inspiring. Contributed by Merrilee Proffitt

BookNet Canada produces blog series on the impact of book bans 

BookNet Canada, a non-profit organization that develops technology, standards, and education to serve the Canadian book industry, produced a three-blog series written by researcher Aline Zara on book banning and censorship in Canada. The first blog, “Tracking banned books in Canada,” was published on 16 November 2023 and demonstrated how Canadian book sellers began tracking banned books in its distribution metadata. The second blog, “Buying banned books in Canada,” was published on 22 February 2024. Based on the research in the first blog, Zara identified the top 10 individual books and top 10 book series that had been identified as banned or challenged. Altogether, sales of these banned books decreased 23% from July 2020 to June 2023. The final blog, “Borrowing banned books in Canada,” was published on 27 June 2024. Looking at the same 20 books and book series, Zara found that loans of these banned books steadily increased 690% from July 2020 to June 2023 and library holds for these banned books increased 32% over the same period. While Zara does not draw any conclusions from her blog posts, it appears that while book bans affect book sales negatively, they increase interest in library patrons, making it even more crucial for libraries to keep these books available to users. 

“Every reader their book. Every book its reader.” Ranganathan’s second and third laws of library science continue to resonate throughout this period of increasing book bans and challenges. Readers continue to rely on libraries to provide materials that allow them to gain new perspectives and grow as citizens. Contributed by Morris Levy.   

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 9 July 2024 appeared first on Hanging Together.

How can ‘openness’ complement climate change? / Open Knowledge Foundation

Text originally published at Eco Media

Climate activists and open knowledge experts in the Global South and Global North have begun a partnership to seek more ‘openness’ in climate change negotiations.

The coalition OPEN GOES COP includes organisations and individuals advocating for openness in the context of the UN Climate Change Conferences (COP).

Screenshots taken during the Inaugural Webinar of OPEN GOES COP on June 5th 2024, World Environment Day

There is a knowledge gap when it comes to climate change and access to information, such as research. Most climate change research is not accessible, with Curtin Open Knowledge Initiative estimating that 47% of climate change research is not accessible.

According to Monica Granados of the Open Climate Campaign, through bibliometric research, they estimated that only about 49% of climate change publications are openly available.

The problem is not a lack of research, but rather the availability of this research to inform decisions. If research is not available, it will mean that much of the knowledge on climate change research will not be accessible without a paid subscription or pay-per-use fee, a cost that will be a burden to individuals, communities, governments, or institutions in low and middle-income countries.

The case for openness in climate action

In 2015, all countries – developed and developing – signed an agreement known as the Paris Agreement that requires them to make significant commitments to address climate change.

Article 13 of the agreement requires all parties to build mutual trust and confidence and to promote effective implementation, an enhanced transparency framework for action and support, with built-in flexibility which considers the Parties’ different capacities and builds upon collective experience.

The transparency requirements can be achieved only if there is available data on existing climate change research. Data is a basic element of any openness, providing an understanding of what is being worked on.

However, there is still a vast margin of work needed when it comes to open data sharing between governments and international agencies or between researchers within the UN Climate Change Conferences and in the larger sustainability space.

“We would love to see open data being discussed as a way of implementing the enhanced Article 13 of the Paris Agreement and that national delegations to COP engage with their open data teams back home to bring proposals on how to implement that collaboration,” said Natalia Carfi, of the Open Data Charter.

“Out of the research we conducted, we know that in some cases, the governmental teams in charge of reporting to the UNFCCC are not connected at all with the open data team and we believe there is public value in the data being created that is locked without that interaction,” she added.

Among the many challenges within the climate change and larger sustainability space is the replication of projects, due to the lack of public digital infrastructure on climate change research, which draws home to no impact.

“With Open Goes COP, we want to create a space where existing projects are showcased so they can serve as best practice. We want to build bridges among different projects with a climate focus on the open movement out there. The idea here is not to compete, but to cooperate instead,” explained Sara Petti of the Open Knowledge Foundation.

Openness in the African Climate Context

In the African climate context, the continent is responsible for less than 10% of global greenhouse gas emissions. However, it faces more effects of climate change than continents with larger emissions.

In 2022, weather, climate, and water-related hazards directly affected over 110 million people in Africa, causing over $8.5 billion in economic damages, according to the Emergency Event Database. There were a reported 5,000 fatalities, with 48% associated with drought and 43% with flooding.

The continent faces two crises in the context of climate change: a lack of finance for adaptation and mitigation innovations and data challenges to inform decision-making and policies on the planetary crisis.

The data challenge is due to the lack of digital public infrastructures that facilitate research sharing.

Another challenge is internet accessibility, as most research and data registries are online. For example, Africa’s Internet penetration rate is much lower at 36% compared to Asia’s 67.4% and the European Union’s 89%.

“It is important to note that openness alone is not a panacea, and its benefits can only be realized if accompanied by complementary policies, investments, and capacity-building efforts. Careful implementation and the leveraging of openness in alignment with the specific needs and contexts of African and other Global South countries are crucial for maximizing the positive impacts,” suggested Otuo-Akyampong Boakye, of the Wiki Green Initiative.

There are hundreds of open knowledge initiatives existing, many in European nations and the United States. To achieve openness in climate change, all parties need to agree on the creation framework that would serve the purpose.

What will be the essence of a research work if it doesn’t benefit its intended audience and bring about change?

Engineering For The Long Term / David Rosenthal

Content Warning: this post contains blatant self-promotion.

Contributions to engineering fields can only reasonably be assessed in hindsight, by looking at how they survived exposure to the real world over the long term. Four of my contributions to various systems have stood the test of time. Below the fold, I blow my own horn four times.

Four Decades

X11R1 on a Sun/1
Wikipedia has a pretty good history of the early days of the X Window System. In The X Window System At 40 I detailed my contributions to the early development of X. To my amazement 40 years after Bob Scheifler's initial release it is still valiantly resisting obsolescence. I contributed to the design, implementation, testing, release engineering and documentation of X11 starting a bit over 39 years ago. At least my design for how X handles keyboards is still the way it works.

All this while I was also working on a competitor, Sun's NeWS — which didn't survive the test of time.

Nearly Three-and-a-Half Decades

One of the things I really enjoyed about working on NeWS was that the PostScript environment it implemented was object-oriented, a legacy of PostScript's origins at Xerox PARC. Owen Densmore and I developed A User‐Interface Toolkit in Object‐Oriented PostScript that made developing NeWS applications very easy, provided you were comfortable with an object-oriented programming paradigm.

I think it was sometime in 1988 while working on the SunOS 4.0 kernel that I realized that the BSD Vnode interface was in a loose sense object-oriented. It defines the interface between the file system and the rest of the kernel. An instance of BSD's type vnode consisted of some instance data and a pointer to an "ops vector" that defined its class via an array of methods (function pointers). But it wasn't object-oriented enough to, for example, implement inheritance properly.

This flaw had led to some inelegancies as the interface had evolved through time, but what interested me more was the potential applications that would be unleashed if the interface could be made properly object-oriented. Instead of being implemented from scratch, file systems could be implemented by sub-classing other file systems. For example, a read-only file system such as a CD-ROM could be made writable by "stacking" a cache file system on top, as shown in Figure 11. I immediately saw the possibility of significant improvements in system administration that could flow from stacking file systems.

Evolving the Vnode Interface: Fig. 11
I started building a prototype by performing major surgery on a copy of the code that would become SunOS 4.1. By late 1989 it worked well enough to demonstrate the potential of the idea, so I published 1990's Evolving the Vnode Interface. The paper describes a number of Vnode modules that can be stacked together to implement interesting functions. Among them was cache-fs, which layered a writable local file system above a local or remote read-only file system:
This simple module can use any file system as a file-level cache for any other (read-only) file system. It has no knowledge of the file systems it is using; it sees them only via their opaque vnodes. Figure 11 shows it using a local writable ufs file system to cache a remote read-only NFS file system, thereby reducing the load on the server. Another possible configuration would be to use a local writable ufs file system to cache a CD-ROM, obscuring the speed penalty of CD.
Over the next quarter-century the idea of stacking vnodes and the related idea of "union mounts" from Rob Pike and Plan 9 churned around until, in October 2014, Linus Torvalds added overlayfs to the 3.18 kernel. I covered the details of this history in 2015's It takes longer than it takes. In it I quoted from Valerie Aurora's excellent series of articles about the architectural and implementation difficulties involved in adding union mounts to the Linux kernel. I concurred with her statement that:
The consensus at the 2009 Linux file systems workshop was that stackable file systems are conceptually elegant, but difficult or impossible to implement in a maintainable manner with the current VFS structure. My own experience writing a stacked file system (an in-kernel chunkfs prototype) leads me to agree with these criticisms.
I wrote:
Note that my original paper was only incidentally about union mounts, it was a critique of the then-current VFS structure, and a suggestion that stackable vnodes might be a better way to go. It was such a seductive suggestion that it took nearly two decades to refute it!
Nevertheless, the example I used in Evolving the Vnode Interface of a use for stacking vnodes was what persisted. It took a while for the fact that overlayfs was an official part of the Linux kernel to percolate through the ecosystem, but after six years I was able to write Blatant Self-Promotion about the transformation it wrought on Linux's packaging and software distribution, inspired by Liam Proven's NixOS and the changing face of Linux operating systems. He writes about less radical ideas than NixOS:
So, instead of re-architecting the way distros are built, vendors are reimplementing similar functionality using simpler tools inherited from the server world: containers, squashfs filesystems inside single files, and, for distros that have them, copy-on-write filesystems to provide rollback functionality.

The goal is to build operating systems as robust as mobile OSes: periodically, the vendor ships a thoroughly tested and integrated image which end users can't change and don't need to. In normal use, the root filesystem is mounted read-only, and there's no package manager.
Since then this model has become universal. Distros ship as a bootable ISO image, which uses overlayfs to mount a writable temporary file system on top. This is precisely how my 1989 prototype was intended to ship SunOS 4.1. The technology has spread to individual applications with systems such as Snaps and Flatpak.

Three Decades

The opportunity we saw when we started Nvidia was that the PC was transitioning from the ISA bus to version 1 of the PCI bus. The ISA bus' bandwidth was completely inadequate for 3D games, but the PCI bus had considerably more. Whether it was enough was an open question. We clearly needed to make the best possible use of the limited bandwidth we could get.

Nvidia's first chip had three key innovations:
  1. Rendering objects with quadric patches not triangles. A realistic model using quadric patches needed perehaps a fifth of the data for an equivalent triangle model.
  2. I/O virtualization with applications using a write-mostly, object-oriented interface. Read operations are neccessarily synchronous, whereas write operations are asynchronous. Thus the more writes per read across sthe bus, the better the utilization of the available bus bandwidth.
  3. A modular internal architecture based on an on-chip token-ring network. Thie goal was that each functional unit be simple enough to be designed and tested by a three-person team.
SEGA's Virtua Fighter on NV1
The first two of these enabled us to get Sega's arcade games running at full frame rate on a PC. Curtis Priem and I designed the second of these, and it is the one that has lasted:
  • I/O virtualization allowed multiple processes direct access to the graphics hardware, with no need to pass operations through the operating system. I explained the importance of this a decade ago in Hardware I/O Virtualization, using the example of Amazon building their own network interface cards. Tne first chip appeared on the bus as having 128 wide FIFOs. The operating system could map one of them into each process wanting access to the chip, allowing applications direct access to the hardware but under the control of the operating system.
  • The interface was write-mostly because the application could read from the FIFO the number of free slots, that is the number of writes before the bus would stall.
  • The interface was object-oriented because the data and the offset in the FIFO formed an invocation of a method on an instance of a (virtual) class. Some classes were implemented in hardware, others trapped into the kernel and were implemented by the driver, but the application just created and used instances of the available classes without knowing which was which. The classes were arranged in a hierarchy starting with class CLASS. Enumerating the instances of class CLASS told the application which classes it could use. Enumerating the instances of each of those classes told the application how many of each type of resource it could use.
The importance of the last of these was that it decoupled the hardware and software release schedules. Drivers could emulate classes that had yet to appear in hardware, the applications would use the hardware once it was available. Old software would run on newer hardware, it would just see some classes it didn't know how to use. One of our frustrations with Sun was the way software and hardware release schedules were inextricably interlinked.

Two-and-a-Half Decades

Last October I celebrated the LOCKSS Program Turns 25. Vicky Reich explained to me how libraries preserved paper academic journals and how their move to the Web was changing libraries role from purchasing a copy of the journal to renting access to the publisher's copy, and I came up with the overall peer-to-peer archtecture (and the acronym). With help from Mark Seiden I built the prototype, and after using it to demonstrate the feasibility of the concept, also used it to show vulnerabilities in the initial protocol. In 2003 I was part of the team that solved these problems, for which we were awarded the Best Paper award at the Symposium on Operating System Principles for Preserving peer replicas by rate-limited sampled voting.

2024 Virtual DLF Forum Program Available & Registration Now Open / Digital Library Federation

The Council on Library and Information Resources is pleased to announce that the program is available and registration is now open for the virtual Digital Library Federation’s (DLF) Forum happening online this fall, October 22-23, 2024.

Browse the program

Register here to secure your spot

The DLF Forum welcomes digital library, archives, and museum practitioners from member institutions and beyond—for whom it serves as a meeting place, marketplace, and congress. Here, the DLF community celebrates successes, learns from mistakes, sets grassroots agendas, and organizes for action. Learn more about the event.

 

Subscribe to our newsletter to be sure to hear all the Forum news first. 

The post 2024 Virtual DLF Forum Program Available & Registration Now Open appeared first on DLF.

All the streets in Montclair / Eric Hellman

 (I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

At the end of 2020, Strava told me I had run 1362 miles over 12 months.  "I hope I never do that again!" I told a running friend. It seemed appropriate that my very last running song from shuffle was Fountains of Wayne's "Stacy's Mom"; founding member Adam Schlesinger had died of Covid. For months of that pandemic year, there wasn't much to do except work on my computer and run. It was boring, but at the same time I loved it. In retrospect,  the parks needn't have closed (or later in the year, required masks while running through). Remember how we veered around other people just enjoying fresh air?  In that year, running was one thing that made sense. But never have I celebrated the new year as joyously I did on the eve of 2021. Vaccines were on the way, the guy who suggested drinking bleach was heading to Florida, and I had a map of Montclair to fill in.

I've called Montclair, the New Jersey town where I live, "a running resort". It has beautiful parks, long, flat tree-lined streets without much traffic, short steep streets for hill work, well maintained tracks, a wonderful running store,  and at least 3 running clubs. During pandemic, everyone seemed to be out running. Even my wife, who for many years would tell me "I don't understand how you can run so much", started running so much. At Christmas our son gave us both  street maps of Montclair to put on the fridge so we could record our running wanderings.

So, come 2021 the three of us said goodbye to the boring routine of running favorite routes. Montclair has 363 streets, and a couple of named alleys so we could have done a street a day for a year if we had wanted to. But it was more fun to construct routes that crossed off several streets ata time. While I was at it, I could make strava art or spell words. Most of my running masterpieces were ex post cursus pareidolia.  Occasionally I spelled out words. Here's "love" (in memory of a running friend's partner). 

Starting on New Year's Day with the Resolution run up "Snake Hill",  I methodically crossed off streets. I passed Yogi Berra's "Fork in the Road"  I finished the complete set of Montclair streets on Jun 13 .

The neighboring town of Glen Ridge came quickly on July 18, as I had done well over half on the way to Montclair streets. Near me, Glen Ridge is only 2 and a half blocks wide! I took a peek at the Frank Lloyd Wright house on a street I'd not been on before! 

With five months left I started on Bloomfield, the next town east. Bloomfield is cut in half by the Garden State Parkway, the source of the "which exit?" joke about New Jersey, and I focused on the half near to me. I got to know Clark's Pond. My streets running helped me set my half marathon PR, in the lovely town of Corning, New York. 

I know of other streets running completists - it seems there's even an app to help you do it. Author Laura Carney wrote about it in her book "My Father's List"  My friend Chris has continued to add towns and cities to his list and has only 9 streets left to finish ALL OF ESSEX COUNTY. Update: He finished! and was written up by nj.com!

To finish the year I spelled out 2021.

2021: 1,268.3 miles, 223 hours 36minutes, 40,653 ft vertical. I ran to 1,700 different songs. Last running song of the year (on shuffle): Joy Division's "No Love Lost":

Wishing that this day won't last

To never see you show your age

To watch until the beauty fades


Posts in this series:


Don't bring me solutions, bring me problems / Hugh Rundle

A couple of weeks ago Emilia Bell published The problem with solutions, an interesting take on some of the problems with the management mantra I encountered a lot in my earlier years in the profession: Don't bring me problems, bring me solutions. Emilia explains fairly succinctly why this is a dumb thing to say as a manager, but it reminded me that this is one of the reasons that at my work people do something they think is helpful, but is actually not: they bring me solutions, when what I want is problems.

In my current role at MPOW, I manage a team of technical specialists. They have expertise in web development, library discovery systems, open education, and digital file management. We manage the library discovery systems, configuration for the library service platform, and the library's websites, amongst other things. One of my enduring frustrations is that despite this pool of expertise, when colleagues come to us for assistance they usually articulate a specific task they wish for us to perform rather than asking us for advice on how to achieve a desired outcome. That is: they come to us with solutions, rather than problems.

There is always a well-known solution to every human problem—neat, plausible, and wrong.

H. L. Mencken

I'm not sure exactly why this happens, but I have a few theories. Partially, it is probably due to the experience some people like me have had with managers telling us to come with solutions, not problems. Partially it is a lack of understanding of–or perhaps respect for–the expertise of the team. And partially I suspect it is an unfortunate side-effect of too many keynotes telling librarians we're geniuses who know everything.

What exactly is expertise? I think perhaps a misunderstanding about the answer to this question is at the heart of the problem I experience. Intuitively we assume that "expertise" means knowing lots of solutions. Engineers know how to build bridges. Comedians know how to tell jokes. Computer programmers know how to write high quality code. Librarians know how to organise and find information. But I think this is a mistake. What expertise brings, more than knowing the solutions, is understanding the problems.

Civil engineers know how to build bridges better than you because they know a large number of things that can make bridges collapse. Experienced comedians can more reliably make people laugh than the average person because they have a lot of experience telling jokes that fell flat. Good computer programmers know a lot of ways that code will make your computer hang, or expose you to security problems, or work, but very inefficiently. Anyone can think of a possible solution, but it takes domain expertise to understand all of the possible problems.

This is why the powerful professions of every society always overestimate their ability to understand problems and provide solutions. Medieval Europe had it's Church Fathers. Soviet Russia had its technocrats. Our current moment has Silicon Valley tech bros. These are people who articulate solutions without understanding the problems outside their own domain expertise.

The best results come when people responsible for providing solutions have a clear articulation of the problem, consult widely about their proposed solution, and receive clear information about it based on experience and knowledge, rather than merely opinion. There is a good reason that the software industry often bases feature development around "user stories" in the formula "As a [type of user], I want to [do something] so that [desired outcome]". But for those of us working in maintenance rather than development, we need a different model. Perhaps an expansion, like: "As a [type of user], I want to [do something] so that [desired outcome], but [experience] makes this difficult because [reason]".

A request like this would allow my team to develop a potential solution and go back to the team requesting a fix to check whether this actually resolves the problem they have identified. We understand most of the possible problems within our domain (mostly technical). They understand most of the possible problems in their domain (often social, or technical in a different domain). It is then a matter of finding a solution that resolves or avoids both types of problems. But it is only by including all the relevant domain expertise that we will find a suitable solution: only considering the technical, or the social, or some other single type of expertise, will result in a "solution" that is at best sub-optimal. We need and want our colleagues to identify issues with the way we're doing our work. Those issues are likely problems identified from outside our own domain expertise – that's good! But the solution to that problem may well come from outside the domain expertise of the person who identified it. In other words:

Don't bring me solutions, bring me problems.


X Window System At 40 / David Rosenthal

X11R1 on Sun
Techfury90 CC0
I apologize that this post is a little late. On 19th June the X Window System celebrated its 40th birthday. Wikipedia has a comprehensive history of the system including the e-mail Bob Scheifler sent announcing the first release:
From: rws@mit-bold (Robert W. Scheifler)
To: window@athena
Subject: window system X
Date: 19 June 1984 0907-EDT (Tuesday)

I've spent the last couple weeks writing a window
system for the VS100. I stole a fair amount of code
from W, surrounded it with an asynchronous rather
than a synchronous interface, and called it X. Overall
performance appears to be about twice that of W. The
code seems fairly solid at this point, although there are
still some deficiencies to be fixed up.

We at LCS have stopped using W, and are now
actively building applications on X. Anyone else using
W should seriously consider switching. This is not the
ultimate window system, but I believe it is a good
starting point for experimentation. Right at the moment
there is a CLU (and an Argus) interface to X; a C
interface is in the works. The three existing
applications are a text editor (TED), an Argus I/O
interface, and a primitive window manager. There is
no documentation yet; anyone crazy enough to
volunteer? I may get around to it eventually.

Anyone interested in seeing a demo can drop by
NE43-531, although you may want to call 3-1945
first. Anyone who wants the code can come by with a
tape. Anyone interested in hacking deficiencies, feel
free to get in touch.
Scheifler was right that it was a "good starting point for experimentation", but it wasn't really a usable window system until version 11 was released on 15th September 1987. I was part of the team that burned the midnight oil at MIT to get that release out, but my involvement started in late 1985.

Below the fold are some reflections on my contributions, some thoughts on the astonishing fact that the code is still widely deployed after 40 years, and some ideas on why it has been so hard to replace.

Involvement

I arrived at Sun in September 1985 to work with James Gosling on a window system for Sun's workstations. We had worked together developing the window system for Carnegie-Mellon's Andrew project. We both realized that, if Unix and Unix workstations like Sun's were to succeed, they had to have a widely adopted window system. Ideally, we thought, it would have three key attributes:
  • An imaging model at a higher level than simply a grid of pixels, so that applications didn't have to adapt to varying screen resolutions. The clear favorite was Adobe PostScript's, which had appeared on the Apple LaserWriter in January 1985.
  • Networking, so that applications could run wherever it made sense with their user interface where the user was. In the mid-80s both workstations and networks were slow, running a window system was a big load on a Motorola 68000. Having two CPUs working on an application, the client-server model, helped.
  • Programmability, because it was hard to provide a good user experience if there was a slow network round-trip required for each response to a user action. If the window system were programmable, the code that responded to the user's actions could run on the user's workstation, eliminating the network round-trip.
Gosling had left C-MU some months earlier. He was adamant that it was possible to implement PostScript on a 68000 fast enough to be a window system. Adding networking, mouse and keyboard extensions to PostScript would satisfy all three of our requirements. Because Adobe's PostScript in the 68000-based LaserWriter was notoriously slow I was skeptical, but I should have known better than to doubt his coding skills. A couple of months later when Gosling showed me a prototype PostScript interpreter running incredibly fast on a Sun/1, I was convinced to join him on what became the NeWS project.

Once at Sun I realized that it was more important for the company that the Unix world standardized on a single window system than that the standard be Sun's NeWS system. At C-MU I had already looked into X as an alternative to the Andrew window system, so I knew it was the obvious alternative to NeWS. Although most of my time was spent developing NeWS, I rapidly ported X version 10 to the Sun/1, likely the second port to non-DEC hardware. It worked, but I had to kludge several areas that depended on DEC-specific hardware. The worst was the completely DEC-specific keyboard support.

Because it was clear that a major redesign of X was needed to make it portable and in particular to make it work well on Sun hardware, Gosling and I worked with the teams at DEC SRC and WRL on the design of X version 11. Gosling provided significant input on the imaging model, and I designed the keyboard support. As the implementation evolved I maintained the Sun port and did a lot of testing and bug fixing. All of which led to my trip to Boston to pull all-nighters at MIT finalizing the release.

My involvement continued after the first release. I was the editor and main author of the X Inter-Client Communications Conventions Manual (ICCCM) that forms Part III of Robert Scheifler and Jim Gettys' X Window System. A user's X environment consists of a set of client applications, an X server managing displays and input devices, and a window manager application implementing a user interface that allows the user to allocate the server's resources such as windows to the clients. The window manager needs to communicate the resource allocations to the clients, and the clients need to communicate between themselves to support, for example, cut-and-paste. The ICCCM explains its scope thus:
Being a good citizen in the X Version 11 world involves adhering to conventions that govern inter-client communications in the following ares:
  • Selection mechanism
  • Cut buffers
  • Window manager
  • Session manager
  • Manipulation of shared resources
  • Device color characterization
This part of the book proposes suitable conventions without attempting to enforce any particular user interface.
xkcd 2730
There is a certain justice in The UNIX-HATERS Handbook's description of my efforts:
one of the most amazing pieces of literature to come out of the X Consortium is the “Inter Client Communication Conventions Manual,” ... It describes protocols that X clients must use to communicate with each other via the X server, including diverse topics like window management, selections, keyboard and colormap focus, and session management. In short, it tries to cover everything the X designers forgot and tries to fix everything they got wrong. But it was too late—by the time ICCCM was published, people were already writing window managers and toolkits, so each new version of the ICCCM was forced to bend over backwards to be backward compatible with the mistakes of the past.
Jim Morris, who ran the Andrew project at C-MU, was very wise about what we were getting into: "We're either going to have a disaster, or a success disaster". He often added: "And I know which I prefer!". Jim was exaggerating to make two points:
  • Being a success didn't mean the problems were over, it meant a different set of problems.
  • Ignoring the problems of success was a good way of preventing it.
X11 was definitely a "success disaster".

Why X not NeWS?

NeWS Xardox
Public Domain
NeWS was amazing technology, providing each of our requirements with usable performance on a 68000. Its interpreter was a clean-room implementation of PostScript from the "BlueRed Book", with extensions that provided cooperative multi-threading, networking, arbitrary-shaped translucent windows, and a fully object-oriented toolkit. Steve Jobs agreed with us that PostScript was the right imaging model, although the Display PostScript interpreter he licensed from Adobe for the NextSTEP window system was far more restricted. The overwhelming success of JavaScript has validated our argument for programmability. So why did X11 take over the world? My view is that there are three main arguments, any one of which would have been decisive:
  • We and Jobs were wrong about the imaging model, for at least two reasons. First, early on pixels were in short supply and applications needed to make the best use of the few they were assigned. They didn't want to delegate control to the PostScript interpreter. Second, later on came GPUs with 3D imaging models. The idea of a one-size-fits-all model became obsolete. The reason that Wayland should replace X11 is that it is agnostic to the application's choice of imaging model.
  • Although we were right about the need for programmability, PostScript was the wrong language. Even back in 1985 Gosling and I were veteran programmers, with experience in many languages including not just C and Assembler, but also LISP, Prolog and Smalltalk. We didn't find PostScript's stack-based language intimidating. But already by that time Darwinian evolution had focused CS education on C-like languages, so the majority of application programmers were intimidated. And NeWS wasn't just stack-oriented, it was also object-oriented and multi-threaded, and none of these were optional, making it even more intimidating. In a sense NeWS was an echo of the success of Sun in the early days — a bunch of really good engineers building the system they wanted to use, not one for the mass market.
  • A major reason for Sun's early success was that they in effect open-sourced the Network File System. X11 was open source under the MIT license. I, and some of the other Sun engineers, understood that NeWS could not displace X11 as the Unix standard window system without being equally open source. But Sun's management looked at NeWS and saw superior technology, an extension of the PostScript that Adobe was selling, and couldn't bring themselves to give it away. But they also couldn't ignore the fact that X11 was popular with their customer base, who could run the MIT X11 environment any time they wanted. The result was the ghastly kludge called the Xnews server, a monument to the inability of Sun to follow McNealy's "all the wood behind one arrow" dictum. Wikipedia correctly notes that:
    This seriously degraded the NeWS interpreter performance and was not considered a very good X11 server either.

Longevity

The anniversary bought forth several commentaries on its extraordinary longevity, including:
  • Richard Speed's The X Window System is still hanging on at 40 is aptly subtitled "Never underestimate the stickiness of legacy technology" and concludes:
    In 2022, we wondered if "Wayland has what it takes to replace X." Two years later, the question is still open, although the direction of travel is clear. Yet the stickiness of "it just works" is not to be underestimated, and we would not be surprised if the 50th anniversary rolls around and there is still someone clinging to X11 for that one old app that won't run properly on anything else.
    This is very likely. In order to displace X11, Wayland had to be compatible with it via the XWayland server. So there is little motivation to rewrite existing applications to use Wayland directly.
  • Kevin Purdy's 40 years later, X Window System is far more relevant than anyone could guess starts:
    Often times, when I am researching something about computers or coding that has been around a very long while, I will come across a document on a university website that tells me more about that thing than any Wikipedia page or archive ever could.

    It's usually a PDF, though sometimes a plaintext file, on a .edu subdirectory that starts with a username preceded by a tilde (~) character. This is typically a document that a professor, faced with the same questions semester after semester, has put together to save the most time possible and get back to their work. I recently found such a document inside Princeton University's astrophysics department: "An Introduction to the X Window System," written by Robert Lupton.

    X Window System, which turned 40 years old earlier this week, was something you had to know how to use to work with space-facing instruments back in the early 1980s, when VT100s, VAX-11/750s, and Sun Microsystems boxes would share space at college computer labs. As the member of the AstroPhysical Sciences Department at Princeton who knew the most about computers back then, it fell to Lupton to fix things and take questions.
    I really like Lupton's assessment:
    "It worked, at least relative to the other options we had," Lupton said. He noted that Princeton's systems were not "heavily networked in those days," such that the network traffic issues some had with X weren't an issue then. "People weren't expecting a lot of GUIs, either; they were expecting command lines, maybe a few buttons... it was the most portable version of a window system, running on both a VAX and the Suns at the time... it wasn't bad."
    Purdy quotes Evan Jenkins from thirteen years ago:
    X is the oldest lady at the dance, and she insists on dancing with everyone. X has millions of lines of source, but most of it was written long ago, when there were no GPUs, and no specialized transistors to do programmable shading or rotation and translation of vertexes. The hardware had no notion of oversampling and interpolation to reduce aliasing, nor was it capable of producing extremely precise color spaces. The time has come for the old lady to take a chair.
There hasn't been an official release of X11 for twelve years, but it is proving very hard to kill off.

DLF Digest: July 2024 / Digital Library Federation

DLF Digest logo: DLF logo at top center "Digest" is centered. Beneath is "Monthly" aligned left and "Updates" aligned right.

A monthly round-up of news, upcoming
working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here.
Hello, DLF Community! It’s hard to believe that the in-person DLF Forum will take place at Michigan State University at the end of this month. We are so excited to head to East Lansing! If you are interested in attending in person, we have a handful of spots left – registration will remain open until we sell out or until July 15. We’re also looking ahead to our virtual event in the fall, for which we’ll release the program and open registration later this month. And in the meantime, though summer is a bit quieter for many of our Working Groups, folks are still busy and meeting, so be sure to check out the Community Calendar to join them. We look forward to seeing you around soon!

– Team DLF

This month’s news:

This month’s DLF group events:

Technology Strategy for Archives Q3 Working Group Call with Natalie Milbrodt

Monday, July 22, 2024, 2:00pm ET / 11:00am PT; Register: https://clirdlf.zoom.us/meeting/register/tZEqfu6pqjkqGdVPxm-CMyB1f7YVprd8xroH#/registration

Join the Technology Strategy for Archives Working Group for a session about approaches to engaging stakeholders in tech strategy collaborations. Featuring a talk by Natalie Milbrodt, University Archivist, City University of New York, about the Cultivating Archives and Institutional Memory Project.

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.

  • Born-Digital Access Working Group (BDAWG): Tuesday, 7/2, 2pm ET / 11am PT.
  • Digital Accessibility Working Group (DAWG): Wednesday, 7/3, 2pm ET / 11am PT.
  • AIG Cultural Assessment Working Group: Monday, 7/8, 2pm ET / 11am PT.
  • AIG Cost Assessment Working Group: Monday, 7/8, 3pm ET / 12pm PT.
  • Assessment Interest Group (AIG) Metadata Working Group: Thursday, 7/18, 1:15pm ET / 10:15am PT. 
  • AIG User Experience Working Group: Friday, 7/19, 11am ET / 8am PT.
  • Technology Strategy for Archives Q3 meeting with Natalie Milbrodt: Monday, 7/22, 2pm ET / 11am PT
  • Committee for Equity and Inclusion: Monday, 7/22, 3pm ET / 12pm PT.
  • Digital Accessibility Working Group: Policy and Workflows subgroup: Friday, 7/26, 1pm ET / 10am PT. 
  • Digital Accessibility Working Group: IT subgroup (DAWG-IT): Monday, 7/29, 1:15pm ET / 10:15am PT. (Rescheduled to August 5, 1pm ET / 10am PT)
  • Climate Justice Working Group: Wednesday, 7/31, 12pm ET / 9am PT.  

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at info@diglib.org

Get Involved / Connect with Us

Below are some ways to stay connected with us and the digital library community: 

The post DLF Digest: July 2024 appeared first on DLF.

Autonomous Vehicles: Trough of Disillusionment / David Rosenthal

Jeremykemp CC BY-SA 3.0, Link
This is the famous Gartner hype cycle and it certianly appears that autonomous vehicles are currently in the Trough of Disillusionment. Whether they will eventually soar up to the Plateau of Productivity is unknown, but for now it is clear even to practitioners that the hype bubble they have been riding for years has burst.

Below the fold I try to catch up with the flood of reporting on the autonomous vehicle winter triggered by the bursting of the bubble.

Cruise

The most obvious symptom is the implosion of General Motors' Cruise robotaxi division. Brad Templeton worked on autonomy at Waymo and is an enthusiast for the technology. But in Robocar 2023 In Review: The Fall Of Cruise he acknowledges that things aren't going well:
Looking at 2023 in review for self-driving cars, one story stands out as the clear top story of the year, namely the fall of General Motors’ “Cruise” robotaxi division, which led to a pause of operations for Cruise over the whole USA, a ban on operations in California, the resignation of the founder/CEO and much more. That was actually the most prominent component of the other big story of the year, namely the battle between San Francisco and both Cruise and Waymo.

Even this serious mistake should not have and perhaps would not have driven Cruise to its fate. The bigger mistake was they did not want to talk about it. They were eager to show the car’s recorded video of events to the press, and to officials, but only over Zoom. They did not mention the dragging, even when asked about it. I did not ask to see the part of the video after the crash, but others who did see it say it ended before the dragging. The DMV states they were not shown the dragging, and Cruise did not tell them about it, including in a letter they sent shortly after the events. Cruise insists they did show the full video to the DMV, and the DMV insists otherwise, but there is little doubt that they were not open about what was obviously the most important part of the chain of events when it came to understanding the role of the robotaxi in the calamity. Cruise was very eager to show that the initial crash was not their fault, but didn’t want to talk at all about their own serious mistake.
Laura Dobberstein reports on part of the aftermath in GM's Cruise sheds nine execs in the name of safety and integrity:
GM’s self-driving taxi outfit, Cruise, has dismissed nine execs – including its chief operating officer – after staff withheld information regarding an incident in which a woman was injured by one of the firm's robotaxis.

"Today, following an initial analysis of the October 2 incident and Cruise's response to it, nine individuals departed Cruise. These include key leaders from Legal, Government Affairs, and Commercial Operations, as well as Safety and Systems," a Cruise spokesperson told The Register.

"As a company, we are committed to full transparency and are focused on rebuilding trust and operating with the highest standards when it comes to safety, integrity, and accountability and believe that new leadership is necessary to achieve these goals," the spokesperson added.
It isn't just the executives who are feeling the pain, as Hayden Field and Michael Wayland report in GM's Cruise laying off 900 employees, or 24% of its workforce: Read the memo here:
Cruise on Thursday announced internally that it will lay off 900 employees, or 24% of its workforce, the company confirmed to CNBC.

The layoffs, which primarily affected commercial operations and related corporate functions, are the latest turmoil for the robotaxi startup and come one day after Cruise dismissed nine “key leaders” for the company’s response to an Oct. 2 accident in which a pedestrian was dragged 20 feet by a Cruise self-driving car after being struck by another vehicle.

The company had 3,800 employees before Thursday’s cuts, which also follow a round of contractor layoffs at Cruise last month.
Templeton's main argument is that Cruise mishandled the PR problem resulting from their vehicle dragging an injured pedestrian. They certainly did, in particular by apparently concealing information from the Dept. of Motor Vehicles. But the PR problem they faced is generic to autonomous vehicles. They are marketed as being safer than human drivers and, statistically, this may even be true. Waymo certainly makes a plausible claim. But although "safer" implies it does not mean "safe", so serious crashes will inevitably occur.

When they do the company's PR department is in a no-win situation. The reporters who call asking for the company's reaction know that the company has a vast amount of detailed information about what happened, logs and video. But the PR people don't, and even if they did they don't have the skills to interpret it. Given the nature of the AI driving the vehicle, it will take considerable time for the engineers to find the bug that was the root cause.

The honest thing for the PR people to say is "we don't have the details, we'll get back to you when we do". Reporters are likely to hear this as buying time for a cover-up.

The wrong thing for them to do is to give the reporters what little they know, and spin it in ways that minimize the company's fault. Later, when the full details emerge, they will have been shown to have been covering up the worst. Next time, even if they are honest, they won't be believed.

The PR problem is even worse because it is fundamentally asymmetric. The marketing pitch is that, statistically, autonomous vehicles cause less accidents than humans. But the public will never know about accidents that would have happened were a human driving, but were averted by the AI driver. They will know about accidents that did happen because the AI driver did something that a human would not have, as with the Uber and Cruise incidents. When set against the expectation of airline-level safety, this is an insuperable problem.

In Driverless cars were the future but now the truth is out: they’re on the road to nowhere, Christian Wolmar who has a book about the autonomous winter entitled Driverless Cars: On a Road to Nowhere? points out that Cruise isn't the first robotaxi company felled by an accident. That would be Uber:
Right from the start, the hype far outpaced the technological advances. In 2010, at the Shanghai Expo, General Motors had produced a video showing a driverless car taking a pregnant woman to hospital at breakneck speed and, as the commentary assured the viewers, safely. It was precisely the promise of greater safety, cutting the terrible worldwide annual roads death toll of 1.25m, that the sponsors of driverless vehicles dangled in front of the public.

And that is now proving their undoing. First to go was Uber after an accident in which one of its self-driving cars killed Elaine Herzberg in Phoenix, Arizona. The car was in autonomous mode, and its “operator” was accused of watching a TV show, meaning they did not notice when the car hit Herzberg, who had confused its computers by stepping on to the highway pushing a bike carrying bags on its handlebars. Fatally, the computer could not interpret this confusing array of objects.

Until then, Uber’s business model had been predicated on the idea that within a few years it would dispense with drivers and provide a fleet of robotaxis. That plan died with Herzberg, and Uber soon pulled out of all its driverless taxi trials.
In her review of Wolmar's book Yves Smith makes two good points:
Wolmar describes how this Brave New World has stalled out. The big reason is that the world is too complicated. or to put it in Taleb-like terms, there are way too many tail events to get them into training sets for AI in cars to learn about them. The other issue, which Wolmar does not make explicit, is that the public does not appear willing to accept the sort of slip-shod tech standards of buggy consumer software. The airline industry, which is very heavy regulated, has an impeccable safety record, and citizens appear to expect something closer to that…particularly citizens who don’t own or have investments in self-driving cars and never consented to their risks.
It isn't just Cruise that is figuring out that Robotaxi Economics don't work. Rita Liao's Facing roadblocks, China’s robotaxi darlings apply the brakes reports that they don't work in China either:
Despite years of hype and progress in self-driving technologies, the widespread availability of robotaxis remains a distant reality. That’s due to a confluence of challenges, including safety, regulations and costs.

The last factor, in particular, is what has pushed China’s robotaxi pioneers toward more opportunistic endeavors. To become profitable, robotaxis need to eventually remove human operators. Though China recently clarified rules around the need for human supervision, taxis without a driver behind the wheel are allowed only in restricted areas at present. To attract customers, robotaxi services offer deep discounts on their paid rides.

Once the subsidies are gone and initial user curiosity wanes, who’s willing to pay the same amount as taxi fares for a few fixed routes?

Struggling to address that question, China’s robotaxi startups have woken up to the money-burning reality of their business.
So they are pivoting to a viable product:
One logical path to monetize self-driving technology is to sell a less robust version of the technology, namely, advanced driver assistance systems (ADAS) that still require human intervention.

Deeproute, which is backed by Alibaba, significantly scaled back its robotaxi operations this year and plunged right into supplying ADAS to automakers. Its production-ready solution, which includes its smart driving software and lidar-powered hardware, is sold competitively at $2,000. Similarly, Baidu is “downgrading the tech stacks” to find paying customers on its way up what it calls the “Mount Everest of self-driving.”

“The experience and insight gleaned from deploying our solutions in [mass-produced] vehicles is being fed into our self-driving technology, giving us a unique moat around security and data,” a Baidu spokesperson said.
Not a good year for the robotaxi concept, the thing that was supposed to distinguish Tesla's cars from everyone else's because they would earn their owners money while they slept.

Tesla

As usual, when it comes to self-driving Tesla's story is worse than almost everyone else's. Elon Musk famously claimed that Tesla is worth zero without Full Self Driving. But although this is typical Musk BS, but unlike some other utterances it contains a kernel of truth. Tesla is valued as a technology company not a car company. Thus it is critical for Telsa that its technology be viewed as better than those of other car companies; anything that suggests it is limited or inadequate is a big problem not just for the company but also for Musk's personal wealth.

I believe this is why Tesla hasn't implemented effective driver monitoring nor geo-fenced their systems. And why members of the Musk cult are so keen on defeating the driver monitoring system with weights on the steering wheel and smiley-face stickers on the camera. Depending upon what you believe, the technology is either groundbreaking autonomy or a below average Level-2 driver assistance system (Mercedes has fielded a Level 3 system). The congnitive dissonance between the pronouncements of the cult leader and the reality of continually being "nagged" about the limits of the technology is too hard for the cult members to take.

The Washington Post team of Trisha Thadani, Rachel Lerman, Imogen Piper, Faiz Siddiqui and Irfan Uraizee has been delivering outstanding reporting on Tesla's technology. They started on Octoober 6th with The final 11 seconds of a fatal Tesla Autopilot crash, in which a Tesla driver enabled Autopilot in conditions for which it was not designed, and set the speed to 14mph above the limit. Then on December 10th they followed with Tesla drivers run Autopilot where it’s not intended — with deadly consequences and Why Tesla Autopilot shouldn’t be used in as many places as you think.

They recount a 2019 crash in Florida:
A Tesla driving on Autopilot crashed through a T intersection at about 70 mph and flung the young couple into the air, killing Benavides Leon and gravely injuring Angulo. In police body-camera footage obtained by The Washington Post, the shaken driver says he was “driving on cruise” and took his eyes off the road when he dropped his phone.

But the 2019 crash reveals a problem deeper than driver inattention. It occurred on a rural road where Tesla’s Autopilot technology was not designed to be used. Dash-cam footage captured by the Tesla and obtained exclusively by The Post shows the car blowing through a stop sign, a blinking light and five yellow signs warning that the road ends and drivers must turn left or right.
Note that, just like the repeated crashes into emergency vehicles, the victims did not volunteer to debug Tesla's software. And also that the Autopilot system was driving 15mph above the speed limit on a road it wasn't designed for, just like the 2019 Banner crash into a semi-trailer. As I wrote about that crash:
It is typical of Tesla's disdain for the law that, although their cars have GPS and can therefore know the speed limit, they didn't bother to program Autopilot to obey the law.
...
Again, Tesla's disdain for the safety of their customers, not to mention other road users, meant that despite the car knowing which road it was on and thus whether it was a road that Autopilot should not be activated on, it allowed Banner to enable it.
Federal regulators have known there was a problem for more than seven years, but they haven't taken effective action:
Nor have federal regulators taken action. After the 2016 crash, which killed Tesla driver Joshua Brown, the National Transportation Safety Board (NTSB) called for limits on where driver-assistance technology could be activated. But as a purely investigative agency, the NTSB has no regulatory power over Tesla. Its peer agency, the National Highway Traffic Safety Administration (NHTSA), which is part of the Department of Transportation, has the authority to establish enforceable auto safety standards — but its failure to act has given rise to an unusual and increasingly tense rift between the two agencies.
The reason may be that car manufacturers "self-certify" conformance with safety standards:
The string of Autopilot crashes reveals the consequences of allowing a rapidly evolving technology to operate on the nation’s roadways without significant government oversight, experts say. While NHTSA has several ongoing investigations into the company and specific crashes, critics argue the agency’s approach is too reactive and has allowed a flawed technology to put Tesla drivers — and those around them — at risk.

The approach contrasts with federal regulation of planes and railroads, where crashes involving new technology or equipment — such as recurring issues with Boeing’s 737 Max — have resulted in sweeping action by agencies or Congress to ground planes or mandate new safety systems. Unlike planes, which are certified for airworthiness through a process called “type certification,” passenger car models are not prescreened, but are subject to a set of regulations called Federal Motor Vehicle Safety Standards, which manufacturers face the burden to meet.
And Tesla's self-certification is self-serving.

Self-certification would work well if the penalty for false certification was severe, but the NTHSA has declined to impose any penalty for Tesla's manifestly inadequate system. It seems that the team's reporting finally drove the NHTSA to do something about the long-standing problems of Autopilot. Reuters reported that Tesla recalls more than 2m vehicles in US over Autopilot system:
Tesla is recalling just over 2m vehicles in the United States fitted with its Autopilot advanced driver-assistance system to install new safeguards, after a safety regulator said the system was open to “foreseeable misuse”.

The National Highway Traffic Safety Administration (NHTSA) has been investigating the electric automaker led by the billionaire Elon Musk for more than two years over whether Tesla vehicles adequately ensure that drivers pay attention when using the driver assistance system.

Tesla said in the recall filing that Autopilot’s software system controls “may not be sufficient to prevent driver misuse” and could increase the risk of a crash.
...
Separately, since 2016, NHTSA has opened more than three dozen Tesla special crash investigations in cases where driver systems such as Autopilot were suspected of being used, with 23 crash deaths reported to date.

NHTSA said there might be an increased risk of a crash in situations when the system is engaged but the driver does not maintain responsibility for vehicle operation and is unprepared to intervene or fails to recognize when it is canceled or not.
The obfuscation is extraordinary — "forseeable misuse", "may not be sufficient" and "could increase". There is no "foreseeable", "may" nor "could"; multiple people have already died because the system was abused. When Tesla is sued about these deaths, their defense is that the system was abused! I believe the system is specifically designed to allow abuse, because preventing abuse would puncture the hype bubble.

Fortunately from the NTHSA's point of view the recall is pure kabuki, posing no risk from Musk's attack-dog lawyers and cult members because the over-the-air update is cheap and doesn't actually fix the problem. The Washington Post's headline writers didn't understand this whan they captioned the team's timeline How Tesla Autopilot got grounded:
Now, more than 2 million Tesla vehicles are receiving a software update to address “insufficient” controls to combat driver inattention while in Autopilot mode. Here’s how the recall unfolded, according to documents from Tesla, safety officials and reporting by The Washington Post.
The team did understand that Autopilot hadn't been "grounded". In Recalling almost every Tesla in America won’t fix safety issues, experts say they lay it out:
Tesla this week agreed to issue a remote update to 2 million cars aimed at improving driver attention while Autopilot is engaged, especially on surface roads with cross traffic and other hazards the driver-assistance technology is not designed to detect.

But the recall — the largest in Tesla’s 20-year history — quickly drew condemnation from experts and lawmakers, who said new warnings and alerts are unlikely to solve Autopilot’s fundamental flaw: that Tesla fails to limit where drivers can turn it on in the first place.
...
Tesla has repeatedly acknowledged in user manuals, legal documents and communications with federal regulators that Autosteer is “intended for use on controlled-access highways” with “a center divider, clear lane markings, and no cross traffic.”
The Washington Post's Geoffrey A. Fowler checked on the result of the kabuki, and wrote Testing Tesla’s Autopilot recall, I don’t feel much safer — and neither should you:
Last weekend, my Tesla Model Y received an over-the-air update to make its driver-assistance software safer. In my first test drive of the updated Tesla, it blew through two stop signs without even slowing down.

In December, Tesla issued its largest-ever recall, affecting almost all of its 2 million cars. It is like the software updates you get on your phone, except this was supposed to prevent drivers from misusing Tesla’s Autopilot software.

After testing my Tesla update, I don’t feel much safer — and neither should you, knowing that this technology is on the same roads you use.

During my drive, the updated Tesla steered itself on urban San Francisco streets Autopilot wasn’t designed for. (I was careful to let the tech do its thing only when my hands were hovering by the wheel and I was paying attention.) The recall was supposed to force drivers to pay more attention while using Autopilot by sensing hands on the steering wheel and checking for eyes on the road. Yet my car drove through the city with my hands off the wheel for stretches of a minute or more. I could even activate Autopilot after I placed a sticker over the car’s interior camera used to track my attention.
Fowler concludes "I found we have every reason to be skeptical this recall does much of anything". Good job, Tesla!

As a final note Rishi Sunak, the UK's tech-bro Prime Minister is naturally determined to make the UK a leader in autonomous vehicles with his new legislation on the subject. But, being a tech-bro, he has no idea of the fundamental problem they pose. Wolmar does understand it, writing:
In the UK, Tesla will fall foul of the legislation introduced into parliament last month, which prevents companies from misleading the public about the capability of their vehicles. Tesla’s troubles have been compounded by the revelations from ex-employee Lukasz Krupski who claims the self-drive capabilities of Teslas pose a risk to the public. Manufacturers will be forced to specify precisely which functions of the car – steering, brakes, acceleration – have been automated. Tesla will have to change its marketing approach in order to comply. So, while the bill has been promoted as enabling the more rapid introduction of driverless cars, meeting its restrictive terms may prove to be an insuperable obstacle for their developers.
Tesla's stock market valuation depends upon "misleading the public about the capability of their vehicles".

Update 12th February 2024

Source
Reinforcement for the last point above comes from Esha Dey's Tesla’s Slide Has Investors Wondering If It’s Still Magnificent and in particular from this chart comparing the history of the price-earnings ratio of the "Magnificent Seven" stocks, Alphabet, Amazon, Apple, Meta, Microsoft,Nvidia, Tesla. Dey writes:
After doubling last year, Tesla’s stock price is down 22% to start 2024. Compare that to Nvidia Corp.’s 46% surge or Meta Platforms Inc.’s 32% gain since the beginning of the year and it’s easy to see where the questions are coming from. Indeed, it’s by far the worst performer in the Magnificent Seven Index this year.

The problem for the EV maker is six of those seven companies are benefiting from the enthusiasm surrounding burgeoning artificial intelligence technology. The group hit a record 29.5% weighting in the S&P 500 last week even with Tesla’s decline, according to data compiled by Bloomberg. But despite Musk’s efforts to position his company as an AI investment, the reality is Tesla faces a unique set of challenges.

“Although Elon Musk would probably disagree, investors don’t see Tesla as an AI play like most of the other Magnificent Seven stocks,” said Matthew Maley, chief market strategist at Miller Tabak + Co. “We have a much different backdrop for Tesla and the others in the Mag Seven — the demand trend for Tesla products is fading, while it’s exploding higher for those companies that are more associated with AI.”
The problem for Tesla the car company is that increasing investment in launching the Cybertruck and a lower-cost sedan is meeting slowing demand for EVs in general:
“During the year, others in the Mag Seven were able to show how AI was driving real, profitable business growth,” Brian Johnson, former auto analyst with Barclays and founder of Metonic Advisors, said in an interview. “Tesla investors just got some random Optimus videos, Musk’s admission Dojo was a moon shot and yet another full-self-driving release that may be an improvement but still a long ways from robotaxi capability.”
Even if it actually were an AI company not a car manufacturer, Tesla's PE is out of line with other, real AI companies. Hence the need for Musk's relentless hype about autonomy and, not incidentally his demands that Tesla double his stake by diluting the sharholders and re-instate the $55B pay package Delaware court invalidated. He needs these decisions made while Tesla's PE is nearly double that of the next highest AI company, not once it is valued like a car company.

Competition-proofing / David Rosenthal

Source
Apart from getting started in the midst of one of Silicon Valley's regular downturns, another great thing about the beginnings of Nvidia was that instead of insisting on the "minimum viable product" our VCs, Sutter Hill and Sequoia, gave us the time to develop a real architecture for a family of chips. It enabled us to get an amazing amount of functionality into a half-micron gate array; I/O virtualization, a DMA engine, a graphics processor that rendered curved surfaces directly, not by approximating them with triangles, a sound engine and support for game controllers. As I write, after a three decade-long history of bringing innovations to the market, Nvidia is America's third most valuable company.

I've written several times about how in pursuit of a quicker buck, VCs have largely discarded the slow process of building an IPO-ready company like Nvidia in favor of building one that will be acquired by one of the dominant monopolists. These VCs don't support innovation. Even if their acquisition-bound companies do innovate in their short lives, their innovations are rarely tested in the market after the acuisition.

Below the fold I discuss a new paper that presents a highly detailed look at the mechanisms the dominant companies use to neutralize the threats startups could pose to their dominance.

In Coopting Disruption law professors Mark Lemley (Stanford) and Matthew Wansley (Cardozo) ask a good question:
Our economy is dominated by five aging tech giants—Alphabet, Amazon, Apple, Meta, and Microsoft. Each of these firms was founded more than twenty years ago: Apple and Microsoft in the 1970s, Google and Amazon in the 1990s, and Facebook in 2004. Each of them grew by successfully commercializing a disruptive technology—personal computers (Apple), operating systems (Microsoft), online shopping (Amazon), search engines (Google), and social networks (Facebook). Each of them displaced the incumbents that came before them. But in the last twenty years, no company has commercialized a new technology in a way that threatens the tech giants. Why?
The TL;DR of Lemley and Wansley's answer to their question is:
While there are many reasons for the tech giants’ continued dominance, we think an important and overlooked one is that they have learned how to coopt disruption. They identify potentially disruptive technologies, use their money to influence the startups developing them, strategically dole out access to the resources the startups need to grow, and seek regulation that will make it harder for the startups to compete. When a threat emerges, they buy it off. And after they acquire a startup, they redirect its people and assets to their own innovation needs.
They observe that:
a company that is started with the goal of being swallowed by a tech giant probably isn’t contributing much to society.

Introduction

They start by identifying the advantages and disadvantages the incumbents possess in their efforts to monetize innovations. Their list of advantages is:
  • "large incumbents can take advantage of economies of scale" not just in manufacturing, but also in marketing an distribution by exploiting their existing customer relationships.
  • "Large incumbents can also take advantage of economies of scope. Innovation creates “involuntary spillovers”—new knowledge that has economic value beyond the specific product that the firm was developing."
  • "large incumbents can access capital at a lower cost" for example from retained earnings from their cash cows.
  • "large incumbents may have another potential advantage—a longer investment time horizon" even more so now with the compression of VC time horizons.
Their list of incumbents disadvantages in innovation is more interesting:
  1. "their success will cannibalize their own market share" or "More generally, a monopolist has diminished incentives to introduce new products, improve product quality, or lower prices because any new sales generated replace its existing sales." Economists call this "Arrow's replacement effect"; more specifically: "The general lesson is, all else equal, the larger a firm’s market share and the less it is threatened by competition, the weaker its incentives to innovate. So we should expect large incumbents to not innovate much. And if they can dispense with the competitors rather than have to compete with them, they will do that."
  2. "their managers prefer to deliver incremental innovations to their existing customers". Unlike Arrow's theory, "Christensen’s theory of disruptive innovation, ... focuses on the career incentives of middle managers ... Incumbent managers have an incentive to deliver sustaining innovations—incremental improvements in quality to the firm’s existing products that will please its existing customers. But they have substantial disincentives to pursue projects that upset the apple cart, even if doing so would bring new customers to the firm" The fundamental problem is that "Housing an innovation project inside a firm with diverse lines of business creates conflict with those other businesses. Some firm assets—cash, cloud computing, equipment, facilities, and engineers’ time—are rivalrous and finite, so executives must be willing to fight internal constituencies to devote those resources to innovation." Ingenuity, NASA's wildly successful Mars helicopter is a good example, as Eric Berger reports in Before Ingenuity ever landed on Mars, scientists almost managed to kill it. It was competing for cost, weight and risk with Perseverance's primary mission.
  3. "their single veto point decision-making structure encourages risk-aversion" More specifically: "Inside a large incumbent, decisions about whether to fund an innovative project must pass through one veto point. In the venture capital market, many competing investors independently decide whether to finance an innovative idea. Inside a firm, an employee with an innovative idea must pitch an idea to managers who ultimately report to one executive gate-keeper. In the venture capital market, if a would-be startup founder pitches an idea to ten VC firms, and nine of them are not persuaded, the idea gets funded." The advantage of market-based finance over internal finance applies not just to the initiation but also the continuation of an innovation project. Inside a firm, an executive who has soured on a project can terminate it. In the venture capital market, when a startup’s initial investors grow skeptical, the company can still pitch outsiders on infusing more cash." The authors make this important point (my emphasis): "And while economists often describe markets as efficient, there is no reason to believe individual corporate executives make efficient (or even rational) decisions. Just ask Twitter. Markets work not because private executives make good decisions but because the ones who make bad decisions get driven out. But that dynamic only works with competition."
  4. "they cannot appropriately compensate employees working on innovation projects." The reason they cannot is that: "Startups solve this problem by giving employees stock options. Every employee with significant equity knows that if the startup successfully exits, they will be rewarded. Stock in a large, diversified public company does not create similar incentives. The incentives are diluted because the value of the stock will be affected by too many variables unrelated to the success of the specific innovation project." And that: "large firms do not recognize internal “property rights” to innovations that employees develop. If they did, employees might become reluctant to share information. But not protecting internal property rights gives innovative employees incentive to leave. If employees at a large firm found their own startup and raise venture capital to fund it, they will earn a much greater share of the profits of the innovation."
The authors go on to describe five techniques incumbents use to neutralize the threat of disruption that innovative startups might pose; network effects, self-preferencing, paying for defaults, cloning, and coopting the disruptor. They claim other have described the first four, but they don't amount to an adequate explanation for why the tech giants haven't been disrupted. I will summarize each of the four in turn..

Network effects

Nearly three decades ago W. Brian Arthur, in Increasing Returns and Path Dependence in the Economy explained how increasing returns to scale, or network effects, of technology markets typically led to them being dominated by one player. Consider a new market opened up by a technologcal development. Several startups enter, for random reasons one gets bigger then the others, network effects amplify its advantage in a feedback loop.

This effect is more important now, as the the giants' business models have evolved to become platforms:
The tech giants’ core businesses are built on platforms. A platform is an intermediary in a two-sided market. It connects users on one side of the market with users on the other side for transactions or interactions.
...
Platforms tend to exhibit network effects—the addition of a new user increases the value of a platform to existing users and attracts new users.
This is precisely the mechanism Brian Arthur described, but applied to a business model that has since been enabled by the Internet.

Self-preferencing

Self-preferencing happens when a platform isn't just a two-sided market, but one in which the platform itself is a vendor:
Amazon, for example, both invites third party vendors to sell their products in its online marketplace and sells its own house brands that compete with those vendors. Amazon has a powerful advantage in that competition. It has access to data on all of its competitors—who their customers are, which products are selling well, and which prices work best. And it controls which ads consumers see when they search for a specific product. Assuming Amazon uses that information to prefer its own products to those of its competitors (either by pricing strategically or by promoting its own products in search results) – something alleged but not yet proven in a pending antitrust case -- the result is to bias competition. Vendors cannot realistically protest Amazon’s self-preferencing (or just go elsewhere) because Amazon has such a dominant share in the online retail market.

Paying for defaults

The value of the default position is notorious because:
Alphabet pays Apple a reported $18 billion (with a b) each year for Google to be the default search engine on iOS devices. Android and iOS together account for 99% of the U.S. mobile operating system market. Consequently, almost everyone who uses a smartphone in America is accustomed to Google search. Alphabet claims that “competition is just a click away.” But research and experience have shown that defaults can be somewhat sticky. So controlling the default position can give Alphabet (or whoever wins the Apple bid) an advantage. That said, someone has to be the default, and it might be better for consumers if the default is the search engine most users already prefer. The real problem might be the idea of paying for placement, whoever wins the bidding war.

Cloning

There are many examples of a tech giant tryng to neutralize the threat from a startup by using the threat of cloning their product to force the startup to sell itself, or of actually cloning the product and using their market power to swamp the startup. Meta's addition of Reels to Instagram in response to Tik Tok is an obvious example. But the authors make two good points: First:
Cloning is only objectionable if the tech giant wins out not by competition on the merits, but by exclusionary conduct.
Second, that cloning often fails:
Google+, Google’s effort to build a social media service that combined the best of Facebook and Twitter was an abject failure. Apple’s effort to control the music world’s move to streaming by offering its own alternative to Spotify hasn’t prevented Spotify from dominating music streaming and eclipsing the once-vibrant (and Apple-dominated) market for music downloads. Meta’s effort to copy Snap, then TikTok, by introducing Stories and Reels has not proven terribly successful, and certainly has not prevented those companies from building their markets.
The fact that the giants can clone a startup's product leads the authors to ask:
If the product is cloneable, then why would you buy the company and burn cash paying off its VCs?
There are two possible answers. It may be faster and easier, though likely not cheaper, to "acquihire" the startup's talent than to recruit equivalent talent in the open market. Or it may be faster and easier, though likely not cheaper, to acquire the company and its product rather than cloning it.

Inadequate Explanation

The authors use the example of Microsoft:
Microsoft enjoyed strong network effects in the 1990s as the dominant maker of operating system software – far more dominant than it is today. It cloned internet browser technology from upstarts like Netscape, and it engaged in anticompetitive conduct designed to ensure that it, not Netscape, became the browser of choice.82 But Microsoft’s victory over Netscape was short-lived. New startups – Mozilla and then Google – came out of nowhere and took the market away from it. Microsoft still benefits from network effects, and it still uses cloning and self-preferencing to send users to its Edge browser. But it doesn’t work. Microsoft employed all the tools of a dominant firm in a network market, but it still faced disruption.
So these four techniques aren't an explanation for the recent dearth of disruption.

Coopting disruption

The authors imagine themselves as a tech giant, asking what else they would do to prevent disruption, and coming up with four new techniques:
  • "First, you would learn as much as you can about which companies had the capability to develop disruptive innovations and try to steer them away from competing with you – perhaps by partnering with them, or perhaps by investing in them."
  • "Second, you would make sure that those companies could not access the critical resources they would need to transform their innovation into a disruptive product."
  • "Third, you would tell your government relations team to seek regulation that would build a competitive moat around your position and keep disruption out."
  • "Fourth, if one of the companies you were tracking nevertheless did start to develop a disruptive product, you would want extract that innovation—and choke off the potential competition—in an acquisition."
These are the techniques they call "coopting disruption", pointing out that the tech giants have:
  • "built a powerful reconnaissance network covering emerging competitive threats by investing in startups as corporate VCs and by cultivating relationships with financial VCs."
  • "accumulated massive quantities of data that are essential for many software and AI innovations, and they dole out access to this data and to their networks selectively."
  • "asked legislators to regulate the tech industry—in a way that will buttress incumbents."
  • "repeatedly bought potentially competitive startups in a way that has flown—until a few years ago—below the antitrust radar."
The authors detail many examples of each of these techniques, for example Facebook conditioning access to user data on the purchase of advertising, and Google's purchase of DoubleClick and YouTube. Interestingly, they contrast the recent purchasing of the tech giants with Cisco's famously successful purchases in the 90s:
The Cisco story exemplifies how the venture capital market, as a market, is better at exploring a series of risky ideas than a firm with a single risk-averse gatekeeper. It also illustrates how the advantages of a large incumbent—in this case access to markets and existing customer relationships—can sometimes extract more market value out of a technology than a new entrant.
The rapid evoluution of networking technology at the time meant that even Cisco, the largest company in the market, didn't have the R&D resources to explore all the opportunities. They depended upon VCs to fund the initial explorations, rewarding them by paying over the odds for the successes. Their market power then got the successes deployed much faster than a startup could.

Why Is Cooption Bad?

The authors explain the harms of cooption:
Our claim here is that the same dynamics that inhibit disruptive innovation by longstanding employees of large incumbents inhibit disruptive innovation by new employees from acquired startups.
...
The tech giants win from coopting disruption even though it destroys social value. In fact, they benefit in two ways. They make faster incremental progress on the sustaining innovations that they want. They get the new code, the valuable intellectual property, and the fresh ideas of the startup. And, critically, they also kill off a competitor. They no longer have to worry about the startup actually developing the more disruptive innovation and leapfrogging them or other tech giants acquiring the startup and using its assets to compete with them.
And, by making the innovators from the startup rich, the acquirer greatly reduces their incentives for future innovation. Andy Bechtolsheim is an outlier.

Remedies?

Lemley and Wansley, who seem to think in fours, make a set of four proposals for how these harms might be reduced:
  • Unlocking Directorates — under the Clayton Act "interlocking officers and directors between companies that compete, even in part, are illegal per se – without any inquiry into whether the companies in fact restrained competition because of their overlapping interests or whether the conduct offered procompetitive benefits." Companies with less than $4.1M in revenue are exampt, which excludes most startups; this should be revised.
  • Limiting Leveraging of Data and Networks — "we would impose on incumbent tech monopolists a presumptive duty of nondiscrimination in access where the defendant (1) provides or sells data or network access to at least some unaffiliated companies and (2) refuses to provide or sell the same data or network access to the plaintiff company on comparable terms, but (3) the plaintiff does not operate a competing network or otherwise compete with the defendant in the market from which it collected the relevant data."
  • Regulating Regulation — "Done right, regulation of technology can be beneficial and even necessary to the development of that technology, minimizing the risk of harm to third parties and ensuring that the world views the technology as safe and trustworthy. But all too often regulation has become a way to insulate incumbents from competition, with predictable results." The authors' suggestions exemplify this difficulty, being rather vague and aspirational.
  • Blocking Cooptive Acquisitions — this is the most complex of the four proposals, and builds on Nascent Competitors by C. Scott Hemphill & Tim Wu, who write:
    We favor an enforcement policy that prohibits anticompetitive conduct that is reasonably capable of contributing significantly to the maintenance of the incumbent s market power. That approach implies enforcement even where the competitive significance of the nascent competitor is uncertain.
    Justifying blocking mergers because of a nascent threat that might never materialize is problematic. But it is only more so than the current way anti-trust works, by projecting likely harm to consumer welfare, which also might never materialize (although it almost always does). Lemley and Wansley explain the dilemma:
    antitrust enforcers need a strategy for blocking cooptive acquisitions that works within existing case law (or plausible improvements to that law) and is surgical enough to avoid chilling investment.
    Some cases are obvious:
    For cooptive acquisitions like Facebook/Instagram deal, we think Hemphill and Wu’s strategy makes sense. Zuckerberg’s email arguing for acquiring startups like Instagram because “they could be very disruptive to us” is a smoking gun of anticompetitive intent.
    But Lemley and Wansley go further, arguing for blocking megers based the startup's ability to innovate distruptive technology:
    Of course, an approach to policing startup acquisitions based on innovation capabilities need limits. Many startups have some innovation capabilities that could have a significant effect on competition. We can cabin enforcement in three ways—by focusing on specific technologies and specific firms and by looking at the cumulative effects of multiple acquisitions.
    Their examples of technologies include generative AI and virtual and augmented reality, both cases where it is already too late. The companies they identify "Alphabet, Amazon, Apple, Microsoft, and Meta" are all veterans of multiple acquisitions in these areas. But they argue that committing to challenge fuure mergers:
    would create socially desirable incentives for startups. A startup developing one of the listed technologies would gain stronger incentives to turn its innovations into the products that its management team believed would garner the highest value on the open market—rather than the one most valuable to the tech giants. They would also gain stronger incentives to build a truly independent business and go public since an acquisition by the tech giants would be a less likely exit.
I think these would all be worthwhile steps, and I'm all in favor of updating anti-trust law and, even better, actually enforcing the laws on the books. But I am skeptical that the government can spot potentially disruptive technologies before the tech giants spot and acquire them. Especially since the government can't be embedded in the VC industry the way the tech giants are. Note that many of the harms Lemley and Wansley identify happen shortly after the acquisition. Would forcing Meta to divest Instagram at this late date restore the innovations the acquisition killed off?

The 50th Asilomar Microcomputer Workshop / David Rosenthal

Source
Last week I attended the 50th Asilomar Microcomputer Workshop. For a crew of volunteers to keep a small, invitation-only, off-the-record workshop going for half a century is an amazing achievement. A lot of the credit goes to the late John H. Wharton, who chaired it from 1985 to 2017 with one missing year. He was responsible for the current format, and the eclecticism of the program's topics.

Brian Berg has written a short history of the workshop for the IEEE entitled The Asilomar Microcomputer Workshop: Its Origin Story, and Beyond. It was started by "Three Freds and a Ted" and one of the Freds, Fred Coury has also written about it in here.Six years ago David Laws wrote The Asilomar Microcomputer Workshop and the Billion Dollar Toilet Seat for the Computer History Museum.

I have attended almost all of them since 1987. I have been part of the volunteer crew for many, including this one, and have served on the board of the 501C3 behind the workshop for some years.

This year's program featured a keynote from Yale Patt, and a session from four of his ex-students, Michael Shebanow, Wen-mei Hwu, Onur Mutlu and Wen-Ti Liu. Other talks came from Alvy Ray Smith based on his book A Biography of the Pixel, Mary Lou Jepsen on OpenWater, her attempt to cost-reduce diagnosis and treatment, and Brandon Holland and Jaden Cohen, two high-school students on applying AI to the Prisoner's Dilemma. I interviewed Chris Malachowsky about the history of NVIDIA. And, as always, the RATS (Rich Asilomar Tradition Session) in which almost everyone gives a 10-minute talk lasted past midnight.

The workshop is strictly off-the-record unless the speaker publishes it elsewhere, so I can't discuss the content of the talks.

WARC-GPT “on tour”: Talk transcript and slide decks / Harvard Library Innovation Lab

Over the past couple of months, my colleague Kristi Mukk and I had the opportunity to talk about WARC-GPT and the concept of Librarianship of AI to the greater GLAM community:

We are grateful for the interest the community showed for our work, and for all the great questions, suggestions and pieces of feedback we’ve received along the way. With this blog post, we would like to share a version of our slide deck and talk outline, as a way to keep the conversation going.

You can view WARC-GPT’s source code and case study at https://github.com/harvard-lil/warc-gpt.

This project is part of our ongoing series exploring how artificial intelligence changes our relationship to knowledge.


Video

Here’s an excerpt from the talk we gave at an event organized by the Web Archiving section of the Society of American Archivists on June 14th, 2024:


Slide Deck


Transcript

Matteo: My name is Matteo, I am a senior software engineer at the Harvard Library Innovation Lab. Today with my colleague Kristi Mukk we’re going to talk about WARC-GPT, an open-source tool for exploring web archives using AI.

The Library Innovation Lab is part of the Harvard Law School Library, and part of our mission is to bring library principles to technological frontiers.

Applied to web archiving, this “mission” of ours led us to create Perma.cc - which many of you already know - but also Perma Tools, a series of open-source web archiving tools.

It is also under that framework that we approach this “AI moment”; and by “AI moment” I mean the “AI ALL the things” phenomenon we’ve observed since the launch of ChatGPT in late 2022.

And this all may feel, at the same time OVERWHELMING:

  • First because there are so many models to choose from. On HuggingFace alone, which is a platform for sharing open source models, there are over half a million models available
  • But even if I just focus on text-generation models, the models that behave a little bit like ChatGPT, there are over 100 000 models to choose from. I don’t know which one is the best overall, or even for my specific use case.
  • Keeping up with progress in AI research is also a challenge, as it’s become the focus of so many research groups around the world.

At the same time, this all feels very much underwhelming:

  • Mainly because veracity is a major concern. We now use generative AI in mission critical scenarios, and AI “hallucinations”, which is a misnomer to describe the fact that sometimes AI models make up stuff, has become its own field of research.
  • But if veracity is a concern, so is accuracy, and impressive AI output doesn’t always stand the test of scrutiny, as you might be able to see in the example I generated and pasted here.

But overall, this moment feels disempowering:

  • We’re being told that AI will do everything and its opposite, that it’s going to take all the jobs and create all the jobs, create an economic boom and economic collapse … but rarely are we told that we have a role to play in shaping what AI can and will do for us.
  • We’re constantly exposed to ethereal imagery of superhuman intelligence, showing AI as a sort of unstoppable force. Most news articles I read about AI feature an illustration like one of the three I generated for this slide, and I don’t think this helps me feel empowered to experiment with AI.

What we’re trying to do is to take a step back and wonder: What can we do with AI now, and why does it matter for knowledge institutions? We think that, in part, this boils down to: If I can ask a question to ChatGPT instead of Google or my librarian, then LLMs are “a new way of knowing”.

AI models accidentally “know” things, they were trained on vast amounts of data, retained some of that information, and are able to restitute an even smaller subset of that.

They also show promising capabilities in summarization and sense-making: can that help improve access to knowledge and understand collections more deeply?

Moreover, as a lab, how can we lower the barrier of entry for experimentation with AI.

These questions led us to explore different “flavors” or RAG, which is the acronym for RETRIEVAL AUGMENTED GENERATION.

RAG is a series of techniques that allows for connecting a large language model to a knowledge base in order to augment its responses.

That knowledge base can be anything: a database, an API or even a single document. The way these two elements are connected together is through a prompt, which is a series of textual instructions that the model responds to.

In RAG, the focus is on that prompt, which is used to pass structured information to the model in order to elicit more grounded, precise or factual responses.

RAG can also be used to take advantage of the summarization and sense-making capabilities of LLMs, to make sense of long and messy documents, which is a feature we’re particularly interested in the context of WARC-GPT, which I am going to talk a little bit more about now.

So WARC-GPT is an open-source RAG pipeline for exploring web archives. The question we asked ourselves was whether RAG could be used to extract hard-to-find information out of web archive collections, but also if we could build something that could help the web archiving community engage with these questions.

So we’ve built and released WARC-GPT as an open source chatbot that you can download and run your machine. Its main feature is that it lets you ask questions to an LLM of your choosing against a given web archives collection.

There is a focus on interpretability, and the UI “cites its sources”, showing you what excerpts from the web archive collection you provided it used to help generate a response.

WARC-GPT is also highly-configurable, and every single part of the pipeline can be inspected, tweaked and replaced. Our goal here was not to build a production-ready chatbot, but instead a research boilerplate for folks to explore the intersection of RAG and web archives.

In that spirit, WARG-GPT lets you communicate with both open-source and proprietary LLMs, and switch between models as the conversation goes.

It also comes with a REST API that helps build experiments “on top” of WARC-GPT.

So that’s what it does, but how does it work?

The “RAG” flavor we used here is what I would call “vanilla”, in the sense that it is the most common way of implementing RAG. A more technical term for it would be “vector-based asymmetric semantic search”: let’s break that down a little bit.

The knowledge base WARC-GPT uses comes from text it parsed out of WARC files the user provided.

These text excerpts, extracted from HTML and PDF records, are processed using a “text similarity” model before being saved into a vector store, which is a specialized type of database.

What a text-similarity model does is that it encodes the “meaning” of a text excerpt into a vector, which is a fixed-length series of numbers.

Doing this for an entire collection allows for performing search by mathematically grouping related vectors together based on their shared “meaning”, for example using cosine similarity.

The “asymmetric” part here is that the text used for questions is much shorter and different in nature than the text excerpts. The text similarity model is trained to match these elements together, because their meaning is connected.

This is what you can see here on this very simplified 2D plot which represents a vector store generated by WARC-GPT. The blue dots are text excerpts coming from WARCs, the red ones questions we asked about the collection. The way this works is that the blue dots that are the closest to the red ones are going to be used to help answer the questions.

By the way WARC-GPT comes with a feature that lets you generate that kind of plot so you can see how well it ingested your collection.

It also lets you configure deeply how this RAG pipeline works, as there is no one size fits all.

Kristi: We put WARC-GPT to the test with a small experiment to see if it behaved as designed. We were particularly interested in seeing how WARC-GPT might be able to highlight the utility and value of web archives by allowing users to access web archives in a new way, potentially offering a different starting point for scholarly inquiry. How might WARC-GPT help reveal connections that may have been hidden or hard to find, or help you find relevant resources more quickly?

So we put together a small thematic collection of 78 URLs about the lunar landing missions of India and Russia in 2023, and we chose this topic because the model we were using for this experiment was likely unaware of recent developments in these missions as the model was trained and released before the missions were launched and completed. For our experiment, we used WARC-GPT’s default configuration using Mistral 7B as our model at temperature 0.0 on a set of 10 questions in a zero-shot prompting scenario, comparing both with and without RAG.

As Matteo and I collaborated, we found that our unique perspectives as engineer and librarian allowed for the collaborative problem-solving AI requires. Librarians have a long history of helping patrons navigate emerging technologies, access the knowledge they need, and evaluate information, and they can harness this expertise to meet patrons where they are and guide them. We think that LAM professionals should be collaborating with engineers in the building, management, and evaluation of AI tools.

Some of the things librarians are particularly well-equipped to help with:

  • Navigating the overwhelming landscape of AI resources and tools.
  • Identifying a suitable collection as a knowledge base for AI applications or vetting AI datasets.
  • With their knowledge of user needs, librarians can identify specific problems or use cases AI is well-suited to address, and when alternative solutions might be more appropriate.
  • Problem formulation, brainstorming questions a user might ask, creating and refining prompts, and knowing what context to provide to the model.
  • Critically evaluating AI output and assessing ethical implications of AI.

A question at the forefront of our minds as we conducted this experiment was: “how might we leverage tools like WARC-GPT not only for accessing web archives, but as a tool for AI literacy?” AI literacy is part of information literacy, so it’s essential to think about how this can start a larger conversation about library principles and reaffirming those library principles as users explore these AI tools. This looks like helping users think through questions like: “how did this AI tool arrive at its search results?,” and thinking about how we can help users go from thinking about generative AI as merely a tool, to thinking about AI as a subject worthy of investigation in its own right. And more broadly, thinking about how AI might change how we communicate, learn, and make sense of the world.

It’s important to note that this new mode of access comes with important trade-offs when compared with “traditional” search. Variability is not a bug, but a feature of LLMs. LLMs sit at an interesting intersection, being neither a database we can query facts from, nor a person we can ask to perform judgments. We see them instead as imperfect simulations of what a person could answer to a question, providing statistically-reasonable responses to prompts. Variability plays a big role in RAG pipelines: decisions like your choice of embedding model, prompt, temperature, or other settings can yield different results.

Understanding the potential and the underlying limitations is key for AI literacy and trust calibration, because users may place too much trust in AI without critically evaluating its output. The evaluation criteria we used to analyze our experiment’s output was inspired by the work of Gienapp et al. We analyzed the following criteria for our experiment:

  • Coherence: Is the response structured well in terms of logic and style?
  • Coverage: Is the response pertinent to the question’s information need in terms of breadth and depth? Were the embeddings pulled relevant?
  • Consistency: Is the response free of contradictions and accurately reflects the source information the system was provided as context? Was that context used appropriately in its response?
  • Correctness: Is the response factually correct and reliable, free of factual errors?
  • Clarity: Is there clarity of language and content?

Let’s take a look at one of the embeddings. This is one of the text excerpts retrieved in response to the question “Identify the cause of Luna 25’s crash”. The red text are the words used in the prompt itself, the highlighted yellow text are the portions of the embedding that the model pulled in to formulate its response. We observed in most cases that the model ignored irrelevant pieces of information or noise in the embeddings. In the response, you can see that it copied near verbatim from the source text.

As all of you in the audience experiment with AI, we encourage you to not only talk to a librarian, but also think like a librarian yourself. AI experimentation should follow this continued cycle of “reflect, refine, repeat.” Metacognition plays a crucial role in AI literacy, and in order to effectively calibrate trust, you must form a correct mental model of AI’s error boundaries, and engage in critical reflection about your own thinking and learning throughout the process.

We’ve coined the term Librarianship of AI to describe this emerging practice and guiding framework as we build tools like WARC-GPT to lower the barrier to entry for librarians and others to understand AI in their particular domain, and empower them to form their own opinions and frameworks for thinking about AI and new ways to access knowledge for their communities. We define Librarianship of AI as “the study of models, their implementation, usage and behavior as a way of helping users make informed decisions and empowering them to use AI responsibly.”

So what’s next for WARC-GPT? While still an experimental tool, WARC-GPT is a step toward collectively understanding the potential and limitations of using AI to explore web archives, and getting at this core question of how we can encourage users to explore the troves of information stored in web archives. We welcome any feedback and contributions to WARC-GPT. The code is open source, and we’d encourage anyone interested to not only use it, but build off of it. We have a few ideas about topics that we see as compelling looking forward, which we’ll share now to either be undertaken by us or others.

First off, we think it would be worthwhile to explore some of the potential and limitations more deeply. This could look like testing WARC-GPT with a larger collection to see if our initial findings actually scale up, conducting automated benchmarking to figure out how much a model “knows” about a given collection, or thinking about description work and what it might mean for your team’s process if you could interact with a chatbot that had insight into what you’ve captured from the web.

We could also envision a future pipeline with multimodal embeddings that would enable you to explore images in web archive collections.

The current setup of WARC-GPT is not well suited for time-based search and comparison, so it could be valuable to explore how WARC-GPT might help us understand how a given website has changed over time.

Web archive playback within WARC-GPT would also allow for easier source verification, fact-checking, and information evaluation.

Lastly, with the rapid pace of AI development, it’s likely that we will need to adapt to new capabilities. Some of the challenges we encounter with RAG today such as context window length could improve in the future.

Have a question or idea, want to collaborate? Get in touch with us at lil@law.harvard.edu. Thank you!

Generative-AI Summarization / Distant Reader Blog

Ann Blair's book Too Much To Know overflows with techniques of how pre-early modern scholars dealt with information overload. [1] One of the more oft-used techniques is summarization. With the advent of generative-AI, it is almost trivial to create more-than-plausible summaries of documents.

The linked Python script is an example. Given the path to a plain text file, the script will load a configured large-language model, vectorize the given plain text file, compare the two, and output a three-sentence summary. I enhanced the script to work in batch, and thus I have used the technique to summarize collections of items:

For any given document there are zero 100% correct summaries; everybody will summarize a document differently. That said, the results of this automated process look pretty good to me. Moreover, each list of summaries addresses difficult to answer questions such as:

  • how can Jane Austen's works be characterized?
  • what is rheumatoid arthritis and what are some of its treatments?
  • how is climate change being manifested across the globe?
  • how has the practice of cataloging changed over time?

The lists of summaries may be deemed as information overload in-and-of themselves, and one might consider summarizing the summaries. Such is an exercise left up to the reader.

I believe libraries and librarians ought to learn how to exploit generative-AI for summarization purposes. Just as the migration of printed cards to MARC transformed how libraries hosted catalogs, migrating from hand-crafted summaries to computed summaries will transform how information overload is managed.

[1] Blair, Ann. 2010. Too Much to Know : Managing Scholarly Information Before the Modern Age. New Haven Conn: Yale University Press.

We'll run 'til we drop / Eric Hellman

(I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

 It wasn't the 10 seconds that made me into a runner.

Eric running across a bridge

I started running races again 20 years ago, in 2004. It was a 10K sponsored by my town's YMCA.  I had run an occasional race in grad school to join my housemates; and I continued to run a couple of miles pretty regularly to add some exercise to my mostly sitting-at-a-computer lifestyle. I gradually added 10Ks - the local "turkey-trot"  because the course went almost by my house - and then a "cherry-blossom" run, through beautiful Branch Brook Park. But I was not yet a real runner - tennis was my main sport.

In 2016, things changed. My wife was traveling a lot for work, and one son was away at college, and I found myself needing more social interaction. I saw that my local Y was offering a training program for their annual 10K, and I thought I would try it out. I had never trained for a race, ever. The closest thing to training I had ever done was the soccer team in high school. But there was a HUGE sacrifice involved - the class started at 8AM on Saturdays, and I was notorious for sleeping past noon on Saturdays! Surprise, surprise, I loved it. It was fun to have people to run with. I'm on the silent side, and it was a pleasure to be with people who were comfortable with the  somewhat taciturn real me.

I trained really hard with that group. I did longer runs than I'd ever done, and it felt great. So by race day, I felt sure that I would smash my PR (not counting the races in my 20's!). I was counting on cutting a couple of minutes off my time. And I did it! But only by a measly 10 seconds. I was so disappointed.

But somehow I had become a runner! It was running with a group that made me a runner. I began to seek out running groups and became somewhat of a running social butterfly.

Fast-forward to five weeks ago, when I was doing a 10-miler with a group of running friends (A 10 miler for me, they were doing longer runs in training for a marathon). I had told them of my decision to do New York this fall, and they were soooo supportive. I  signed up for a half marathon to be held on April 27th  - many of my friends were training for the associated full marathon. The last 2 miles were really rough for me (maybe because my shoes were newish??) and I staggered home. That afternoon I could hardly walk and I realized I had strained my right knee. Running was suddenly excruciatingly painful.

By the next day I could get down the stairs and walk with a limp, but running was impossible. The next weekend, I was able to do a slow jog with some pain, so I decided to stick to walking, which was mostly pain-free. I saw a PT who advised me to build up slowly and get plenty of rest. It was working until the next weekend, when I was hurrying to catch a train and unthinkingly took a double step in Penn Station and re-sprained the knee. It was worse than before and I had only 3 weeks until the half marathon!

The past three weeks have been the hardest thing I've had to deal with in my running "career". I've had a calf strain, T-band strains, back strains, sore quads, inter-tarsal neuromas and COVID get in the way of running, but this was the worst. Because of my impatience.

Run-walk (and my running buddies) were what saved me. I slowly worked my way from 2 miles at a 0.05-to-0.25 mile run-to-walk ratio up to 4 miles at 0.2-to-0.05 mile run-to-walk, with 2 days of rest between each session. I started my half marathon with a plan to run 2 mimutes and walk 30 seconds until the knee told me to stop the running bits. I was hoping for a 3 hour half.

The knee never complained (the rest of the body complained, but I'm used to that!!) I finished with the very respectable time of 2:31:28, faster than 2 of my previous 11 half marathons. One of my friends took a video of me staggering over the finish. 


 I'm very sure I don't look like that in real life.

Here's our group picture, marathoners and half-marathoners. Together, we're real runners.

After this weekend, my biggest half marathon challenge to date, I have more confidence than ever that I'll be able to do the New York Marathon in November - in one piece - with Team Amref. (And with your contributions towards my fund-raising goal, as well.)

We're gonna get to that place where we really wanna go and we'll walk in the sun

Jim Thorpe Half Marathon 2024 results. 

My half on Strava.

This series of posts:

Running Song of the Day / Eric Hellman

(I'm blogging my journey to the 2024 New York Marathon. You can help me get there.)

Steve Jobs gave me back my music. Thanks Steve!

I got my first iPod a bit more than 20 years ago. It was a 3rd generation iPod, the first version with an all-touch control. I loved that I could play my Bruce, my Courtney, my Heads and my Alanis at an appropriate volume without bothering any of my classical-music-only family. Looking back on it, there was a period of about five years when I didn't regularly listen to music. I had stopped commuting to work by car, and though commuting was no fun, it had kept me in touch with my music. No wonder those 5 years were such a difficult period of my life!

Today, my running and my music are entwined. My latest (and last 😢) iPod already has some retro cred. It's a 6th generation iPod Nano. I listen to to my music on 90% of my runs and 90% of my listening is on my runs. I use shuffle mode so that over the course of a year of running, I'll listen to 2/3 of my ~2500 song library. In 2023, I listened to 1,723 songs. That's a lot of running!

Yes, I keep track. I have a system to maintain a 150 song playlist for running. I periodically replace all the songs I've heard in the most recent 2 months (unless I've listened to the song less than 5 times - you need at least that many plays to become acquainted with a song!) This is one of the ways I channel certain of my quirkier programmerish tendencies so that I project as a relatively normal person. Or at least I try.

Last November, I decided to do something new (for me). I made a running playlist! Carefully selected to have the right cadence and to inspire the run! It was ordered to have to have particular songs play at appropriate points of the Ashenfelter 8K  on Thanksgiving morning. It started with "Born to Run" and ended with either "Save it for Later", "Breathless" or "It's The End Of The World As We Know It", depending on my finishing time. It worked OK. I finished with Exene. I had never run with a playlist before.

1. "Born to Run". 2. "American Land". The first part of the race is uphill, so an immigrant song seemed appropriate. 3. "Wake Up" - Arcade Fire. Can't get complacent. 4. "Twist & Crawl - The Beat. The up-tempo pushed me to the fastest part of the race. 5. "Night". Up and over the hill. "you run sad and free until all you can see is the night".  6. "Rock Lobster" - B-52s. The perfect beats per minute.  7. "Shake It Up" - Taylor Swift. A bit of focused anger helps my energy level. 8. "Roulette". Recommended by the Nuts, and yes it was good. Shouting a short lyric helps me run faster. 9. "Workin' on the Highway". The 4th mile of 5 is the hardest, so "all day long I don't stop". 10. "Your Sister Can't Twist" - Elton John. A short nasty hill. 11. "Save it for Later" - The Beat. I could run all day to this, but "sooner or later your legs give way, you hit the ground." 12. "Breathless" - X. If I had hit my goal of 45 minutes, I would have crossed the finish as this started, but I was very happy with 46:12. and a 9:14 pace. 13. "It's The End Of The World As We Know It" - R.E.M. 48 minutes would not have been the end of the world, but I'd feel fine.

Last year, I started to extract a line from the music I had listened to during my run to use as the Strava title for the run. Through September 3, I would choose a line from a Springsteen song (he had to take a health timeout after that). For my New Year's resolution, I promised to credit the song and the artist in my run descriptions as well.

I find now that with many songs, they remind me of the place where I was running when I listened to them. And running in certain places now reminds me of particular songs. I'm training the neural network in my head. I prefer to think of it as creating a web of connections, invisible strings, you might say, that enrich my experience of life. In other words, I'm creating art. And if you follow my Strava, the connections you make to my runs and my songs become part of this little collective art project. Thanks!


Reminder: I'm earning my way into the NYC Marathon by raising money for Amref. 

This series of posts:


Data Package version 2.0 is out! / Open Knowledge Foundation

Text originally published at Frictionless Data blog

We are very excited to announce the release of version 2.0 of the Data Package standard(previously known as Frictionless Specs). Thanks to the generous support of NLnet, starting from November last year, we were able to focus on reviewing Data Package in order to include features that were often requested throughout the years and improve extensibility for domain-specific implementations.

Data Package is a standard for data containerisation, which consists of a set of simple yet extensible specifications to describe datasets, data files and tabular data. It is a data definition language (DDL) and data API that enhances data FAIRness (findability, accessibility, interoperability, and reusability). Since its initial release in 2007, the community has suggested many features that could improve or extend the standard for use cases that weren’t initially envisioned. Those were sometimes adopted, but there wasn’t a versioning or governance process in place to truly evolve the standard.

We started with the issues that had accumulated in the GitHub repository to build our Roadmap for v2. Many of the requested features are now adopted, making Data Package the answer for even more use cases.

In parallel we assembled an outstanding Data Package Working Group composed of experts from the community. We carefully selected a diverse group of people who brought different use-cases, formats, and data types that we would need the Standard to support. Together with them, we crafted a governance model that is explicit, in order to create an environment that adequately supports new contributions and ensures project sustainability.

We would like to thank each one of them for their remarkable contribution and for the incredibly insightful conversations we had during these months. Thank you to my colleague Evgeny Karev, Peter Desmet from the Research Institute for Nature and Forest (INBO), Phil Schumm from CTDS – University of Chicago, Kyle Husmann from the PennState University, Keith Hughitt from the National Institutes of Health, Jakob Voß from the Verbundzentrale des GBV (VZG), Ethan Welty from the World Glacier Monitoring Service, Paul Walsh from Link Digital, Pieter Huybrechts from the Research Institute for Nature and Forest (INBO), Martin Durant from Anaconda, inc., Adam Kariv from The Public Knowledge Workshop, Johan Richer from Multi, and Stephen Diggs from the University of California Digital Library.

If you are curious about the conversations we had during the Standard review, they are all captured (and recorded) in the blog summaries of the community calls. Alternatively you can also check out the closed issues on GitHub.

So what is new in version 2?

During these months we have been working on the core specifications that compose the Standard, namely: Data Package – a simple container format for describing a coherent collection of data in a single ‘package’, Data Resource to describe and package a single data resource, Table Dialect to describe how tabular data is stored in a file, and Table Schema to declare a schema for tabular data.

During the update process we tried to be as little disruptive as possible, avoiding breaking changes when possible.

We put a lot of effort into removing ambiguity, cutting or clarifying under-defined features, and promoting some well-oiled recipes into the Standard itself. An example of a recipe (or pattern, as they were called in v1) that has been promoted to the Standard is the Missing values per field. We also added a versioning mechanism, support for categorical data, and changes that make it easier to extend the Standard.

If you would like to know the details about what has changed, see the Changelog we published.

To increase and facilitate adoption, we published a metadata mapper written in Python. We have also worked on Data Package integrations for the most notable open data portals out there. Many people from the community use Zenodo, so we definitely wanted to target that. They have recently migrated their infrastructure to Invenio RDM and we proposed a Data Package serializer for better integration with the Standard (more info on this integration will be announced in an upcoming blog!). We also created a pull request that exposes datapackage.json as a metadata export target in the Open Science Framework system, and built an extension that adds a datapackage.json endpoint to every dataset in CKAN.

If you want to know more about how to coordinate a standard update, we shared our main takeaways at FOSDEM 2024. The presentation was recorded, and you can watch it here.

And what happens now?

While the work on Data Package 2.0 is done (for now!), we will keep working on the Data Package website and documentation together with the Working Group, to make it as clear and straightforward as possible for newcomers. In parallel, we will also start integrating the version 2 changes in the software implementations).

Would you like to contribute? We always welcome new people to the project! Go and have a look at our Contribution page to understand the general guideline. Please get in touch with us by joining our community chat on Slack (also accessible via Matrix), or feel free to jump in any of the discussions on GitHub.

Funding

This project was funded through NGI0 Entrust, a fund established by NLnet with financial support from the European Commission’s Next Generation Internet program. Learn more at the NLnet project page.

Building state-of-the-art library services through collaboration / HangingTogether

The following post is part of an ongoing series about the OCLC-LIBER “Building for the future” program.

Two hands making a fist bumpPhoto by Markus Spiske on Unsplash

Libraries are navigating a complex landscape of emerging technologies and ever-growing quantities of data, impacting library operations and transforming service offerings. In response, the OCLC Research Library Partnership (RLP) and LIBER (Association of European Research Libraries) jointly convened the five-part Building for the future series between October 2023 and June 2024 to examine the opportunities and responsibilities libraries face in providing state-of-the-art services, as described in LIBER’s 2023-2027 strategy. Overall, the series engaged 266 participants from 28 countries on four continents.

The series concluded on 6 June 2024, with a plenary that combined a synthesis of the lessons learned in the previous small group discussions with insights from a panel of library leaders from LIBER and RLP institutions:

  • Prof. Dr. Ana Petrus, Professor of Data Management, University of Applied Sciences of the Grisons (Switzerland)​
  • Amanda Rinehart, Life Sciences Librarian, The Ohio State University​ (United States)
  • Dr. Peter Verhaar, Assistant Professor, Leiden University Centre for the Arts in Society​ (Netherlands)

Rachel Frick, Executive Director, Research Partnerships and Engagement, OCLC, moderated the panel. Hilde van Wijngaarden, Director of the University Library, Vrije Universiteit Amsterdam, shared welcoming and closing remarks on behalf of LIBER. The event recording is available, and I encourage you to watch it at your convenience. The remainder of this post offers reflections on the overall Building for the Future series, including insights from the three closing plenary panelists.

Through all the sessions in the Building for the Future series, three key themes dominated:

  • Collaboration
  • Adaptability
  • Responsibility

Collaboration

In order to provide state-of-the-art services, libraries must collaborate. The facilitated discussion exploring the challenges and opportunities of research data management (RDM) surfaced the imperative for collaboration across two different axes:

  • Cross-campus collaboration or “social interoperability” with other institutional stakeholders, including research administrators and information technology (IT) professionals
  • Multi-institutional collaboration to scale capacity and expertise.

In the opening plenary, Courtney Mumma, Deputy Director of the Texas Digital Library, described how the Texas Digital Library consortium scales expertise, capacity, and technology in its provision of the Texas Data Repository. ​OCLC Research provides an in-depth case study of this multi-institutional collaboration in its recent report, Building Research Data Management Capacity: Case Studies in Strategic Library Collaboration.

Collaboration is also needed to support data-driven decision making, again at multi-institutional and local levels. For example, collective collections analyses demand significant investment and commitment from a wide variety of stakeholders across many institutions and library units, and local analyses often require close cooperation with other campus stakeholders.

Adaptability

We also repeatedly heard how libraries must adapt in order to provide state-of-the-art services. Saskia Scheltjens, Head of Research Services at the Rijksmuseum, spoke in the opening plenary about the need for libraries to adapt in a data-intensive environment, which can mean reimagining the role of the library within its parent institution. At the Rijksmuseum, this has resulted in the library being embedded within a larger research services department. Through this adaptation, the library has been able to extend its expertise with information and data beyond library collections, to influence and help steward the broader Rijksmuseum collections. ​

In many cases, adapting means upskilling, both individually and in teams. In the discussion on data-driven decision making, participants described the many ways they are acquiring new skills, such as forming library teams or working groups to support development of data analysis and visualization skills and creating local communities of practice. In the session on AI and machine learning, participants described how they were often acting independently to develop their own AI knowledge, through experimentation with an array of tools. At the most basic level, librarians need access to tools and the time to practice and experiment. Closing plenary panelist Amanda Rinehart described this as a significant obstacle for emerging fields of librarianship; she called upon libraries to acknowledge this challenge and document it transparently in position descriptions, building in time for librarians to independently gain mastery of relevant knowledge.

Library adaptation may also mean new roles. Throughout the conversations we heard from participants that upskilling alone will be insufficient in many libraries. Many research libraries will also need to advocate for the inclusion of workers with new skills, particularly data scientists.

LIBER working groups are collaborating to support skills development and agility by developing the Digital scholarship and data science essentials for library professional (DS Essentials) resource. This growing resource offers guidance to library professionals seeking to learn more about data science topics, including AI and machine learning, collections as data, and much more.

Responsibility

Responsibility was the third main theme heard throughout the series. In the opening plenary, Thomas Padilla, Deputy Director, Archiving and Data Services, Internet Archive, spoke about the brave new world of AI and machine learning and its impact on libraries. He emphasized the need for libraries to think and act responsibly in a rapidly changing technological environment by examining practices to ensure alignment with ethical principles and social justice values, particularly related to privacy, diversity and inclusion, and sustainability. You can explore these ideas further in Responsible Operations: Data Science, Machine Learning, and AI in Libraries, the OCLC Research position paper Thomas authored.

Event participants expressed a variety of perspectives on AI and machine learning, ranging from curiosity to skepticism to apprehension. And while there are many possible use cases for AI in libraries, it’s becoming apparent that one of the most important roles for research libraries will be in AI literacy. Closing plenary panelist Peter Verhaar observed: “Librarians may not necessarily be engaged in fundamental research on AI themselves, but they can help. . . to bridge the gap between people who develop the technology and the people who want to use the technology.”

Better together

Over the course of this partnership, OCLC and LIBER have learned from each other, and better yet, by combining our networks, we’ve been able to support discussions at a global scale, connecting librarians with peers outside of their network. All three panelists in the closing plenary remarked on the importance of the transcontinental conversations, highlighting that despite differences in our geographic and institutional contexts, there are many similarities among the challenges we face. Panelist Ana Petrus found value in the series because it surfaced “. . .the similarities, the feeling of not being alone, that other people are having similar ideas and. . . struggling with similar things.” In short, it helped us to all to find commonality in a global setting.

OCLC and LIBER look forward to continuing our partnership, and we are excited to share our upcoming plans later this year.

The post Building state-of-the-art library services through collaboration appeared first on Hanging Together.