Planet Code4Lib

Evergreen releases 3.17.2 and 3.16.8 are available / Evergreen ILS

The Evergreen release team is pleased to announce that point releases for 3.17.2 and 3.16.8 are available.

These releases contain updates for Evergreen’s RESTful API suite, search fixes, and Angular interface updates.

Files and release notes are available on the Downloads page: https://evergreen-ils.org/egdownloads/

Thanks to the June release team: Galen Charlton (Equinox), Gina Monti (Bibliomation), Andrea Buntz Neiman (Equinox) and Chris Sharp (PINES); as well as everyone who contributed fixes and testing to this release.

I hate compilers / Xe Iaso

Anubis is about to get WebAssembly-based proof of work checks so that administrators can use a non-SHA256 proof of work method to protect their websites. Part of the implementation goals of this work is that the check logic is defined in one place on both client and server. The client and server will then hook into the WebAssembly in order to make sure they're running in lockstep.

However, one small problem comes up. What do you do when the client has WebAssembly disabled? I really don't want to de-facto lock people out of websites. Anubis exists in an impossible balance of user experience, administrator experience, and developer experience and any change to any of these factors disrupts the balance for other factors.

To work around this and also fulfill the goal of having check logic defined once, I decided to take inspiration from the legendary talk The Birth and Death of JavaScript and just recompile the WebAssembly to JavaScript. Sure, the resulting JavaScript will be slower than the equivalent WebAssembly (even more so because disabling WASM usually disables the JavaScript JIT, the thing that makes JavaScript fast), but it will finish eventually. Hopefully it will be more efficient than the existing JavaScript is on lower end hardware, but research is required.

Luckily enough, the tool I need (wasm2js from the binaryen project) is packaged in Linux distributions. The bad news is that distributions ship ancient versions of it that don't get the same output as the version on my development machine's copy from Homebrew.

In order to really make sure that the output of this is deterministic (essential for reproducible builds), I need to bundle a copy of wasm2js. So I did that by building a version of wasm2js compiled to WebAssembly with wasi-sdk. The rest of the article is the tale of reproducibility woe that lead to the implementation I ended up with. Buckle up and enjoy the ride!

Reproducible builds are surprisingly hard

Aoi is wut
Aoi

Back up a sec, this doesn't make sense to me. If you have the same bytes of input to a compiler, you should get the same bytes of output assuming that the compiler flags, target, and other platform details are controlled for right? A compiler is just a deterministic function of input source code becomes output bytecode, right?

Numa is smug
Numa

lol you'd think, but no, it's not. In theory it is (and for small scale compilers it definitely is), but in practice compilers are strange and complicated beasts containing multitudes that no mere mortal can fully comprehend on their own.

There are a shocking number of ways to accidentally create nondeterministic output when doing C/C++ development. One of the easiest is to use the builtin __DATE__ and __TIME__ macros to stamp a build with the time the compiler was executed at:

// hello.cpp
        
        #include <iostream>
        
        int main() {
            std::cout << __DATE__ << " " << __TIME__ << std::endl;
            return 0;
        }
        

Building and running it once gets me this:

$ make clean && make hello.wasm && wasmtime run -W exceptions=y ./hello.wasm
        rm -f hello.o hello.wasm
        wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false  -c hello.cpp -o hello.o
        wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false  -fwasm-exceptions -lunwind --no-wasm-opt hello.o -o hello.wasm
        Jun 18 2026 00:00:59
        

Another time it gets me this:

$ make clean && make hello.wasm && wasmtime run -W exceptions=y ./hello.wasm
        rm -f hello.o hello.wasm
        wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false  -c hello.cpp -o hello.o
        wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false  -fwasm-exceptions -lunwind --no-wasm-opt hello.o -o hello.wasm
        Jun 18 2026 00:01:11
        

Even though the source code had the same bytes, the output of the compiler was wildly different.

In order for users and packagers to trust the binaries of wasm2js I'm committing to the Anubis repo, I need to make sure that you can build the same version I built, down to the same bytes. For an added bonus, you should be able to build this on your machine and get the same bytes I got.

Numa is smug
Numa

That sure does sound like a great ideal, it would be horrible if something unforeseen came up to ruin it!

Clang silently runs wasm-opt from $PATH behind your back

Among other tools like wasm2js, binaryen has a bunch of other useful tools such as wasm-opt. wasm-opt optimizes WebAssembly compiler output to let you eke out more performance. This doesn't work in every circumstance, but when it does work it makes a huge difference. As such, clang shells out to wasm-opt when doing builds.

This normally makes sense, but in this case it caused builds to fail on my DGX Spark because its version of wasm-opt is too old:

$ uname -m && which wasm-opt && wasm-opt --version
        aarch64
        /usr/bin/wasm-opt
        wasm-opt version 108
        

Compared to my workstation which installs wasm-opt from Homebrew:

$ uname -m && which wasm-opt && wasm-opt --version
        x86_64
        /home/linuxbrew/.linuxbrew/bin/wasm-opt
        wasm-opt version 130
        

Turns out that wasi-sdk and binaryen rely on the WebAssembly Exceptions extension. This is a reasonable thing to assume given that wasi-sdk mostly assumes you're building things for web browsers and 93.86% of browser users have a browser engine new enough to support it. C++ is also one of the main places where exceptions are used, so I guess WebAssembly-native exception handling removes a lot of boilerplate here.

Both wasmtime and wazero require you to flag into exception support. This is fine; we can just pass -W exceptions=y to wasmtime and use a custom runner harness for wazero. The annoying part is what happens when my arm machine's anemic build of wasm-opt sees exception handling instructions, causing it to exit. This made the build fail.

The solution was to pass --no-wasm-opt at the linking step. This removed one angle of irreproducibility.

Mara is hacker
Mara

I guess in the future we could make it use the version of wasm-opt it just built to optimize the output, but that may be a premature optimization for now.

Clang relies on address layout for ordering things

The version of clang that I use to compile wasm2js has some address-sensitive code generation hiding in its exception handling path. Raw pointer values leak into the order a handful of try_table blocks come out in. This surfaces as every build differing from the next by about 29 bytes:

-002a9af0: 2802 0441 0647 0d00 1f40 0103 0820 0241  (..A.G...@... .A
        -002a9b00: 206a 2103 2002 4138 6a20 0141 086a 10b5   j!. .A8j .A.j..
        -002a9b10: 8881 8000 2104 0b1f 4001 0304 2003 2004  ....!...@... . .
        +002a9af0: 2802 0441 0647 0d00 1f40 0103 041f 4001  (..A.G...@....@.
        +002a9b00: 0309 2002 4120 6a21 0320 0241 386a 2001  .. .A j!. .A8j .
        +002a9b10: 4108 6a10 b588 8180 0021 040b 2003 2004  A.j......!.. . .
        

To make this easier to spot, here's a partial disassembly:

  i32.load  offset=4            ;; 28 02 04
          i32.const 6                   ;; 41 06
          i32.ne                        ;; 47
          br_if     0                   ;; 0d 00
        - try_table (catch_all_ref 8)   ;; 1f 40 01 03 08
        + try_table (catch_all_ref 4)   ;; 1f 40 01 03 04
        + try_table (catch_all_ref 9)   ;; 1f 40 01 03 09
            local.get 2                 ;; 20 02
            i32.const 32                ;; 41 20
            i32.add                     ;; 6a
            local.set 3                 ;; 21 03
            local.get 2                 ;; 20 02
            i32.const 56                ;; 41 38
            i32.add                     ;; 6a
            local.get 1                 ;; 20 01
            i32.const 8                 ;; 41 08
            i32.add                     ;; 6a
            call 17461                  ;; 10 b5 88 81 80 00
            local.set 4                 ;; 21 04
          end                           ;; 0b
        - try_table (catch_all_ref 4)   ;; 1f 40 01 03 04
            local.get 3                 ;; 20 03
            local.get 4                 ;; 20 04
        

The computation is nearly identical, but the byte order is just different enough to also make the catch references differ. This also fires when you build this pinned version of wasm2js on arm64 machines because its pointer iteration order is different from it is on my workstation.

To work around this, I took two steps:

  1. Disable address-space randomization for this build using setarch --addr-no-randomize.
  2. Create known good sha256 checksums for both x86_64 and arm64 via building this program on machines I trust.

I also made a CI job ensure this:

- name: Ensure reproducibility
          run: |
            cd ./utils/wasm/wasm2js
            ./build.sh
            if sha256sum -c --status shasums.x86_64; then
              echo "OK: rebuilt modules match the recorded x86_64 checksums"
            elif sha256sum -c --status shasums.arm64; then
              echo "OK: rebuilt modules match the recorded arm64 checksums"
            else
              echo "::error::rebuilt wasm2js/wasm-opt match neither recorded checksum set on ${{ matrix.runner }}" >&2
              sha256sum wasm-opt_130.wasm wasm2js_130.wasm
              exit 1
            fi
        

To be extra sure, we have this job run on both x86_64 and arm64 hosts. I'd really love to have this be reproducible across hosts, but that's an upstream LLVM bug that I am not powerful enough to tackle. If you work on LLVM and are reading this, it would be nice to set a seed of some kind to ensure that this iteration order is fixed across architectures.

At the very least builds are deterministic within architectures. This may have to be good enough for now.

120 blocks, one story: The collective creation of the OCLC Quilt / HangingTogether

Collaboration. Serendipity. Diversity. These are the qualities that come to mind when I think about this year’s OCLC quilt and the community that created it.

The OCLC Quilters, a group of current and retired OCLC employees, have spent months creating a quilt of 120 cross-cut blocks to donate to the silent auction held during the ALA 2026 Annual Conference. The ALA BiblioQuilters annually host this auction as a fundraiser for the Christopher Hoy Scholarship, which awards a $5,000 scholarship each year to a U.S./Canadian citizen or permanent resident who is pursuing an MLS in an ALA-accredited program.

This is the fourth year in a row that the OCLC Quilters have donated a quilt to the silent auction. Their work inspired me to take up sewing about a year ago, and I’m proud to move from admirer to participant, contributing to the OCLC quilt for the first time. Although left-handed people are about 10% of the population, three of the 13 people contributing to the OCLC quilt, including myself, are left-handed. While that doesn’t affect the result, it requires a few adjustments in technique and having the appropriate scissors. Sharing advice on adapting equipment and shopping for left-handed supplies is one of the ways we support each other.

Nine people in a bright lobby pose with a large, colorful patchwork quilt featuring a grid of multicolored fabric squares; four stand behind holding the quilt Four people sit on a blue bench in front and five stand behind the benchNine of the 13 contributors to the OCLC Quilt for the ALA 2026 Annual Conference

Like all handicrafts, quilting is an activity with its own nomenclature. As a quilter and cataloger, I found myself wondering: “What controlled vocabulary terms could I use to describe the OCLC quilt?” There are several from vocabularies such as Library of Congress Subject Headings (LCSH) and Getty Art and Architecture Thesaurus (AAT). These are listed at the end of this blog.

A quilt is created from many elements that may not be individually significant but form a meaningful whole, just like a WorldCat bibliographic record. The blocks of the quilt function like data elements in a WorldCat record, with contributions from multiple individuals creating the larger work.

Assembling the quilt

OCLC quilters sewed 120 blocks, which are the fabric squares comprising the quilt’s front. The blocks are a cross-cut design—a pattern chosen because it is accessible for novice sewists and makes good use of small fabric pieces. Quilters often save these leftover pieces, called “scraps,” from other projects for future use. Reusing scraps makes quilting a sustainable craft, and quilters often share them with one another. An experienced OCLC quilter, who keeps her scrap collection organized in true librarian fashion, donated most of the fabric pieces used for the blocks.

Experienced quilters arranged and sewed the blocks together and cut the batting (soft material used between the front and back sides of the quilt). The next step, in which three layers are sewn together with a decorative stitch, is quilting. This is the strict definition of the term “quilting,” although it is often used to refer to the entire process of creating a quilt. The pattern used for the quilt stitching is called “modern ties,” and it looks a bit like tied shoelace loops.

Close-up of a colorful patchwork quilt made of bright, patterned fabric squares and cross-shaped strips; a green strip includes the white OCLC logo.An OCLC logo is incorporated into one of the quilt blocks

The final step is to sew a long strip of binding fabric around the edges of the quilt, which will keep the ends from fraying as well as being decorative. Two labels were sewn into the binding: “Made in OH” and “Is it perfect? No.” Both of these labels are accurate descriptions of this quilt, but unlike in bibliographic descriptions, a certain amount of imperfection is not only tolerated but may be considered part of the quilt’s charm.

A quilting tradition at ALA Annual

The OCLC quilt will be one of many available at the ALA BiblioQuilters silent auction during ALA Annual in Chicago, Illinois. The BiblioQuilters were founded at the 1998 ALA Annual Conference in Washington, D.C. Since 2000, the BiblioQuilters have had a silent auction of quilts every year except 2020 and 2021 (because of the pandemic). The quilts are usually available to view and bid on near the registration area. If you are attending ALA in Chicago, I highly recommend you visit the auction table to view them. After ALA, you may be inspired to browse the shelves of your local public library for 746.46, the Dewey Decimal number for quilting.

Subject vocabulary terms

For those readers who appreciate quilting and metadata, the following controlled vocabulary terms reflect concepts discussed in this blog. You might even find it fun to match the concepts to the natural language descriptions!

Getty Art and Architecture Thesaurus terms

batting

binding (textile material)

blocks (quilt components)

fabric scissors

quilting

Library of Congress Subject Heading terms

Quilting

Sewing—Left-handed techniques

Textile fabrics

The post 120 blocks, one story: The collective creation of the OCLC Quilt appeared first on Hanging Together.

Author Interview: Cynthia Pelayo / LibraryThing (Thingology)

Cynthia Pelayo

LibraryThing is pleased to sit down this month with groundbreaking author and poet Cynthia Pelayo, who in 2022 became the first Puerto Rican and first Latina to win a Bram Stoker Award after her Crime Scene took the prize in the Poetry Collection category. Her Into The Forest And All The Way Through was a 2020 nominee, also in the Poetry Collection category, while her Children of Chicago was a 2021 nominee in the Novel category. Pelayo earned a BA in Journalism from Columbia College Chicago, a MS from Roosevelt University, and a MFA in Writing from the School of the Art Institute of Chicago. She is currently pursuing a PhD in English. Her MFA writing thesis, Lotería, was republished in 2023, winning an International Latino Book Award Silver Medal in the Best Collection of Short Stories category. A co-publisher of Burial Day Books, which focuses on horror writing, she is the author of numerous other books, stories and poems, including novels such as The Shoemaker’s Magician (2023), Forgotten Sisters (2024), and Vanishing Daughters (2025). Her new novel, It Came from Neverland, a work of horror inspired by the classic Peter Pan, was published by Crooked Lane Books earlier this month. Pelayo sat down with Abigail this month to discuss her new book.

Tell us a little bit about It Came from Neverland. How did the idea for the story first come to you?

Like many people, I grew up watching the Disney version of Peter Pan, and then I remember watching Hook with Robin Williams and being captivated, seeing Peter Pan as an adult who had to remember who he was. There was an older Wendy in that film, and that always stayed with me because I really wanted to know what Wendy’s story looked like aged into young adulthood.

When I went back to J.M. Barrie’s Peter and Wendy and the original play, I worked out that Wendy would be in her early twenties at the start of the First World War. And then I learned that many young men at that time lied about their age to enlist, some of them were barely more than boys. It all felt like a perfect juxtaposition, these boys going off to war and Wendy’s trauma round caring for the Lost Boys from Neverland.

A woman of that time period would certainly not be believed if she tried to tell the truth about what she experienced as a child in Neverland, and so that certainly played into her experience. Then I thought well, Wendy at this age would certainly be in a position that reflected her character, and schoolteacher fit perfectly. The story wrote itself, because I knew that Wendy would do all that she could to protect those children and I also knew Peter Pan would surely return to whisk more children away to Neverland. So this story is that tale, what does she do to stop Peter Pan.

What drew you to Peter Pan, and what made you feel it was horror?

Peter Pan without Wendy Darling is just a boy screaming into the dark. The story only works because Wendy agrees to go with him to Neverland, and she goes because she is sweet and kind and she believes him. Wendy’s only failure here was that she had a good heart, which is just so sad because her being nice is why she was taken advantage of. She trusted someone who was manipulating her. Peter told her she was special, but what he really meant was that she was useful. She was given the role of mother to the Lost Boys, not because she was truly loved or valued, but because someone needed to do the mending. This is not fantasy. This is a domestic horror story dressed in fairy dusty.

In Peter and Wendy we’re also essentially told that growing up is a curse, but I push back on that. Growing up is the adventure, becoming yourself, and gaining autonomy is the gift.

Peter’s entire pitch was stay here, never change, never leave me. Shrink. Lost yourself to praise me. That’s not love. That’s control.

The horror was always there. I just removed the glitter.

What is it about fairy tales that speaks to you?

Fairy tales were the very first stories I was told as a child. “Little Red Riding Hood,” “Hansel and Gretel,” “Cinderella,” more. I hold all of them dear, even though many of them have a thread of terror, but I suppose that’s why I’m the writer I am today.

What I’ve come to understand is that fairy tales are an early societal warning system, in a way. They prepared us for danger, and that’s why they still work. Little Red Riding Hood is everyone who has been told not to talk to strangers on their way home. Bluebeard is everyone who has been warned to be cautious with suitors. Snow White is every person who has fallen victim to the cruelty of jealousy. These stories survived centuries because beneath them there is some truth that can be applied to many of our experiences today. They encode the things that we should say out loud, but don’t because of all of those strange polite society rules, things like – don’t trust the stranger who flatters you, the beautiful thing is likely the trap, or even, the person who promises you forever can very well be the one who seeks to destroy you.

When looking at all of these through a horror lens, they echo to the horror writer what is our job – and that is to tell the truth. Horror is the genre of truth, to highlight the danger, to be a witness to survival, more. So much of what fairy tales do speak to this.

Are there other classic works you’re interested in transforming?

Yes, and the one upcoming is a Frankenstein retelling titled Everina from Union Square & Co. I have more, but I can’t mention those quite yet.

Tell us a little bit about your writing process.

I generally write in the early morning. In the evenings that’s when I tend to answer email or work on lectures for any workshops I’m teaching or work on any of my own homework.

In terms of my actual writing sessions, I read before I write, generally I will read some poetry and then start writing. If you’re asking about the big questions regarding how do I create something, I think I’m a pretty methodical writer in that yes, I allow discovery to happen, but as of recent, there is a lot of researching, planning, and pre-writing that goes into my actual writing. Then there is editing, and that is an entirely different process. I tell writers to think of these processes as three separate demands, research and preparation requires one aspect of your brain, editing requires a different aspect of your brain and the actual writing, which is a completely different process. When you are writing you are creating, allow yourself to have fun and explore when you’re actually writing.

What comes next for you?

Something Followed Us Home: Tales of Latiné Horror comes out September 29 from Simon & Schuster. It’s an anthology I edited that features Mariana Enríquez, Agustina Bazterrica, Mónica Ojeda, Isabel Cañas, Daniel José Older, Zoraida Córdova, and others, with a foreword by Brenda Lozano. Latiné horror isn’t a subgenre, it’s a tradition that I’m grateful to have had the opportunity to share with readers.

After that, Everina, which is my Frankenstein retelling.

Tell us about your library. What’s on your own shelves?

I read a lot of poetry and classics. My bookshelves reflect that, so we have Lorca, Vallejo, Pizarnik, Anne Sexton, Gwendolyn Brooks, Adrienne Rich, Dickens, Dostoevsky, Homer, Steinbeck, Tolstoy, more. Yes, there’s horror as well, Shirley Jackson, Angela Carter, Mariana Enríquez, Carmen Maria Machado, Daphne du Maurier. And, of course fairy tales, so many fairy tales and non-fiction texts analyzing fairy tales.

What have you been reading lately, and what would you recommend to other readers?

I am reading Piranesi by Susanna Clarke. I love world building, and here, we have a world that operates on its own logic, the architecture of the space, and the feelings it evokes. Clarke gives us infinite halls that become both prison and sanctuary, and it’s this tension that I’m drawn to as both a reader and writer.

Chatbots vs. Ozone / David Rosenthal

Source
Back in February I posted The Kessler Syndrome, which also included a brief section mentioning the impacts of the proposed megaconstellations on the environment, specifically global warming from CO2 and black carbon, and depletion of the ozone layer. Three months earlier Anton Petrov had examined the last of these in Risk of Ozone Layer Destruction from Internet Satellite Swarms and Rocket Fuel. He has now followed up with SpaceX Is Conducting a Giant Chemical Experiment on Our Atmosphere Without Realizing. Below the fold I survey the papers Petrov cited and a few others.
The papers involved are, in date order, as follows together with extracts from their abstracts:

Impact of Rocket Launch and Space Debris Air Pollutant Emissions on Stratospheric Ozone and Global Climate by Robert Ryan et al (9th June 2022):
Rockets, unlike other anthropogenic pollution sources, emit gaseous and solid chemicals directly into the upper atmosphere. We compile inventories of these chemicals from rocket launches in 2019 and projections of future growth and speculative space tourism activity. We incorporate these in a 3D atmospheric chemistry model to simulate the impact on climate and the protective stratospheric ozone layer. We find that loss of ozone due to current rockets is small, but that routine space tourism launches may undermine progress made by the Montreal Protocol in reversing ozone depletion in the Arctic springtime upper stratosphere. The BC (or soot) particles from rockets are also of great concern, as these are almost five hundred times more efficient at warming the atmosphere than all other sources of soot combined.
Note that even four years ago it was already clear that the space industry was both depleting ozone and aggravating global warming. But this was before the scale of the proposed mega constellations was evident.

Metals from spacecraft reentry in stratospheric aerosol particles by Daniel Murphy et al (7th September 2023):
So far, models of spacecraft reentry have focused on understanding the hazard presented by objects that survive to the surface rather than on the fate of the metals that vaporize. Here, we show that metals that vaporized during spacecraft reentries can be clearly measured in stratospheric sulfuric acid particles. Over 20 elements from reentry were detected and were present in ratios consistent with alloys used in spacecraft. The mass of lithium, aluminum, copper, and lead from the reentry of spacecraft was found to exceed the cosmic dust influx of those metals. About 10% of stratospheric sulfuric acid particles larger than 120 nm in diameter contain aluminum and other elements from spacecraft reentry. Planned increases in the number of low earth orbit satellites within the next few decades could cause up to half of stratospheric sulfuric acid particles to contain metals from reentry.
Much of the reentry burn happens above the stratosphere, and it takes time for the aluminum nanoparticles to drift down to the levels where they were collected. So the 10% number represents pollution from an earlier period with fewer reentries that the 2020s. Murphy notes that:
Most of the meteoric mass is deposited at altitudes between 75 and 110 km by a very large number of sub-millimeter meteoroids. Reentering spacecraft, which are larger and moving more slowly, ablate between 40 and 70 km over a ~300 km long footprint
His samples were collected at 19km altitude.

Potential Ozone Depletion From Satellite Demise During Atmospheric Reentry in the Era of Mega-Constellations by José P. Ferreira et al (11th June 2024):
This paper investigates the oxidation process of the satellite's aluminum content during atmospheric reentry utilizing atomic-scale molecular dynamics simulations. We find that the population of reentering satellites in 2022 caused a 29.5% increase of aluminum in the atmosphere above the natural level, resulting in around 17 metric tons of aluminum oxides injected into the mesosphere. The byproducts generated by the reentry of satellites in a future scenario where mega-constellations come to fruition can reach over 360 metric tons per year. As aluminum oxide nanoparticles may remain in the atmosphere for decades, they can cause significant ozone depletion.
Ferreira et al confirm the potentially long delay between reentry and the nanoparticles reaching the ozone layer and depleting it:
we find that these reentry byproducts may take up to 30 years to settle from the top of the mesosphere into the stratospheric ozone layer. Upon reaching an altitude of about 40 km, aluminum oxides catalyze chlorine activation which promotes ozone depletion. This suggests that concentrations of aluminum oxide compounds may start increasing in the mesosphere well before reaching the stratospheric ozone layer. This would introduce a noticeable delay between the beginning of the injection process when orbiting bodies are decommissioned and the eventual ozone-depletion consequences in the stratosphere.
Investigating the Potential Atmospheric Accumulation and Radiative Impact of the Coming Increase in Satellite Reentry Frequency by Christopher Maloney et al (21st March 2025):
A lack of observations and validated models of reentry demise limits our ability to simulate the complex aerosols associated with reentry, which makes estimating the climate impacts difficult. Aluminum is a primary satellite component and will likely be emitted during reentry vaporization in the form of alumina. Unmodified alumina is a useful approximation for metallic reentry aerosol. In this study, we simulate a potential yearly emission of 10,000 metric tons of alumina from reentering space debris. We investigate how the location of atmospheric accumulation, aerosol size distribution, and radiative properties of reentry alumina impacts the middle atmosphere. We find that 20,000–40,000 metric tons of alumina accumulates at high latitudes between 10 and 30 km in both hemispheres. Small changes in mesospheric heating rates lead to 1.5-K temperature anomalies in the middle atmosphere at high latitudes. These temperature anomalies are accompanied by changes in wind speed in the polar vortex.
So there are thermal effects on the climate as well as the effects on the ozone layer.

Near-future rocket launches could slow ozone recovery by Laura Revell et al (9th June 2025):
To understand if significant ozone losses could occur as the launch industry grows, we examine two scenarios. Our ‘ambitious’ scenario (2040 launches/year) yields a −0.29% depletion in annual-mean, near-global total column ozone in 2030. Antarctic springtime ozone decreases by 3.9%. Our ‘conservative’ scenario (884 launches/year) yields −0.17% annual, near-global depletion; current licensing rates suggest this scenario may be exceeded before 2030. Ozone losses are driven by the chlorine produced from solid rocket motor propellant, and black carbon which is emitted from most propellants. The ozone layer is slowly healing from the effects of CFCs, yet global-mean ozone abundances are still 2% lower than measured prior to the onset of CFC-induced ozone depletion. Our results demonstrate that ongoing and frequent rocket launches could delay ozone recovery. Action is needed now to ensure that future growth of the launch industry and ozone protection are mutually sustainable.
Note that this paper addresses only the ozone depletion from launches, not from reentry. But their 'ambitious' scenario of 5.6 launches/day is far short of Musk's ambitions, let alone the other planned megaconstellations. My understanding is that the 2040 launches/year in their scenario are of Falcon 9 class vehicles but "only 4.4% of launches are using vehicles designed for re-entry", which is implausible. But the mega-constellations can't be built or maintained with Falcon 9s.

Will Lockett is, as one should be, skeptical of Musk's claims. In Musk’s Orbital Data Centre Idea Is Getting More Stupid By The Day he analyzes the claimed "million satellite data center" assuming it is built, as Musk claims, with Starship but over 15 years, a longer timescale than Musk's:
To achieve that, they would need to launch 120,000 satellites per year. Over the 15 years, they would launch 1.8 million satellites, but 800,000 of them would fail (as part of our 9% failure rate), leaving a total operational fleet of one million satellites. This equates to 3,158 Starship launches per year, or nearly nine launches per day. For some context, the current launch rate for Starship is just five per year.
...
In order to keep a million satellites in the constellation, it needs to be maintained. So, each year, SpaceX would have to launch 90,000 AI Sat Minis to replace the roughly 9% of the constellation that failed. That equates to 2,368 Starship launches per year, or 6.4 per day.
That's 9 launches/day for 15 years then 6.4 launches/day indefinitely of a much rocket that is vastly bigger than Falcon 9 and is completely re-usable.

Of course, these claims are ridiculous - neither logistically nor economically feasible. But assuming Starship or a competitor such as Blue Origin does manage to create a reliable, reusable, 100 ton to LEO launch vehicle, there will be a lot more mass in LEO and a lot more of it reentering.

Measurement of a lithium plume from the uncontrolled re-entry of a Falcon 9 rocket by Robin Wing et al (19th February 2026):
A 10-fold enhancement of lithium atoms was detected at 96 km altitude by a resonance lidar at Kühlungsborn, Germany, approximately 20 hours after the uncontrolled re-entry of a Falcon 9 upper stage. The upper-atmospheric extension of the ICON general circulation model, nudged to ECMWF, was used to calculate winds. Backwards trajectories, including wind variability as measured by radar, traced air masses to the Falcon 9 re-entry path at 100 km altitude, west of Ireland. This study presents the first measurement of upper-atmospheric pollution resulting from space debris re-entry and the first observational evidence that the ablation of space debris can be detected by ground-based lidar. The analysis of geomagnetic conditions, atmospheric dynamics, and ionospheric measurements supports the claim that the enhancement was not of natural origin. Our findings demonstrate that identifying pollutants and tracing them to their sources is achievable, with significant implications for monitoring and mitigating space emissions in the atmosphere.
The effect of lithium and other spacecraft ingredients on the ozone layer doesn't appear to have been studied compared to aluminum. To be fair, there will be a lot more aluminum.

Radiative Forcing and Ozone Depletion of a Decade of Satellite Megaconstellation Missions by Connor Barker et al (14th May 2026):
We use a global inventory of launch and re-entry emissions covering the onset of the megaconstellation era (2020–2022), and project these to 2029 based on 2020–2022 growth rates. We implement this inventory into a 3D atmospheric chemistry model to determine the impacts of megaconstellations on the ozone layer and climate. We find that global stratospheric ozone depletion from all mission types is relatively small compared to surface sources and megaconstellation missions only account for about one-tenth of this depletion. This is because rockets launching megaconstellations almost all use kerosene, a large source of black carbon or soot particles, but not of chemicals such as chlorine that directly destroy ozone. Soot from rockets absorbs sunlight, warming the upper layers of the atmosphere and decreasing the amount of sunlight reaching Earth's lower atmosphere, causing it to cool. Megaconstellation missions are responsible for about half of this climate effect. In this regard, rockets launching megaconstellations and other missions are like small-scale stratospheric aerosol injection experiments without forethought for potential unintended consequences.
Again, this paper addresses only atmospheric impacts from launches, not from reentries. And, the launch rate for 2020-2022 is far less, and uses much smaller rockets, than the proposed "million satellite data center" and its competitors.

An Open Hardware TPU on Your Desk / Harvard Library Innovation Lab

The open-source movement emphasizes the power of freely modifiable, flexible code to support transparency, collaboration, and building outside vendor lock-in. Open hardware extends that logic to the physical layer: chips you can read, modify, and build on. All software runs on hardware, and over the past few years, the ground under the hardware industry has been shifting. Since 2022, the United States, the Netherlands, and Japan have progressively tightened export controls on advanced chips and the equipment used to manufacture them; China has responded with a state-backed effort to reproduce every layer of that supply chain at home. Policy analysts now routinely describe the trajectory as a “fragmentation” or “decoupling” of the global semiconductor market into separate technology spheres. Hardware costs are climbing across the board, so accessing open hardware feels all the more relevant for a group building open-source software.

Open hardware doesn’t make the chips any cheaper. What it changes is what a chip you already own is allowed to become or what kinds of application-specific chips you can create. A single FPGA (Field-Programmable Gate Array) on a shelf can be a video codec today, a custom search accelerator the next, and a faithful copy of a decommissioned architecture the day after that. The unit cost is what it is; the value you can extract from that unit is no longer fixed by a vendor. For institutions whose time horizons stretch over decades, like libraries, archives, and public-interest research groups, that reconfigurability is the core of the open-hardware argument that compounds.

There are no known open-source hardware implementations of Google’s TPU, the in-house chip family Google designed to accelerate neural network math. The silicon is proprietary and not directly purchasable by consumers except in edge TPU form (e.g., Google Coral). This field note reports a small experiment porting the OpenTPU project, a Python simulation of Google’s TPU published by UCSB’s ArchLab, to an inexpensive FPGA board to explore the pros and cons of using more open hardware. To keep the experiment lightweight, most of the code was written by AI coding agents, with a human directing the work.

Why FPGAs, For A Lab Like This One

An FPGA (Field-Programmable Gate Array) is a chip whose internal logic can be reconfigured. You describe the circuit in a hardware description language, SystemVerilog, compile it to a bitstream, and load it onto the board. A CPU runs your program. An FPGA becomes your program. FPGAs are the obvious place to start, because they’re the one piece of reconfigurable silicon a small lab can actually buy and program today.

The practical difference lies in the shape of the problems they solve. A CPU is a generalist reading a manual one step at a time. A GPU is a factory floor of thousands executing the exact same standard math in unison. An FPGA is a machine whose gears are configured to fit exactly one algorithm. They excel at problems with strange shapes. If your workload involves multiplying massive, uniform matrices to train a language model, you want a GPU. But if your work requires parsing millions of irregular text strings as they stream, finding exact bit-level matches across a sprawling archive, or piping data through a custom hash without ever pausing to fetch instructions from memory, an FPGA is arguably the more elegant approach. A dataset is sometimes only as legible as the software that reads it, and that software is only as runnable as the hardware underneath. An open FPGA design can preserve a faithful copy of obsolete circuitry, thereby preserving the means to read data, not just the data itself.

Diagram illustrating FPGA architecture. Source: Wevolver

For an institution like the lab, that distinction matters. Libraries, archives, public-interest research groups, and smaller labs deal with workloads and questions that are awkwardly sized: too large for a laptop, too specialized to deserve a recurring cloud bill, too long-running for any one grant cycle. For a well-resourced lab, the answer is cloud GPUs. For everyone else, the answer has historically been to wait or to scale down the question. That is why open hardware matters here: it gives smaller institutions a path to specialized computing without waiting for ideal market conditions.

A $300 board on a desk can run a custom-designed circuit, tailored to one workload, indefinitely, without anyone’s permission and without a metered bill. A physical circuit sheds the overhead of an operating system and the constant fetch-and-exec cycle, drawing a fraction of the power of a standard CPU. For a public-interest lab, this efficiency makes running some workloads both financially and environmentally sustainable.

TPU-style architectures are particularly well-suited to this kind of work because of the systolic array at their core. A systolic array is a grid of small multiply-add units that pass partial results to their neighbors on every clock tick, making it efficient for matrix math because data flows through the grid rather than being fetched repeatedly from memory. That structure is designed for exactly the dense-matrix operations that underpin embedding generation, similarity search, and the neural network inference behind modern document analysis. This is the kind of regular, spatial structure FPGAs are built to host. A grid of identical small units, wired to their neighbors, all ticking together, is close to a literal description of what an FPGA’s fabric already is.

Designs published openly in SystemVerilog are reusable across institutions, as open code is. What has kept this out of reach was never the silicon; it was the labor: months of vendor-tool learning, a small group of knowledgeable practitioners, and debugging cycles unique to physical systems. That cost is what AI coding agents have started to chip away at, making the open-hardware case more practical.

Porting OpenTPU and the Silicon Boundary

OpenTPU is a published academic re-implementation of Google’s first-generation TPU implemented in Python code that models the hardware’s behavior, not software meant to run on it. The hardware target of this experiment was the Alchitry Pt v2, a roughly $300 board built on an Xilinx Artix-7 chip, with an add-on that exposes USB 3.0 to a host PC. I worked with AI coding agents to describe goals in plain English, review edits, and iterate. A sandbox repository came first: blink an LED, echo bytes, talk to the USB chip. That groundwork paid off within hours of starting the real port. The OpenTPU translation itself — the systolic matrix unit, the weight memory, the instruction decoder, the activation logic, and the host interface — went through seven planned phases. A few days of wall-clock time later, the SystemVerilog testbench ran the same matrix-multiply program as the original Python simulator and produced the same answer.

Photo of the Alchitry Pt v2 FPGA board used in the experiment. Source: Jenevieve Haggard, HLS LIL

Translating Python that describes hardware into SystemVerilog that synthesizes hardware turns out to be something AI agents are somewhat capable of. The source is unambiguous, and the testbenches give instant feedback. The hard part was always the silicon boundary: a physical board, a vendor toolchain with subtle caching behavior, a USB chip with several gotchas. What worked was a scaffold around the agent that provided deterministic simulation tests for everything testable in software, plus on-board LEDs wired to specific finite-state-machine states, so a human eye could see what the testbench could not. A small notes tool called “cq” (by Mozilla) recorded what each session learned and made future sessions read those notes first. By the end, the store held 48 entries, almost all painfully, expensively earned.

Performance Realities and the USB Latency Bottleneck

By the end, the FPGA produced output that matched a NumPy reference lane-for-lane on a 200-vector benchmark. End-to-end, it was about 2x faster than the original Python simulator; on compute alone, about 4x. It was also comfortably slower than a CPU running optimized BLAS on small inputs. The point of the result is not that the FPGA wins everywhere, but that it shows where open hardware can be useful.

At this workload size, USB round-trip latency dominates the FPGA’s time budget. FPGAs win when data is moved to the device once, stays there, and is processed many times. Tiny inputs in, tiny outputs out, on every call, is the worst case for this design. We are running the worst case since we’re just at the “make it work” stage. PCIe-attached FPGAs sit much closer to the CPU and avoid this bottleneck entirely; with batched workloads on that kind of board, the compute-side 4x advantage we already see should carry through to the end-to-end number. They’re the natural next step for any workload where the dev-board numbers are encouraging enough to justify the cost.

Open Hardware and the Lowered Cost of Specialization

The lab occasionally runs computations large enough to be uncomfortable: searches over case law corpora and analyses across millions of documents. A pattern for designing a small piece of custom hardware that does those tasks well is a real option for problems that do not fit on a laptop and do not deserve a cloud bill.

More than preservation, open hardware offers freedom of access and the power to control your own collections, your own work, and your own computing. That opens up questions we are only beginning to ask. As archival media like silica and DNA-based storage mature, could open hardware lower the cost or complexity of building the readers required for those formats? As AI model architectures keep shifting, could a reconfigurable board keep pace in a way fixed silicon can’t? FPGAs already take on inference tasks at places like CERN (see their FPGA Developers Forum), adapting to software advances with matching hardware; could FPGAs do the same for our tasks?

AI coding agents are reducing the cost of specialized work that used to require a dedicated team. Custom hardware has historically been among the most specialized. If a small library research lab can produce a working SystemVerilog port of a real architecture in a few weeks, the question of what else has quietly come into reach is suddenly much broader.

From Inherited Systems to Strategic Decisions / Information Technology and Libraries

The author examines the migration of Indiana University Libraries’ interlibrary loan platform, ILLiad, from a locally-hosted server to OCLC hosting through the perspective of a new department head inheriting this critical technology decision. He explores how staffing changes, lost institutional knowledge, recurring system instability, and limited technical capacity prompted a reassessment of long-standing local practices. The piece outlines research, consortium consultation, approval processes, implementation challenges, authentication and workflow issues, and post-migration tradeoffs. Ultimately, the author offers practical guidance for new leaders tasked with managing inherited systems, vendor relationships, imperfect information, and strategic change in complex academic library environments.

Locked Out of the Library / Information Technology and Libraries

While incarcerated students face many challenges when commencing higher education, a lack of access to the internet is a considerable barrier. This technological exclusion has implications for the delivery of course materials, most of which are offered only electronically. A project team from Curtin University Library sought to understand and address the challenges faced by incarcerated students in accessing library services, particularly ebooks and audiovisual content. It was found that restrictions related to contract terms, digital rights management, and copyright contribute to a reactive and uncertain situation for library services. This article outlines the state of the problem and offers possible pathways academic libraries can take to improve the state of information access for incarcerated students.

Improving Database Discovery and Understandability by Identifying and Reducing A–Z List Jargon / Information Technology and Libraries

Countless research questions arise when investigating connections between library resource discovery and student success. Existing literature explores best practices of database description language and style, the usability of database A–Z lists, and library resource jargon. Academic libraries continue to grapple with these challenges in resource discovery, even as online searching behavior evolves and new research tools emerge. A research team at the University of Arizona Libraries builds on the literature by examining these topics with a focus on the impact of a user’s academic discipline, university affiliation (faculty, staff, or student), and research experience on their understanding of database terminology, resource content and applications, and A–Z list type filters. The authors conducted an environmental scan of library websites along with several usability tests to identify and reduce library and disciplinary jargon on their A–Z list to make databases more understandable and approachable to all users. This article presents the results of these assessments as a case study for exploring external and internal factors that impact users’ understanding and discovery of databases.

Improving Accessibility of Electronic Course Reserve PDFs to Users with Disabilities at Hunter College Library / Information Technology and Libraries

By April 2027 and 2028, institutions covered by Title II of the Americans with Disabilities Act are expected to be legally required to ensure that digital content created or used at the institution is accessible as defined by Web Content Accessibility Guidelines (WCAG) 2.1 Level AA. The new law strongly emphasizes accessibility of course materials—including PDFs. This case study demonstrates how an R2 academic library staff can enhance the accessibility of PDF course materials by improving the accessibility of electronic reserves (e-reserves) PDFs at Hunter College Library (HCL).

Processes described here can be adapted by other libraries. Supporting campuses’ work to make course readings accessible may be a natural role for academic libraries. Locating or procuring the best quality version of a text available to the institution is a critical task for which libraries are optimally equipped. Furthermore, when readings are available only in print format, libraries can create higher-quality scans than those typically produced when the task is left to individual faculty members.

HCL began improving the accessibility of e-reserves PDFs in 2020. This article shares the knowledge acquired, established processes, limitations, and future directions. The workflow comprises checking each e-reserves reading. For those deemed poor, we locate an HCL collection or open access copy, purchase a digital copy, or remediate. Remediation involves optical character recognition (OCR), fixing errors therein, correcting reading order, removing repetitive headers and footers, and tagging. Literature the authors found on libraries proactively correcting OCR and tagging PDFs—that is, preceding a user’s request—was sparse, with the exceptions of the University of Toronto and the University of Michigan. Literature about proactively doing so for e-reserves was even narrower. This case study is intended to help fill the gap.

Generative AI Meets Cataloging Practice / Information Technology and Libraries

This study evaluates the performance of four generative AI models—ChatGPT, DeepSeek, Gemini, and Copilot—in generating descriptive metadata for bibliographic resources. Models were tested on a small, diverse set of resources using four prompt types: a basic prompt, a basic prompt with an example, a detailed prompt referencing Resource Description and Access (RDA) guidelines, and a detailed prompt with an example. Results show that both detailed RDA guidance and the inclusion of sample outputs improved metadata quality, particularly in formatting and field structure. While DeepSeek and ChatGPT showed better performance on the tasks, all models displayed limitations in parsing and following the prompts, using descriptive metadata fields, analyzing subject headings, and assigning URIs. These findings suggest that while generative AI holds potential to assist in metadata creation, its current capabilities fall short of meeting cataloging standards without human review.

Case Study of the Implementation of AI Primo Research Assistant (Beta Version) in Academic Libraries in Poland / Information Technology and Libraries

One of the generative artificial intelligence tools developed for use in libraries, including academic libraries, is the AI Primo Research Assistant. Of the 65 academic libraries in Poland, only 19 have access to software that supports this tool. In practice, only 9 libraries have implemented it (data from March 2025). For the purposes of this study, original research was conducted to assess the implementation status of the Primo Assistant in academic libraries in Poland. Two anonymous surveys were developed for this purpose and sent to libraries that had implemented the feature, as well as to those with the capability to run the Primo Assistant (i.e., the Primo VE Discovery admin role), in order to gather information on why they had chosen not to implement it. The analysis revealed several positive aspects, mainly a reduction in the workload of staff tasked with preparing publication lists on topics requested by library users. Some concerns were also raised by library employees, mainly regarding the reliability of the metadata provided and the accuracy of the recommended publications. The study also revealed a general lack of awareness and a need for further implementation. This paper presents the first scientific study focused on the implementation of the AI Primo Research Assistant in Polish academic libraries.

Enhancing Information Technology Governance at the University of Riau Library / Information Technology and Libraries

Effective information technology (IT) governance is essential for the University of Riau (UNRI) Library to achieve its research and educational objectives. This paper presents a qualitative pilot study investigating the library’s current IT governance processes, focusing on two COBIT 5 processes—DSS01 (Manage Operations) and DSS05 (Manage Security Services). These processes were selected in consultation with library and IT leadership due to their direct relevance to ensuring operational reliability and safeguarding the library’s information assets. COBIT 5 principles and capability models guide the assessment, emphasizing regulatory compliance, performance monitoring, and stakeholder collaboration. Using a detailed questionnaire and capability model, the study evaluates base practices and work products for DSS01 and DSS05. Results indicate varying proficiency levels, with DSS01 at level 0 and DSS05 at level 1, highlighting significant gaps between current and desired capability levels. Recommendations include implementing standard operating procedures, enhancing security measures, and optimizing resource management. In conclusion, the findings underscore the need for standardized processes, continuous monitoring, and alignment with established frameworks like COBIT 5. By addressing identified gaps and implementing recommended improvements, the UNRI Library can strengthen its IT governance, enhance operational efficiency, and better support its academic mission.

Access Reframed / Information Technology and Libraries

This study critically explores the transformative potential of human-computer interaction (HCI) in reimagining African public libraries as dynamic, user-centered, and culturally grounded spaces. Based on a literature review and comparative analysis of libraries across several African countries, the research investigates how HCI principles can enhance user engagement, usability, and inclusivity, particularly in multilingual, resource-constrained, and postcolonial contexts. The paper situates libraries as sociotechnical infrastructures that mediate between technology, local knowledge systems, and community needs, and argues for the importance of participatory and culturally responsive design approaches in library digitization efforts. The findings highlight significant gaps in current implementations of HCI within library services, including the lack of localized interfaces and limited user involvement in design processes. The study concludes by offering practical recommendations for integrating HCI into library development strategies and advocating for the co-creation of digital public spaces that reflect and empower Africa’s diverse knowledge ecologies. In doing so, the paper contributes to the growing discourse on decolonial approaches to technology and the future of public libraries in the digital age.

The Kids Are All Right / Dan Cohen

A banner that says "2026"

Writing has been light around here recently for a wonderful reason: our twins graduated from their respective colleges over the past month, and we have been in nearly nonstop revelry (and packing, and schlepping…). We are so fortunate to have two great kids; I’m super proud of them.

Speakers at our kids’ commencements, thankfully and remarkably, said little about artificial intelligence, but they did talk a lot about the complex circumstances and especially the psychology of this rising generation, and offered advice on how the graduating seniors should move forward in life given significant headwinds. I suppose it’s tempting to describe and analyze the troubles facing each graduating class, and provide sage guidance in response to the historical moment, but I’m not sure that my kids, their friends, and their generation overall are so very different from any other, or that any distinct advice is needed.

The Great Class of 2026 is, I’m afraid, just like every graduating class: happy and sad, confused and hopeful about the future, striving and procrastinating. Young adults, in other words. Sure, they seem to be impacted by new technology and our dreadful national politics and nerve-racking global challenges, but hasn’t it always been so? My college class graduated into a recession, the rise of the internet, the fall of the Berlin Wall, the chaotic end of the Soviet Union, and a messy war in the Middle East — all of these dominoes falling after a childhood in which we were fairly sure we would perish at any moment in a nuclear war. That was a lot to absorb! Back then, commencement speakers picked up on our anxiety, which had apparently morphed into excessive irony and a general lack of motivation, epitomized by the title and content of a Richard Linklater film: Slacker.

It may have taken some time, but we muddled through. So did the generation another turn of the clock back from ours (Vietnam, stagflation, etc.) and the generations before that (pick your World War and/or the Great Depression, etc.). History is, unfortunately, a procession of horrible developments, but also a showcase of astonishing resilience and creativity. Is it so Pollyannaish to simply say that Gen Z will also find a way forward, and frankly might be better off without pithy advice from the olds? Must we unconsciously mimic the opening of Woody Allen’s fictional commencement address, raising the graduating class’s blood pressure by declaring, “More than at any other time in history, mankind faces a crossroads. One path leads to despair and utter hopelessness. The other, to total extinction. Let us pray we have the wisdom to choose correctly”?

Instead, I saw hope in every joyful row of begowned seniors, students who, despite all of the radical changes and stressful tensions around them, had nevertheless maintained their curiosity and maybe even cultivated a passion during college. Students who found their special niche in music, writing, art, or science, who felt compelled to listen to it all, read it all, see it all, or experiment late into the night, regardless of the requirements of the classroom. I have a feeling that this kind of deep and abiding engagement, born not from careerism but from genuine profound interest, will serve these graduates well in the years ahead. As it always has.


Books I Have Not Written

The class-action lawsuit of authors against Anthropic and its subsequent settlement have helpfully informed me of the many, many other writers named Daniel Cohen, because the settlement administrators, in their quest to match authors and texts, have sent emails and letters asking if I am the Dan Cohen who wrote this or that book. There are too many volumes by The Daniel Cohens to list in full here, but as a public service to a handful of special fellow Dans, I hereby declare:

I am not the Daniel Cohen who wrote The Monsters of Star Trek, but I would wager 100 quatloos on Triskelion that I would greatly enjoy meeting that Dan Cohen.

I am #$%@# mad I am not the Daniel Cohen who penned Famous Curses, because my family is on a mission to bring back the useful exclamation “Gordon Bennett!

I did not write Southern Fried Rat and Other Gruesome Tales, but, based on the delightful cover of this not-me Daniel Cohen book, I probably read it at camp the year it was published.

My final confession: The settlement administrators believe there is a Daniel Cohen who authored a book titled Final Confession, but, alas, I am not the one.

My conscience is now clear.


Tree 2 / Ed Summers

Tree 2

Same as Tree 1 but after playing with some filters on my Android phone before uploading.

Tree 1 / Ed Summers

Tree 1

A reflection of a tree in the Northwest Branch river, cropped and turned upside down.

Bookmarks - data, design, vis, book / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 When Bits Rot - with C McKean, L Talboom, A Page-Mitchell

English Edition: floppy disks, hard drives, CDs, DVDs, SSD drives - no matter what you choose to store your data on - ultimately they all decay. With my guests Callum McKean, Leontien Talboom and Adrian Page-Mitchell, we’re going to talk about what kinds of data we find on old drives, why we want to get them in the first place, and what can go wrong with the storage media. To all of you who love all things retro - we’ll be talking about floppy disks a bit.

🔖 series: learning-rust (10 parts)

  • Learn Rust Basics By Building a Brainfuck Interpreter
  • Learn Rust Ownership and Borrowing By Building Mini Grep
  • Build a JSON Parser in Rust from Scratch
  • Learn Error Handling in Rust By Building a TOML Config Parser
  • Learn Rust Generics and Traits By Building a Mini Blackjack Game
  • Learn Rust Lifetimes by Building a Generic LRU Cache
  • Learn Rust HashMap and Iterators by Building a Git Object Store Reader
  • Learn Rust Closures By Building a Tiny Rule-Based Linter *Learn Rust Smart Pointers and Interior Mutability by Building Git Commit Graph Viewer
  • Learn Rust Concurrency By Building a Thread Pool

🔖 Porting our Django backend to Rust improved the infra usage by 90%

We went from using 220 CPUs and 800 GB of ram to just 24 CPUs and 64 GB. Thus, way less money, less things to maintain.

The number of open DB connections at any point in time have improved quite a bit, from the thousands to hundreds (about ~3-5x reduction).

The good news is that we haven’t even added caching to the Rust backend yet, and query timings are already 5-10 times faster.

🔖 How to Build an Agentic RAG with RubyLLM and Rails

I run a RAG application for Italian pension and tax consultants. Users ask questions about INPS, professional pension funds, laws and regulations, and the app answers using a knowledge base of uploaded documents.

For a long time the app used the classic single-shot RAG pipeline: take the question, search the database, stuff the results into a system prompt, ask the model. It works, but it has a hard limit: the retrieval happens once, before the model has any chance to reason about the question. If the first search misses, the answer is bad and there is nothing the model can do about it.

So I rebuilt the pipeline as an agent. Now the model drives the retrieval itself: it decides what to search, reads the results, searches again with different terms, follows cross references between documents, and only then writes the answer. All in plain Ruby, with RubyLLM and Rails. No LangChain, no Python sidecar.

In this article I will show you exactly how it works, with the real code from my application. One note before we start: since the app serves Italian consultants, all the prompts, tool descriptions and user-facing strings are in Italian in the real codebase. I translated them to English here so you can follow along, but the structure is identical.

🔖 Zooming Out: Can We Integrate IIIF and Wikimedia?

Wikimedia and GLAM institutions share a challenge. How do we make cultural heritage collections accessible at scale without sacrificing quality, provenance, sustainability, or community control? The International Image Interoperability Framework, IIIF, is now used by thousands of institutions to serve high-resolution media through open standards. Wikimedia does not currently integrate IIIF in its core architecture. Should it?

🔖 5 things to know about the Eastern Silver Spring Communities Plan

Since 2023, Montgomery Planning staff have been working on the Eastern Silver Spring Communities Plan, drafting recommendations on zoning and land use, transportation, housing, parks and the environment, economic development and urban design. The plan is expected to set a vision for the area’s future development for decades to come. The plan is bordered by Colesville Road, University Boulevard and New Hampshire Avenue and will include three future Purple Line stations, the Piney Branch Road, Long Branch and Manchester Place

🔖 What Design Can’t Do

Design is broken. Young and not-so-young designers are becoming increasingly aware of this. Many feel impotent: they were told they had the tools to make the world a better place, but instead the world takes its toll on them. Beyond a haze of hype and bold claims lies a barren land of self-doubt and impostor syndrome. Although these ‘feels’ might be the Millennial norm, design culture reinforces them. In conferences we learn that “with great power comes great responsibility” but, when it comes to real-life clients, all they ask is to “make the logo bigger.”

🔖 Gemini 3 Solves Handwriting Recognition and it’s a Bitter Lesson

On our strictest tests, Gemini 3 achieved a CER of 1.67% and a WER of 4.42%. On these tests, any difference between the ground truth and test texts counts as an error. WER is thus almost always a bit more than double the CER because if a single character in a word is wrong, including leading or trailing punctuation like commas, single quotes vs double quotes, etc, the whole word is marked as an error. On this measure, Gemini 3 performs nearly 50% better than the best, fine-tuned specialized models and achieved performance comparable to an early career, professional human typist.

🔖 FacilMap

FacilMap is a privacy-friendly, open-source versatile online map that combines different services based on OpenStreetMap. FacilMap offers the following features:

  • Show different map styles, for example maps optimized for driving, cycling, hiking or showing the topography or public transportation networks.
  • Search for places
  • Show amenities and POIs
  • Calculate a route, optionally showing the elevation profile.
  • Find out what is at a particular point on the map
  • Open geographic files, for example GPX, KML or GeoJSON files
  • Show your location on the map
  • Share a link to a particular view of the map.
  • Add FacilMap as an app to your device.
  • Change the language settings in the user preferences.
  • FacilMap is privacy-friendly and does not track you

🔖 Vector Search, Visualised

SQL makes sense. But when it breaks, you reach for EXPLAIN. Vector search offers no such comfort. Multi-thousand-dimension embeddings, approximate nearest-neighbour indexes, and quantisation tradeoffs make it hard to know what your system is doing, and harder still to diagnose when results quietly degrade. Through interactive visualisations, Simon Hearne shows what embeddings look like in high-dimensional space, what quantisation does to your recall, and how to catch retrieval failures before your agents do. You’ll leave with a sharper mental model and a diagnostic toolkit for the production problems hardest to see.

🔖 Using the Screen Capture API to record a browser window

Once again I am reminded that modern web tech is amazing, and web browsers are incredibly capable. There’s a Screen Capture API to record the screen. You can select a tab, a window, or the entire screen. The feature has limited browser support so I don’t think I’d use it in a big web app, but it’s fine for a one-off screen recording. (I wonder how browser-based video conference apps like Google Meet do screen sharing? Do they use this API, or do they use something with wider support?)

TASCAM recovery / Ed Summers

TL;DR if you have a TASCAM 788 backup and don’t know how to get the audio out of it this script might help. Also: AI tools work best when paired with expertise.


I needed to take a very personal excursion into digital preservation recently as I attempted to listen to some audio recordings my brother John had made about 20 years ago. John died recently, and is sorely missed by his friends and family.

John was a continuous source of inspiration for me, because of his many varied interests and projects. One thing he did consistently since he was a teenager was perform music as a singer-songwriter.

As my family and I went through the very difficult process of emptying his apartment, we discovered a set of recordings he had made on CD-R. Three of these CDs were clearly conceived of as albums, and easily mounted as CDDA when I popped them in my CD player.

However he also left a binder of CD-Rs, where each CD was neatly labeled with a song title and a year. All in all there are 108 of them, from the 2003-2008 time period. There is a lot of material on these CDs that is not present on the three albums. However, when I popped these in my CD player all I saw was a macOS error dialog box saying:

The disk you attached was not readable by this computer.

John’s binder of CD-Rs

At first I thought they might be damaged or corrupted. But it seemed unlikely that so many of them would be. After some asking around I got pointed to two excellent guides to working with CDs:

These guides were great, and did help me extract the raw data from the CD-R with cdrdao, but ultimately I was unable to determine what format the data was in using tools, like file, Siegfried and Droid.

In a fit of desperation I spent some time in Claude Code trying to see if it could help me identify what format the data was in. Despite several forays, it kept going round in circles, burning tokens.

One of those forays led me on a wild goose chase installing an old version of macOS in order to see if an old version of Retrospect might be able to read the CDs (it didn’t).

During this time I got some excellent advice over in the Fediverse at digipres.club. One of those messages was from Ross Spencer who took a look at a sample raw CD image. He was able to spot some markers that pointed to it possibly being a backup from a TASCAM DAW, specifically a TASCAM 788 (I believe Ross was using either strings or a hex editor to look for these clues).

TASCAM 788

Unfortunately, after poking around in various user forums, I discovered that there were not really any tools for working with TASCAM 788 backups. Everyone seemed to be recommending the purchase of a TASCAM 788 and its CD Burner, since the data was in a proprietary format, and there were no emulators.

Before dropping some money on Ebay I decided to roll the dice with Claude Code again, but this time with the more specific guidance that this was likely a TASCAM 788 backup, and asking about options for recovery. If you are interested you can read the transcript for this session. The key part of the back and forth for me was:

The 2488 stores audio as raw 16-bit or 24-bit PCM at 44.1kHz in a proprietary block structure. Once you identify the byte offset where audio data starts, you can use Audacity’s “Import Raw Data” with 24-bit signed big-endian PCM, 44.1kHz, to listen and verify.

I prompted it to try to identify the offset, so I could attempt the import in Audacity. It did some work writing Python snippets and executing them for a few minutes, and then output a likely offset. The first time I read it in I only heard white noise. But after twiddling some of the import options in Audacity I saw some promising waveforms appear in the Audacity display. And when I pressed play ✨✨✨✨ instead of white noise I heard John’s guitar and voice!

Audacity screenshot of imported raw data

What appeared to be a single track turned out to be multiple tracks created with the TASCAM, that were joined together. The final segment was the completed mix.

I continued to work with Claude on a program that would identify the offset in the raw CD data, then extract a WAV file, and then extract the separate tracks, as well as the complete track. It did this by looking for gaps inside the audio. I put the program here:

https://github.com/edsu/jas-discs/blob/main/extract_tracks.py

Here is the guitar / vocal first track (there are a few seconds of silence at the beginning):

And here is the mix including percussion and keyboards:

These recordings are Copyright John Summers CC-BY-NC

I have since been able to find John’s TASCAM 788 at my brother Matt’s house–although it doesn’t have the SCSI external CD burner anymore. So there’s no way to read the CDs with it.

These CDs and songs are important enough to me that I want to see if the actual hardware can do a better job of preserving John’s work. So I’ve got a bid one of the external CD-Recorder devices I found on Ebay.

John clearly spent a lot of time and care taking a snapshot of these songs he used to perform in coffee shops around Bucks County Pennsylvania. I plan to release some of them on his Bandcamp, with some of his artworks as album covers. I want to share them with people who knew him, and put these songs out into the world in a way that respects his memory and creative work, while also being something that he just wasn’t focused on as an artist. For John it was the creative process itself that mattered most.

None of this will bring John back of course. He’s gone now, and at peace. But he will always be remembered by those who loved him. Look for more posts here after I’ve been able to extract these songs in total.

Why are cached input tokens cheaper with AI services? / Xe Iaso

When you see AI model pricing pages, you usually see things broken down like this:

ModelContext LengthMax CoT TokensMax Output TokensInput Price (Cache Hit)Input Price (Cache Miss)Output Price
deepseek-chat64K-8K$0.07 / 1M tokens$0.27 / 1M tokens$1.10 / 1M tokens
deepseek-reasoner64K32K8K$0.14 / 1M tokens$0.55 / 1M tokens$2.19 / 1M tokens

Source: DeepSeek API Docs

If you manage to have most of your input tokens be cached, you save a huge amount, in this case $0.20 per million tokens. What does this mean though? What does caching do that makes you save so much, in some cases upwards of tens of kilodollars?

Someone explain the cached vs not thing to me for how this is $10,000 worth of savings lol



[image or embed]

— Chimney Sweepers Local 420 FKA yburyug (

@bobbby.online

)

June 12, 2026 at 12:39 AM

Warning

I'm gonna be totally honest, I barely understand the basic outline of the math involved here. Where possible I am to not be completely wrong here, but I'm not going to emit something 1:1 accurate with the mathematical truth of large language models' inner workings. Bear with me.

When you make an API call to large language model services, you make an API call like the following:

curl http://localhost:11434/api/chat -d '{
          "model": "llama3.2",
          "messages": [
            {
              "role": "user",
              "content": "why is the sky blue?"
            }
          ]
        }'
        

That messages element is the key bit. Every time you accumulate messages from the initial system prompt, initial user request, AI responses and any tool use requests/responses, you add to that array and make it grow bigger and bigger.

A good way to think about this is that sending a conversation to a large language model is like having a pair of people share a roll of paper on two different typewriters. Every time you finish your message, you send the roll of paper back to the AI model and it has to re-read through the entire conversation in order to start typing on the end with its response. As the conversation gets longer, this gets more and more expensive because the model has to recalculate its internal state all over again for every additional message.

However, large language model inference is complicated but deterministic. Given the same inputs, you will always get the same output. This means that you can use a technique called key-value caching (KV caching) in order to save that intermediate state and use it for next time. Most of the time this cache is a prefix cache because that allows you to just add on more messages to the end of the request pretty easily and be fine.

Imagine something like this:

curl http://localhost:11434/api/chat -d '{
          "model": "llama3.2",
          "messages": [
            {
              "role": "user",
              "content": "why is the sky blue?"
            },
            {
              "role": "assistant",
              "content": "The sky is blue because of a phenomenon..."
            },
            {
              "role": "user",
              "content": "But I am looking outside right now and it is orange!"
            }
          ]
        }'
        

If the model has already processed the question about the sky being blue and generated the response about Rayleigh scattering, it doesn't need to process both of those messages again to answer the user's question about sunsets. In production AI model deployments you would put that generated intermediate state into the KV cache so that the model doesn't need to run twice for the same data. This saves time and effort on the side of the AI model provider, and currently model providers decide to pass that savings onto API users in the form of cheaper inference costs for cached lookups.

As you develop an application with AI in it, try to avoid changing any inference settings or previous messages between prompts. This makes your application's queries much more likely to read from the cache, making it faster, reducing the environmental impact, and saving you(r users) money.

Reading the room: What global library leadership conversations teach us / HangingTogether

This is the first installment of a three-part series on global library leadership engagement, contributed by Ellen Hartman, OCLC Leaders Council Manager. We’re grateful to Ellen for sharing her perspectives on this topic.

Proof we engaged face to face

At a recent gathering of the OCLC Leaders Council, something happened that I always hope for but never take for granted. Connections were being made, there was laughter, sidebar conversations over lunch and dinner, and a willingness to challenge each other’s ideas, honesty about what people were struggling with, and genuine curiosity about what others are doing. All of this was built on a foundation of trust that made these in-depth conversations possible.

These moments don’t happen automatically. In my experience, they take timeand often, the opportunity to meet in person. Meeting online can be very efficient, but it can feel rushed and impersonalit’s hard to truly get to know each other through a screen. Being in the room together over the course of a few days, in a small enough group that you actually get to speak to everyone, creates a solid foundation for future opportunities to meet again, online or in person, to build on the connections, themes, and conversations that started there.

What made this gathering particularly significant was its global dimension. Library leaders do come together regularly, but often within their own region, or among peers from the same library type. Academic and public library leaders, for instance, don’t always get the opportunity to meet for in-depth conversation, even though there is much they can learn from each other. Conversations organized by library type or region have real value, of course, but there is something additional that comes with a broader perspective that is still rooted in the library ecosystem while extending beyond your usual network. Every perspective in the room adds something, regardless of what an institution has or hasn’t yet achieved. The value of these conversations comes from the range of experiences present.

OCLC Research has published work on building relationships across unit boundaries within institutions (social interoperability), as well as creating and sustaining successful multi-institutional collaborative partnerships. But what I’d like to talk about here is more fundamental—the foundation for building successful partnerships: global library leaders from a variety of backgrounds and experiences engaging with one another in the same room. Prompted by the recent Leaders Council meeting, here are some reflections on the practical realities of these conversations, intended to deepen understanding and maximize their effectiveness.

Same words, different realities

Across international leadership spaces, a remarkably consistent vocabulary tends to surface. Terms recur across sessions, regions, and formats, and their repetition signals that we are all on the same page: a reassurance that participants are engaged with the same broad challenges and moving in a broadly similar direction.

The problem is that shared language doesn’t necessarily mean a shared understanding or a shared reality. One of the things that becomes apparent, watching these conversations unfold, is how often the same word lands differently depending on who is in the room.

Take efficiency, a term that surfaces regularly in conversations about how libraries operate and plan strategically. In some contexts, efficiency encompasses decisions about workforce size and structure. In others, those decisions are shaped by employment frameworks that lead to a very different kind of conversation, shifting the focus instead toward technology, software, or finding different ways of working within existing structures. The word is the same. The need it describes, and the range of solutions available, are not. This is why you need a deeper understanding of each other’s context to find out where you are using the same words but aren’t speaking the same language.

Glimpses, not full pictures

Even with that understanding in place, international leadership conversations can only ever offer glimpses of each other’s reality rather than the full picture. You see enough of someone else’s context to recognize the challenge, but rarely enough to understand all the constraints behind it.

This matters because those constraints are often what make the difference. Take something many library leaders struggle with: making the case for their library’s value to the broader institution or community they serve (for more on this topic, see OCLC Research’s latest report!). Some leaders have, through long-term effort and considerable perseverance, managed to position the library as visibly central to their institution’s priorities and a key part of its success. For others, making that same case remains difficult. The reasons could be structural or personal: the physical or organizational distance between the library and the part of the institution that makes key decisions, the data available to demonstrate the library’s impact, or the library leader’s own position, voice, and access to the right conversations at the right time.

In international settings, what tends to surface is the success story. What is harder to showcase is the full path to that success. The years of lobbying, the hundreds of stakeholder conversations, the incremental steps that made this outcome possible. A leader who has achieved that recognition may share what they did in good faith and genuinely want to help others reach the same goal. But because the conditions that made their success possible are often invisible in how the story gets told, it can be hard for them to understand why the same challenge feels insurmountable to a peer.

The value outside of the program

International leadership meetings are often evaluated by what happens in the formal program. But some of the most valuable exchanges happen elsewhere. Recognizing that is part of understanding how these spaces work in practice.

In smaller gatherings, it’s the time outside the formal agenda where a lot of the magic happens. When a group of library leaders meet for the first time, they are still in the process of getting to know one another. This is why you can’t expect them to immediately share their biggest challenges or most acute pain points. There is a measure of trust building that happens as a gathering takes place, especially over multiple days. It’s often after the official program ends, and there is room for leaders to relax and reflect together (for example, during dinner or at the bar) that the more personal and complex topics get discussed.

That kind of conversation requires enough prior exchanges that people feel safe being a little vulnerable. Admitting that your library is struggling to secure its position, or that you haven’t found a way to make your value proposition tangible enough to institutional leadership or other stakeholders that control funding, is not something most people are willing to do in a room full of peers they’ve just met. It becomes possible when the group has had time to become something more than a collection of strangers.

This is one of the reasons smaller, sustained gatherings tend to produce a different quality of exchange than large conferences. It is also why the informal spaces within those gatherings deserve to be nurtured rather than left entirely to chance.

No neat resolutions needed

One expectation worth setting aside is that international leadership conversations should resolve into clear conclusions. They rarely do, and that is not a failure.

Conversations like these do not need to end in consensus or a neat step-by-step path forward. It’s often the process of sharing and reflecting on both differences and commonalities that provides the greatest benefit. It might be an idea you hear and want to incorporate in your own library. A perspective that’s truly new to you and makes you see a topic in a different way. Or simply the opportunity to take a subject that was discussed at surface level and deepen the conversation in future gatherings.

That is why continued engagement matters more than resolution. Understanding accumulates across multiple conversations, multiple gatherings, and sometimes multiple years. It cannot be compressed into a single meeting, however well designed. The friction and the moments of genuine surprise are part of the value. Smoothing those moments away or rushing toward consensus risks losing exactly what makes international exchange worthwhile.

Conclusion

International leadership spaces are often judged by the ideas they surface or the alignment they appear to produce. But their deeper value lies in the glimpses they offer into realities that are different from our own. Those glimpses don’t tell the full story of what other library leaders are experiencing, but taken together, they help form a better understanding of what experiences are out there.

When designed well and when opportunities for informal interactions are cultivated, global library leadership spaces create the conditions for the kinds of conversations that go deepest. Those conversations rarely happen on the agenda, but rather emerge when enough trust has been built that people are willing to be open and candid with one another. That is not something that happens automatically: it requires continued investment in bringing people together, and repeated exposure to each other’s contexts, experiences, and points of view over time. Trust is not built overnight.

The next post in this series takes a closer look at what global engagement actually involves beyond the conversation itself and why showing up, in every sense of the phrase, costs more for some than others.

The post Reading the room: What global library leadership conversations teach us appeared first on Hanging Together.

Inside Out / Ed Summers

I have a problem with RSS. Not RSS itself, RSS is great!

The problem is that I subscribe to more feeds than I can possibly read, so the unread count in FreshRSS climbs faster than I can bring it down. Some days I skim titles, declare bankruptcy, and mark everything as read. Other days I let it pile up and feel guilty.

I’ve tried to using newer tools like Current which was definitely an improvement, but still didn’t quite do it. My friend Dan has been working on a new RSS tool that works a bit like a personal newspaper, that seems like it could be extremely helpful, and I’m keeping my eye on it. But meanwhile the list of unread posts grows…

Now, I’ve been very reluctant and slow to introduce LLMs into my daily work. But even from under my rock, in a cave, down by the river, I’ve heard that LLMs are good at text summarization.

I thought maybe, just maybe, I could try using one to summarize my unread posts? It seemed like a good fit for an experiment since the impact of getting things wrong is basically zero (in theory).

I wanted to try routing my unread RSS posts through an LLM to get a daily digest. From under my rock I’d also heard about Model-Context-Protocol (MCP), and how it is going to change everything. So I thought it would be a good exercise in seeing how that works in practice with a tool like Claude Code. I’d use Claude Code’s MCP support to connect directly to FreshRSS and ask Claude to summarize what I’d missed. Yeah, that’s the ticket.

This is the Way?

The first thing I tried was ChrisLAS’s freshrss-mcp server, which wraps the FreshRSS GReader API and exposes it as a set of MCP tools. The idea is that you drop it into your Claude configuration and Claude can then call those tools to fetch and read your articles.

I gave it a try, and it worked! But the results were… mixed. Claude would usually fetch articles. But then it would produce a lot of diagnostic chatter alongside the actual summary: narrating its own tool calls, noting what it was about to do, explaining why it was skipping certain things, asking for permission for this and that.

And more frustratingly, it would sometimes take strange detours: executing inline Python code, and Unix tools to do things it could have done by calling the MCP tools more directly, wandering into unnecessary computation. The experience felt noisy and unpredictable, and (frankly) just a bit scary.

I started by creating some “skills” and some scripts for those skills thinking it would make things a bit more deterministic. It kinda did?

I thought maybe my problem was that the skills weren’t bundled together, so I built my own plugin: freshrss-claude. This version bundled the MCP server as a Claude Code plugin with a set of “skills”, the structured prompts to guide Claude through fetching and summarizing in a more controlled way.

It seemed better? Not needing to start the MCP server was definitely better. But ultimately it wasn’t as big an improvement as I’d hoped for. Claude still exhibited strange behaviors: writing and executing Python scripts unnecessarily, going off-script in ways that were hard to anticipate. The summaries themselves were fine when they arrived, but the path to getting them there was erratic and unpredictable.

The last straw for me was the idea of running this Rube Goldberg machine from a cron job to generate the summary for me automatically. To run it automatically I needed to grant it all kinds of permissions to ensure it ran through. This scared the shit out of me, given it was giving it permission to run arbitrary Python programs and reach out to the web, and interact with the filesystem. Running it once or twice manually was ok. But sticking it in my crontab and forgetting about it? Forget about it. I exprerimented briefly with putting things in a Docker container, and Claude Cowork’s sandboxing, but then…

Turning it inside out

I stepped back and rethought the problem. The thing I’d been trying to do, have an LLM orchestrate a set of tools to accomplish a task, is one (seemingly popular) way to use an LLM. But it turns out to be kinda demented. You’re asking the model to plan, to sequence, to decide. You are asking it to be An Agent. Sure models can do this, but they are not reliable in the way a simple program is. They wander. They improvise. They sometimes decide to take a detour. Do I really benefit from this runtime model in this little RSS digest app? Nah, not really.

So the alternative, and this is the inversion that made things click for me, is to write a deterministic program that calls the LLM as a component, rather than letting the LLM drive the program as an Agent. My code fetches the articles. My code shapes the prompt. My code writes the output to a file. The LLM does exactly one thing: it reads the content I hand it and produces a summary.

Take Two (or Three, or Four?)

I threw it all on the fire and started over by writing rss-digest instead. Well, truth be told, Claude and I wrote it. Ok, ok, mostly Claude.

It’s a small Python CLI that connects to any GReader API-compatible RSS reader (FreshRSS, Miniflux, Tiny Tiny RSS, The Old Reader), fetches your recent unread articles, and asks an LLM to produce a digest. Because it uses LiteLLM under the hood, you can point it at any compatible model: OpenAI, a local model running in LM Studio, whatever you prefer.

The output is a Markdown file (or HTML with –html). I have a cron job run it in the morning and drop a file on my desktop for me to read. Here’s an example of what it looks like.

For smaller batches (≤25 articles) it gives you a structured list. For larger ones it produces a curated prose summary grouped by theme. You can pass a custom system prompt file if you want to tune the style or grouping. You can pass –mark-read if you want it to mark everything as read afterward.

The tool is on PyPI and the code is on GitHub. I’ve just started using it, so it quite possibly has problems. The prompt that is used for doing the summarization is configurable. If you have a different take on the prompt or want to extend it, please send me a pull request so I can add it as an alternative.

So…

What I keep coming back to is the design lesson underneath all of this.

There’s real value in being thoughtful about which part of your system is deterministic and which part is probabilistic. There’s no doubt that LLMs are magical things, but it’s not a reliable program. It shouldn’t always be the thing making decisions about what to fetch, when to stop, or how to structure output. Hand it a well-formed input, ask it a clear question, and (hopefully) it will return something useful. Everything else, the plumbing, the sequencing, the file I/O stays in your code that you can look at, and test and run directly.

I’m not saying all programs using LLMs need to take this approach. I’m just saying maybe you don’t need MCP, Agentic AI, etc, etc all the time. Experiment with it, but don’t forget to turn it inside out when you need to.

Library of Congress Storage Architecture Meeting 2026 / David Rosenthal

Once again I attended most of the library of Congress' Designing Storage Architectures workshop remotely. I apologize for the delay in posting this; domestic duties have kept me very busy recently. Below the fold notes on the talks that caught my attention, based on my now somewhat memory and the slide decks for the talks from the Library of Congress website.

Data Storage Trends

As usual, IBM's Georg Lauhoff provided an invaluable overview of the storage industry as of late 2025, co-authored with Sassan Shahidi. They make an important point that I have been making since at least 2018's Archival Media: Not a Good Business:
Challenges of Alternative Archival Technologies
• Alternative archival technologies face technical and economic hurdles.
This justifies their focus on flash, hard disk and tape. Their "exabytes shipped" graph shows that indeed Hard Disk Unexpectedly Not Dead; the dramatic decline in HDD's share since 2008 reversed in 2024.
The key metric for technological progress in traditional storage media is areal density:
  • Lauhoff and Shahidi's graph shows that tape, which has the easiest path because of the relatively large size of the bits, has continued its steady growth, although one could argue both that their 24% annual growth exaggerates the period since 2017, and that INSIC's projection of 28% is optimistic.
  • It is clear that HDD areal density progress slowed dramatically about 2010 to around 11% per year. But the developments Jon Trantham reported, see the next section, could lead to a significant acceleration in HDD areal density.
  • Flash has continued a steady 30% per year growth since about 2010, thanks to stacking cells vertically and storing multiple bits in them. Both of these have limits, into which the industry will eventually run.
As regards the relative cost per TB of the three media, the big picture is that since around 2010 change has been very gradual. Tape and flash have both become cheaper relative to HDD, but the rate of change has been much lower than predicted.

Lauhoff and Shahidi conclude that:
  • Tape Storage: continues to evolve.
  • HDD: improvements slow down but recently high demand.
  • NAND: well-suited for hot storage but not for archival purposes.
  • Lack of Alternatives: Within the foreseeable future (within 10 years), there are no viable alternatives to Tape, HDD, and NAND storage.
  • AI leads to storage demands across the tiers
This last point was a theme for the entire meeting. But it is important to note that the meeting was too early to capture the full impact of AI on the cost and availability of media and systems.

Mass Capacity Storage in an AI Era

Jon Trantham of Seagate reported that after more than a quarter-century of work and 14 years after HAMR was demonstrated in the lab, Seagate has finally been shipping HAMR drives in quantity since early 2025.
He also announced that they have started to ship their 40TB HAMR drives. Their roadmap to 100TB/drive presents some significant challenges, as shown in Trantham's slide. The history of HAMR shows that Seagate can surmount major technical challenges, but it may take longer than they project.
One of Trantham's slides vividly illustrated the technology challenges the HDD industry faces, showing to scale to evolution since 1997 of the sizes of the bits on the media, the reader, and the writer. Note the 1610-fold decrease in the area of the writer, the 305-fold decrease in the area of the bit, and the 289-fold decrease in the area of the reader.

Flash for Archival Storage

Fifteen years ago, Ethan Miller, Ian Adams and I published Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. It was inspired by work at Carnegie-Mellon from 2009, FAWN: a fast array of wimpy nodes, which argued that implementing fast storage using large numbers of small nodes built from cell-phone technology could save two orders of magnitude in energy per query. We argued that it would be possible to build low cost, low energy archival storage systems using a similar approach.

Our idea was ignored, but at this meeting Ethan Miller revived the idea of using flash as an archival medium. He argues for a rack-scale system storing 500PB/rack built from 5U shelves, similar to Backblaze's, each holding 216 of Pure Storage's 300TB DFMs (direct flash modules) stacked vertically.

There are three big challenges:
  • First, if all the DFMs were actively I/O-ing the rack would draw 45KW. Supplying the rack with that much power and cooling it would be very difficult (see the design of Nvidia's racks). But, just as with Facebook's hard disk cold storage, this can be mitigated by scheduling accesses so that only a small proportion of the drives are active.
  • Second, flash cells gradually leak electrons, so must be regularly refreshed by reading and re-writing them. This task must be scheduled along with the application's reads and writes, but doing so is fairly easy since the refresh timing isn't critical.
  • Third, flash is more expensive per TB than hard disk or tape. As I have argued for a long time, in the archival storage market the time value of money makes it difficult to justify trading increased capex for decreased opex:
    • The opex savings are significant, with essentially no mechanical failures, more benign failure modes, and much higher bandwidth for erasure code recovery.
    • Miller argues that the capex isn't as bad as the cost of the media makes it look, because at 0.5EB/rack there are savings in space, power and cooling. He doesn't point out that the lower latency for read access potentially allows for the elimination of an entire warm layer of the storage hierarchy.
    But he acknowledges that AI is driving up the media cost. This is probably only relative to tape, since hard drive prices are also skyrocketing.
Miller argues that, over time, flash costs will come down. The scope for further shrinkage of the cells, and the addition of more layers, is limited. Once that happens the fabs that manufacture flash will gradually fall behind the leading edge and become depreciated.

Although I'm naturally biassed, I think Miller's case for archival flash is worth a detailed investigation.

Avoiding the Pitfalls of Cloud Storage for AI Applications

Fourteen years ago in Cloud vs. Local Storage Costs and More on Glacier Pricing I started writing about the way the complex and somewhat opaque pricing models of cloud storage platforms made it difficult to estimate how much you would end up paying. People are just now figuring out that AI has the same problem. Neither is an accident; these pricing models serve two goals important for the platform's business model. First, the purchase decision is based on the "Low, Low" advertised price. Second, once you discover how much more you're actually paying, you face the lock-in created by egress fees. In 2019's Cloud for Presevation I wrote about how egress charges implement vendor lock-in.
David Boland of Wasabi presented a current analysis of this issue. He reports that about half of all the organizations they surveyed exceeded their budget for public cloud storage.
The budget overruns were caused by the fact that the actual spend was about double the sticker price for the storage. Fees were the culprit, which by design are much harder to project.

Digital Storage Architectures for AI and ML

Will Cavin of Amazon had two important iterms of news:

Announcing AI for Libraries – a weekly newsletter / Artefacto

AI is one of those generational tech topics that isn’t going away soon. But the signal to noise (or hype to reality) ratio can be truly overwhelming.  There are just so many links, opinions, new resources that are getting lost in the mix. And that’s for us information and tech nerds – we can only [...]

Continue Reading...

Source

2026-06-09: Teaching Database Concepts for Senior Undergraduate and Graduate Students at ODU / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

In the Spring 2026 semester at Old Dominion University (ODU), I taught CS 450 (Undergrad) / CS 550 (Graduate): Database Concepts. The course was fully online, with synchronous live Zoom sessions held twice a week. The attendance was not mandatory but strongly encouraged. All lectures were recorded and made available for students to access whenever needed.


Figure 1: Canvas course page for CS 450/550: Database Concepts


Through this blog post, I want to share my experience of teaching a senior-level undergraduate/graduate course for the first time, the behind-the-scenes realities of course preparation through to the end of the course, and how student feedback actively shaped the course as it progressed. 


Since the course had been taught previously by other instructors, materials were already available, which made things easier. Rather than building everything from scratch, I started by copying over the existing course structure and then carefully updating it to align with the current semester. The more time-consuming part was setting everything up, cleaning up the Canvas course, especially updating deadlines and revising the syllabus, while ensuring the topics were properly aligned with assignment deadlines. If you are instructing for the first time, it is very important to make sure you get access to the course in time, so you can set everything up without a rush.

Throughout the semester, to make the most of class time, I spent a couple of hours before each session preparing things such as reviewing material, planning examples, and thinking through how topics would connect. I tried to debug issues during the class in real time whenever possible. If something took longer than expected, I pushed it to the end of class or moved it to the office hours. It helped me to continue the flow of the topic without interruptions.


I was able to experience first hand how handling a class of 50 students without a teaching assistant (TA) was, honestly, a lot more work than I expected. Grading labs, homework, quizzes, and discussions while also preparing for lectures and responding to emails required a constant balance. I wasn’t always perfect, but I made a steady effort to stay on top of it. Grades were returned as quickly as I could manage, and emails were typically answered within 24 hours, often sooner. Again, it reinforced something I had already noticed as a student: timeliness matters. Things do not have to be instant, but when there is a clear effort to respond and follow through, it builds trust and keeps students engaged.


One of the first challenges I faced as an instructor to this course involved managing classroom dynamics. After a few classes, a student shared a concern that some well-intentioned peer engagement (jumping in to answer questions or adding explanations during lecture) was becoming distracting to follow along. It was a fair concern, and an important one. At the same time, I didn’t want to discourage participation. Active engagement is something every instructor hopes for, and it was clear that students were eager to contribute. My challenge was to find the right balance. I responded by acknowledging the concern and assuring the student that I would make adjustments so that participation remained helpful rather than overwhelming. Before taking action, I also reached out to a mentor for advice, which helped me approach the situation more thoughtfully. I thanked students for being engaged and willing to contribute, but also clarified expectations: participation was welcome, but lectures and question answering would be primarily instructor-led, with designated moments for peer discussion. I also reflected on something I had noticed during the class introductions: students were coming from a wide range of backgrounds. Some had prior experience with databases, while others were encountering these concepts for the first time. Because of that, maintaining a consistent pace and structure was important. I believe that framing it this way helped convey the message to the students that my goal is not to limit participation but to support a better learning environment for all. There were no further concerns raised afterwards and the students remained engaged while being supportive of the entire class. 

   

Midway through the semester, I conducted an anonymous check-in survey to better understand how students were experiencing the course. To encourage participation, I offered a small amount of extra credit, which resulted in a strong response rate.


Figure 2: Screenshot of the CS 450/550 mid-semester student check-in survey page 


Overall, the feedback was encouraging, most students agreed or strongly agreed that assignments were clear, the workload was manageable, and the pace was appropriate (Figure 3). But what mattered more were the written responses. They highlighted patterns that helped me see the course from the students’ perspective (full set of responses).


A few consistent concerns stood out:

  1. Some students said they weren’t always sure what to prepare before class or whether a session would lean more toward lecture or lab. That feedback pushed me to be more specific in my announcements, clearly laying out what each class would cover.

  2. Several students pointed out that while their answers were marked incorrect or partially correct, the reasoning behind it wasn’t always clear. This was a fair point, and a difficult balance when grading at scale. Still, I made a more conscious effort to leave clearer comments.  

  3. Even when students understood the concepts, many struggled to translate them into SQL queries or ER diagrams. That reinforced something I kept coming back to: the need for more in-class examples and live coding, which I continued to prioritize.

  4. Interestingly, a lot of students said the challenge wasn’t the material itself, but managing their time. A few students shared situations where missing a single assignment significantly impacted their grade. This feedback later influenced my decision to allow requests for reopening missed work.


At the same time, there were plenty of positive notes that helped confirm what was working:

  1. Students consistently appreciated the clarity of explanations and examples.

  2. The labs and live coding sessions were frequently mentioned as highlights.

  3. Many felt the course structure was organized and manageable.   

  4. Some even described it as one of the best online courses they had taken.


I also asked students a simple question: what’s one thing I should keep doing, and one thing I could do better? Here are some of the responses that stood out:


“The instructor is great. Instructions are clear, vibes are good, I would recommend this class. The homework is work intensive but not unreasonable.”


“You have been doing a great job and this has been one of the best online courses I have taken at ODU”


“very good at explaining things, even when the students dont seem to get something she fines a new way of explaining it so they get it.”


“The instructor is accommodating to students within reason and I believe that is something they should keep doing.”


“keep being a great teacher :)”


Figure 3: Summary of student responses to four questions: assignments, workload, grading, and pace

At the mid-semester point, once the grades were up-to-date, I started reaching out to students who had missing work or were falling behind. The intention wasn’t to penalize them, but to give them an opportunity to catch-up. At the same time, I made a point to recognize those who were consistently performing well and allowed all students the same opportunity to request the opportunity to catch up  on any missed assignments to maintain fairness. Many students responded well to that nudge.   


One practice I intentionally carried forward from my own experience as a student was leaving comments on graded work, not just when points were deducted, but also to acknowledge strong submissions. It is a small effort from my end, but it helps students feel seen and motivates them to keep improving. As a student, those were the moments we looked forward to, knowing the instructor noticed good work.


As the semester came to an end, the focus shifted to final evaluations, especially grading the course projects and submitting final grades to the university. One thing I did not fully anticipate during this phase was the time needed to carefully evaluate student projects. Each submission reflected a significant amount of effort, and I wanted to give them the attention they deserved. As a result, grading ran later than I had initially expected, although it was still well within the official deadline. 


Teaching this course taught me some important things. Good teaching is not about getting everything perfect, it’s a way to strengthen your own knowledge while sharing that knowledge in a way others can truly grasp. It is also about being responsive, thinking about what’s working and what isn’t, and being willing to adjust along the way. Managing a full class without a TA was basically a one-person band situation (except I was the entire percussion section, keeping tempo, fixing the rhythm mid-performance, and still trying not to miss a beat while everyone else expected a flawless show). But throughout the semester, I focused on doing the best I could and continuously improving based on student input. Overall, this experience was incredibly rewarding and reaffirmed my plan to pursue a career in academia.


Acknowledgements


I sincerely thank my advisors, Dr. Michele C. Weigle, Dr. Michael L. Nelson, and Associate Professor & Assistant Chair of the Department of Computer Science, Dr. Steven J. Zeil for providing me with this invaluable opportunity to gain teaching experience as a PhD student. I am also grateful to my advisors and my colleague Dr. Bhanuka Mahanama, for always being available to answer questions. Special thanks to Dr. Santosh Nukavarapu for his mentorship throughout the semester and Syed R. Rizvi for providing the course slides. Credit for establishing and continuously refining this structure should go to the instructors who have taught the course over the years, including but not limited to Drs. Irwin Levinstein, Jian Wu, Vikas Ashok, Syed Rizvi, and Santosh Nukavarapu.


And finally, a very special thank you to my husband, Skanda Siva, for being endlessly flexible with his schedule and for his constant support, and to Yara Siva, who may not know it yet but was my tiniest companion through it all.

~ Himarsha Jayanetti (HimarshaJ)

"No way to prevent this" say users of only language where this regularly happens / Xe Iaso

In the hours following the release of CVE-2026-45447 for the project OpenSSL, site reliability workers and systems administrators scrambled to desperately rebuild and patch all their systems to fix a heap use-after-free in PKCS7_verify(). This is due to the affected components being written in C, the only programming language where these vulnerabilities regularly happen. "This was a terrible tragedy, but sometimes these things just happen and there's nothing anyone can do to stop them," said programmer Prof. Fabian Greenholt, echoing statements expressed by hundreds of thousands of programmers who use the only language where 90% of the world's memory safety vulnerabilities have occurred in the last 50 years, and whose projects are 20 times more likely to have security vulnerabilities. "It's a shame, but what can we do? There really isn't anything we can do to prevent memory safety vulnerabilities from happening if the programmer doesn't want to write their code in a robust manner." At press time, users of the only programming language in the world where these vulnerabilities regularly happen once or twice per quarter for the last eight years were referring to themselves and their situation as "helpless."

Giving your Go apps Tigris superpowers / Xe Iaso

Tigris is S3-compatible, which means you can point the AWS SDK at it and most things just work. The catch is that the Tigris-exclusive features—bucket forking, snapshots, object renaming, and the like—need verbose workarounds because the AWS SDK doesn't know they exist.

So we wrote a Go SDK that does. It comes in two flavors: the storage package is a drop-in replacement for the standard S3 client with first-class methods for the Tigris-specific operations, and simplestorage is a higher-level client for the common single-bucket case that infers its configuration from the environment so you stop passing the same parameters over and over. You can adopt the Tigris features incrementally without refactoring your existing S3 code, and the simpler API still works against other S3-compatible providers.

I wrote up how it works and why we built it over on the Tigris blog.

WARCbench: A Swiss Army Knife for WARC Processing / Harvard Library Innovation Lab

The Perma team is excited to announce WARCbench, an open-source tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.

WARCbench builds on over a decade of experience gained from developing Perma.cc. Over that time, we’ve accumulated a collection of scripts, utilities, debugging workflows, and one-off experiments for dealing with web archives. WARCbench brings together those processes into a simple command-line tool that helps web archivists make sense of the wild, occasionally malformed, and deeply heterogeneous web archives that web archivists encounter in practice.

WARCbench was designed to make as few assumptions as possible about your familiarity with web archives, the kind of WARC you are working with, or what you want to do with it. It is intentionally a command-line tool. You can use it to explore and work with WARC files even without deep prior knowledge of the format, though it does assume you’re comfortable using a terminal and open to a bit of experimentation. The goal is not to hide the complexity of web archives. It is to make that complexity easier to inspect, manipulate, and learn from so you can experiment and iterate.

While many existing WARC tools are optimized for specific production workflows, the exploratory, in-the-moment WARC wrangling and debugging work archivists and developers often need to do benefits from different design choices. Sometimes you need to inspect a malformed or misbehaving WARC. Sometimes you need hooks and custom callbacks for an experiment. Sometimes you need to optimize for speed, memory, or convenience. Sometimes you just need to look and see what is there before deciding what to do next. WARCbench was designed for those moments.

We don’t know all the ways researchers or web archivists might use WARCbench, but we hope it becomes a versatile Swiss Army knife that others will find valuable to keep in their toolkit too.

Links

Slide Deck from IIPC Web Archiving Conference Presentation on April 21

 

Thanks and acknowledgments

We would like to thank our colleagues Chris Setzer and Ben Steinberg for their help and support in developing this tool.

WARCbench logo by Jacob Rhoades.

Welcome Ann McCranie, Director of Research Insights / HangingTogether

We’re delighted to welcome Ann McCranie, PhD, who joins OCLC on June 8 as our new Director of Research Insights.

Ann joins OCLC at an important moment as OCLC Research advances Research Reimagined, a strategic effort to strengthen the relevance, visibility, and impact of our research for library leaders and their institutions. In her role, Ann will help connect research priorities to practical insights that support decision‑making across a rapidly changing library and higher education landscape. She will lead a team of research scientists and engineers focused on advancing the Research Reimagined strategy.

Portrait photo of Ann McCranie

Ann brings more than a decade of experience leading research programs in higher education, with expertise in mixed methods research, research operations, and research communication. Most recently, she held senior leadership roles at Indiana University.

Her work has focused on building durable research services, guiding cross-functional teams, and helping researchers and administrators navigate change. Throughout her career, she has paired rigorous analysis with practical application.

Ann also brings a perspective shaped by close collaboration with researchers, research administrators, and campus leaders beyond the library. That experience informs how she thinks about the evolving roles libraries as institutions respond to changes in technology, AI-informed scholarly workflows, and research infrastructure. This perspective will be especially valuable as OCLC Research continues exploring future-focused questions facing libraries and higher education.

To help introduce Ann to the community, we asked her a few informal questions.

What drew you to this role at OCLC?

What attracted me was the opportunity to help connect research to the decisions library leaders are making today, while also contributing to longer-term thinking about where libraries are headed. In higher education, I’ve worked with researchers, administrators, and institutional leaders who rely on strong evidence and practical insights to guide strategy, services, and priorities.

I was especially excited by the chance to bring that experience to OCLC and support work that can have both immediate and lasting value for libraries. Research is most meaningful to me when it helps people navigate change, make informed decisions, and think differently about what comes next.

How do you think about “research insights”?

I tend to think about research insights through the lens of impact. I once asked a doctor about a medical test, and she explained that she would not order it because the result would not change the treatment plan. At first, I was a little disappointed because I was genuinely curious, but that idea stayed with me.

It became a useful way for me to think about research. I’m always asking whether the work can help inform decisions, shape action, or open up new possibilities. If the findings don’t create an opportunity to do something differently, it’s worth asking how we can make the research more purposeful and useful.

What are you most looking forward to as you get started?

I’m really looking forward to getting to know my team and connecting with colleagues across OCLC to understand their work, priorities, and how Research Insights can support them.

As a social networks scholar, I’ve always been interested in the connections between people and how relationships help ideas spread and grow. So much innovation comes from those informal networks, whether that is among coworkers, library partners, or the broader community. I’m excited to learn from those connections and help build on the momentum already underway with Research Reimagined.

Ann will be attending ALA at the end of June, and we look forward to introducing her to many of you there. Until then, please join us in welcoming her to OCLC Research.

The post Welcome Ann McCranie, Director of Research Insights appeared first on Hanging Together.

Systems Life: Lost (and Found) in Log Data / Library | Ruth Kitchin Tillman

This post is part of a series in which I write about experiences or specific challenges from my day-to-day work. I’m hoping that these will be interesting for other librarians that work in entirely different areas, for my colleagues who are solving different problems on different systems (or maybe eventually the same one after we migrate), and for those who are thinking about doing this kind of work in the future.

Building from navigating the distributed database, I want to get more deeply into what cross-system problem solving can look like. To re-set the stage (but for more details about these tools, check the previous post), transaction history of items is only available for most users via our Analytics tool.

Transaction Histories

Transaction history represents the ways an item’s traveled, checkouts but also transits and receipts. This is one of the many transactions created while my request for the Alien: Romulus DVD was filled. In this transaction, a coworker at York (I’ve redacted any details, but the user ID is in the actual log) sets the item to transit for reason “HOLD” to “UP-PAT”:

Trans Hist Datetime Trans Hist Workstation Trans Hist Command Desc Trans Hist Data Code Desc Trans Hist Data Value
2025-08-27 12:10:48 0173 Transit Item call number POPULAR
2025-08-27 12:10:48 0173 Transit Item copy number 1
2025-08-27 12:10:48 0173 Transit Item item ID 000080622957
2025-08-27 12:10:48 0173 Transit Item Max length of transaction response 3000000
2025-08-27 12:10:48 0173 Transit Item station library UP-PAT
2025-08-27 12:10:48 0173 Transit Item station login clearance NONE
2025-08-27 12:10:48 0173 Transit Item station login user access REDACTED
2025-08-27 12:10:48 0173 Transit Item station user’s user ID REDACTED
2025-08-27 12:10:48 0173 Transit Item transit from UP-PAT
2025-08-27 12:10:48 0173 Transit Item transit reason HOLD
2025-08-27 12:10:48 0173 Transit Item transit to UP-ANNEX

This is the Analytics export, which I transformed from a CSV into a table for readability in this post.

Unfortunately, even though the underlying Symphony database has unique item keys for records, Analytics seems to use the barcode as the primary key of an item table, not just the primary way to find an item record. An item’s transaction history is completely wiped from Analytics if someone changes the barcode. And sometimes, barcodes change. In our case, we change barcodes on everything that’s permanently shifted to the annex (see my post on macros). We also have barcodes wear out or fall off. So we have hundreds of thousands of items whose histories were lost, at least from the Analytics.

These lost records came to a head when our Collection Maintenance team needed to be able to track large sets of items being moved the Annex. Once the items arrived, their barcodes would be replaced with an Annex barcode, which serves a different function. So one could follow a set of barcodes on their journey until “poof,” every record related to them vanished. On the one hand, one could assume the item had been processed by the Annex since it had now disappeared. But it made tracking uneven and meant collections maintenance couldn’t tell what route an item had taken to get there or how long it’d taken.

First, I’ll note that our systems work is also quite distributed. While I was working with our collections maintenance data expert on getting access to older data, the Symphony admins were configuring item extended information to include an original barcode field, which is now populated when a barcode updates. They’ve also done some work hunting down barcode changes to update the original barcode fields. These will be exportable, even though they won’t be searchable the same way in our Analytics. Systems takes a village.

Where the Data Still Lives

Getting back to the problem-solving, this data can still be found through the oldest method of ILS data access: Workflows reports.

By running a Scan History Logs report against a set of barcodes, we can export every log in which that barcode shows up. This data wasn’t nearly as easy to use as an Analytics or Data Control export. It’s exported in a text file and uses opaque datacodes.1 Here are two example log entries from a barcode change (the actual user’s ID has been replaced with REDACTED):

2/4/2025,10:23:04 Station: 0265 Request: Sequence #: 59 Command: Edit Item Part B
$<datacode_FF>:REDACTED  $<datacode_FE>:UP-ANNEX  $<datacode_Fc>:NONE  $<datacode_FW>:REDACTED  $<datacode_NQ>:000009387393  $<datacode_IQ>:DD3.M825M66 1976  $<datacode_NX>:A2  $<datacode_NY>:2046  $<datacode_0A>:0902  $<datacode_0B>:015  $<datacode_IN>:CATO-PARK  $<datacode_NR>:20460902015  0y:Y  $<datacode_Fv>:3000000  
 
2/4/2025,10:23:04 Station: 0265 Request: Sequence #: 71 Command: Edit Item Part B
$<datacode_FF>:REDACTED  $<datacode_FE>:UP-ANNEX  $<datacode_Fc>:NONE  $<datacode_FW>:REDACTED  $<datacode_NQ>:20460902015  $<datacode_IQ>:DD3.M825M66 1976  $<datacode_Io>:USERID  $<datacode_Fv>:3000000

That top entry is really important because, even though there are other ways of accessing a permanent item ID, it’s not in the logs. So by scanning for that original barcode, we can get the entry where the barcode is in NQ and the new barcode is in NR.

I wrote a Python script that processed entire log entries, since the colleague from Collection Maintenance wasn’t just looking for old/new barcodes but for the transaction histories of that item. He could apply a date range to the log export itself, so he could set it just to export the last few months. I’m not going to share the entire script here, but this is the overall approach that I used:

current = re.search(r'\<datacode_NQ\>:(.+?) ',line)

I wrote a conditional function for all the entries which might not be present. For the handful whose data might contain a space, I wrote the search to break at the $<datacode that begins the next entry and trimmed space off the right side.

The ouptut is a very large JSON object, which I’ve condensed below to reflect the key fields from this transaction. Even with its size, it’s a lot more compact and efficient than the Analytics output shared above and thus might be easier to process, so this script may end up being useful in other contexts.

    {
        "date": "2/4/2025",
        "time": "10:23:04",
        "sequence": "59",
        "command": "Edit Item Part B",
        "station_user_login": "REDACTED",
        "station_library": "UP-ANNEX",
        "station_user_ID": "REDACTED",
        "current_barcode": "000009387393",
        "new_barcode": "20460902015",
        "call": "DD3.M825M66 1976"
    },

Going forward (to migration) this new process should meet the use cases of:

  1. connecting old and new barcodes in collection maintenance logs,
  2. tracking item histories that had been dropped from Analytics, and
  3. created a friendly JSON object vs. the same entry spread across a dozen lines or more of a CSV, making it of potential use for reporting on new barcodes as well.

So in sum:

  1. Sysadmins created a new field for original barcodes and set it to populate when barcodes are changed.
  2. Sysadmins began hunting through logs to find barcode changes and wrote a script to populate them in the database for export/reference.
  3. I created a way to extract JSON objects for item transaction history out of log reports run on old barcodes since those transactions were no longer accessible in Analytics.

  1. There are also ways to export a formatted log which is human readable, but those logs are much harder to turn into data structures. ↩︎

Weekly Bookmarks / Ed Summers

These are some things I’ve wandered across on the web this week.

🔖 Graph database-ball! Exploring the Game with the graph capabilities of LadybugDB, DuckDB and PostgreSQL

This article presents a comparison of graph capabilities in three different databases: DuckDB (v1.4.4 with duckpgq), LadybugDB (0.16.1), and PostgreSQL (19devel). We will load a large volume of records (5,635,972 rows of baseball data covering people, parks, team records, and game play-by-plays) into each database, define the entities and relationships, and write a variety of queries that take full advantage of the graph structure.

🔖 Ambient Church

Ambient Church transforms architecturally stunning spaces into immersive audio-visual environments. Our events feature pioneering artists presenting vibrant works in a context that elevates both the music and the space.

Founded in Brooklyn in 2016, we facilitate collective peak experiences through the soundscapes of modern contemplative music. With an emphasis on education and environment, we seek to illuminate an underacknowledged lineage of sonic exploration.

🔖 Language models transmit behavioural traits through hidden signals in data

Large language models (LLMs) are increasingly used to generate data to train improved models1,2,3, but it remains unclear what properties are transmitted in this model distillation4,5. Here we show that distillation can lead to subliminal learning—the transmission of behavioural traits through semantically unrelated data. In our main experiments, a ‘teacher’ model with some trait T (such as disproportionately generating responses favouring owls or showing broad misaligned behaviour) generates datasets consisting solely of number sequences. Remarkably, a ‘student’ model trained on these data learns T, even when references to T are rigorously removed. More realistically, we observe the same effect when the teacher generates math reasoning traces or code. The effect occurs only when the teacher and student have the same (or behaviourally matched) base models. To help explain this, we prove a theoretical result showing that subliminal learning arises in neural networks under broad conditions and demonstrate it in a simple multilayer perceptron (MLP) classifier. As artificial intelligence systems are increasingly trained on the outputs of one another, they may inherit properties not visible in the data. Safety evaluations may therefore need to examine not just behaviour, but the origins of models and training data and the processes used to create them.

🔖 The Archive in Art Art in the Archive

In this essay we will attempt to look at both the archive of art as well as the archive as art. When we draw a distinction between those materials that we treat as documents with a ‘factual’ historical significance (those which offer themselves in the service of scholarship), and the uses which artists make of the archive as one of the media of expression that intersect with their documentary value, we ask ourselves: which theories about the archive’s nature and function are applicable to Syrian art? What are the roles adopted by ‘the document’ and ‘the archivist’? To what extent do these roles alternate and intersect?

🔖 Cave of Forgotten Dreams

Cave of Forgotten Dreams is a 2010 3D documentary film by Werner Herzog about the Chauvet Cave in Southern France, which contains some of the oldest human-painted images yet discovered—some of them were crafted around 32,000 years ago. It consists of footage from inside the cave, as well as of the nearby Pont d’Arc natural bridge, alongside interviews with various scientists and historians. The film premiered on 13 September 2010 at the Toronto International Film Festival.

🔖 Starbucks ditches AI inventory system after just 9 months

Starbucks is saying goodbye to its artificial intelligence inventory management system about nine months after its debut, Reuters reported Thursday. The tool, which used computer vision to track some parts of the chain’s inventory, was announced in September as a method to simplify inventory record-keeping and prevent stockouts.

🔖 FediRoster

FediRoster is a slightly more heavyweight alternative to David Adler’s Sociologists on Mastodon software. It is intended to function as a public list of Mastodon and other fediverse accounts, geared primarily towards academic communities, but suitable for others as well. It offers functions for following listed accounts individually or in bulk. The main novelty here is that you can add yourself to the list through an authentication process instead of all the work falling on a list maintainer. You can sign in through your Mastodon account or send a message to the list’s bot to verify your account ownership. This also means that the hosting process for new lists is a bit more involved (it’s a Python/WSGI application).

🔖 Rust for Python Programmers: Complete Training Guide

A comprehensive guide to learning Rust for developers with Python experience. This guide covers everything from basic syntax to advanced patterns, focusing on the conceptual shifts required when moving from a dynamically-typed, garbage-collected language to a statically-typed systems language with compile-time memory safety.

🔖 CS336: Language Modeling from Scratch

Language models serve as the cornerstone of modern natural language processing (NLP) applications and open up a new paradigm of having a single general purpose system address a range of downstream tasks. As the field of artificial intelligence (AI), machine learning (ML), and NLP continues to grow, possessing a deep understanding of language models becomes essential for scientists and engineers alike. This course is designed to provide students with a comprehensive understanding of language models by walking them through the entire process of developing their own. Drawing inspiration from operating systems courses that create an entire operating system from scratch, we will lead students through every aspect of language model creation, including data collection and cleaning for pre-training, transformer model construction, model training, and evaluation before deployment.

🔖 wasteback-machine

Wasteback Machine is a JavaScript library for analysing archived web pages, measuring their size and composition to enable retrospective, quantitative web research.

🔖 No, Artificial Intelligence Is Not Conscious

The primary difference between deepfake photos and LLM conversations is that the people who generate the former are deliberately trying to fool others, and many of the people who elicit the latter from LLMs have inadvertently fooled themselves.

🔖 A Visual Guide to Gemma 4 12B

The removal of the encoders, which are typically in charge of making sense of the multimodal inputs, places the burden of making sense of all outputs on the LLM. Although the model is encoder-free, all modalities are now unified within the LLM. Instead of the model having to wait for the encoders to finish processing the audio and image inputs, the LLM can get started earlier processing the input and generating output!

In this guide, I want to showcase what it took to remove the vision and audio encoders and replace them with something much faster. The result, a 12B model that can handle audio and image inputs but without the need for encoders.

🔖 LiteRT-LM Python API

The Python API of LiteRT-LM for Linux, macOS and Windows. Features like multi-modality, tools use, and GPU and NPU acceleration are supported.

🔖 LiteRT-LM

LiteRT-LM is Google’s production-ready, high-performance, open-source inference framework for deploying Large Language Models on edge devices

🔖 Google AI Edge Gallery

AI Edge Gallery is the premier destination for running the world’s most powerful open-source Large Language Models (LLMs) on your mobile device. Experience high-performance Generative AI directly on your hardware—fully offline, private, and lightning-fast.

🔖 Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Today, we are introducing Gemma 4 12B, our latest model designed to bring agentic multimodal intelligence directly to laptops. Bridging the gap between our edge-friendly E4B and our more advanced 26B Mixture of Experts (MoE), Gemma 4 12B packages powerful capabilities inside a reduced memory footprint. It is also our first mid-sized model to feature native audio inputs

🔖 Solid State Book

Solid State Books is a full-service, Black-owned general interest bookstore with a great selection of fiction & non-fiction titles. We stock literary gifts, stationery, greeting cards & puzzles for all ages. We have a carpeted, playful children’s books area in both stores for kids & parents alike to spread out & read together. Come by for weekly children’s story hours, catch monthly book groups, author readings/signings, local interest panels, political conversations & more!

🔖 minisearch

MiniSearch is a tiny but powerful in-memory fulltext search engine written in JavaScript. It is respectful of resources, and it can comfortably run both in Node and in the browser.

🔖 Documents the Department of Justice tried to disappear.

In May 2026, the Justice Department began systematically removing material from its web sites regarding the many indictments and convictions related to the Jan. 6 attack on the U.S. Capitol. This archive reconstructs the vast bulk of those thousands of deleted records.

🔖 The Justice Department Erases History; Lawfare Restores It

Last week, the Justice Department began systematically removing material from its web sites regarding the many indictments and convictions related to the Jan. 6 attack on the U.S. Capitol.

The operation started without fanfare or formal announcement and proceeded largely unnoticed. Until, that is, journalists such as the Washington Post’s Meryl Kornfield took notice of certain press releases and other materials that had conspicuously disappeared from www.justice.gov.

“The Trump admin is quietly deleting info about the Capitol attack from the DOJ website as it prepares to give funds to J6ers,” Kornfield posted. “This week, DOJ deleted a press release about one man with an ongoing child solicitation case who came to the Capitol with bear spray.”

Then, with typical bombast, the Justice Department responded by taking issue with one particular aspect of Kornfield’s characterization. “Nothing ‘quiet’ about it,” the DOJ Rapid Response account replied. “We are proud to reverse the DOJ’s weaponization under the Biden administration. We will do everything in our power to make whole those who were persecuted for political purposes. This includes stripping DOJ’s website of partisan propaganda.”

We are not erasing history quietly, the Justice Department seemed to suggest. We are erasing history loudly and proudly.

At Lawfare, we have restored the vast bulk of what was deleted. We have also started to preemptively archive a raft of material that has not yet been deleted but probably will be, given its thematic relationship to the material that was 86ed.

🔖 Data Center Policy Database

Data centers are the physical facilities that power cloud services, AI systems, streaming, and nearly every digital platform people use each day. As demand for artificial intelligence accelerates, data centers are becoming major sources of electricity demand and local infrastructure pressure, which means their growth affects energy systems, communities, and long-term public planning.

🔖 The State of Data Center Policy in the United States

The regulatory landscape for data centers in the United States has shifted dramatically in recent years from a period of aggressive economic incentives to a phase of intense scrutiny, restriction, and community-led resistance. To track these legislative changes, the DIGS Lab at the University of Virginia reviewed more than 700 federal, state, and local policies related to data centers. The data center policy database aims to bring transparency around zoning, permitting, and regulating data centers and their impacts on communities. This is what we found.

🔖 R.E.M. Live - 1981-11-07 Viceroy Park, Charlotte, North Carolina

Another early R.E.M. set, from the same state but a different city as the previous show. Pretty much the same library of songs, but this one’s the superior show to get - it sounds slightly nicer and doesn’t have the equipment failures of the previous show. There’s already a source on here, but that’s a different master of the same recording.

🔖 itsjunetime / tdf

A terminal-based PDF viewer.

Designed to be performant, very responsive, and work well with even very large PDFs. Built with ratatui.

🔖 ratatui

Ratatui (ˌræ.təˈtu.i) is a Rust crate for cooking up terminal user interfaces (TUIs). It provides a simple and flexible way to create text-based user interfaces in the terminal, which can be used for command-line applications, dashboards, and other interactive console programs.

IPv6 zones in URLs are a mistake / Xe Iaso

IPv6 is weird. One of the more strange parts of the standard is that every interface's link local addresses are in fe80::whatever. If you have a machine with two network interfaces, both of them will be in fe80::, so if you have a packet destined to fe80::4, how do you disambiguate it?

The answer is you use IPv6 scopes/zones. The exact format of what goes into a zone is OS dependent, but on Linux it's the interface name and on Windows it's the interface ID. This lets the kernel's routing table know how to handle an address range conflict.

On my tower, this would be represented like this:

fe80::4%eth0
        

Where eth0 is the name of my tower's ethernet device.

When you create a host:port bindhost, you normally separate the hostname and port with a colon. IPv6 uses colons to separate hex groups. In order to disambiguate what's the host and what's the port, you typically format the IPv6 address in square brackets, so fe80::4 on port 80 would look like this:

[fe80::4]:80
        

And with the right scope it looks like this:

[fe80::4%eth0]:80
        

Now let's get URL encoding into the mix. From high orbit, you can imagine a URL's format as being something like this:

<scheme>:[//][<username>[:<password>]@][<hostname>][:<port>][/<path>][?<query>][#<fragment>]
        

An IPv6 zone would then be part of the hostname, just like with that fe80::4 port 80 example from earlier. So you'd think the URL would be something like this:

http://[fe80::4%eth0]:80
        

But if you try to parse this as a URL in Go, you get an error:

package main
        
        import "net/url"
        
        func main() {
        	if _, err := url.Parse("http://[fe80::4%eth0]:80"); err != nil {
        		panic(err)
        	}
        }
        

Yields:

panic: parse "http://[fe80::4%eth0]:80": invalid URL escape "%et"
        

This happens because URLs can't represent all Unicode values, so any values that don't fit into the grammar of a URL become percent-encoded. This is why sometimes you'll see a %20 in URLs in the wild; that's encoding the ascii space key, which is invalid in URLs.

In order to work around this, you need to percent-encode the percent sign in the IPv6 zone:

package main
        
        import (
        	"fmt"
        	"net/url"
        )
        
        func main() {
        	u, err := url.Parse("http://[fe80::4%25eth0]:80")
        	if err != nil {
        		panic(err)
        	}
        	fmt.Println(u.Hostname())
        }
        

Yields:

fe80::4%eth0
        

In theory, there is guidance for how to properly handle IPv6 zones in user interfaces in RFC 9844, but there's no such guidance for URLs. Go also does not seem to follow this RFC in net/url.

Cadey is coffee
Cadey

EDIT: It seems that this behaviour is compliant with RFC 6874 and that this is in fact how it is meant to be done.

      IP-literal = "[" ( IPv6address / IPv6addrz / IPvFuture  ) "]"
        
              ZoneID = 1*( unreserved / pct-encoded )
        
              IPv6addrz = IPv6address "%25" ZoneID
        

Our industry confounds me.

So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the % with percent encoding. This is horrible, but it seems that this is an edge case that applies to other frameworks, programming languages, and libraries:

Maybe some day in the future there will be a better option here. In the meantime my policy of not forking the Go standard library means that this somewhat terrible UX for an edge case is acceptable. I hate it, but what can you do?

TL;DR: computers were a mistake.

Sustaining Open-Source Web Archiving Infrastructure: Takeaways from IIPC Conference 2026 / Harvard Library Innovation Lab

The Perma team recently attended the International Internet Preservation Consortium’s (IIPC) Web Archiving Conference, held this year at the KBR—Royal Library of Belgium in Brussels. A recurring theme was that web archiving depends on collective stewardship of the open-source tools, institutions, and people that make preservation possible. At a moment when the web is becoming more difficult to archive, the conference offered an assessment of current challenges and a reminder that the sustainability of the field relies heavily on collaboration and shared responsibility.

The opening keynote panel—“Sustainability for Open Source Web Archiving Tools”—brought together perspectives from libraries, consortia, and open source service providers: Lauren Ko (University of North Texas Libraries), Tessa Walsh (Webrecorder), Neil Jefferies (Open Preservation Foundation), Yves Maurer (National Library of Luxembourg), and LIL’s very own Clare Stanton (Perma.cc). The conversation focused on the structural pressures now reshaping the digital landscape, and what collective stewardship might realistically look like. Key takeaways from this conversation are outlined below.

Five keynote speakers sitting in front of a slide in an auditorium describing the Library Innovation Lab and Perma.cc

Clare Stanton (center) discusses Perma.cc during the opening keynote.

Need for sustained investment in open-source software

The web archiving community no longer has the luxury of treating tool and infrastructure maintenance as someone else’s problem. Nearly every institution in the room relies on these open-source tools, including Perma itself. For example, the replay functionality for Perma.cc is built on replayweb.page, part of the software suite developed by our long-time collaborators at Webrecorder. Despite almost everyone using these open-source tools, almost no one is funding them proportionally. Historically, many projects survived on grants and foundation support, but that funding landscape is shrinking. Yves framed open-source work as a shared mission and responsibility, especially for national libraries and cultural heritage institutions whose mandates depend on long-term stewardship. Institutions should be contributing back to the web archiving ecosystem they depend on.

An asymmetric fight against a complex and closing web

Web archiving has become more difficult in the past few years, and the scale and pace of change is only accelerating. Tessa described the current environment as an “asymmetric fight” due to bot detection and anti-scraping systems increasingly treat archiving crawlers the same way they treat commercial scrapers. Several panelists pointed to the collateral damage caused by large-scale scraping and large language model (LLM) training. Infrastructure providers are tightening access controls across the web, often in ways that make legitimate archival crawling significantly harder. Tessa noted that archivists now need to spend more time simply observing crawls to determine whether captures succeeded or whether crawlers archived nothing but bot verification pages. Clare suggested that the closing web may create an opportunity for archiving institutions to advocate collectively for differentiated treatment, making the case to infrastructure companies like Cloudflare that preservation work serves a fundamentally different purpose from commercial scraping.

Beyond single maintainers: Sustaining people, not just code

The panelists repeatedly returned to governance and community structure as equally important to technical capability, and also discussed the human labor behind open source tooling. Multiple panelists emphasized that storage and compute are not the primary costs in web archiving operations. The expensive part is retaining highly skilled people capable of adapting tools to a rapidly changing web environment. Neil argued that sustainability problems become especially acute when projects depend too heavily on single maintainers. The goal, Neil suggested, is not to remove human dependency, but to move from person-dependent systems to people-dependent systems, with succession planning, multiple technical leads, and stronger organizational support structures.

Digital preservation as collective responsibility

There was some cautious optimism about potential sources for more sustainable support. Panelists discussed adding funding requirements for upstream open-source projects into public tenders for web archiving services, creating institutional budget lines specifically for open-source maintenance, and treating contributions to community software as legitimate professional development work for developers within libraries and archives. Some panelists pointed to growing interest in digital sovereignty policies in Europe, where governments increasingly want more direct control over digital infrastructure and collections stewardship. Yves suggested that this political shift could create opportunities for open-source preservation tooling, particularly if public sector procurement rules begin explicitly rewarding contributions back to shared infrastructure.

Benefits and limitations of AI-assisted coding

Not surprisingly, AI hovered over much of the discussion. AI-assisted coding may reduce some development overhead, and some panelists described productive uses for code review, bug detection, and scripting assistance. However, the panel was skeptical of the idea that AI meaningfully solves the underlying sustainability problem. Faster code generation does not automatically create maintainable systems, healthy governance, or resilient communities. As Tessa noted, velocity without understanding creates its own risks.

Open-source software is critical preservation infrastructure

The key takeaway that emerged from the opening keynote was a reframing of open-source web archiving infrastructure not as ancillary technical tooling, but as critical preservation infrastructure. The field behaves as though these systems are indispensable, but there is a significant underinvestment in open-source tools. The harder question, and the one the panel kept circling back to, is whether institutions are willing to fund, maintain, and steward them accordingly.

Reading Rooms for the Archived Web / Ed Summers

Archives de Bevaix by Service intercommunal d’archivage

The Wayback Machine is (usually) good at preserving web pages, but it’s not always good at helping you find your way around what’s been preserved. URLs from a vanished website may be archived, but if the original site is gone, the paths into it (its navigation, its search, its tables of contents) are sometimes gone too.

This creates a need, and opportunity, for sites I want to call reading rooms for the archived web: standalone sites that sit to the side of archived web content and provide the index, browse, search and curation layers that the original site used to, with provenance links back to the captures they’re drawn from. The metaphor I have in mind is the reading room in a brick & mortar archive, the place you go to consult a collection, with finding aids close at hand and the records themselves a request slip away. Perhaps a finding aid is the better metaphor here?

The most recent example of this I’ve come across is work from Lawfare Media, who recovered 5,772 pages deleted from the Department of Justice website related to the Jan. 6 attack on the US Capitol. They’ve built a standalone archival viewer of the extracted content that links back to the Wayback Machine. There is more about the motivation for the project in their post The Justice Department Erases History. Lawfare Restores It. (Sadly the GitHub repo for the archive itself looks to be private.)

This is a bit archive-eating-its-own-tail, but one feature of the site that Lawfare Media built is that the search is operational from within the Wayback Machine’s own snapshot of the site, since the search runs client-side. A user search doesn’t require an API back to the server.

Searching the archive from inside the Wayback Machine

Looking at the HTML it appears the site is using minisearch for client-side search. A nice side effect of client-side search is that the indexed corpus (metadata for all the DoJ content) is itself available on the open web, as corpus.json. Some caring person has even already thought to archive corpus.json using Save Page Now:

A Wayback Machine snapshot of corpus.json from May 29, 2026

Other “Reading Rooms”

Lawfare’s archive sits in a small but growing genre. Or maybe it’s well established and I’m just noticing it for the first time? Another example is Ben Welsh’s FiveThirtyEight Index which he built after Disney shut down fivethirtyeight.com in March 2025. It catalogs over 38,000 articles, datasets, podcasts and graphics, browsable by author, date and series, with every record linked back to its Wayback Machine snapshot. (The Internet Archive also runs a companion collection.)

Another example is Internet Archive’s Scholar, which provides a catalog of published research (mostly journal articles) that are found in the Wayback Machine. I believe this is a presentation layer over data collected by IA’s FatCat project. Which provides some ability to edit the metadata about the archived content.

In archival terms what these projects are doing is effectively what finding aids doe: describing scope, arrangement, and provenance, but wrapping it in something that feels more like a reading room than a paper inventory. They are themselves websites that will eventually need to be archived. I think it’s interesting to think about them as a continuation of something archives have been doing for a very long time. It’s also interesting to think about the role that agentic coding tools played in their production (at least in the case of the Jan 6 Archive).

Jonathan Gray and the Public Data Lab at King’s College London run a project called Repurposing Web Archives (with the Internet Archive and Internet Archive Europe) that looks at the tools, methods, and stories of how researchers, journalists, and artists actually work with the archived web: see their recent Follow the Changes post. Perhaps this idea of Reading Rooms for web archives is a subset of the types of practices this project is interested in? It seems like there is a gray area between research that incorporates web archives, and more documentation oriented content for providing an entry point into web archives?

If you know of other examples of Reading Rooms (or finding aids) for Web Archives I’d love to hear about them!


This post was originally a thread over in the Fediverse. Thanks to (freegovinfo?) for the pointer to the Lawfare Media work.

2026-05-28: If LLMs can write abstracts, what's our job? The Uncanny Valley and Gell-Mann Amnesia Effect in the ACM Digital Library / Web Science and Digital Libraries (WS-DL) Group at Old Dominion University

 If LLMs can write abstracts, what's our job? 

The Uncanny Valley and Gell-Mann Amnesia Effect in the ACM Digital Library


Michael L. Nelson

2026-05-28


I serve on the ACM Digital Libraries Board, and we are navigating a number of changes to the ACM's Digital Library, which as a professional society and memory organization, is arguably the ACM's primary asset.   A recent article (March, 2026) by Jack Davidson and Wayne Graves provides a status update of the ACM's move to open access, which includes establishing a "basic" and "premium" service level. Although there are some questions regarding the long-term implications of moving to open access, I, and presumably all authors, welcome the ACM's bold strategy for ensuring that our content reaches the widest possible audience.  


Jack's and Wayne's article also addressed the DL's recent experimentation with AI/LLM enrichment of articles, specifically landing pages.  And unfortunately, the experimentation got off on the wrong foot.  Just before the holidays in 2025, the landing page for articles in the DL added AI-generated summaries as a sort of alternate or rival abstract.  To make matters worse, these summaries were shown by default, and users had to select a tab to show the original, author-supplied abstracts.  The figure below is an example taken from Dr. Casey Fielder (CU Boulder), whose social media post about the summaries being shown by default instead of the abstracts gained a lot of traction. 


AI-generated summary shown by default (2025-12-16) for https://doi.org/10.1145/3706598.3713322 


Fortunately, the expected behavior of showing the authors' abstract by default returned very quickly, and the AI-generated summary is now clearly marked as such, including the date that the summary was generated:



Author-generated abstract is now shown by default https://doi.org/10.1145/3706598.3713322 



The AI-generated summary is now clearly marked as such, and includes the date the summary was generated https://doi.org/10.1145/3706598.3713322 


First, let me be clear: showing the AI-generated summary by default instead of the authors' abstract was a terrible idea and was uniformly rebuked.  The DL board was not informed that this was going to happen, and I can't recall anyone on the DL board even suggesting it; perhaps it was just an oversight by an ACM staff member or engineer at Atypon. I don't recall exactly when the expected default behavior was restored, but it was soon after the author community complained. 


My original suggestion at the DL board meetings (echoed by Dr. Fiesler) was to provide wiki-style editing on the AI-generated summaries, possibly limited to logged-in authors (a possible premium feature?).  One can make a good argument for either opt-out or opt-in, but neither option adequately addresses the problem of the sizable back catalog of unreachable authors (JACM began in 1954).  


But what I find interesting is the level of author backlash against AI-generated summaries, at least as I observed on social media.  This is all anecdotal, and I realize people don't post about things for which they are neutral or have even mildly positive feelings about because, let's face it: carping is a lot more fun.  But Dr. Fiesler and the others in the thread are all reasonable people and aren't just trolling. I think there's something more fundamental happening.  I think our collective reaction (revulsion?) to AI-generated summaries can be explained by adapting two phenomena: the Uncanny Valley, and the Gell-Mann Amnesia Effect.  


The Uncanny Valley is an hypothesis that posits that our emotional response to depictions of humans (expressions, speech, movement, etc.) initially rises as the likeness becomes more human-like, and then takes a sharp dive as the likeness becomes nearly human-like but not quite. Basically, most cartoon characters, anthropized animals, etc. are "cute", but the more realistic animated humans in movies like "Polar Express" (2004) are just creepy.  



The Uncanny Valley (Source: Wikipedia


I propose that something similar happens with text.  Most authors have no problem with AI tools enriching the work, for example: language translation, extracting citations, repairing/rewriting hyperlinks, suggesting related works, suggesting/assigning keywords and ACM CCS values, and any number of other services and derived content.  But generating a summary that rivals the abstract?  Yuck.  No thanks.  An error in citation parsing or CCS assignment?  Meh, who cares, either ignore it or fix it, but no one takes to social media to complain.  A subtle but detectable (if only by the author) error in a summary?  That's glaring and viscerally wrong. And even if we can find no substantive errors, knowing the text is AI-generated, we will find fault with phrasing, the structure, and various minutiae (cf. humans' negative attitudes to replicants in Blade Runner).  Extracting keywords is what computers do. Writing abstracts is what we do. If LLMs can write abstracts, what's our job? 


Those assessments inevitably derive from us reviewing AI-generated summaries of our own work.  Presumably, no one knows the material better than us, so the best anyone / anything else can do is be "as good as", certainly not "better". We're writing for our peers, and we share a nuanced, high-bandwidth vocabulary that outsiders just can't appreciate.  On the other hand, if we have to read articles outside of our area of expertise, we often wonder why are the authors so obtuse? Why can't "those people" just write plainly?  


gocomics.com/calvinandhobbes 


This is the essence of the Gell-Mann Amnesia Effect, which was coined by Michael Crichton to describe the phenomena that the more you know about a topic, the more likely you are to see the flaws in a third party analysis, but at the same time not being as critical when that same third party summarizes a topic on which you are not an expert. Anyone who has been interviewed by the media has experienced this: the reporters inevitably butcher your hour-long exposition, provided in painstaking detail, covering all the nuances, edge cases, historical review, and possible future directions – all reduced to a minute or less of decontextualized soundbites. But that news outlet suddenly becomes a trusted and valuable source when they cover a topic outside of your expertise.  


dilbert.com


I suspect the Gell-Mann Amnesia Effect applies to AI-generated summaries as well: they are an abomination when applied to my work, but a useful de-jargoning tool for exploring unfamiliar or even adjacent sub-fields.  This even presupposes that there should be multiple AI-generated summaries, aimed at different audiences (e.g., lay person, High School, undergraduate, researcher).  In fact, the rival abstract in Dr. Fiesler's example might be the least useful summary, precisely because it does rival the author's abstract.  But writing for audiences other than our own is a different skill set: writing for my fellow researchers at JCDL, Hypertext, Web Science, etc. is what I do, but writing for high schoolers is not what I do.  Casting my work into something appropriate for high schoolers would be a good use of LLMs, and simplifications (if not outright errors) are to be expected.  


In summary, I think it's natural to feel revulsion when the LLMs are used to rival our work: it falls into the textual uncanny valley, in a way that other generative works, such as translation, do not (at least not currently).  But at the same time and based on the Gell-Mann Amnesia Effect, our harshest judgement of AI-generated summaries is reserved for areas in which we are an expert, and our assessment of AI-generated summaries improves as we apply them to areas further from our own.  


With that in mind, it would make sense for the ACM DL to enable wiki-style editing on summaries, move away from the model of a single summary that rivals the author's abstract in length and complexity, and introduce multiple summaries, tailored to audience and intended purpose. 



–Michael 


2026-05-29 Update: I was chatting with Martin Klein, and he informed me that bioRxiv introduced in late 2023 on-demand summaries at variable reading levels. bioRxiv is far from my field, so I'm not completely clear on its status as a production service or just a prototype. For example, this recently published preprint doesn't show the option for AI-generated summaries: 


 

Clicking on the "Automated Services" for the recently published https://www.biorxiv.org/content/10.1101/2025.05.23.655690v1


…shows "There are no automated services for this paper."  


However, I was able to find this preprint from a year ago that does have that option available:


The "Automated Services" option is active for https://www.biorxiv.org/content/10.1101/2025.05.23.655690v1 


When clicked, the default AI-generated summary is for the "General" audience:


 

The "General" AI-generated summary for https://www.biorxiv.org/content/10.1101/2025.05.23.655690v1 

The "Expert" AI-generated summary for https://www.biorxiv.org/content/10.1101/2025.05.23.655690v1 


Are these good summaries?  I guess so – although I'm not sure what else to evaluate them against. I don't know the first thing about proteomics, so the "General" summary is certainly the most accessible to me.  The "Expert" summary is more detailed than the "General" summary, but still more accessible to me than the authors' abstract. That's not a surprise because 1) I haven't studied biology or chemistry since High School, some 40 (!) years ago, so Schär et al. aren't writing for me, and 2) the summaries are both about half the length of the authors' abstract. I saved all three into separate files:


% wc -w bio-*txt | grep -v total

     219 bio-abs.txt

     107 bio-expert.txt

      88 bio-general.txt


Two hundred words is a good target for abstracts. I'm guessing the prompts for the AI-generated summaries had a target of about 100 words, so by design even the "Expert" summary will not rival the authors' abstract (though metadata and wiki-style editing would be nice). The "Automated Services" tab has at the bottom a link to "Explore Further on ScienceCast":


The target of the "Explore Further on ScienceCast" link https://sciencecast.org/casts/jpdm4k710oet 



I don't have an account (yet) on ScienceCast, so that's the end of my exploration for now.  But there's clearly a bigger AI↔paper ecosystem to explore, for both me personally and the ACM DL.  


–Michael 


2026-06-02 Update: In another chat with Martin Klein, and had just discovered the institutional repository at Niigata University. It does not a native English interface, so all of the translations shown below are via Chrome and thus a little clunky.  When you first visit the repository, it asks you to choose a persona or level from three choices: "adult", "junior and senior High School students", and "Elementary school student"


Choosing a persona when visiting https://repolab.lib.niigata-u.ac.jp/ 


Selecting a persona brings up the search page (with the persona changeable via a dropdown menu in the upper right-hand side):


Search page for https://repolab.lib.niigata-u.ac.jp/ 


I did a search for "web archiving".  The hits are not especially relevant (perhaps no one at Niigata is active in the field), but they are sufficient to demonstrate the personas.

Result #1 in the SERP for https://repolab.lib.niigata-u.ac.jp/ 


Clicking on "View AI explanation", there are three tabs corresponding to the three personas previously introduced:

AI Explanation for Middle and High School Students https://repolab.lib.niigata-u.ac.jp/records/record-2000416/  


AI Explanation for Adults https://repolab.lib.niigata-u.ac.jp/records/record-2000416/ 



AI Explanation for Elementary School Students https://repolab.lib.niigata-u.ac.jp/records/record-2000416/ 


Chrome's translation for Elementary School students is not smooth, but I'm guessing that's an issue with Chrome and not the LLM that Niigata is using – presumably there is less training data for translating "children's" Japanese?


The landing page Niigata's institutional repository does have the regrettable "embedded PDF" interface, and it does list a truncated "AI Explanation" above the "Summary by the author" (to be fair, perhaps it's named "summary by the author" instead of "abstract" is a function of the translation) 

Top of the landing page for https://repolab.lib.niigata-u.ac.jp/records/record-2000416/ 


The bottom of the landing page for https://repolab.lib.niigata-u.ac.jp/records/record-2000416/ 



It is a little hard to evaluate this three-level approach, since there's the added dimension of language translation.  But it feels like an interesting application of LLMs, and aside from being listed at the top of the SERP, it does not seem to be in competition with the authors' abstract.  


Note that the landing page displayed above is likely an experimental and/or local UI since it is hosted at niigata-u.ac.jp, and is very different from the more conventional looking landing page for associated the handle which resolves to nii.ac.jp


The handle http://hdl.handle.net/10191/0002000416 resolves to https://niigata-u.repo.nii.ac.jp/records/2000416  



I appreciate all of Martin's suggestions and pointers, and welcome more from other readers.


–Michael



*Apologies for including Dilbert, but the options for Gell-Mann Amnesia Effect cartoons are limited. 



AI's PR Problem / David Rosenthal

J.P. Morgan hits photographer with cane
This is just a brief post to explain to my old boss, Eric Schmidt, why he and his ilk are getting booed at college commencements, and why laws against data centers are getting passed. The explanation is below the fold.

Let us start from an under-appreciated fact. Paul Campos reports that:
The college wage premium, that is, the increased earnings associated with having a college degree as opposed to only being a high school graduate, hasn’t changed at all in the past 25 years, because median real wages have been flat as a pancake for everybody, no matter what their formal education level, for the past quarter century.
But:
I wonder what’s happened to capital over this time? Value of S & P 500, inflation-adjusted, 1/2000 to 9/2025 (same period as the wage data):

2000: $1,394

2025: $6,688
On average, for more than the students' entire lives, stock-owners like Schmidt and (to a much lesser extent) I have stolen every last drop of the productivity increase of US workers at every age and education level. (See the actual numbers in the appendix)

Now, the perpetrators of this theft are telling their victims, the students and the public at large, that whether they like it or not they will be subjected to AI because that will make the perpetrators even richer. The victims have been informed that this new technology will:
Nothing better illustrates the contempt of the Epstein class for the proletariat than that these oligarchs would expect the graduating class to enthusiastically accept this prospect.

Appendix

Here are the actual numbers from Paul Campos' 25 years of flat wages and no increase in the college wage premium, while value of capital has skyrocketed:
I was fooling around with FRED this morning, as one does, and here are some stats: (The FRED numbers are presented in nominal dollars; I’ve converted them to CPI-adjusted dollars).

Median usual weekly earnings of workers with a high school degree only:

2000: $968

2025: $980

Median usual weekly earnings of workers with a bachelor degree only:

2000: $1,587

2025: $1,580
...
Median usual weekly earnings of people with a bachelor’s degree or higher:

2000: $1,705

2025: $1,747
Here is a short list of YouTube videos on this topic: As a boomer, I think this post might be the exception that proves Ms. Baba's rule.

Note that every single one of the ads that I saw watching these videos in an incognito window was advertising an AI company! As are 49% of all the billboards in the Bay Area. Read the room, guys!

An Automated Data Monitoring Toolkit and the AI Benchmarking Exercise at the Public Data Project / Harvard Library Innovation Lab

This post is being shared on both the dataindex.us newsletter and the Library Innovation Lab Blog.


“Is data changing? Is it being disappeared? How do we know? How can we know?” This interrogative refrain rang through just about every conversation I had when, almost a year ago, I came to Harvard Law School Library to lead the Public Data Project. Thanks to the dataindex.us Data Checkup, a plan is in place to do this complicated but essential work. Through the careful scaffolding dataindex.us has constructed and the assiduous research of its staff, more than a dozen federal datasets have “health assessments,” and the team continues to add to this list.

In October 2025, the Public Data Project partnered with dataindex.us to develop a data monitoring toolkit that could both work at scale and be user-driven. In addition to creating an automated tool that can process large numbers of datasets, we also want the user to determine which datasets they want to monitor. Let’s face it, when it comes to federal data, one person’s byzantine, inscrutable dataset is another person’s trove of invaluable ground truth. The anecdotes of data use collected by essentialdata.us offer varied examples of the ways people benefit from federal datasets. The range of uses is a clear indication that people need to be able to monitor the data that matters to them.

At the Public Data Project, we are creating a toolkit that will enable users to detect and monitor changes to federal datasets over time. It will enable users to select a dataset and track changes within the data itself, as well as to automate the monitoring of external sources that indicate whether the data might be changing. Indicators of change to a given dataset range from somewhat obvious sources, like major news sites, to more obscure sources, like the U.S. Code. At present, our tool development has produced two components.

First, Binoc is a command-line tool and library to generate changelogs for datasets that don’t have them.

Scanned illustration depicting a man made out of optical equipment; advertisement for L. Srisheim, optician (ca. 1840) Advertising card for L. Srisheim, optician. Source: American Antiquarian Society.

Unlike generic diffing utilities intended to describe line-level differences in plain-text content such as source code or Markdown, Binoc aims to efficiently summarize changes in real-world datasets, including file additions and deletions, row-level updates, and schema alterations. Given a series of dataset snapshots captured at different points in time, Binoc detects what changed, expresses any changes as a minimal structured diff, and produces a human-readable summary. Binoc is currently in a collaborative design phase of development, with new features being added regularly. We welcome feedback from early adopters.

We have also begun the research for a second component of the data monitoring toolkit development.

Photograph of cast bronze USGS benchmark Cast bronze benchmark. Source: United States Geological Survey.

We have created an AI benchmarking exercise to compare and to evaluate how well AI can monitor data and assess its risk when considered next to the processes and conclusions of a careful researcher. The goals of the exercise are to:

  • Test how well AI can assess various types of risk to federal datasets;
  • Evaluate what baseline a popular search model would use to answer those without a custom search harness;
  • Surface and reflect on the tacit knowledge necessary to perform risk assessment, including the sources needed, the steps involved, and the difficulty of defining criteria;
  • Create awareness and community through an intellectually engaging activity that includes both individual research and group reflection.

We have conducted an initial test run of this exercise with a group of 10 information professionals. After introducing the participants to the dataindex.us rubric to assess the risk level of a given dataset, each participant was assigned a dataset and asked to evaluate it across three of the six risk dimensions outlined in the rubric. Each participant was either assigned the first three dimensions — Historical Data Availability, Future Data Availability, and Data Quality — or the latter three — Statutory Context, Staffing and Funding, and Policy. For the first hour, participants more or less worked alone, diligently researching a subject that they lacked expertise in, but for which they had clear guidelines for the kind of information they sought. Participants then opened ChatGPT, and fed it prompts that we had scripted and tailored for each dataset. First in a form that asked them specific questions and then as a group compared their results with ChatGPT’s, participants reflected on their findings. Going through their three assessment dimensions, participants compared their conclusions to those of AI, reflecting on what AI missed, what they missed, and on what parts of the rubric may have led to confusion.

This exercise gave us an early insight into the potentials and pitfalls of AI’s ability to assess data risk, as well as ways in which we might tweak both the exercise and the assessment rubric. This group of participants were information professionals, not policy wonks, and we are eager to see how area specialists’ experience might lead to different outcomes in this exercise. In addition, we want to experiment with prompt engineering and give participants more leeway in their interaction with AI. In the next iteration of the exercise, we will rely on the transcription of each participant’s interactions with AI for analysis, rather than asking individuals to respond in a form.

What we liked most about this exercise, however, were the collective reflections not just on AI, but on public data more generally. One participant described it as an “excellent empathy-building exercise” because, through the work, both alone and as a group, participants become aware of the importance of and perils to public data. They reflected on whether and how to translate their own empathetic experience to AI.

June 2026 Early Reviewers Batch Is Live! / LibraryThing (Thingology)

Win free books from the June 2026 batch of Early Reviewer titles! We’ve got 251 books this month, and a grand total of 3,098 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.

If you haven’t already, sign up for Early Reviewers. If you’ve already signed up, please check your mailing/email address and make sure they’re correct.

» Request books here!

The deadline to request a copy is Thursday, June 25th at 6PM EDT.

Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the UK, the US, Canada, Australia, Germany, New Zealand, Ireland, Malta, Italy, Latvia and more. Make sure to check the message on each book to see if it can be sent to your country.

Employee No. 9The Weight of AngelsFood Is Medicine: Healing Our Bodies, Nourishing Our Minds, and Transforming Our Food SystemA Voice Like Mine: A MemoirLast Seen in Sea IsleAmarisa's Cooking Pot: Tales of Life in All Its WondersMount MiseryFunjeepups: A Beautiful SongFunjeepups: A Star WishGood Families Don'tMy Best Friend Is a Butternut SquashToad on the GoThe Wise PickleWhale, That Was UnexpectedConfessions of the Green River Killer: A True Story of Manipulation, Madness, and a Search for JusticeThe Roman Holiday RuleThe Crazy TestI Know ThingsJonathan's JournalEvery Nanny Before MeAntitherapiesRedworkThe Rise and Fall of the Republic of West Delphi: A MemoirNo One Will Ever Hear You: StoriesA Day in the Life: An NPC LitRPG AnthologyThe Set of All SpiesThe Set of All SpiesSwitching SidesBubbles, Roses, and RumpIs It Poop?: A Guessing Game With Poop and Animals That Look Like PoopRuptured: Jewish Woman in Australia Reflect on Life Post - October 7DiodeDestiny or DefeatErkül Bwaroo, Elf Detective - At Your ServiceWoman Outside the CityShoulders of GiantsThe Very Unremarkable Life of Mrs. Etty BloomParallel CircuitsThe Durbar's ReckoningPadani: A Family StoryA Misfit's Guide to Magic and MayhemQuo Vadis, Jane Mitchell?Wave 2: The SequelEden at DawnThe Necklace of Seven SoulsLong WeekendThe Chumash of Heroes: Bereshit (Genesis)The Girl Who Watched the Trains DepartWriting Memoir in Flashes: Creative Ways to Tell Your True Stories, One Memory at a TimeExalted ObjectsCircus of the Vanishing ElephantTessa's LandingDispatches from Grief: A Mother's Journey Through the UnthinkableThe Curse of Teed HouseHeracles: The Hydra of LernaPhantom of the GalleriaThe Quaint Convictions of Kit BennetThe Timeless Teachings of Conny Mendez: An Essential Collection of Metaphysics in Plain EnglishOcean Animals: An Animal Guessing Game Book Full of Fun and FactsLove in the AbstractThe Only CatchBigger Than VersaceBrandedBareThe Shattered MirrorA Voice of WrathLike MagicDevoted to His SwordWoman Afraid of WaterHunter's BloodBrutal Country: Ten Short StoriesFinding My Way Through Cancer: A Gentle Journey Through Early-Stage Lung CancerWhere Water Meets Sky: An Isekai RomantasyScales of DestinyChange By Doing Nothing: The Hidden Science of Self-Sabotage and Why You Can Change Only When You Stop Forcing ItThe WindowZombies of the Upper East SideEncoded Minds: A Biological ThrillerNot All HandsirlGood Grooming and a Healthy Respect for AuthorityThe Ash Cycle: How the Trypillians Defeated Urbanization Through FireBrilliant Life: The 5 Science-Backed Pillars to Boost Energy, Improve Sleep, and Build Healthy Habits That LastOnly Breath and ShadowTo The Moon and BackRest For The Weary: Biblical Support for Autistic Burnout and RecoveryTakes One to Know OneFlash PointThe Relief of Not Knowing: Stop Overthinking Decisions Start Trusting YourselfThe Fifth SilenceAfter the Altar: Living the Promises of the Wedding DayMaillane: That Morning Sun Comes Rising UpDeath in the End ZoneThe Devil of Tarsyn ForestThe Agentic CMO: How Artificial Intelligence Is Rewriting the Rules of Marketing LeadershipTech Equity: Freedom Through Enabling Technology: A Dream Officer's Playbook for Tech Equity in Disability & Aging ServicesBio-Logic Herbalism: Evidence-Based Natural Herbal Remedies & Home Apothecary Protocols for the Whole FamilyHome Apothecary for Healthy Lifestyle: The Practical DIY Herbal Guide for Household Use, Immune Support, and Natural First AidGoodbyeThe Journey Beyond the MapI Know When You're AsleepChronicle of the Stellar BridgesThe Legend of KaaliThe Family LiarBlue Year: A Literary Lesbian Erotic NovelBlack Hole Guns For HireHow to Show Up for Your Life: Start Living the Life You were MEANT to LiveParents Praying… Heaven Answers: When God Hears the Cry of Parents and IntercessorsTinkerHow to Be Enough: Your Worth Isn't Up for NegotiationThat McKenzie GirlMurder on the Shuffleboard CourtThe Light That Devours the SkyFreedom Quest: A Love StoryA Touch of Magic & DesireThe Last Sethu QueenThe Ever-Changing SandsA Spark of Earth and FlameHealing Your Inner Bestie: A Practical Guide to Master Self-Love and Overcome Self-DoubtLebanon: A Country for No One & EveryoneOutsphereBuduneliEvery Last BoneIntrospection: Exploring the Racialized Politics and Conception of Ideal-Blackness Within African American CultureBetween Home and Silence: A Memoir of Family, Silence, Work, Migration, and Survival Between Two WorldsHold On to MeWhere the Willow BendsThe Redux of Sam MurdochArcadian AlcoveEveryone Kept Quiet So I Did Too: Tales of a Reluctant SoldierSire, Oleander Isn't Dead! (Yet)The AwakeningImago Nine: The Popstar ApocalypseBlood ForgedThe CriticWhispers on FlowersDead ExitDead ExitNursing FlagstopDon't Believe a WordUnshaken: A 30-Day Anxiety Management Workbook for High-Functioning MenThe RanchThe Soufflé Also RisesSketchGates of LoryndasTY, Thel: Films of Thelma RitterA Vamp RevampedThe Supreme LunaThe Three Creature CurseWhat Hears YouAltars in the Ruins: Twenty-Five Sermons from the Ruins Redeemed by GraceCathedral of Scars: Fifteen Sermons from the Ruins That Became SanctuaryWaterspoutThe 1-Thing Way: A Sustainable Path to Reach Your Goals Without BurnoutReclaim Your Body for Life: The 1-Thing Way to Sustainable Fat Loss, Metabolic Health, and EnergyRadical Son Back to RootsThe Man with the Blue Suede ShoesThe Divine Feminine ScentThe Hollow Gospel: Scripture of WoundsConspiracy In TimeA Small Tree in a Texas Hurricane: A MemoirWhat No One Tells You About Caring for an Aging Parent: Real-Life Lessons, Emotional Survival, and Practical Wisdom From 14 Years as My Mother’s CaregiverA Perfectly Normal Childhood (and other lies I tell myself)Sterne: ValerieWho Is Singing?The Devil You KnowBuying Wealth With Money: A Workbook On LegacyH. A. L. T. Own Your Emotions: A Workbook on Self-ControlTeen Slang for Parents: What Your Kids Are Actually SayingMy Voice: A Guide to Mastering Life, Truth, and PurposeThe Park RaceA Tale of Two Chinas: A Fifteen-Year Odyssey Through China's Cultural HeartlandsAngel's SalvationDiddly Duggins and the Great Memory MisplacementA Devil AmidstHarbinger of DarknessThe Next Hundred Years404 Love Not Found: The Story of Harper and JonahThe Resilience of Red ThreadMurder At The Radio StationThe Summer That Changed UsSultana: The Last Road Home: The Titanic of the MississippiThe FalsehoodThe Thirteenth DreamThe One24 Hours to ForgetPelagic ShoresAre We Friends Yet?: How to Deepen Your Relationships and Create the Community You NeedMAX and the Beanstalk!HeliumThe Dying TideA Selfless Marriage: How Mutual Service Rebuilds Love, Respect, and Emotional ConnectionAI Adventures with Maya and ByteThe Quiet Night HugWhere Does It Live?: Learn Where Emotions Live in Your Body and What to Do About ThemOrdinary SoulsCardboard SpaceshipThis Sea WithinThe Statistically Unlikely ReunionThe Statistically Unlikely ReboundTicket to MarsThe Moonscorn MandateAchieve Financial Peace Budget Planner: 12 Month Practical Debt Workbook for Beginners in Large SizeThe Shipton PrincipleCrown and ChronosWalking Along the Ancient Tokaido Road: A Pilgrim's Path: Adventures and Transformations (Vol. 1: Departure)Walking Along the Ancient Tokaido Road: A Pilgrim's Path: Adventures and Transformations (Vol. 2: Insight and Memories)My Mother Said My NameWhispers on FlowersCome, Play with Me: Writer's Camp 3rd AnthologyReflections from a ShoeboxHaiku Redo: A Collection of Haiku, Companion Pieces, and Space for Your OwnHow to Conquer The BillionairesKink-Affirming Therapy Worksheets: A Clinician’s Guide to Sex-Positive and Consensually Non-Monogamous IntegrationIn the Flesh: Why Manifestation Fails in Your Head, and Works in Your BodyCatamorphosisThe Fortress of UsLeo and the Dragon of Sound: A Journey Through the Kingdom of NoiseNotes on HopeKevin The Werewolf: Shattered MoonThe 7-Day Dopamine Detox: A Beginner's Guide to Unplugging, Resetting, and Not Falling Apart OnlineAuthor and Finisher Volume IBecause I Deserve It: What Chronic Illness Taught Me about Finding My Voice in the Healthcare SystemCalling Out the Shadows: A Father's Stand Against the CurrentKiera and Lamby: TokyoCalling Out the Shadows: A Father's Stand Against the CurrentOdysseyThe Brink of Becoming: Designing a Future Beyond Zionism and Cultural ProgrammingThe Last Summer on Hawthorne StreetEternalWe Never Signed the ContractWildfire & The Sun PrinceShadow & The Air TricksterThe Cave of Past and PresentMary FalconDeadly GroundThe Vow RewrittenLost HeroRepatriated: Sons of the SoilSame IceOur Lady of the ArtilectsFart, Laugh, and Be Happy: Inspiring Bathroom Humor Stories to Uplift Your SpiritThe Great Bathroom Humor Cover-Up: An Investigation into the Lost History of Bodily Function ComedyThe Coin of ForeverA Literary Offering: Observations & CommentaryStriking JusticeThe Question of When: A Practical Guide to Knowing When It's Time for Assisted Living, Memory Care, or Skilled NursingThe Protector and the AnnihilationCornelius & The Sneak Goose AttackBlind ItemThe Echo She Left Behind

Thanks to all the publishers participating this month!

ALIO Publishing Group Attwater Books Autumn House Press
Bricolage Lit Brother Mockingbird City of Words
City Owl Press Crooked Lane Books Entrada Publishing
Flat Sole Studio Gefen Publishing House Haven
Henry Holt and Company History Through Fiction HTF Publishing
Inferno Books Infinite Books Inkd Publishing LLC
LaPuerta Books and Media Learning Spark Educational Publishing NeoParadoxa
Plexus Publishing, Inc. Pocketbook Press PublishNation
Restless Books Riverfolk Books RIZE Press
Rootstock Publishing Running Wild Press, LLC Somewhat Grumpy Press
Thinking Ink Press Tundra Books Tuxtails Publishing, LLC
Type Eighteen Books University of Nevada Press University of New Mexico Press
UpLit Press WorthyKids

DLF Digest: June 2026 / Digital Library Federation

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here

Happy June, DLF community! Thanks to everyone who participated in Community Voting for the 2026 Virtual DLF Forum. We appreciate your input as we work with the Forum Planning Committee to build this year’s program.

Look out for updates this month: the program release, registration opening, and Digital Storytelling Fellows applications. We’re excited to share what’s next!

Warmly,

-Shaneé

This month’s news

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus conferences and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.

  • DLF Born-Digital Access Working Group (BDAWG): Tuesday, 6/2, 2pm ET / 11am PT.
  • DLF Digital Accessibility Working Group (DAWG): Tuesday, 6/2, 2pm ET / 11am PT. 
  • DLF AIG Cultural Assessment Working Group: Monday, 6/8, 1pm ET / 10am PT.
  • AIG Metadata Assessment Group: Friday, 6/12, 2pm ET / 11am PT.
  • AIG User Experience Working Group: Friday, 6/19, 11am ET / 8am PT.
  • Digitization Interest Group: Monday, 6/22, 2pm ET / 11am PT.
  • Committee for Equity & Inclusion: Monday, 6/22 3pm ET / 12pm PT.
  • DLF Open Source Capacity Resources Group: Wednesday, 6/24, 1pm ET / 10am PT.
  • DAWG Policy & Workflows: Friday, 6/26, 1pm ET / 10am PT.
  • DAWG IT & Development: Monday, 6/29, 1pm ET/ 10am PT.
  • DLF Climate Justice Working Group: Tuesday, 6/30, 3pm ET / 12pm PT.

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at info@diglib.org

Get Involved / Connect with Us

Below are some ways to stay connected with the digital library community and us: 

The post DLF Digest: June 2026 appeared first on DLF.

"No way to prevent this" say users of only package manager where this regularly happens / Xe Iaso

In the hours following the news that Redhat Insights' JavaScript packages fell victim to a supply chain attack via NPM, developers and systems administrators scrambled ensure all of their projects were unaffected from a supply chain attack that steals credentials for AWS, GCP, Azure, Kubernetes, HashiCorp Vault, npm, and CircleCI before then self-propagating via said stolen npm credentials and the bypass_2fa setting. This establishes persistence via Claude Code hooks and VS Code task injection. If you have installed the affected package, reprovision your development hardware. This is is due to the affected dependencies being distributed via NPM, the only package manager where these supply-chain attacks regularly happen. "This was a terrible tragedy, but sometimes these things just happen and there's nothing anyone can do to stop them," said programmer Lady Eulah Howell, echoing statements expressed by hundreds of thousands of programmers who use the only package manager where 90% of the world's supply-chain attacks have occurred in the last decade, and whose projects are 20 times more likely to fall victim to supply chain attacks. "It's a shame, but what can we do? There really isn't anything we can do to prevent supply-chain attacks from happening if the maintainers don't want to secure access to their accounts in a robust manner". At press time, users of the only package manager in the world where these vulnerabilities regularly happen once or twice per week for the last year were referring to themselves and their situation as "helpless".

For more information, please see upstream documentation published by Redhat Insights' JavaScript packages at the following link: redhat-javascript-clients-06-2026.

Systems Life: Navigating the Distributed Database / Library | Ruth Kitchin Tillman

This post is the first in a series in which I write about experiences or specific challenges from my day-to-day work. Planned posts include descriptions of a bug and how this impacted the coworkers, how I wrote a script to parse log data… I’m hoping that these will be interesting for other librarians that work in entirely different areas, for my colleagues who are solving different problems on different systems (or maybe eventually the same one after we migrate), and for those who are thinking about doing this kind of work in the future.

When we talk about the ILS or LSP, it can sound like we’re talking about a single system. And we are, some of the time. But just like our permissions shape what we can see and do, the ways we access the system and its data may lead to entirely different experiences. More importantly, if you don’t know how different tools and even databases work, you may end up with inaccurate results or not knowing that something is possible.

For example, our Sirsi ILS and reporting system(s) consist of two separate databases. These databases can be accessed in: one way for most folks (two for people using a BLUEcloud module), two-to-three ways for some, four ways if you’re special, and five ways if you’re one of two people.

Diffusion of Databases

The Sirsi Symphony Database, fka Unicorn1, underlies the whole thing. This Oracle database is the ultimate database of record. If we load MARC, it ends up in the Symphony database. If we place orders, they become entries across Symphony tables. If we loan materials, it triggers a series of updates in the Symphony database.

BLUEcloud Analytics runs off a separate database, also Oracle.2 This separation is common and appropriate. Alma also uses a separate Oracle database and FOLIO has the option of Metadb built with PostgreSQL. The analytics databases don’t contain live data. Instead, they’re updated regularly overnight, based on things that have occurred in the primary database. Change a title? It’ll show up in analytics tomorrow.3 Check out a book? That transaction will show up in circ stats tomorrow.

This is an appropriate choice for three reasons:

  1. It’s a bad idea to run large analytical queries on production. Plus, static indexes are much more efficient to search.
  2. The analytics system has no real demand overnight, so its server can do a full reindex before running any scheduled jobs.
  3. The analytics database can be designed differently.

Following that last point, the analytics database isn’t just a snapshot of production. It has a fundamentally different design. It anonymizes circulation transactions, but it also builds completely different indexes from the ones we need for daily work. For example, it indexes circulation data by hour, day, month, and year as well as by circulation desk. Sometimes we want big numbers. Sometimes we want to see which desks get the most traffic. Those aren’t the kind of searcehs we need to do in day-to-day work. It indexes MARC as fields and subfields, including invalid ones like λ.

Accessing the Databases

Most of my coworkers only access Symphony using one tool: Workflows. A few also use BLUEcloud Circ.4 Using the client, they look for records, update them, perform transactions, etc. We import single MARC records using Workflows wizards. We import batches of MARC records using Workflows reporting (and FTP). Global item updates are done in Workflows. The Workflows reporting module can be used to load, transform, or extract data, history, or (some) statistics.

Next, we have BLUEcloud Analytics. A much smaller set of people (but still plenty) have rights here. As described above, Analytics is a completely separate database. It’s also designed in a way that’s more oriented toward statistical work. Folks use it to extract shelf lists, acquisitions data, spreadsheets of MARC subfields, etc. The indexes are enormous and joined queries can take some time to run (and you can only run joined queries which are supported by the system), but you can get a lot of data and can’t accidentally bring down production.

About four years ago, we got access to Data Control. This is probably my favorite Sirsi product5. Unlike Analytics, Data Control gives you the power to query or even update the Symphony database itself. That means it doesn’t have some things that are in Analytics. You can’t see an item’s transaction history, for example, just its current data.6 Even fewer people have access to this, most use it on our Stage server, and just a couple of us are allowed to run batch updates to production.7

seltools is like Data Control for the command line. More properly, Data Control is an interface that lets ordinary humans use seltools with enough scaffolding not to mess quite as many things up. seltools can do even more and can do it very quickly. It is a sysadmin tool and only two people here have rights to use it. It can do extraordinary work in seconds and could cause irreparable damage (or at least, damage that requires restoring from backup). AFAIK it dates back to the launch of Unicorn.

How I Access the Data

I have rights in Workflows, BLUEcloud, Analytics, and Data Control. I tend to use them as a kind of grab bag and often chain Analytics and Data Control in my work, sometimes performing interim steps with Python or OpenRefine.

Because Analytics isn’t querying live data, it’s a much better place to do initial MARC searches. If I want to find every record with a 699, for example, Analytics is the place to do that fast. Or I could look for every 100 or 700 with a subfield “e” or search for a particular piece of text in one or more fields.

But in terms of output, Analytics leaves a lot to be desired for MARC work. It’ll shows a field’s subfields like a table. For example:

Field Subfield data
264 a New York
b Grosset & Dunlap
c [1972]

That’s fine if I only want to facet down to the subfield b in each row, but if I want to deal with the MARC data as a field it becomes a problem.

In the Analytics reports I use, it’s easy to add the bib key to a report if it wasn’t already in there. Before we got Data Control, my next step would be to actually switch to something like Z39.50 and download all the bibs manually, hoping I got everything (because our keys are not always in the 001, it’s a long story). I then had to do a delimited export in MarcEdit or write a pymarc script to get the fields I wanted.

Now, if I want to see a set of fields from the record, I simply upload that same set of bibkeys in Data Control8. I structure my query to include the tables I want and output the fields I need from each table. I can then export them into a much nicer spreadsheet with the MARC field (and indicators, if desired) printed the way it appears in the original MARC. I can also export the entire set of records as MARC.

264
|aNew York :|bGrosset & Dunlap,|c[1972]

An Example Update

But, even better, I have the rights to update the data. In most cases, I can even use regular expressions. For example, when we added a new ILLiad request placement module to our MyAccount app, we grabbed the 020 (ISBN field) straight from the Symphony API.9 Unfortunately, about 600,000 of our 020 fields followed the pre-2013 structure, when qualifying information was still included in the subfield a. In 2013, subfield q was introduced to handle things like “(paperback)”. This unexpected data was messing with ILLiad’s automated processes. We could’ve changed the script, but it made more sense to fix the actual data, since we niw had the tools.

First, I ran an Analytics query to find all records where the 020a contained (,), or any letter except x. I exported the data, extracted the bibkey column, and then broke it into batches of 25,000 bibkeys.

I spent a few weeks working on our stage server to develop the appropriate regex-based find and replace patterns to move qualifying data into a subfield q. I had to handle various edge cases: no parentheticals, only one half of the parenthetical, etc. Once I felt confident, I ran a batch of about 5000 on stage and QAd my results thoroughly. I then spent the next month running batches in production. I limited batch sizes and chose days when we didn’t have other jobs which would trigger big reindexes (you can only do so many jobs in a night or the reindex will take forever and throw off all the other chron jobs).

Once the project was done, I was able to re-run queries in Analytics to ensure there weren’t any issues remaining.

I can also click into and update single records from Data Control results page or set it to let me modify a particular field and paste repeating data into that field. The former is useful when there might be other related fields which need to be updated or I need more context. The latter is useful when only some of the results need to be updated or the person hasn’t yet got regex privileges on production.

Clashing Designs

So that’s what it looks like when things go well. Tech librarianship so often involves what Marshall Breeding called “Knitting Systems Together” that I almost don’t think about the ways I hop across tools. At most I feel a minor irritation. Recently, I ran across a case where the difference between system designs and who had permissions to access what was making a huge difference in my coworkers’ abilities to get their work done.

In theory, the data in Analytics should mirror what’s in Symphony, at most with a different structure. However, when a barcode is updated in Symphony (generally via Workflows), Analytics completely drops entries related to that barcode. The entries are not transferred to the new barcode. Data that’s still in the item record is retained, so we have the item last activity date, the circulation count (an incremented field), etc. But we can’t see the item transaction history.

Now, there were a couple things we could do about this… I’ll describe how system logs come into play in my next post!


  1. Still labeled Unicorn in some places. ↩︎

  2. Specifically, it’s MicroStrategy whose Wikipedia page starts off like any other data analytics software and then …pivots to Bitcoin. It’s Michael Saylor’s company, if that name means anything to you. ↩︎

  3. Timing could be more frequent, but I believe most have daily updates. ↩︎

  4. BLUEcloud is Sirsi’s next-gen browser client. To my knowledge, we still only use the circulation module and many people still use Workflows for circulation. ↩︎

  5. It’s extremely powerful, though extremely fragile – but that could also describe me, so I can only be so annoyed by it. ↩︎

  6. Transactions here meaning every time the item was scanned, some of which is available via Analytics. There is also transaction history in Symphony but it’s in logs. ↩︎

  7. It also supports two kinds of batch updates – a batch modify which lets you edit fields individually in a browser interface and a batch substitute which lets you run updates on fields using regular expressions. If you wanted to update a MARC 500 field on a set of items, for example, someone with batch modify permissions could display all 500 fields on the records, click Modify, and then paste a new text into any field they wanted to replace (while skipping 500 fields which didn’t match). Someone with regex permissions could find all notes matching the old note and sub it with the new note. ↩︎

  8. Why not do the whole search in Data Control? It is painfully slow compared to Analytics, especially for MARC searches. For the cases when Data Control is designed better for searching, I’ll export a set of keys for the overall records I want to search within and then perform it as a scoped search, which is much faster. ↩︎

  9. We only use the APIs for integrations not for reporting/updates/etc., so I didn’t list it above. Seltools are much faster and more powerful. ↩︎

Untitled / Ed Summers

Untitled

by John Summers