Thanks to the June release team: Galen Charlton (Equinox), Gina Monti (Bibliomation), Andrea Buntz Neiman (Equinox) and Chris Sharp (PINES); as well as everyone who contributed fixes and testing to this release.
Anubis is about to get WebAssembly-based proof of work checks so that administrators can use a non-SHA256 proof of work method to protect their websites. Part of the implementation goals of this work is that the check logic is defined in one place on both client and server. The client and server will then hook into the WebAssembly in order to make sure they're running in lockstep.
However, one small problem comes up. What do you do when the client has WebAssembly disabled? I really don't want to de-facto lock people out of websites. Anubis exists in an impossible balance of user experience, administrator experience, and developer experience and any change to any of these factors disrupts the balance for other factors.
To work around this and also fulfill the goal of having check logic defined once, I decided to take inspiration from the legendary talk The Birth and Death of JavaScript and just recompile the WebAssembly to JavaScript. Sure, the resulting JavaScript will be slower than the equivalent WebAssembly (even more so because disabling WASM usually disables the JavaScript JIT, the thing that makes JavaScript fast), but it will finish eventually. Hopefully it will be more efficient than the existing JavaScript is on lower end hardware, but research is required.
Luckily enough, the tool I need (wasm2js from the binaryen project) is packaged in Linux distributions. The bad news is that distributions ship ancient versions of it that don't get the same output as the version on my development machine's copy from Homebrew.
In order to really make sure that the output of this is deterministic (essential for reproducible builds), I need to bundle a copy of wasm2js. So I did that by building a version of wasm2js compiled to WebAssembly with wasi-sdk. The rest of the article is the tale of reproducibility woe that lead to the implementation I ended up with. Buckle up and enjoy the ride!
Back up a sec, this doesn't make sense to me. If you have the same bytes of
input to a compiler, you should get the same bytes of output assuming
that the compiler flags, target, and other platform details are controlled
for right? A compiler is just a deterministic function of input source code
becomes output bytecode, right?
lol you'd think, but no, it's not. In theory it is (and for small scale
compilers it definitely is), but in practice compilers are strange and
complicated beasts containing multitudes that no mere mortal can fully
comprehend on their own.
There are a shocking number of ways to accidentally create nondeterministic output when doing C/C++ development. One of the easiest is to use the builtin __DATE__ and __TIME__ macros to stamp a build with the time the compiler was executed at:
$ make clean && make hello.wasm && wasmtime run -W exceptions=y ./hello.wasm
rm -f hello.o hello.wasm
wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false -c hello.cpp -o hello.o
wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false -fwasm-exceptions -lunwind --no-wasm-opt hello.o -o hello.wasm
Jun 18 2026 00:00:59
Another time it gets me this:
$ make clean && make hello.wasm && wasmtime run -W exceptions=y ./hello.wasm
rm -f hello.o hello.wasm
wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false -c hello.cpp -o hello.o
wasi-sdk-33.0-x86_64-linux/bin/wasm32-wasip1-clang++ -O3 -fwasm-exceptions -mllvm -wasm-use-legacy-eh=false -fwasm-exceptions -lunwind --no-wasm-opt hello.o -o hello.wasm
Jun 18 2026 00:01:11
Even though the source code had the same bytes, the output of the compiler was wildly different.
In order for users and packagers to trust the binaries of wasm2js I'm committing to the Anubis repo, I need to make sure that you can build the same version I built, down to the same bytes. For an added bonus, you should be able to build this on your machine and get the same bytes I got.
That sure does sound like a great ideal, it would be horrible if something
unforeseen came up to ruin it!
Clang silently runs wasm-opt from $PATH behind your back
Among other tools like wasm2js, binaryen has a bunch of other useful tools such as wasm-opt. wasm-opt optimizes WebAssembly compiler output to let you eke out more performance. This doesn't work in every circumstance, but when it does work it makes a huge difference. As such, clang shells out to wasm-opt when doing builds.
This normally makes sense, but in this case it caused builds to fail on my DGX Spark because its version of wasm-opt is too old:
$ uname -m && which wasm-opt && wasm-opt --version
aarch64
/usr/bin/wasm-opt
wasm-opt version 108
Compared to my workstation which installs wasm-opt from Homebrew:
$ uname -m && which wasm-opt && wasm-opt --version
x86_64
/home/linuxbrew/.linuxbrew/bin/wasm-opt
wasm-opt version 130
Turns out that wasi-sdk and binaryen rely on the WebAssembly Exceptions extension. This is a reasonable thing to assume given that wasi-sdk mostly assumes you're building things for web browsers and 93.86% of browser users have a browser engine new enough to support it. C++ is also one of the main places where exceptions are used, so I guess WebAssembly-native exception handling removes a lot of boilerplate here.
Both wasmtime and wazero require you to flag into exception support. This is fine; we can just pass -W exceptions=y to wasmtime and use a custom runner harness for wazero. The annoying part is what happens when my arm machine's anemic build of wasm-opt sees exception handling instructions, causing it to exit. This made the build fail.
The solution was to pass --no-wasm-opt at the linking step. This removed one angle of irreproducibility.
I guess in the future we could make it use the version of wasm-opt it just
built to optimize the output, but that may be a premature optimization for
now.
Clang relies on address layout for ordering things
The version of clang that I use to compile wasm2js has some address-sensitive code generation hiding in its exception handling path. Raw pointer values leak into the order a handful of try_table blocks come out in. This surfaces as every build differing from the next by about 29 bytes:
The computation is nearly identical, but the byte order is just different enough to also make the catch references differ. This also fires when you build this pinned version of wasm2js on arm64 machines because its pointer iteration order is different from it is on my workstation.
To work around this, I took two steps:
Disable address-space randomization for this build using setarch --addr-no-randomize.
Create known good sha256 checksums for both x86_64 and arm64 via building this program on machines I trust.
I also made a CI job ensure this:
-name: Ensure reproducibility
run:| cd ./utils/wasm/wasm2js
./build.sh
if sha256sum -c --status shasums.x86_64; then
echo "OK: rebuilt modules match the recorded x86_64 checksums"
elif sha256sum -c --status shasums.arm64; then
echo "OK: rebuilt modules match the recorded arm64 checksums"
else
echo "::error::rebuilt wasm2js/wasm-opt match neither recorded checksum set on ${{ matrix.runner }}" >&2
sha256sum wasm-opt_130.wasm wasm2js_130.wasm
exit 1
fi
To be extra sure, we have this job run on both x86_64 and arm64 hosts. I'd really love to have this be reproducible across hosts, but that's an upstream LLVM bug that I am not powerful enough to tackle. If you work on LLVM and are reading this, it would be nice to set a seed of some kind to ensure that this iteration order is fixed across architectures.
At the very least builds are deterministic within architectures. This may have to be good enough for now.
Collaboration. Serendipity. Diversity. These are the qualities that come to mind when I think about this year’s OCLC quilt and the community that created it.
The OCLC Quilters, a group of current and retired OCLC employees, have spent months creating a quilt of 120 cross-cut blocks to donate to the silent auction held during the ALA 2026 Annual Conference. The ALA BiblioQuilters annually host this auction as a fundraiser for the Christopher Hoy Scholarship, which awards a $5,000 scholarship each year to a U.S./Canadian citizen or permanent resident who is pursuing an MLS in an ALA-accredited program.
This is the fourth year in a row that the OCLC Quilters have donated a quilt to the silent auction. Their work inspired me to take up sewing about a year ago, and I’m proud to move from admirer to participant, contributing to the OCLC quilt for the first time. Although left-handed people are about 10% of the population, three of the 13 people contributing to the OCLC quilt, including myself, are left-handed. While that doesn’t affect the result, it requires a few adjustments in technique and having the appropriate scissors. Sharing advice on adapting equipment and shopping for left-handed supplies is one of the ways we support each other.
Nine of the 13 contributors to the OCLC Quilt for the ALA 2026 Annual Conference
Like all handicrafts, quilting is an activity with its own nomenclature. As a quilter and cataloger, I found myself wondering: “What controlled vocabulary terms could I use to describe the OCLC quilt?” There are several from vocabularies such as Library of Congress Subject Headings (LCSH) and Getty Art and Architecture Thesaurus (AAT). These are listed at the end of this blog.
A quilt is created from many elements that may not be individually significant but form a meaningful whole, just like a WorldCat bibliographic record. The blocks of the quilt function like data elements in a WorldCat record, with contributions from multiple individuals creating the larger work.
Assembling the quilt
OCLC quilters sewed 120 blocks, which are the fabric squares comprising the quilt’s front. The blocks are a cross-cut design—a pattern chosen because it is accessible for novice sewists and makes good use of small fabric pieces. Quilters often save these leftover pieces, called “scraps,” from other projects for future use. Reusing scraps makes quilting a sustainable craft, and quilters often share them with one another. An experienced OCLC quilter, who keeps her scrap collection organized in true librarian fashion, donated most of the fabric pieces used for the blocks.
Experienced quilters arranged and sewed the blocks together and cut the batting (soft material used between the front and back sides of the quilt). The next step, in which three layers are sewn together with a decorative stitch, is quilting. This is the strict definition of the term “quilting,” although it is often used to refer to the entire process of creating a quilt. The pattern used for the quilt stitching is called “modern ties,” and it looks a bit like tied shoelace loops.
An OCLC logo is incorporated into one of the quilt blocks
The final step is to sew a long strip of binding fabric around the edges of the quilt, which will keep the ends from fraying as well as being decorative. Two labels were sewn into the binding: “Made in OH” and “Is it perfect? No.” Both of these labels are accurate descriptions of this quilt, but unlike in bibliographic descriptions, a certain amount of imperfection is not only tolerated but may be considered part of the quilt’s charm.
A quilting tradition at ALA Annual
The OCLC quilt will be one of many available at the ALA BiblioQuilters silent auction during ALA Annual in Chicago, Illinois. The BiblioQuilters were founded at the 1998 ALA Annual Conference in Washington, D.C. Since 2000, the BiblioQuilters have had a silent auction of quilts every year except 2020 and 2021 (because of the pandemic). The quilts are usually available to view and bid on near the registration area. If you are attending ALA in Chicago, I highly recommend you visit the auction table to view them. After ALA, you may be inspired to browse the shelves of your local public library for 746.46, the Dewey Decimal number for quilting.
Subject vocabulary terms
For those readers who appreciate quilting and metadata, the following controlled vocabulary terms reflect concepts discussed in this blog. You might even find it fun to match the concepts to the natural language descriptions!
LibraryThing is pleased to sit down this month with groundbreaking author and poet Cynthia Pelayo, who in 2022 became the first Puerto Rican and first Latina to win a Bram Stoker Award after her Crime Scene took the prize in the Poetry Collection category. Her Into The Forest And All The Way Through was a 2020 nominee, also in the Poetry Collection category, while her Children of Chicago was a 2021 nominee in the Novel category. Pelayo earned a BA in Journalism from Columbia College Chicago, a MS from Roosevelt University, and a MFA in Writing from the School of the Art Institute of Chicago. She is currently pursuing a PhD in English. Her MFA writing thesis, Lotería, was republished in 2023, winning an International Latino Book Award Silver Medal in the Best Collection of Short Stories category. A co-publisher of Burial Day Books, which focuses on horror writing, she is the author of numerous other books, stories and poems, including novels such as The Shoemaker’s Magician (2023), Forgotten Sisters (2024), and Vanishing Daughters (2025). Her new novel, It Came from Neverland, a work of horror inspired by the classic Peter Pan, was published by Crooked Lane Books earlier this month. Pelayo sat down with Abigail this month to discuss her new book.
Tell us a little bit about It Came from Neverland. How did the idea for the story first come to you?
Like many people, I grew up watching the Disney version of Peter Pan, and then I remember watching Hook with Robin Williams and being captivated, seeing Peter Pan as an adult who had to remember who he was. There was an older Wendy in that film, and that always stayed with me because I really wanted to know what Wendy’s story looked like aged into young adulthood.
When I went back to J.M. Barrie’s Peter and Wendy and the original play, I worked out that Wendy would be in her early twenties at the start of the First World War. And then I learned that many young men at that time lied about their age to enlist, some of them were barely more than boys. It all felt like a perfect juxtaposition, these boys going off to war and Wendy’s trauma round caring for the Lost Boys from Neverland.
A woman of that time period would certainly not be believed if she tried to tell the truth about what she experienced as a child in Neverland, and so that certainly played into her experience. Then I thought well, Wendy at this age would certainly be in a position that reflected her character, and schoolteacher fit perfectly. The story wrote itself, because I knew that Wendy would do all that she could to protect those children and I also knew Peter Pan would surely return to whisk more children away to Neverland. So this story is that tale, what does she do to stop Peter Pan.
What drew you to Peter Pan, and what made you feel it was horror?
Peter Pan without Wendy Darling is just a boy screaming into the dark. The story only works because Wendy agrees to go with him to Neverland, and she goes because she is sweet and kind and she believes him. Wendy’s only failure here was that she had a good heart, which is just so sad because her being nice is why she was taken advantage of. She trusted someone who was manipulating her. Peter told her she was special, but what he really meant was that she was useful. She was given the role of mother to the Lost Boys, not because she was truly loved or valued, but because someone needed to do the mending. This is not fantasy. This is a domestic horror story dressed in fairy dusty.
In Peter and Wendy we’re also essentially told that growing up is a curse, but I push back on that. Growing up is the adventure, becoming yourself, and gaining autonomy is the gift.
Peter’s entire pitch was stay here, never change, never leave me. Shrink. Lost yourself to praise me. That’s not love. That’s control.
The horror was always there. I just removed the glitter.
What is it about fairy tales that speaks to you?
Fairy tales were the very first stories I was told as a child. “Little Red Riding Hood,” “Hansel and Gretel,” “Cinderella,” more. I hold all of them dear, even though many of them have a thread of terror, but I suppose that’s why I’m the writer I am today.
What I’ve come to understand is that fairy tales are an early societal warning system, in a way. They prepared us for danger, and that’s why they still work. Little Red Riding Hood is everyone who has been told not to talk to strangers on their way home. Bluebeard is everyone who has been warned to be cautious with suitors. Snow White is every person who has fallen victim to the cruelty of jealousy. These stories survived centuries because beneath them there is some truth that can be applied to many of our experiences today. They encode the things that we should say out loud, but don’t because of all of those strange polite society rules, things like – don’t trust the stranger who flatters you, the beautiful thing is likely the trap, or even, the person who promises you forever can very well be the one who seeks to destroy you.
When looking at all of these through a horror lens, they echo to the horror writer what is our job – and that is to tell the truth. Horror is the genre of truth, to highlight the danger, to be a witness to survival, more. So much of what fairy tales do speak to this.
Are there other classic works you’re interested in transforming?
Yes, and the one upcoming is a Frankenstein retelling titled Everina from Union Square & Co. I have more, but I can’t mention those quite yet.
Tell us a little bit about your writing process.
I generally write in the early morning. In the evenings that’s when I tend to answer email or work on lectures for any workshops I’m teaching or work on any of my own homework.
In terms of my actual writing sessions, I read before I write, generally I will read some poetry and then start writing. If you’re asking about the big questions regarding how do I create something, I think I’m a pretty methodical writer in that yes, I allow discovery to happen, but as of recent, there is a lot of researching, planning, and pre-writing that goes into my actual writing. Then there is editing, and that is an entirely different process. I tell writers to think of these processes as three separate demands, research and preparation requires one aspect of your brain, editing requires a different aspect of your brain and the actual writing, which is a completely different process. When you are writing you are creating, allow yourself to have fun and explore when you’re actually writing.
What comes next for you?
Something Followed Us Home: Tales of Latiné Horror comes out September 29 from Simon & Schuster. It’s an anthology I edited that features Mariana Enríquez, Agustina Bazterrica, Mónica Ojeda, Isabel Cañas, Daniel José Older, Zoraida Córdova, and others, with a foreword by Brenda Lozano. Latiné horror isn’t a subgenre, it’s a tradition that I’m grateful to have had the opportunity to share with readers.
After that, Everina, which is my Frankenstein retelling.
Tell us about your library. What’s on your own shelves?
What have you been reading lately, and what would you recommend to other readers?
I am reading Piranesi by Susanna Clarke. I love world building, and here, we have a world that operates on its own logic, the architecture of the space, and the feelings it evokes. Clarke gives us infinite halls that become both prison and sanctuary, and it’s this tension that I’m drawn to as both a reader and writer.
Rockets, unlike other anthropogenic pollution sources, emit gaseous and solid chemicals directly into the upper atmosphere. We compile inventories of these chemicals from rocket launches in 2019 and projections of future growth and speculative space tourism activity. We incorporate these in a 3D atmospheric chemistry model to simulate the impact on climate and the protective stratospheric ozone layer. We find that loss of ozone due to current rockets is small, but that routine space tourism launches may undermine progress made by the Montreal Protocol in reversing ozone depletion in the Arctic springtime upper stratosphere. The BC (or soot) particles from rockets are also of great concern, as these are almost five hundred times more efficient at warming the atmosphere than all other sources of soot combined.
Note that even four years ago it was already clear that the space industry was both depleting ozone and aggravating global warming. But this was before the scale of the proposed mega constellations was evident.
So far, models of spacecraft reentry have focused on understanding the hazard presented by objects that survive to the surface rather than on the fate of the metals that vaporize. Here, we show that metals that vaporized during spacecraft reentries can be clearly measured in stratospheric sulfuric acid particles. Over 20 elements from reentry were detected and were present in ratios consistent with alloys used in spacecraft. The mass of lithium, aluminum, copper, and lead from the reentry of spacecraft was found to exceed the cosmic dust influx of those metals. About 10% of stratospheric sulfuric acid particles larger than 120 nm in diameter contain aluminum and other elements from spacecraft reentry. Planned increases in the number of low earth orbit satellites within the next few decades could cause up to half of stratospheric sulfuric acid particles to contain metals from reentry.
Much of the reentry burn happens above the stratosphere, and it takes time for the aluminum nanoparticles to drift down to the levels where they were collected. So the 10% number represents pollution from an earlier period with fewer reentries that the 2020s. Murphy notes that:
Most of the meteoric mass is deposited at altitudes between 75 and 110 km by a very large number of sub-millimeter meteoroids. Reentering spacecraft, which are larger and moving more slowly, ablate between 40 and 70 km over a ~300 km long footprint
This paper investigates the oxidation process of the satellite's aluminum content during atmospheric reentry utilizing atomic-scale molecular dynamics simulations. We find that the population of reentering satellites in 2022 caused a 29.5% increase of aluminum in the atmosphere above the natural level, resulting in around 17 metric tons of aluminum oxides injected into the mesosphere. The byproducts generated by the reentry of satellites in a future scenario where mega-constellations come to fruition can reach over 360 metric tons per year. As aluminum oxide nanoparticles may remain in the atmosphere for decades, they can cause significant ozone depletion.
Ferreira et al confirm the potentially long delay between reentry and the nanoparticles reaching the ozone layer and depleting it:
we find that these reentry byproducts may take up to 30 years to settle from the top of the mesosphere into the stratospheric ozone layer. Upon reaching an altitude of about 40 km, aluminum oxides catalyze chlorine activation which promotes ozone depletion. This suggests that concentrations of aluminum oxide compounds may start increasing in the mesosphere well before reaching the stratospheric ozone layer. This would introduce a noticeable delay between the beginning of the injection process when orbiting bodies are decommissioned and the eventual ozone-depletion consequences in the stratosphere.
A lack of observations and validated models of reentry demise limits our ability to simulate the complex aerosols associated with reentry, which makes estimating the climate impacts difficult. Aluminum is a primary satellite component and will likely be emitted during reentry vaporization in the form of alumina. Unmodified alumina is a useful approximation for metallic reentry aerosol. In this study, we simulate a potential yearly emission of 10,000 metric tons of alumina from reentering space debris. We investigate how the location of atmospheric accumulation, aerosol size distribution, and radiative properties of reentry alumina impacts the middle atmosphere. We find that 20,000–40,000 metric tons of alumina accumulates at high latitudes between 10 and 30 km in both hemispheres. Small changes in mesospheric heating rates lead to 1.5-K temperature anomalies in the middle atmosphere at high latitudes. These temperature anomalies are accompanied by changes in wind speed in the polar vortex.
So there are thermal effects on the climate as well as the effects on the ozone layer.
To understand if significant ozone losses could occur as the launch industry grows, we examine two scenarios. Our ‘ambitious’ scenario (2040 launches/year) yields a −0.29% depletion in annual-mean, near-global total column ozone in 2030. Antarctic springtime ozone decreases by 3.9%. Our ‘conservative’ scenario (884 launches/year) yields −0.17% annual, near-global depletion; current licensing rates suggest this scenario may be exceeded before 2030. Ozone losses are driven by the chlorine produced from solid rocket motor propellant, and black carbon which is emitted from most propellants. The ozone layer is slowly healing from the effects of CFCs, yet global-mean ozone abundances are still 2% lower than measured prior to the onset of CFC-induced ozone depletion. Our results demonstrate that ongoing and frequent rocket launches could delay ozone recovery. Action is needed now to ensure that future growth of the launch industry and ozone protection are mutually sustainable.
Note that this paper addresses only the ozone depletion from launches, not from reentry. But their 'ambitious' scenario of 5.6 launches/day is far short of Musk's ambitions, let alone the other planned megaconstellations. My understanding is that the 2040 launches/year in their scenario are of Falcon 9 class vehicles but "only 4.4% of launches are using vehicles designed for re-entry", which is implausible. But the mega-constellations can't be built or maintained with Falcon 9s.
To achieve that, they would need to launch 120,000 satellites per year. Over the 15 years, they would launch 1.8 million satellites, but 800,000 of them would fail (as part of our 9% failure rate), leaving a total operational fleet of one million satellites. This equates to 3,158 Starship launches per year, or nearly nine launches per day. For some context, the current launch rate for Starship is just five per year.
...
In order to keep a million satellites in the constellation, it needs to be maintained. So, each year, SpaceX would have to launch 90,000 AI Sat Minis to replace the roughly 9% of the constellation that failed. That equates to 2,368 Starship launches per year, or 6.4 per day.
That's 9 launches/day for 15 years then 6.4 launches/day indefinitely of a much rocket that is vastly bigger than Falcon 9 and is completely re-usable.
Of course, these claims are ridiculous - neither logistically nor economically feasible. But assuming Starship or a competitor such as Blue Origin does manage to create a reliable, reusable, 100 ton to LEO launch vehicle, there will be a lot more mass in LEO and a lot more of it reentering.
A 10-fold enhancement of lithium atoms was detected at 96 km altitude by a resonance lidar at Kühlungsborn, Germany, approximately 20 hours after the uncontrolled re-entry of a Falcon 9 upper stage. The upper-atmospheric extension of the ICON general circulation model, nudged to ECMWF, was used to calculate winds. Backwards trajectories, including wind variability as measured by radar, traced air masses to the Falcon 9 re-entry path at 100 km altitude, west of Ireland. This study presents the first measurement of upper-atmospheric pollution resulting from space debris re-entry and the first observational evidence that the ablation of space debris can be detected by ground-based lidar. The analysis of geomagnetic conditions, atmospheric dynamics, and ionospheric measurements supports the claim that the enhancement was not of natural origin. Our findings demonstrate that identifying pollutants and tracing them to their sources is achievable, with significant implications for monitoring and mitigating space emissions in the atmosphere.
The effect of lithium and other spacecraft ingredients on the ozone layer doesn't appear to have been studied compared to aluminum. To be fair, there will be a lot more aluminum.
We use a global inventory of launch and re-entry emissions covering the onset of the megaconstellation era (2020–2022), and project these to 2029 based on 2020–2022 growth rates. We implement this inventory into a 3D atmospheric chemistry model to determine the impacts of megaconstellations on the ozone layer and climate. We find that global stratospheric ozone depletion from all mission types is relatively small compared to surface sources and megaconstellation missions only account for about one-tenth of this depletion. This is because rockets launching megaconstellations almost all use kerosene, a large source of black carbon or soot particles, but not of chemicals such as chlorine that directly destroy ozone. Soot from rockets absorbs sunlight, warming the upper layers of the atmosphere and decreasing the amount of sunlight reaching Earth's lower atmosphere, causing it to cool. Megaconstellation missions are responsible for about half of this climate effect. In this regard, rockets launching megaconstellations and other missions are like small-scale stratospheric aerosol injection experiments without forethought for potential unintended consequences.
Again, this paper addresses only atmospheric impacts from launches, not from reentries. And, the launch rate for 2020-2022 is far less, and uses much smaller rockets, than the proposed "million satellite data center" and its competitors.
The open-source movement emphasizes the power of freely modifiable, flexible code to support transparency, collaboration, and building outside vendor lock-in. Open hardware extends that logic to the physical layer: chips you can read, modify, and build on. All software runs on hardware, and over the past few years, the ground under the hardware industry has been shifting. Since 2022, the United States, the Netherlands, and Japan have progressively tightened export controls on advanced chips and the equipment used to manufacture them; China has responded with a state-backed effort to reproduce every layer of that supply chain at home. Policy analysts now routinely describe the trajectory as a “fragmentation” or “decoupling” of the global semiconductor market into separate technology spheres. Hardware costs are climbing across the board, so accessing open hardware feels all the more relevant for a group building open-source software.
Open hardware doesn’t make the chips any cheaper. What it changes is what a chip you already own is allowed to become or what kinds of application-specific chips you can create. A single FPGA (Field-Programmable Gate Array) on a shelf can be a video codec today, a custom search accelerator the next, and a faithful copy of a decommissioned architecture the day after that. The unit cost is what it is; the value you can extract from that unit is no longer fixed by a vendor. For institutions whose time horizons stretch over decades, like libraries, archives, and public-interest research groups, that reconfigurability is the core of the open-hardware argument that compounds.
There are no known open-source hardware implementations of Google’s TPU, the in-house chip family Google designed to accelerate neural network math. The silicon is proprietary and not directly purchasable by consumers except in edge TPU form (e.g., Google Coral). This field note reports a small experiment porting the OpenTPU project, a Python simulation of Google’s TPU published by UCSB’s ArchLab, to an inexpensive FPGA board to explore the pros and cons of using more open hardware. To keep the experiment lightweight, most of the code was written by AI coding agents, with a human directing the work.
Why FPGAs, For A Lab Like This One
An FPGA (Field-Programmable Gate Array) is a chip whose internal logic can be reconfigured. You describe the circuit in a hardware description language, SystemVerilog, compile it to a bitstream, and load it onto the board. A CPU runs your program. An FPGA becomes your program. FPGAs are the obvious place to start, because they’re the one piece of reconfigurable silicon a small lab can actually buy and program today.
The practical difference lies in the shape of the problems they solve. A CPU is a generalist reading a manual one step at a time. A GPU is a factory floor of thousands executing the exact same standard math in unison. An FPGA is a machine whose gears are configured to fit exactly one algorithm. They excel at problems with strange shapes. If your workload involves multiplying massive, uniform matrices to train a language model, you want a GPU. But if your work requires parsing millions of irregular text strings as they stream, finding exact bit-level matches across a sprawling archive, or piping data through a custom hash without ever pausing to fetch instructions from memory, an FPGA is arguably the more elegant approach. A dataset is sometimes only as legible as the software that reads it, and that software is only as runnable as the hardware underneath. An open FPGA design can preserve a faithful copy of obsolete circuitry, thereby preserving the means to read data, not just the data itself.
For an institution like the lab, that distinction matters. Libraries, archives, public-interest research groups, and smaller labs deal with workloads and questions that are awkwardly sized: too large for a laptop, too specialized to deserve a recurring cloud bill, too long-running for any one grant cycle. For a well-resourced lab, the answer is cloud GPUs. For everyone else, the answer has historically been to wait or to scale down the question. That is why open hardware matters here: it gives smaller institutions a path to specialized computing without waiting for ideal market conditions.
A $300 board on a desk can run a custom-designed circuit, tailored to one workload, indefinitely, without anyone’s permission and without a metered bill. A physical circuit sheds the overhead of an operating system and the constant fetch-and-exec cycle, drawing a fraction of the power of a standard CPU. For a public-interest lab, this efficiency makes running some workloads both financially and environmentally sustainable.
TPU-style architectures are particularly well-suited to this kind of work because of the systolic array at their core. A systolic array is a grid of small multiply-add units that pass partial results to their neighbors on every clock tick, making it efficient for matrix math because data flows through the grid rather than being fetched repeatedly from memory. That structure is designed for exactly the dense-matrix operations that underpin embedding generation, similarity search, and the neural network inference behind modern document analysis. This is the kind of regular, spatial structure FPGAs are built to host. A grid of identical small units, wired to their neighbors, all ticking together, is close to a literal description of what an FPGA’s fabric already is.
Designs published openly in SystemVerilog are reusable across institutions, as open code is. What has kept this out of reach was never the silicon; it was the labor: months of vendor-tool learning, a small group of knowledgeable practitioners, and debugging cycles unique to physical systems. That cost is what AI coding agents have started to chip away at, making the open-hardware case more practical.
Porting OpenTPU and the Silicon Boundary
OpenTPU is a published academic re-implementation of Google’s first-generation TPU implemented in Python code that models the hardware’s behavior, not software meant to run on it. The hardware target of this experiment was the Alchitry Pt v2, a roughly $300 board built on an Xilinx Artix-7 chip, with an add-on that exposes USB 3.0 to a host PC. I worked with AI coding agents to describe goals in plain English, review edits, and iterate. A sandbox repository came first: blink an LED, echo bytes, talk to the USB chip. That groundwork paid off within hours of starting the real port. The OpenTPU translation itself — the systolic matrix unit, the weight memory, the instruction decoder, the activation logic, and the host interface — went through seven planned phases. A few days of wall-clock time later, the SystemVerilog testbench ran the same matrix-multiply program as the original Python simulator and produced the same answer.
Source: Jenevieve Haggard, HLS LIL
Translating Python that describes hardware into SystemVerilog that synthesizes hardware turns out to be something AI agents are somewhat capable of. The source is unambiguous, and the testbenches give instant feedback. The hard part was always the silicon boundary: a physical board, a vendor toolchain with subtle caching behavior, a USB chip with several gotchas. What worked was a scaffold around the agent that provided deterministic simulation tests for everything testable in software, plus on-board LEDs wired to specific finite-state-machine states, so a human eye could see what the testbench could not. A small notes tool called “cq” (by Mozilla) recorded what each session learned and made future sessions read those notes first. By the end, the store held 48 entries, almost all painfully, expensively earned.
Performance Realities and the USB Latency Bottleneck
By the end, the FPGA produced output that matched a NumPy reference lane-for-lane on a 200-vector benchmark. End-to-end, it was about 2x faster than the original Python simulator; on compute alone, about 4x. It was also comfortably slower than a CPU running optimized BLAS on small inputs. The point of the result is not that the FPGA wins everywhere, but that it shows where open hardware can be useful.
At this workload size, USB round-trip latency dominates the FPGA’s time budget. FPGAs win when data is moved to the device once, stays there, and is processed many times. Tiny inputs in, tiny outputs out, on every call, is the worst case for this design. We are running the worst case since we’re just at the “make it work” stage. PCIe-attached FPGAs sit much closer to the CPU and avoid this bottleneck entirely; with batched workloads on that kind of board, the compute-side 4x advantage we already see should carry through to the end-to-end number. They’re the natural next step for any workload where the dev-board numbers are encouraging enough to justify the cost.
Open Hardware and the Lowered Cost of Specialization
The lab occasionally runs computations large enough to be uncomfortable: searches over case law corpora and analyses across millions of documents. A pattern for designing a small piece of custom hardware that does those tasks well is a real option for problems that do not fit on a laptop and do not deserve a cloud bill.
More than preservation, open hardware offers freedom of access and the power to control your own collections, your own work, and your own computing. That opens up questions we are only beginning to ask. As archival media like silica and DNA-based storage mature, could open hardware lower the cost or complexity of building the readers required for those formats? As AI model architectures keep shifting, could a reconfigurable board keep pace in a way fixed silicon can’t? FPGAs already take on inference tasks at places like CERN (see their FPGA Developers Forum), adapting to software advances with matching hardware; could FPGAs do the same for our tasks?
AI coding agents are reducing the cost of specialized work that used to require a dedicated team. Custom hardware has historically been among the most specialized. If a small library research lab can produce a working SystemVerilog port of a real architecture in a few weeks, the question of what else has quietly come into reach is suddenly much broader.
The author examines the migration of Indiana University Libraries’ interlibrary loan platform, ILLiad, from a locally-hosted server to OCLC hosting through the perspective of a new department head inheriting this critical technology decision. He explores how staffing changes, lost institutional knowledge, recurring system instability, and limited technical capacity prompted a reassessment of long-standing local practices. The piece outlines research, consortium consultation, approval processes, implementation challenges, authentication and workflow issues, and post-migration tradeoffs. Ultimately, the author offers practical guidance for new leaders tasked with managing inherited systems, vendor relationships, imperfect information, and strategic change in complex academic library environments.
While incarcerated students face many challenges when commencing higher education, a lack of access to the internet is a considerable barrier. This technological exclusion has implications for the delivery of course materials, most of which are offered only electronically. A project team from Curtin University Library sought to understand and address the challenges faced by incarcerated students in accessing library services, particularly ebooks and audiovisual content. It was found that restrictions related to contract terms, digital rights management, and copyright contribute to a reactive and uncertain situation for library services. This article outlines the state of the problem and offers possible pathways academic libraries can take to improve the state of information access for incarcerated students.
Countless research questions arise when investigating connections between library resource discovery and student success. Existing literature explores best practices of database description language and style, the usability of database A–Z lists, and library resource jargon. Academic libraries continue to grapple with these challenges in resource discovery, even as online searching behavior evolves and new research tools emerge. A research team at the University of Arizona Libraries builds on the literature by examining these topics with a focus on the impact of a user’s academic discipline, university affiliation (faculty, staff, or student), and research experience on their understanding of database terminology, resource content and applications, and A–Z list type filters. The authors conducted an environmental scan of library websites along with several usability tests to identify and reduce library and disciplinary jargon on their A–Z list to make databases more understandable and approachable to all users. This article presents the results of these assessments as a case study for exploring external and internal factors that impact users’ understanding and discovery of databases.
By April 2027 and 2028, institutions covered by Title II of the Americans with Disabilities Act are expected to be legally required to ensure that digital content created or used at the institution is accessible as defined by Web Content Accessibility Guidelines (WCAG) 2.1 Level AA. The new law strongly emphasizes accessibility of course materials—including PDFs. This case study demonstrates how an R2 academic library staff can enhance the accessibility of PDF course materials by improving the accessibility of electronic reserves (e-reserves) PDFs at Hunter College Library (HCL).
Processes described here can be adapted by other libraries. Supporting campuses’ work to make course readings accessible may be a natural role for academic libraries. Locating or procuring the best quality version of a text available to the institution is a critical task for which libraries are optimally equipped. Furthermore, when readings are available only in print format, libraries can create higher-quality scans than those typically produced when the task is left to individual faculty members.
HCL began improving the accessibility of e-reserves PDFs in 2020. This article shares the knowledge acquired, established processes, limitations, and future directions. The workflow comprises checking each e-reserves reading. For those deemed poor, we locate an HCL collection or open access copy, purchase a digital copy, or remediate. Remediation involves optical character recognition (OCR), fixing errors therein, correcting reading order, removing repetitive headers and footers, and tagging. Literature the authors found on libraries proactively correcting OCR and tagging PDFs—that is, preceding a user’s request—was sparse, with the exceptions of the University of Toronto and the University of Michigan. Literature about proactively doing so for e-reserves was even narrower. This case study is intended to help fill the gap.
This study evaluates the performance of four generative AI models—ChatGPT, DeepSeek, Gemini, and Copilot—in generating descriptive metadata for bibliographic resources. Models were tested on a small, diverse set of resources using four prompt types: a basic prompt, a basic prompt with an example, a detailed prompt referencing Resource Description and Access (RDA) guidelines, and a detailed prompt with an example. Results show that both detailed RDA guidance and the inclusion of sample outputs improved metadata quality, particularly in formatting and field structure. While DeepSeek and ChatGPT showed better performance on the tasks, all models displayed limitations in parsing and following the prompts, using descriptive metadata fields, analyzing subject headings, and assigning URIs. These findings suggest that while generative AI holds potential to assist in metadata creation, its current capabilities fall short of meeting cataloging standards without human review.
One of the generative artificial intelligence tools developed for use in libraries, including academic libraries, is the AI Primo Research Assistant. Of the 65 academic libraries in Poland, only 19 have access to software that supports this tool. In practice, only 9 libraries have implemented it (data from March 2025). For the purposes of this study, original research was conducted to assess the implementation status of the Primo Assistant in academic libraries in Poland. Two anonymous surveys were developed for this purpose and sent to libraries that had implemented the feature, as well as to those with the capability to run the Primo Assistant (i.e., the Primo VE Discovery admin role), in order to gather information on why they had chosen not to implement it. The analysis revealed several positive aspects, mainly a reduction in the workload of staff tasked with preparing publication lists on topics requested by library users. Some concerns were also raised by library employees, mainly regarding the reliability of the metadata provided and the accuracy of the recommended publications. The study also revealed a general lack of awareness and a need for further implementation. This paper presents the first scientific study focused on the implementation of the AI Primo Research Assistant in Polish academic libraries.
Effective information technology (IT) governance is essential for the University of Riau (UNRI) Library to achieve its research and educational objectives. This paper presents a qualitative pilot study investigating the library’s current IT governance processes, focusing on two COBIT 5 processes—DSS01 (Manage Operations) and DSS05 (Manage Security Services). These processes were selected in consultation with library and IT leadership due to their direct relevance to ensuring operational reliability and safeguarding the library’s information assets. COBIT 5 principles and capability models guide the assessment, emphasizing regulatory compliance, performance monitoring, and stakeholder collaboration. Using a detailed questionnaire and capability model, the study evaluates base practices and work products for DSS01 and DSS05. Results indicate varying proficiency levels, with DSS01 at level 0 and DSS05 at level 1, highlighting significant gaps between current and desired capability levels. Recommendations include implementing standard operating procedures, enhancing security measures, and optimizing resource management. In conclusion, the findings underscore the need for standardized processes, continuous monitoring, and alignment with established frameworks like COBIT 5. By addressing identified gaps and implementing recommended improvements, the UNRI Library can strengthen its IT governance, enhance operational efficiency, and better support its academic mission.
This study critically explores the transformative potential of human-computer interaction (HCI) in reimagining African public libraries as dynamic, user-centered, and culturally grounded spaces. Based on a literature review and comparative analysis of libraries across several African countries, the research investigates how HCI principles can enhance user engagement, usability, and inclusivity, particularly in multilingual, resource-constrained, and postcolonial contexts. The paper situates libraries as sociotechnical infrastructures that mediate between technology, local knowledge systems, and community needs, and argues for the importance of participatory and culturally responsive design approaches in library digitization efforts. The findings highlight significant gaps in current implementations of HCI within library services, including the lack of localized interfaces and limited user involvement in design processes. The study concludes by offering practical recommendations for integrating HCI into library development strategies and advocating for the co-creation of digital public spaces that reflect and empower Africa’s diverse knowledge ecologies. In doing so, the paper contributes to the growing discourse on decolonial approaches to technology and the future of public libraries in the digital age.
Writing has been light around here recently for a wonderful reason: our twins graduated from their respective colleges over the past month, and we have been in nearly nonstop revelry (and packing, and schlepping…). We are so fortunate to have two great kids; I’m super proud of them.
Speakers at our kids’ commencements, thankfully and remarkably, said little about artificial intelligence, but they did talk a lot about the complex circumstances and especially the psychology of this rising generation, and offered advice on how the graduating seniors should move forward in life given significant headwinds. I suppose it’s tempting to describe and analyze the troubles facing each graduating class, and provide sage guidance in response to the historical moment, but I’m not sure that my kids, their friends, and their generation overall are so very different from any other, or that any distinct advice is needed.
The Great Class of 2026 is, I’m afraid, just like every graduating class: happy and sad, confused and hopeful about the future, striving and procrastinating. Young adults, in other words. Sure, they seem to be impacted by new technology and our dreadful national politics and nerve-racking global challenges, but hasn’t it always been so? My college class graduated into a recession, the rise of the internet, the fall of the Berlin Wall, the chaotic end of the Soviet Union, and a messy war in the Middle East — all of these dominoes falling after a childhood in which we were fairly sure we would perish at any moment in a nuclear war. That was a lot to absorb! Back then, commencement speakers picked up on our anxiety, which had apparently morphed into excessive irony and a general lack of motivation, epitomized by the title and content of a Richard Linklater film: Slacker.
It may have taken some time, but we muddled through. So did the generation another turn of the clock back from ours (Vietnam, stagflation, etc.) and the generations before that (pick your World War and/or the Great Depression, etc.). History is, unfortunately, a procession of horrible developments, but also a showcase of astonishing resilience and creativity. Is it so Pollyannaish to simply say that Gen Z will also find a way forward, and frankly might be better off without pithy advice from the olds? Must we unconsciously mimic the opening of Woody Allen’s fictional commencement address, raising the graduating class’s blood pressure by declaring, “More than at any other time in history, mankind faces a crossroads. One path leads to despair and utter hopelessness. The other, to total extinction. Let us pray we have the wisdom to choose correctly”?
Instead, I saw hope in every joyful row of begowned seniors, students who, despite all of the radical changes and stressful tensions around them, had nevertheless maintained their curiosity and maybe even cultivated a passion during college. Students who found their special niche in music, writing, art, or science, who felt compelled to listen to it all, read it all, see it all, or experiment late into the night, regardless of the requirements of the classroom. I have a feeling that this kind of deep and abiding engagement, born not from careerism but from genuine profound interest, will serve these graduates well in the years ahead. As it always has.
Books I Have Not Written
The class-action lawsuit of authors against Anthropic and its subsequent settlement have helpfully informed me of the many, many other writers named Daniel Cohen, because the settlement administrators, in their quest to match authors and texts, have sent emails and letters asking if I am the Dan Cohen who wrote this or that book. There are too many volumes by The Daniel Cohens to list in full here, but as a public service to a handful of special fellow Dans, I hereby declare:
I am not the Daniel Cohen who wrote The Monsters of Star Trek, but I would wager 100 quatloos on Triskelion that I would greatly enjoy meeting that Dan Cohen.
I am #$%@# mad I am not the Daniel Cohen who penned Famous Curses, because my family is on a mission to bring back the useful exclamation “Gordon Bennett!”
I did not write Southern Fried Rat and Other Gruesome Tales, but, based on the delightful cover of this not-me Daniel Cohen book, I probably read it at camp the year it was published.
My final confession: The settlement administrators believe there is a Daniel Cohen who authored a book titled Final Confession, but, alas, I am not the one.
English Edition: floppy disks, hard drives, CDs, DVDs, SSD drives - no
matter what you choose to store your data on - ultimately they all
decay. With my guests Callum McKean, Leontien Talboom and Adrian
Page-Mitchell, we’re going to talk about what kinds of data we find on
old drives, why we want to get them in the first place, and what can go
wrong with the storage media. To all of you who love all things retro -
we’ll be talking about floppy disks a bit.
I run a RAG application for Italian pension and tax consultants. Users
ask questions about INPS, professional pension funds, laws and
regulations, and the app answers using a knowledge base of uploaded
documents.
For a long time the app used the classic single-shot RAG pipeline: take
the question, search the database, stuff the results into a system
prompt, ask the model. It works, but it has a hard limit: the retrieval
happens once, before the model has any chance to reason about the
question. If the first search misses, the answer is bad and there is
nothing the model can do about it.
So I rebuilt the pipeline as an agent. Now the model drives the
retrieval itself: it decides what to search, reads the results, searches
again with different terms, follows cross references between documents,
and only then writes the answer. All in plain Ruby, with RubyLLM and
Rails. No LangChain, no Python sidecar.
In this article I will show you exactly how it works, with the real code
from my application. One note before we start: since the app serves
Italian consultants, all the prompts, tool descriptions and user-facing
strings are in Italian in the real codebase. I translated them to
English here so you can follow along, but the structure is identical.
Wikimedia and GLAM institutions share a challenge. How do we make
cultural heritage collections accessible at scale without sacrificing
quality, provenance, sustainability, or community control? The
International Image Interoperability Framework, IIIF, is now used by
thousands of institutions to serve high-resolution media through open
standards. Wikimedia does not currently integrate IIIF in its core
architecture. Should it?
Since 2023, Montgomery Planning staff have been working on the Eastern
Silver Spring Communities Plan, drafting recommendations on zoning and
land use, transportation, housing, parks and the environment, economic
development and urban design. The plan is expected to set a vision for
the area’s future development for decades to come. The plan is bordered
by Colesville Road, University Boulevard and New Hampshire Avenue and
will include three future Purple Line stations, the Piney Branch Road,
Long Branch and Manchester Place
Design is broken. Young and not-so-young designers are becoming
increasingly aware of this. Many feel impotent: they were told they had
the tools to make the world a better place, but instead the world takes
its toll on them. Beyond a haze of hype and bold claims lies a barren
land of self-doubt and impostor syndrome. Although these ‘feels’ might
be the Millennial norm, design culture reinforces them. In conferences
we learn that “with great power comes great responsibility” but, when it
comes to real-life clients, all they ask is to “make the logo bigger.”
On our strictest tests, Gemini 3 achieved a CER of 1.67% and a WER of
4.42%. On these tests, any difference between the ground truth and test
texts counts as an error. WER is thus almost always a bit more than
double the CER because if a single character in a word is wrong,
including leading or trailing punctuation like commas, single quotes vs
double quotes, etc, the whole word is marked as an error. On this
measure, Gemini 3 performs nearly 50% better than the best, fine-tuned
specialized models and achieved performance comparable to an early
career, professional human typist.
FacilMap is a privacy-friendly, open-source versatile online map that
combines different services based on OpenStreetMap. FacilMap offers the
following features:
Show different map styles, for example maps optimized for driving,
cycling, hiking or showing the topography or public transportation
networks.
Search for places
Show amenities and POIs
Calculate a route, optionally showing the elevation profile.
Find out what is at a particular point on the map
Open geographic files, for example GPX, KML or GeoJSON files
Show your location on the map
Share a link to a particular view of the map.
Add FacilMap as an app to your device.
Change the language settings in the user preferences.
FacilMap is privacy-friendly and does not track you
SQL makes sense. But when it breaks, you reach for EXPLAIN. Vector
search offers no such comfort. Multi-thousand-dimension embeddings,
approximate nearest-neighbour indexes, and quantisation tradeoffs make
it hard to know what your system is doing, and harder still to diagnose
when results quietly degrade. Through interactive visualisations, Simon
Hearne shows what embeddings look like in high-dimensional space, what
quantisation does to your recall, and how to catch retrieval failures
before your agents do. You’ll leave with a sharper mental model and a
diagnostic toolkit for the production problems hardest to see.
Once again I am reminded that modern web tech is amazing, and web
browsers are incredibly capable.
There’s a Screen Capture API to record the screen. You can select a tab,
a window, or the entire screen. The feature has limited browser support
so I don’t think I’d use it in a big web app, but it’s fine for a
one-off screen recording. (I wonder how browser-based video conference
apps like Google Meet do screen sharing? Do they use this API, or do
they use something with wider support?)
TL;DR if you have a TASCAM 788 backup and don’t know how to get the
audio out of it this
script might help. Also: AI tools work best when paired with
expertise.
I needed to take a very personal excursion into digital preservation
recently as I attempted to listen to some audio recordings my brother
John had made about 20 years ago. John died recently,
and is sorely missed by his friends and family.
John was a continuous source of inspiration for me, because of his many
varied interests and projects. One thing he did consistently since he
was a teenager was perform music as a singer-songwriter.
As my family and I went through the very difficult process of emptying
his apartment, we discovered a set of recordings he had made on CD-R.
Three of these CDs were clearly conceived of as albums, and easily
mounted as CDDA
when I popped them in my CD player.
However he also left a binder of CD-Rs, where each CD was neatly labeled
with a song title and a year. All in all there are 108 of them, from the
2003-2008 time period. There is a lot of material on these CDs that is
not present on the three albums. However, when I popped these in my CD
player all I saw was a macOS error dialog box saying:
The disk you attached was not readable by this computer.
John’s binder of CD-Rs
At first I thought they might be damaged or corrupted. But it seemed
unlikely that so many of them would be. After some asking around I got
pointed to two excellent guides to working with CDs:
These guides were great, and did help me extract the raw data from the
CD-R with cdrdao, but
ultimately I was unable to determine what format the data was in using
tools, like file, Siegfried and Droid.
In a fit of desperation I spent some time in Claude Code trying to see
if it could help me identify what format the data was in. Despite
several forays, it kept going round in circles, burning tokens.
One of those forays led me on a wild goose chase installing an old
version of macOS in order to see if an old version of Retrospect
might be able to read the CDs (it didn’t).
During this time I got some excellent advice over in the Fediverse at
digipres.club. One of
those messages was from Ross Spencer who took a look
at a sample raw CD image. He was able to spot some markers that pointed
to it possibly being a backup from a TASCAM DAW,
specifically a TASCAM 788 (I
believe Ross was using either strings or a hex editor to look
for these clues).
TASCAM 788
Unfortunately, after poking around in various user forums, I discovered
that there were not really any tools for working with TASCAM 788
backups. Everyone seemed to be recommending the purchase of a TASCAM 788
and its CD Burner, since the data was in a proprietary format, and there
were no emulators.
Before dropping some money on Ebay I decided to roll the dice with
Claude Code again, but this time with the more specific
guidance that this was likely a TASCAM 788 backup, and asking about
options for recovery. If you are interested you can read the
transcript for this session. The key part of the back and forth for
me was:
The 2488 stores audio as raw 16-bit or 24-bit PCM at 44.1kHz in a
proprietary block structure. Once you identify the byte offset where
audio data starts, you can use Audacity’s “Import Raw Data” with 24-bit
signed big-endian PCM, 44.1kHz, to listen and verify.
I prompted it to try to identify the offset, so I could attempt the
import in Audacity. It did some work writing Python snippets and
executing them for a few minutes, and then output a likely offset. The
first time I read it in I only heard white noise. But after twiddling
some of the import options in Audacity I saw some promising waveforms
appear in the Audacity display. And when I pressed play ✨✨✨✨ instead
of white noise I heard John’s guitar and voice!
Audacity screenshot of imported raw data
What appeared to be a single track turned out to be multiple tracks
created with the TASCAM, that were joined together. The final segment
was the completed mix.
I continued to work with Claude on a program that would identify the
offset in the raw CD data, then extract a WAV file, and then extract the
separate tracks, as well as the complete track. It did this by looking
for gaps inside the audio. I put the program here:
Here is the guitar / vocal first track (there are a few seconds of
silence at the beginning):
And here is the mix including percussion and keyboards:
These recordings are Copyright John Summers CC-BY-NC
I have since been able to find John’s TASCAM 788 at my brother Matt’s
house–although it doesn’t have the SCSI external CD burner anymore. So
there’s no way to read the CDs with it.
These CDs and songs are important enough to me that I want to see if the
actual hardware can do a better job of preserving John’s work. So I’ve
got a bid one of the external CD-Recorder devices I found on Ebay.
John clearly spent a lot of time and care taking a snapshot of these
songs he used to perform in coffee shops around Bucks County
Pennsylvania. I plan to release some of them on his Bandcamp, with some
of his artworks as album covers. I want to share them with people who
knew him, and put these songs out into the world in a way that respects
his memory and creative work, while also being something that he just
wasn’t focused on as an artist. For John it was the creative process
itself that mattered most.
None of this will bring John back of course. He’s gone now, and at
peace. But he will always be remembered by those who loved him. Look for
more posts here after I’ve been able to extract these songs in total.
If you manage to have most of your input tokens be cached, you save a huge amount, in this case $0.20 per million tokens. What does this mean though? What does caching do that makes you save so much, in some cases upwards of tens of kilodollars?
Someone explain the cached vs not thing to me for how this is $10,000
worth of savings lol
I'm gonna be totally honest, I barely understand the basic outline of the math
involved here. Where possible I am to not be completely wrong here, but I'm
not going to emit something 1:1 accurate with the mathematical truth of large
language models' inner workings. Bear with me.
When you make an API call to large language model services, you make an API call like the following:
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
]
}'
That messages element is the key bit. Every time you accumulate messages from the initial system prompt, initial user request, AI responses and any tool use requests/responses, you add to that array and make it grow bigger and bigger.
A good way to think about this is that sending a conversation to a large language model is like having a pair of people share a roll of paper on two different typewriters. Every time you finish your message, you send the roll of paper back to the AI model and it has to re-read through the entire conversation in order to start typing on the end with its response. As the conversation gets longer, this gets more and more expensive because the model has to recalculate its internal state all over again for every additional message.
However, large language model inference is complicated but deterministic. Given the same inputs, you will always get the same output. This means that you can use a technique called key-value caching (KV caching) in order to save that intermediate state and use it for next time. Most of the time this cache is a prefix cache because that allows you to just add on more messages to the end of the request pretty easily and be fine.
Imagine something like this:
curl http://localhost:11434/api/chat -d'{
"model": "llama3.2",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
},
{
"role": "assistant",
"content": "The sky is blue because of a phenomenon..."
},
{
"role": "user",
"content": "But I am looking outside right now and it is orange!"
}
]
}'
If the model has already processed the question about the sky being blue and generated the response about Rayleigh scattering, it doesn't need to process both of those messages again to answer the user's question about sunsets. In production AI model deployments you would put that generated intermediate state into the KV cache so that the model doesn't need to run twice for the same data. This saves time and effort on the side of the AI model provider, and currently model providers decide to pass that savings onto API users in the form of cheaper inference costs for cached lookups.
As you develop an application with AI in it, try to avoid changing any inference settings or previous messages between prompts. This makes your application's queries much more likely to read from the cache, making it faster, reducing the environmental impact, and saving you(r users) money.
This is the first installment of a three-part series on global library leadership engagement, contributed by Ellen Hartman, OCLC Leaders Council Manager. We’re grateful to Ellen for sharing her perspectives on this topic.
Proof we engaged face to face
At a recent gathering of the OCLC Leaders Council, something happened that I always hope for but never take for granted. Connections were being made, there was laughter, sidebar conversations over lunch and dinner, and a willingness to challenge each other’s ideas, honesty about what people were struggling with, and genuine curiosity about what others are doing. All of this was built on a foundation of trust that made these in-depth conversations possible.
These moments don’t happen automatically. In my experience, they take time—and often, the opportunity to meet in person. Meeting online can be very efficient, but it can feel rushed and impersonal—it’s hard to truly get to know each other through a screen. Being in the room together over the course of a few days, in a small enough group that you actually get to speak to everyone, creates a solid foundation for future opportunities to meet again, online or in person, to build on the connections, themes, and conversations that started there.
What made this gathering particularly significant was its global dimension. Library leaders do come together regularly, but often within their own region, or among peers from the same library type. Academic and public library leaders, for instance, don’t always get the opportunity to meet for in-depth conversation, even though there is much they can learn from each other. Conversations organized by library type or region have real value, of course, but there is something additional that comes with a broader perspective that is still rooted in the library ecosystem while extending beyond your usual network. Every perspective in the room adds something, regardless of what an institution has or hasn’t yet achieved. The value of these conversations comes from the range of experiences present.
Across international leadership spaces, a remarkably consistent vocabulary tends to surface. Terms recur across sessions, regions, and formats, and their repetition signals that we are all on the same page: a reassurance that participants are engaged with the same broad challenges and moving in a broadly similar direction.
The problem is that shared language doesn’t necessarily mean a shared understanding or a shared reality. One of the things that becomes apparent, watching these conversations unfold, is how often the same word lands differently depending on who is in the room.
Take efficiency, a term that surfaces regularly in conversations about how libraries operate and plan strategically. In some contexts, efficiency encompasses decisions about workforce size and structure. In others, those decisions are shaped by employment frameworks that lead to a very different kind of conversation, shifting the focus instead toward technology, software, or finding different ways of working within existing structures. The word is the same. The need it describes, and the range of solutions available, are not. This is why you need a deeper understanding of each other’s context to find out where you are using the same words but aren’t speaking the same language.
Glimpses, not full pictures
Even with that understanding in place, international leadership conversations can only ever offer glimpses of each other’s reality rather than the full picture. You see enough of someone else’s context to recognize the challenge, but rarely enough to understand all the constraints behind it.
This matters because those constraints are often what make the difference. Take something many library leaders struggle with: making the case for their library’s value to the broader institution or community they serve (for more on this topic, see OCLC Research’s latest report!). Some leaders have, through long-term effort and considerable perseverance, managed to position the library as visibly central to their institution’s priorities and a key part of its success. For others, making that same case remains difficult. The reasons could be structural or personal: the physical or organizational distance between the library and the part of the institution that makes key decisions, the data available to demonstrate the library’s impact, or the library leader’s own position, voice, and access to the right conversations at the right time.
In international settings, what tends to surface is the success story. What is harder to showcase is the full path to that success. The years of lobbying, the hundreds of stakeholder conversations, the incremental steps that made this outcome possible. A leader who has achieved that recognition may share what they did in good faith and genuinely want to help others reach the same goal. But because the conditions that made their success possible are often invisible in how the story gets told, it can be hard for them to understand why the same challenge feels insurmountable to a peer.
The value outside of the program
International leadership meetings are often evaluated by what happens in the formal program. But some of the most valuable exchanges happen elsewhere. Recognizing that is part of understanding how these spaces work in practice.
In smaller gatherings, it’s the time outside the formal agenda where a lot of the magic happens. When a group of library leaders meet for the first time, they are still in the process of getting to know one another. This is why you can’t expect them to immediately share their biggest challenges or most acute pain points. There is a measure of trust building that happens as a gathering takes place, especially over multiple days. It’s often after the official program ends, and there is room for leaders to relax and reflect together (for example, during dinner or at the bar) that the more personal and complex topics get discussed.
That kind of conversation requires enough prior exchanges that people feel safe being a little vulnerable. Admitting that your library is struggling to secure its position, or that you haven’t found a way to make your value proposition tangible enough to institutional leadership or other stakeholders that control funding, is not something most people are willing to do in a room full of peers they’ve just met. It becomes possible when the group has had time to become something more than a collection of strangers.
This is one of the reasons smaller, sustained gatherings tend to produce a different quality of exchange than large conferences. It is also why the informal spaces within those gatherings deserve to be nurtured rather than left entirely to chance.
No neat resolutions needed
One expectation worth setting aside is that international leadership conversations should resolve into clear conclusions. They rarely do, and that is not a failure.
Conversations like these do not need to end in consensus or a neat step-by-step path forward. It’s often the process of sharing and reflecting on both differences and commonalities that provides the greatest benefit. It might be an idea you hear and want to incorporate in your own library. A perspective that’s truly new to you and makes you see a topic in a different way. Or simply the opportunity to take a subject that was discussed at surface level and deepen the conversation in future gatherings.
That is why continued engagement matters more than resolution. Understanding accumulates across multiple conversations, multiple gatherings, and sometimes multiple years. It cannot be compressed into a single meeting, however well designed. The friction and the moments of genuine surprise are part of the value. Smoothing those moments away or rushing toward consensus risks losing exactly what makes international exchange worthwhile.
Conclusion
International leadership spaces are often judged by the ideas they surface or the alignment they appear to produce. But their deeper value lies in the glimpses they offer into realities that are different from our own. Those glimpses don’t tell the full story of what other library leaders are experiencing, but taken together, they help form a better understanding of what experiences are out there.
When designed well and when opportunities for informal interactions are cultivated, global library leadership spaces create the conditions for the kinds of conversations that go deepest. Those conversations rarely happen on the agenda, but rather emerge when enough trust has been built that people are willing to be open and candid with one another. That is not something that happens automatically: it requires continued investment in bringing people together, and repeated exposure to each other’s contexts, experiences, and points of view over time. Trust is not built overnight.
The next post in this series takes a closer look at what global engagement actually involves beyond the conversation itself and why showing up, in every sense of the phrase, costs more for some than others.
I have a problem with RSS. Not RSS itself, RSS is great!
The problem is that I subscribe to more feeds than I can possibly read,
so the unread count in FreshRSS climbs faster than I can bring
it down. Some days I skim titles, declare bankruptcy, and mark
everything as read. Other days I let it pile up and feel guilty.
I’ve tried to using newer tools like Current which was
definitely an improvement, but still didn’t quite do it. My friend Dan
has been working
on a new RSS tool that works a bit like a personal newspaper, that seems
like it could be extremely helpful, and I’m keeping my eye on it. But
meanwhile the list of unread posts grows…
Now, I’ve been very reluctant and slow to introduce LLMs into my daily
work. But even from under my rock, in a cave, down by the river, I’ve
heard that LLMs are good at text summarization.
I thought maybe, just maybe, I could try using one to summarize
my unread posts? It seemed like a good fit for an experiment since the
impact of getting things wrong is basically zero (in theory).
I wanted to try routing my unread RSS posts through an LLM to get a
daily digest. From under my rock I’d also heard about
Model-Context-Protocol (MCP),
and how it is going to change everything. So I thought it would
be a good exercise in seeing how that works in practice with a tool like
Claude Code. I’d use Claude Code’s MCP support to connect directly to
FreshRSS and ask Claude to summarize what I’d missed. Yeah, that’s the
ticket.
This is the Way?
The first thing I tried was ChrisLAS’s
freshrss-mcp server, which wraps the FreshRSS GReader API and exposes
it as a set of MCP tools. The idea is that you drop it into your Claude
configuration and Claude can then call those tools to fetch and read
your articles.
I gave it a try, and it worked! But the results were… mixed. Claude
would usually fetch articles. But then it would produce a lot of
diagnostic chatter alongside the actual summary: narrating its own tool
calls, noting what it was about to do, explaining why it was skipping
certain things, asking for permission for this and that.
And more frustratingly, it would sometimes take strange detours:
executing inline Python code, and Unix tools to do things it could have
done by calling the MCP tools more directly, wandering into unnecessary
computation. The experience felt noisy and unpredictable, and (frankly)
just a bit scary.
I started by creating some “skills” and some scripts for those skills
thinking it would make things a bit more deterministic. It kinda did?
I thought maybe my problem was that the skills weren’t bundled together,
so I built my own plugin: freshrss-claude. This
version bundled the MCP server as a Claude Code plugin with a set of
“skills”, the structured prompts to guide Claude through fetching and
summarizing in a more controlled way.
It seemed better? Not needing to start the MCP server was definitely
better. But ultimately it wasn’t as big an improvement as I’d hoped for.
Claude still exhibited strange behaviors: writing and executing Python
scripts unnecessarily, going off-script in ways that were hard to
anticipate. The summaries themselves were fine when they arrived, but
the path to getting them there was erratic and unpredictable.
The last straw for me was the idea of running this Rube Goldberg machine
from a cron job to generate the summary for me automatically. To run it
automatically I needed to grant it all kinds of permissions to ensure it
ran through. This scared the shit out of me, given it was giving it
permission to run arbitrary Python programs and reach out to the web,
and interact with the filesystem. Running it once or twice manually was
ok. But sticking it in my crontab and forgetting about it? Forget about
it. I exprerimented briefly with putting things in a Docker container,
and Claude Cowork’s sandboxing, but then…
Turning it inside out
I stepped back and rethought the problem. The thing I’d been trying to
do, have an LLM orchestrate a set of tools to accomplish a task, is one
(seemingly popular) way to use an LLM. But it turns out to be kinda
demented. You’re asking the model to plan, to sequence, to decide.
You are asking it to be An Agent. Sure models can do this, but they are
not reliable in the way a simple program is. They wander. They
improvise. They sometimes decide to take a detour. Do I really benefit
from this runtime model in this little RSS digest app? Nah, not really.
So the alternative, and this is the inversion that made things click for
me, is to write a deterministic program that calls the LLM as a
component, rather than letting the LLM drive the program as an Agent. My
code fetches the articles. My code shapes the prompt. My code writes the
output to a file. The LLM does exactly one thing: it reads the content I
hand it and produces a summary.
Take Two (or Three, or Four?)
I threw it all on the fire and started over by writing rss-digest instead. Well,
truth be told, Claude and I wrote it. Ok, ok, mostly Claude.
It’s a small Python CLI that connects to any GReader API-compatible
RSS reader (FreshRSS, Miniflux, Tiny Tiny RSS, The Old Reader), fetches
your recent unread articles, and asks an LLM to produce a digest.
Because it uses LiteLLM under the
hood, you can point it at any compatible model: OpenAI, a local model
running in LM Studio, whatever you prefer.
The output is a Markdown file (or HTML with –html). I have
a cron job run it in the morning and drop a file on my desktop for me to
read. Here’s an example
of what it looks like.
For smaller batches (≤25 articles) it gives you a structured list. For
larger ones it produces a curated prose summary grouped by theme. You
can pass a custom system prompt file if you want to tune the style or
grouping. You can pass –mark-read if you want it to mark
everything as read afterward.
The tool is on PyPI and the code is on
GitHub. I’ve just
started using it, so it quite possibly has problems. The prompt that is
used for doing the summarization is configurable. If you have a
different take on the prompt or want to extend it, please send me a pull
request so I can add it as an alternative.
So…
What I keep coming back to is the design lesson underneath all of this.
There’s real value in being thoughtful about which part of your
system is deterministic and which part is probabilistic. There’s no
doubt that LLMs are magical things, but it’s not a reliable program. It
shouldn’t always be the thing making decisions about what to fetch, when
to stop, or how to structure output. Hand it a well-formed input, ask it
a clear question, and (hopefully) it will return something useful.
Everything else, the plumbing, the sequencing, the file I/O stays in
your code that you can look at, and test and run directly.
I’m not saying all programs using LLMs need to take this approach. I’m
just saying maybe you don’t need MCP, Agentic AI, etc, etc all the time.
Experiment with it, but don’t forget to turn it inside out when you need
to.
Once again I attended most of the library of Congress' Designing Storage Architectures workshop remotely. I apologize for the delay in posting this; domestic duties have kept me very busy recently. Below the fold notes on the talks that caught my attention, based on my now somewhat memory and the slide decks for the talks from the Library of Congress website.
As usual, IBM's Georg Lauhoff provided an invaluable overview of the storage industry as of late 2025, co-authored with Sassan Shahidi. They make an important point that I have been making since at least 2018's Archival Media: Not a Good Business:
Challenges of Alternative Archival Technologies
• Alternative archival technologies face technical and economic hurdles.
This justifies their focus on flash, hard disk and tape. Their "exabytes shipped" graph shows that indeed Hard Disk Unexpectedly Not Dead; the dramatic decline in HDD's share since 2008 reversed in 2024.
The key metric for technological progress in traditional storage media is areal density:
Lauhoff and Shahidi's graph shows that tape, which has the easiest path because of the relatively large size of the bits, has continued its steady growth, although one could argue both that their 24% annual growth exaggerates the period since 2017, and that INSIC's projection of 28% is optimistic.
It is clear that HDD areal density progress slowed dramatically about 2010 to around 11% per year. But the developments Jon Trantham reported, see the next section, could lead to a significant acceleration in HDD areal density.
Flash has continued a steady 30% per year growth since about 2010, thanks to stacking cells vertically and storing multiple bits in them. Both of these have limits, into which the industry will eventually run.
As regards the relative cost per TB of the three media, the big picture is that since around 2010 change has been very gradual. Tape and flash have both become cheaper relative to HDD, but the rate of change has been much lower than predicted.
Lauhoff and Shahidi conclude that:
Tape Storage: continues to evolve.
HDD: improvements slow down but recently high demand.
NAND: well-suited for hot storage but not for archival purposes.
Lack of Alternatives: Within the foreseeable future (within 10 years), there are no viable alternatives to Tape, HDD, and NAND storage.
AI leads to storage demands across the tiers
This last point was a theme for the entire meeting. But it is important to note that the meeting was too early to capture the full impact of AI on the cost and availability of media and systems.
He also announced that they have started to ship their 40TB HAMR drives. Their roadmap to 100TB/drive presents some significant challenges, as shown in Trantham's slide. The history of HAMR shows that Seagate can surmount major technical challenges, but it may take longer than they project.
One of Trantham's slides vividly illustrated the technology challenges the HDD industry faces, showing to scale to evolution since 1997 of the sizes of the bits on the media, the reader, and the writer. Note the 1610-fold decrease in the area of the writer, the 305-fold decrease in the area of the bit, and the 289-fold decrease in the area of the reader.
Fifteen years ago, Ethan Miller, Ian Adams and I published Using Storage Class Memory for Archives with DAWN, a Durable Array of Wimpy Nodes. It was inspired by work at Carnegie-Mellon from 2009, FAWN: a fast array of wimpy nodes, which argued that implementing fast storage using large numbers of small nodes built from cell-phone technology could save two orders of magnitude in energy per query. We argued that it would be possible to build low cost, low energy archival storage systems using a similar approach.
Our idea was ignored, but at this meeting Ethan Miller revived the idea of using flash as an archival medium. He argues for a rack-scale system storing 500PB/rack built from 5U shelves, similar to Backblaze's, each holding 216 of Pure Storage's 300TB DFMs (direct flash modules) stacked vertically.
There are three big challenges:
First, if all the DFMs were actively I/O-ing the rack would draw 45KW. Supplying the rack with that much power and cooling it would be very difficult (see the design of Nvidia's racks). But, just as with Facebook's hard disk cold storage, this can be mitigated by scheduling accesses so that only a small proportion of the drives are active.
Second, flash cells gradually leak electrons, so must be regularly refreshed by reading and re-writing them. This task must be scheduled along with the application's reads and writes, but doing so is fairly easy since the refresh timing isn't critical.
Third, flash is more expensive per TB than hard disk or tape. As I have argued for a long time, in the archival storage market the time value of money makes it difficult to justify trading increased capex for decreased opex:
The opex savings are significant, with essentially no mechanical failures, more benign failure modes, and much higher bandwidth for erasure code recovery.
Miller argues that the capex isn't as bad as the cost of the media makes it look, because at 0.5EB/rack there are savings in space, power and cooling. He doesn't point out that the lower latency for read access potentially allows for the elimination of an entire warm layer of the storage hierarchy.
But he acknowledges that AI is driving up the media cost. This is probably only relative to tape, since hard drive prices are also skyrocketing.
Miller argues that, over time, flash costs will come down. The scope for further shrinkage of the cells, and the addition of more layers, is limited. Once that happens the fabs that manufacture flash will gradually fall behind the leading edge and become depreciated.
Although I'm naturally biassed, I think Miller's case for archival flash is worth a detailed investigation.
Fourteen years ago in Cloud vs. Local Storage Costs and More on Glacier Pricing I started writing about the way the complex and somewhat opaque pricing models of cloud storage platforms made it difficult to estimate how much you would end up paying. People are just now figuring out that AI has the same problem. Neither is an accident; these pricing models serve two goals important for the platform's business model. First, the purchase decision is based on the "Low, Low" advertised price. Second, once you discover how much more you're actually paying, you face the lock-in created by egress fees. In 2019's Cloud for Presevation I wrote about how egress charges implement vendor lock-in.
David Boland of Wasabi presented a current analysis of this issue. He reports that about half of all the organizations they surveyed exceeded their budget for public cloud storage.
The budget overruns were caused by the fact that the actual spend was about double the sticker price for the storage. Fees were the culprit, which by design are much harder to project.
Using AMAzon's cloud services for long-term preservation always suffered from the fact that, to confirm the fixity of the preserved content, it had to be read and thus incur fees. Finally, two advances have improved things. First, it is now possible to use SHA-256 and SHA-512 schecksums when uploading data. Second, it is now possible to use an S3 batch job to validate the checksums on objects without reading them.
AI is one of those generational tech topics that isn’t going away soon. But the signal to noise (or hype to reality) ratio can be truly overwhelming. There are just so many links, opinions, new resources that are getting lost in the mix. And that’s for us information and tech nerds – we can only [...]
In the Spring 2026 semester at Old Dominion University (ODU), I taught CS 450 (Undergrad) / CS 550 (Graduate): Database Concepts. The course was fully online, with synchronous live Zoom sessions held twice a week. The attendance was not mandatory but strongly encouraged. All lectures were recorded and made available for students to access whenever needed.
Figure 1: Canvas course page for CS 450/550: Database Concepts
Through this blog post, I want to share my experience of teaching a senior-level undergraduate/graduate course for the first time, the behind-the-scenes realities of course preparation through to the end of the course, and how student feedback actively shaped the course as it progressed.
Since the course had been taught previously by other instructors, materials were already available, which made things easier. Rather than building everything from scratch, I started by copying over the existing course structure and then carefully updating it to align with the current semester. The more time-consuming part was setting everything up, cleaning up the Canvas course, especially updating deadlines and revising the syllabus, while ensuring the topics were properly aligned with assignment deadlines. If you are instructing for the first time, it is very important to make sure you get access to the course in time, so you can set everything up without a rush. Throughout the semester, to make the most of class time, I spent a couple of hours before each session preparing things such as reviewing material, planning examples, and thinking through how topics would connect. I tried to debug issues during the class in real time whenever possible. If something took longer than expected, I pushed it to the end of class or moved it to the office hours. It helped me to continue the flow of the topic without interruptions.
I was able to experience first hand how handling a class of 50 students without a teaching assistant (TA) was, honestly, a lot more work than I expected. Grading labs, homework, quizzes, and discussions while also preparing for lectures and responding to emails required a constant balance. I wasn’t always perfect, but I made a steady effort to stay on top of it. Grades were returned as quickly as I could manage, and emails were typically answered within 24 hours, often sooner. Again, it reinforced something I had already noticed as a student: timeliness matters. Things do not have to be instant, but when there is a clear effort to respond and follow through, it builds trust and keeps students engaged.
One of the first challenges I faced as an instructor to this course involved managing classroom dynamics. After a few classes, a student shared a concern that some well-intentioned peer engagement (jumping in to answer questions or adding explanations during lecture) was becoming distracting to follow along. It was a fair concern, and an important one. At the same time, I didn’t want to discourage participation. Active engagement is something every instructor hopes for, and it was clear that students were eager to contribute. My challenge was to find the right balance. I responded by acknowledging the concern and assuring the student that I would make adjustments so that participation remained helpful rather than overwhelming. Before taking action, I also reached out to a mentor for advice, which helped me approach the situation more thoughtfully. I thanked students for being engaged and willing to contribute, but also clarified expectations: participation was welcome, but lectures and question answering would be primarily instructor-led, with designated moments for peer discussion. I also reflected on something I had noticed during the class introductions: students were coming from a wide range of backgrounds. Some had prior experience with databases, while others were encountering these concepts for the first time. Because of that, maintaining a consistent pace and structure was important. I believe that framing it this way helped convey the message to the students that my goal is not to limit participation but to support a better learning environment for all. There were no further concerns raised afterwards and the students remained engaged while being supportive of the entire class.
Midway through the semester, I conducted an anonymous check-in survey to better understand how students were experiencing the course. To encourage participation, I offered a small amount of extra credit, which resulted in a strong response rate.
Overall, the feedback was encouraging, most students agreed or strongly agreed that assignments were clear, the workload was manageable, and the pace was appropriate (Figure 3). But what mattered more were the written responses. They highlighted patterns that helped me see the course from the students’ perspective (full set of responses).
A few consistent concerns stood out:
Some students said they weren’t always sure what to prepare before class or whether a session would lean more toward lecture or lab. That feedback pushed me to be more specific in my announcements, clearly laying out what each class would cover.
Several students pointed out that while their answers were marked incorrect or partially correct, the reasoning behind it wasn’t always clear. This was a fair point, and a difficult balance when grading at scale. Still, I made a more conscious effort to leave clearer comments.
Even when students understood the concepts, many struggled to translate them into SQL queries or ER diagrams. That reinforced something I kept coming back to: the need for more in-class examples and live coding, which I continued to prioritize.
Interestingly, a lot of students said the challenge wasn’t the material itself, but managing their time. A few students shared situations where missing a single assignment significantly impacted their grade. This feedback later influenced my decision to allow requests for reopening missed work.
At the same time, there were plenty of positive notes that helped confirm what was working:
Students consistently appreciated the clarity of explanations and examples.
The labs and live coding sessions were frequently mentioned as highlights.
Many felt the course structure was organized and manageable.
Some even described it as one of the best online courses they had taken.
I also asked students a simple question: what’s one thing I should keep doing, and one thing I could do better? Here are some of the responses that stood out:
“The instructor is great. Instructions are clear, vibes are good, I would recommend this class. The homework is work intensive but not unreasonable.”
“You have been doing a great job and this has been one of the best online courses I have taken at ODU”
“very good at explaining things, even when the students dont seem to get something she fines a new way of explaining it so they get it.”
“The instructor is accommodating to students within reason and I believe that is something they should keep doing.”
“keep being a great teacher :)”
Figure 3: Summary of student responses to four questions: assignments, workload, grading, and pace
At the mid-semester point, once the grades were up-to-date, I started reaching out to students who had missing work or were falling behind. The intention wasn’t to penalize them, but to give them an opportunity to catch-up. At the same time, I made a point to recognize those who were consistently performing well and allowed all students the same opportunity to request the opportunity to catch up on any missed assignments to maintain fairness. Many students responded well to that nudge.
One practice I intentionally carried forward from my own experience as a student was leaving comments on graded work, not just when points were deducted, but also to acknowledge strong submissions. It is a small effort from my end, but it helps students feel seen and motivates them to keep improving. As a student, those were the moments we looked forward to, knowing the instructor noticed good work.
As the semester came to an end, the focus shifted to final evaluations, especially grading the course projects and submitting final grades to the university. One thing I did not fully anticipate during this phase was the time needed to carefully evaluate student projects. Each submission reflected a significant amount of effort, and I wanted to give them the attention they deserved. As a result, grading ran later than I had initially expected, although it was still well within the official deadline.
Teaching this course taught me some important things. Good teaching is not about getting everything perfect, it’s a way to strengthen your own knowledge while sharing that knowledge in a way others can truly grasp. It is also about being responsive, thinking about what’s working and what isn’t, and being willing to adjust along the way. Managing a full class without a TA was basically a one-person band situation (except I was the entire percussion section, keeping tempo, fixing the rhythm mid-performance, and still trying not to miss a beat while everyone else expected a flawless show). But throughout the semester, I focused on doing the best I could and continuously improving based on student input. Overall, this experience was incredibly rewarding and reaffirmed my plan to pursue a career in academia.
Acknowledgements
I sincerely thank my advisors, Dr. Michele C. Weigle, Dr. Michael L. Nelson, and Associate Professor & Assistant Chair of the Department of Computer Science, Dr. Steven J. Zeil for providing me with this invaluable opportunity to gain teaching experience as a PhD student. I am also grateful to my advisors and my colleague Dr. Bhanuka Mahanama, for always being available to answer questions. Special thanks to Dr. Santosh Nukavarapu for his mentorship throughout the semester and Syed R. Rizvi for providing the course slides. Credit for establishing and continuously refining this structure should go to the instructors who have taught the course over the years, including but not limited to Drs. Irwin Levinstein, Jian Wu, Vikas Ashok, Syed Rizvi, and Santosh Nukavarapu.
And finally, a very special thank you to my husband, Skanda Siva, for being endlessly flexible with his schedule and for his constant support, and to Yara Siva, who may not know it yet but was my tiniest companion through it all.
In the hours following the release of CVE-2026-45447 for the project OpenSSL, site reliability workers
and systems administrators scrambled to desperately rebuild and patch all their systems to fix a heap use-after-free in PKCS7_verify(). This is due to the affected components being
written in C, the only programming language where these vulnerabilities regularly happen. "This was a terrible tragedy, but sometimes
these things just happen and there's nothing anyone can do to stop them," said programmer Prof. Fabian Greenholt, echoing statements
expressed by hundreds of thousands of programmers who use the only language where 90% of the world's memory safety vulnerabilities have
occurred in the last 50 years, and whose projects are 20 times more likely to have security vulnerabilities. "It's a shame, but what can
we do? There really isn't anything we can do to prevent memory safety vulnerabilities from happening if the programmer doesn't want to
write their code in a robust manner." At press time, users of the only programming language in the world where these vulnerabilities
regularly happen once or twice per quarter for the last eight years were referring to themselves and their situation as "helpless."
Tigris is S3-compatible, which means you can point the AWS SDK at it and most things just work. The catch is that the Tigris-exclusive features—bucket forking, snapshots, object renaming, and the like—need verbose workarounds because the AWS SDK doesn't know they exist.
So we wrote a Go SDK that does. It comes in two flavors: the storage package is a drop-in replacement for the standard S3 client with first-class methods for the Tigris-specific operations, and simplestorage is a higher-level client for the common single-bucket case that infers its configuration from the environment so you stop passing the same parameters over and over. You can adopt the Tigris features incrementally without refactoring your existing S3 code, and the simpler API still works against other S3-compatible providers.
I wrote up how it works and why we built it over on the Tigris blog.
The Perma team is excited to announce WARCbench, an open-source tool for exploring, analyzing, transforming, recombining, and extracting data from WARC (Web ARChive) files.
WARCbench builds on over a decade of experience gained from developing Perma.cc. Over that time, we’ve accumulated a collection of scripts, utilities, debugging workflows, and one-off experiments for dealing with web archives. WARCbench brings together those processes into a simple command-line tool that helps web archivists make sense of the wild, occasionally malformed, and deeply heterogeneous web archives that web archivists encounter in practice.
WARCbench was designed to make as few assumptions as possible about your familiarity with web archives, the kind of WARC you are working with, or what you want to do with it. It is intentionally a command-line tool. You can use it to explore and work with WARC files even without deep prior knowledge of the format, though it does assume you’re comfortable using a terminal and open to a bit of experimentation. The goal is not to hide the complexity of web archives. It is to make that complexity easier to inspect, manipulate, and learn from so you can experiment and iterate.
While many existing WARC tools are optimized for specific production workflows, the exploratory, in-the-moment WARC wrangling and debugging work archivists and developers often need to do benefits from different design choices. Sometimes you need to inspect a malformed or misbehaving WARC. Sometimes you need hooks and custom callbacks for an experiment. Sometimes you need to optimize for speed, memory, or convenience. Sometimes you just need to look and see what is there before deciding what to do next. WARCbench was designed for those moments.
We don’t know all the ways researchers or web archivists might use WARCbench, but we hope it becomes a versatile Swiss Army knife that others will find valuable to keep in their toolkit too.
We’re delighted to welcome Ann McCranie, PhD, who joins OCLC on June 8 as our new Director of Research Insights.
Ann joins OCLC at an important moment as OCLC Research advances Research Reimagined, a strategic effort to strengthen the relevance, visibility, and impact of our research for library leaders and their institutions. In her role, Ann will help connect research priorities to practical insights that support decision‑making across a rapidly changing library and higher education landscape. She will lead a team of research scientists and engineers focused on advancing the Research Reimagined strategy.
Ann brings more than a decade of experience leading research programs in higher education, with expertise in mixed methods research, research operations, and research communication. Most recently, she held senior leadership roles at Indiana University.
Her work has focused on building durable research services, guiding cross-functional teams, and helping researchers and administrators navigate change. Throughout her career, she has paired rigorous analysis with practical application.
Ann also brings a perspective shaped by close collaboration with researchers, research administrators, and campus leaders beyond the library. That experience informs how she thinks about the evolving roles libraries as institutions respond to changes in technology, AI-informed scholarly workflows, and research infrastructure. This perspective will be especially valuable as OCLC Research continues exploring future-focused questions facing libraries and higher education.
To help introduce Ann to the community, we asked her a few informal questions.
What drew you to this role at OCLC?
What attracted me was the opportunity to help connect research to the decisions library leaders are making today, while also contributing to longer-term thinking about where libraries are headed. In higher education, I’ve worked with researchers, administrators, and institutional leaders who rely on strong evidence and practical insights to guide strategy, services, and priorities.
I was especially excited by the chance to bring that experience to OCLC and support work that can have both immediate and lasting value for libraries. Research is most meaningful to me when it helps people navigate change, make informed decisions, and think differently about what comes next.
How do you think about “research insights”?
I tend to think about research insights through the lens of impact. I once asked a doctor about a medical test, and she explained that she would not order it because the result would not change the treatment plan. At first, I was a little disappointed because I was genuinely curious, but that idea stayed with me.
It became a useful way for me to think about research. I’m always asking whether the work can help inform decisions, shape action, or open up new possibilities. If the findings don’t create an opportunity to do something differently, it’s worth asking how we can make the research more purposeful and useful.
What are you most looking forward to as you get started?
I’m really looking forward to getting to know my team and connecting with colleagues across OCLC to understand their work, priorities, and how Research Insights can support them.
As a social networks scholar, I’ve always been interested in the connections between people and how relationships help ideas spread and grow. So much innovation comes from those informal networks, whether that is among coworkers, library partners, or the broader community. I’m excited to learn from those connections and help build on the momentum already underway with Research Reimagined.
Ann will be attending ALA at the end of June, and we look forward to introducing her to many of you there. Until then, please join us in welcoming her to OCLC Research.
This post is part of a series in which I write about experiences or specific challenges from my day-to-day work. I’m hoping that these will be interesting for other librarians that work in entirely different areas, for my colleagues who are solving different problems on different systems (or maybe eventually the same one after we migrate), and for those who are thinking about doing this kind of work in the future.
Building from navigating the distributed database, I want to get more deeply into what cross-system problem solving can look like. To re-set the stage (but for more details about these tools, check the previous post), transaction history of items is only available for most users via our Analytics tool.
Transaction Histories
Transaction history represents the ways an item’s traveled, checkouts but also transits and receipts. This is one of the many transactions created while my request for the Alien: Romulus DVD was filled. In this transaction, a coworker at York (I’ve redacted any details, but the user ID is in the actual log) sets the item to transit for reason “HOLD” to “UP-PAT”:
Trans Hist Datetime
Trans Hist Workstation
Trans Hist Command Desc
Trans Hist Data Code Desc
Trans Hist Data Value
2025-08-27 12:10:48
0173
Transit Item
call number
POPULAR
2025-08-27 12:10:48
0173
Transit Item
copy number
1
2025-08-27 12:10:48
0173
Transit Item
item ID
000080622957
2025-08-27 12:10:48
0173
Transit Item
Max length of transaction response
3000000
2025-08-27 12:10:48
0173
Transit Item
station library
UP-PAT
2025-08-27 12:10:48
0173
Transit Item
station login clearance
NONE
2025-08-27 12:10:48
0173
Transit Item
station login user access
REDACTED
2025-08-27 12:10:48
0173
Transit Item
station user’s user ID
REDACTED
2025-08-27 12:10:48
0173
Transit Item
transit from
UP-PAT
2025-08-27 12:10:48
0173
Transit Item
transit reason
HOLD
2025-08-27 12:10:48
0173
Transit Item
transit to
UP-ANNEX
This is the Analytics export, which I transformed from a CSV into a table for readability in this post.
Unfortunately, even though the underlying Symphony database has unique item keys for records, Analytics seems to use the barcode as the primary key of an item table, not just the primary way to find an item record. An item’s transaction history is completely wiped from Analytics if someone changes the barcode. And sometimes, barcodes change. In our case, we change barcodes on everything that’s permanently shifted to the annex (see my post on macros). We also have barcodes wear out or fall off. So we have hundreds of thousands of items whose histories were lost, at least from the Analytics.
These lost records came to a head when our Collection Maintenance team needed to be able to track large sets of items being moved the Annex. Once the items arrived, their barcodes would be replaced with an Annex barcode, which serves a different function. So one could follow a set of barcodes on their journey until “poof,” every record related to them vanished. On the one hand, one could assume the item had been processed by the Annex since it had now disappeared. But it made tracking uneven and meant collections maintenance couldn’t tell what route an item had taken to get there or how long it’d taken.
First, I’ll note that our systems work is also quite distributed. While I was working with our collections maintenance data expert on getting access to older data, the Symphony admins were configuring item extended information to include an original barcode field, which is now populated when a barcode updates. They’ve also done some work hunting down barcode changes to update the original barcode fields. These will be exportable, even though they won’t be searchable the same way in our Analytics. Systems takes a village.
Where the Data Still Lives
Getting back to the problem-solving, this data can still be found through the oldest method of ILS data access: Workflows reports.
By running a Scan History Logs report against a set of barcodes, we can export every log in which that barcode shows up. This data wasn’t nearly as easy to use as an Analytics or Data Control export. It’s exported in a text file and uses opaque datacodes.1 Here are two example log entries from a barcode change (the actual user’s ID has been replaced with REDACTED):
That top entry is really important because, even though there are other ways of accessing a permanent item ID, it’s not in the logs. So by scanning for that original barcode, we can get the entry where the barcode is in NQ and the new barcode is in NR.
I wrote a Python script that processed entire log entries, since the colleague from Collection Maintenance wasn’t just looking for old/new barcodes but for the transaction histories of that item. He could apply a date range to the log export itself, so he could set it just to export the last few months. I’m not going to share the entire script here, but this is the overall approach that I used:
current = re.search(r'\<datacode_NQ\>:(.+?) ',line)
I wrote a conditional function for all the entries which might not be present. For the handful whose data might contain a space, I wrote the search to break at the $<datacode that begins the next entry and trimmed space off the right side.
The ouptut is a very large JSON object, which I’ve condensed below to reflect the key fields from this transaction. Even with its size, it’s a lot more compact and efficient than the Analytics output shared above and thus might be easier to process, so this script may end up being useful in other contexts.
Going forward (to migration) this new process should meet the use cases of:
connecting old and new barcodes in collection maintenance logs,
tracking item histories that had been dropped from Analytics, and
created a friendly JSON object vs. the same entry spread across a dozen lines or more of a CSV, making it of potential use for reporting on new barcodes as well.
So in sum:
Sysadmins created a new field for original barcodes and set it to populate when barcodes are changed.
Sysadmins began hunting through logs to find barcode changes and wrote a script to populate them in the database for export/reference.
I created a way to extract JSON objects for item transaction history out of log reports run on old barcodes since those transactions were no longer accessible in Analytics.
There are also ways to export a formatted log which is human readable, but those logs are much harder to turn into data structures. ↩︎
This article presents a comparison of graph capabilities in three
different databases: DuckDB (v1.4.4 with duckpgq), LadybugDB (0.16.1),
and PostgreSQL (19devel). We will load a large volume of records
(5,635,972 rows of baseball data covering people, parks, team records,
and game play-by-plays) into each database, define the entities and
relationships, and write a variety of queries that take full advantage
of the graph structure.
Ambient Church transforms architecturally stunning spaces into immersive
audio-visual environments. Our events feature pioneering artists
presenting vibrant works in a context that elevates both the music and
the space.
Founded in Brooklyn in 2016, we facilitate collective peak experiences
through the soundscapes of modern contemplative music. With an emphasis
on education and environment, we seek to illuminate an underacknowledged
lineage of sonic exploration.
Large language models (LLMs) are increasingly used to generate data to
train improved models1,2,3, but it remains unclear what properties are
transmitted in this model distillation4,5. Here we show that
distillation can lead to subliminal learning—the transmission of
behavioural traits through semantically unrelated data. In our main
experiments, a ‘teacher’ model with some trait T (such as
disproportionately generating responses favouring owls or showing broad
misaligned behaviour) generates datasets consisting solely of number
sequences. Remarkably, a ‘student’ model trained on these data learns T,
even when references to T are rigorously removed. More realistically, we
observe the same effect when the teacher generates math reasoning traces
or code. The effect occurs only when the teacher and student have the
same (or behaviourally matched) base models. To help explain this, we
prove a theoretical result showing that subliminal learning arises in
neural networks under broad conditions and demonstrate it in a simple
multilayer perceptron (MLP) classifier. As artificial intelligence
systems are increasingly trained on the outputs of one another, they may
inherit properties not visible in the data. Safety evaluations may
therefore need to examine not just behaviour, but the origins of models
and training data and the processes used to create them.
In this essay we will attempt to look at both the archive of art as well
as the archive as art. When we draw a distinction between those
materials that we treat as documents with a ‘factual’ historical
significance (those which offer themselves in the service of
scholarship), and the uses which artists make of the archive as one of
the media of expression that intersect with their documentary value, we
ask ourselves: which theories about the archive’s nature and function
are applicable to Syrian art? What are the roles adopted by ‘the
document’ and ‘the archivist’? To what extent do these roles alternate
and intersect?
Cave of Forgotten Dreams is a 2010 3D documentary film by Werner Herzog
about the Chauvet Cave in Southern France, which contains some of the
oldest human-painted images yet discovered—some of them were crafted
around 32,000 years ago. It consists of footage from inside the cave, as
well as of the nearby Pont d’Arc natural bridge, alongside interviews
with various scientists and historians. The film premiered on 13
September 2010 at the Toronto International Film Festival.
Starbucks is saying goodbye to its artificial intelligence inventory
management system about nine months after its debut, Reuters reported
Thursday. The tool, which used computer vision to track some parts of
the chain’s inventory, was announced in September as a method to
simplify inventory record-keeping and prevent stockouts.
FediRoster is a slightly more heavyweight alternative to David Adler’s
Sociologists on Mastodon software. It is intended to function as a
public list of Mastodon and other fediverse accounts, geared primarily
towards academic communities, but suitable for others as well. It offers
functions for following listed accounts individually or in bulk. The
main novelty here is that you can add yourself to the list through an
authentication process instead of all the work falling on a list
maintainer. You can sign in through your Mastodon account or send a
message to the list’s bot to verify your account ownership. This also
means that the hosting process for new lists is a bit more involved
(it’s a Python/WSGI application).
A comprehensive guide to learning Rust for developers with Python
experience. This guide covers everything from basic syntax to advanced
patterns, focusing on the conceptual shifts required when moving from a
dynamically-typed, garbage-collected language to a statically-typed
systems language with compile-time memory safety.
Language models serve as the cornerstone of modern natural language
processing (NLP) applications and open up a new paradigm of having a
single general purpose system address a range of downstream tasks. As
the field of artificial intelligence (AI), machine learning (ML), and
NLP continues to grow, possessing a deep understanding of language
models becomes essential for scientists and engineers alike. This course
is designed to provide students with a comprehensive understanding of
language models by walking them through the entire process of developing
their own. Drawing inspiration from operating systems courses that
create an entire operating system from scratch, we will lead students
through every aspect of language model creation, including data
collection and cleaning for pre-training, transformer model
construction, model training, and evaluation before deployment.
Wasteback Machine is a JavaScript library for analysing archived web
pages, measuring their size and composition to enable retrospective,
quantitative web research.
The primary difference between deepfake photos and LLM conversations is
that the people who generate the former are deliberately trying to fool
others, and many of the people who elicit the latter from LLMs have
inadvertently fooled themselves.
The removal of the encoders, which are typically in charge of making
sense of the multimodal inputs, places the burden of making sense of all
outputs on the LLM. Although the model is encoder-free, all modalities
are now unified within the LLM. Instead of the model having to wait for
the encoders to finish processing the audio and image inputs, the LLM
can get started earlier processing the input and generating output!
In this guide, I want to showcase what it took to remove the vision and
audio encoders and replace them with something much faster. The result,
a 12B model that can handle audio and image inputs but without the need
for encoders.
AI Edge Gallery is the premier destination for running the world’s most
powerful open-source Large Language Models (LLMs) on your mobile device.
Experience high-performance Generative AI directly on your
hardware—fully offline, private, and lightning-fast.
Today, we are introducing Gemma 4 12B, our latest model designed to
bring agentic multimodal intelligence directly to laptops. Bridging the
gap between our edge-friendly E4B and our more advanced 26B Mixture of
Experts (MoE), Gemma 4 12B packages powerful capabilities inside a
reduced memory footprint. It is also our first mid-sized model to
feature native audio inputs
Solid State Books is a full-service, Black-owned general interest
bookstore with a great selection of fiction & non-fiction titles. We
stock literary gifts, stationery, greeting cards & puzzles for all
ages. We have a carpeted, playful children’s books area in both stores
for kids & parents alike to spread out & read together. Come by
for weekly children’s story hours, catch monthly book groups, author
readings/signings, local interest panels, political conversations &
more!
MiniSearch is a tiny but powerful in-memory fulltext search engine
written in JavaScript. It is respectful of resources, and it can
comfortably run both in Node and in the browser.
In May 2026, the Justice Department began systematically removing
material from its web sites regarding the many indictments and
convictions related to the Jan. 6 attack on the U.S. Capitol. This
archive reconstructs the vast bulk of those thousands of deleted
records.
Last week, the Justice Department began systematically removing material
from its web sites regarding the many indictments and convictions
related to the Jan. 6 attack on the U.S. Capitol.
The operation started without fanfare or formal announcement and
proceeded largely unnoticed. Until, that is, journalists such as the
Washington Post’s Meryl Kornfield took notice of certain press releases
and other materials that had conspicuously disappeared from
www.justice.gov.
“The Trump admin is quietly deleting info about the Capitol attack from
the DOJ website as it prepares to give funds to J6ers,” Kornfield
posted. “This week, DOJ deleted a press release about one man with an
ongoing child solicitation case who came to the Capitol with bear
spray.”
Then, with typical bombast, the Justice Department responded by taking
issue with one particular aspect of Kornfield’s characterization.
“Nothing ‘quiet’ about it,” the DOJ Rapid Response account replied. “We
are proud to reverse the DOJ’s weaponization under the Biden
administration. We will do everything in our power to make whole those
who were persecuted for political purposes. This includes stripping
DOJ’s website of partisan propaganda.”
We are not erasing history quietly, the Justice Department seemed to
suggest. We are erasing history loudly and proudly.
At Lawfare, we have restored the vast bulk of what was deleted. We have
also started to preemptively archive a raft of material that has not yet
been deleted but probably will be, given its thematic relationship to
the material that was 86ed.
Data centers are the physical facilities that power cloud services, AI
systems, streaming, and nearly every digital platform people use each
day. As demand for artificial intelligence accelerates, data centers are
becoming major sources of electricity demand and local infrastructure
pressure, which means their growth affects energy systems, communities,
and long-term public planning.
The regulatory landscape for data centers in the United States has
shifted dramatically in recent years from a period of aggressive
economic incentives to a phase of intense scrutiny, restriction, and
community-led resistance. To track these legislative changes, the DIGS
Lab at the University of Virginia reviewed more than 700 federal, state,
and local policies related to data centers. The data center policy
database aims to bring transparency around zoning, permitting, and
regulating data centers and their impacts on communities. This is what
we found.
Another early R.E.M. set, from the same state but a different city as
the previous show. Pretty much the same library of songs, but this one’s
the superior show to get - it sounds slightly nicer and doesn’t have the
equipment failures of the previous show. There’s already a source on
here, but that’s a different master of the same recording.
Ratatui (ˌræ.təˈtu.i) is a Rust crate for cooking up terminal user
interfaces (TUIs). It provides a simple and flexible way to create
text-based user interfaces in the terminal, which can be used for
command-line applications, dashboards, and other interactive console
programs.
IPv6 is weird. One of the more strange parts of the standard is that every interface's link local addresses are in fe80::whatever. If you have a machine with two network interfaces, both of them will be in fe80::, so if you have a packet destined to fe80::4, how do you disambiguate it?
The answer is you use IPv6 scopes/zones. The exact format of what goes into a zone is OS dependent, but on Linux it's the interface name and on Windows it's the interface ID. This lets the kernel's routing table know how to handle an address range conflict.
On my tower, this would be represented like this:
fe80::4%eth0
Where eth0 is the name of my tower's ethernet device.
When you create a host:port bindhost, you normally separate the hostname and port with a colon. IPv6 uses colons to separate hex groups. In order to disambiguate what's the host and what's the port, you typically format the IPv6 address in square brackets, so fe80::4 on port 80 would look like this:
[fe80::4]:80
And with the right scope it looks like this:
[fe80::4%eth0]:80
Now let's get URL encoding into the mix. From high orbit, you can imagine a URL's format as being something like this:
An IPv6 zone would then be part of the hostname, just like with that fe80::4 port 80 example from earlier. So you'd think the URL would be something like this:
http://[fe80::4%eth0]:80
But if you try to parse this as a URL in Go, you get an error:
package main
import"net/url"funcmain(){if_, err := url.Parse("http://[fe80::4%eth0]:80"); err !=nil{panic(err)}}
This happens because URLs can't represent all Unicode values, so any values that don't fit into the grammar of a URL become percent-encoded. This is why sometimes you'll see a %20 in URLs in the wild; that's encoding the ascii space key, which is invalid in URLs.
In order to work around this, you need to percent-encode the percent sign in the IPv6 zone:
package main
import("fmt""net/url")funcmain(){ u, err := url.Parse("http://[fe80::4%25eth0]:80")if err !=nil{panic(err)} fmt.Println(u.Hostname())}
Yields:
fe80::4%eth0
In theory, there is guidance for how to properly handle IPv6 zones in user interfaces in RFC 9844, but there's no such guidance for URLs. Go also does not seem to follow this RFC in net/url.
So in the meantime in order for Anubis to point to IPv6 zoned addresses, you need to encode the % with percent encoding. This is horrible, but it seems that this is an edge case that applies to other frameworks, programming languages, and libraries:
Maybe some day in the future there will be a better option here. In the meantime my policy of not forking the Go standard library means that this somewhat terrible UX for an edge case is acceptable. I hate it, but what can you do?
The Perma team recently attended the International Internet Preservation Consortium’s (IIPC) Web Archiving Conference, held this year at the KBR—Royal Library of Belgium in Brussels. A recurring theme was that web archiving depends on collective stewardship of the open-source tools, institutions, and people that make preservation possible. At a moment when the web is becoming more difficult to archive, the conference offered an assessment of current challenges and a reminder that the sustainability of the field relies heavily on collaboration and shared responsibility.
The opening keynote panel—“Sustainability for Open Source Web Archiving Tools”—brought together perspectives from libraries, consortia, and open source service providers: Lauren Ko (University of North Texas Libraries), Tessa Walsh (Webrecorder), Neil Jefferies (Open Preservation Foundation), Yves Maurer (National Library of Luxembourg), and LIL’s very own Clare Stanton (Perma.cc). The conversation focused on the structural pressures now reshaping the digital landscape, and what collective stewardship might realistically look like. Key takeaways from this conversation are outlined below.
Clare Stanton (center) discusses Perma.cc during the opening keynote.
Need for sustained investment in open-source software
The web archiving community no longer has the luxury of treating tool and infrastructure maintenance as someone else’s problem. Nearly every institution in the room relies on these open-source tools, including Perma itself. For example, the replay functionality for Perma.cc is built on replayweb.page, part of the software suite developed by our long-time collaborators at Webrecorder. Despite almost everyone using these open-source tools, almost no one is funding them proportionally. Historically, many projects survived on grants and foundation support, but that funding landscape is shrinking. Yves framed open-source work as a shared mission and responsibility, especially for national libraries and cultural heritage institutions whose mandates depend on long-term stewardship. Institutions should be contributing back to the web archiving ecosystem they depend on.
An asymmetric fight against a complex and closing web
Web archiving has become more difficult in the past few years, and the scale and pace of change is only accelerating. Tessa described the current environment as an “asymmetric fight” due to bot detection and anti-scraping systems increasingly treat archiving crawlers the same way they treat commercial scrapers. Several panelists pointed to the collateral damage caused by large-scale scraping and large language model (LLM) training. Infrastructure providers are tightening access controls across the web, often in ways that make legitimate archival crawling significantly harder. Tessa noted that archivists now need to spend more time simply observing crawls to determine whether captures succeeded or whether crawlers archived nothing but bot verification pages. Clare suggested that the closing web may create an opportunity for archiving institutions to advocate collectively for differentiated treatment, making the case to infrastructure companies like Cloudflare that preservation work serves a fundamentally different purpose from commercial scraping.
Beyond single maintainers: Sustaining people, not just code
The panelists repeatedly returned to governance and community structure as equally important to technical capability, and also discussed the human labor behind open source tooling. Multiple panelists emphasized that storage and compute are not the primary costs in web archiving operations. The expensive part is retaining highly skilled people capable of adapting tools to a rapidly changing web environment. Neil argued that sustainability problems become especially acute when projects depend too heavily on single maintainers. The goal, Neil suggested, is not to remove human dependency, but to move from person-dependent systems to people-dependent systems, with succession planning, multiple technical leads, and stronger organizational support structures.
Digital preservation as collective responsibility
There was some cautious optimism about potential sources for more sustainable support. Panelists discussed adding funding requirements for upstream open-source projects into public tenders for web archiving services, creating institutional budget lines specifically for open-source maintenance, and treating contributions to community software as legitimate professional development work for developers within libraries and archives. Some panelists pointed to growing interest in digital sovereignty policies in Europe, where governments increasingly want more direct control over digital infrastructure and collections stewardship. Yves suggested that this political shift could create opportunities for open-source preservation tooling, particularly if public sector procurement rules begin explicitly rewarding contributions back to shared infrastructure.
Benefits and limitations of AI-assisted coding
Not surprisingly, AI hovered over much of the discussion. AI-assisted coding may reduce some development overhead, and some panelists described productive uses for code review, bug detection, and scripting assistance. However, the panel was skeptical of the idea that AI meaningfully solves the underlying sustainability problem. Faster code generation does not automatically create maintainable systems, healthy governance, or resilient communities. As Tessa noted, velocity without understanding creates its own risks.
Open-source software is critical preservation infrastructure
The key takeaway that emerged from the opening keynote was a reframing of open-source web archiving infrastructure not as ancillary technical tooling, but as critical preservation infrastructure. The field behaves as though these systems are indispensable, but there is a significant underinvestment in open-source tools. The harder question, and the one the panel kept circling back to, is whether institutions are willing to fund, maintain, and steward them accordingly.
Together with AGESIC, we are piloting a traceable AI system with Uruguay’s open data catalogue – so that citizens can receive verifiable answers from national data
The Wayback Machine is (usually)
good at preserving web pages, but it’s not always good at helping you
find your way around what’s been preserved. URLs from a vanished website
may be archived, but if the original site is gone, the paths into it
(its navigation, its search, its tables of contents) are sometimes gone
too.
This creates a need, and opportunity, for sites I want to call
reading rooms for the archived web: standalone sites that sit
to the side of archived web content and provide the index, browse,
search and curation layers that the original site used to, with
provenance links back to the captures they’re drawn from. The metaphor I
have in mind is the reading room in a brick & mortar archive, the
place you go to consult a collection, with finding aids close
at hand and the records themselves a request slip away. Perhaps a
finding aid is the better metaphor here?
The most recent example of this I’ve come across is work from Lawfare Media, who recovered
5,772 pages deleted from the Department of Justice website related to
the Jan. 6 attack on the US Capitol. They’ve built a standalone archival
viewer of the extracted content that links back to the Wayback Machine. There is more about
the motivation for the project in their post The
Justice Department Erases History. Lawfare Restores It. (Sadly the
GitHub repo for the archive itself looks to be private.)
This is a bit archive-eating-its-own-tail, but one feature of the site
that Lawfare Media built is that the search
is operational from within the Wayback Machine’s own snapshot
of the site, since the search runs client-side. A user search doesn’t
require an API back to the server.
Searching the archive from inside the Wayback Machine
Looking at the HTML it appears the site is using minisearch for
client-side search. A nice side effect of client-side search is that the
indexed corpus (metadata for all the DoJ content) is itself available on
the open web, as corpus.json.
Some caring person has even already thought to archive
corpus.json using Save Page Now:
A Wayback Machine snapshot of corpus.json from May 29, 2026
Other “Reading Rooms”
Lawfare’s archive sits in a small but growing genre. Or maybe it’s well
established and I’m just noticing it for the first time? Another example
is Ben Welsh’s
FiveThirtyEight Index
which he built after Disney shut down fivethirtyeight.com
in March 2025. It catalogs over 38,000 articles, datasets, podcasts and
graphics, browsable by author, date and series, with every record linked
back to its Wayback Machine snapshot. (The Internet Archive also runs a
companion
collection.)
Another example is Internet Archive’s Scholar, which provides a
catalog of published research (mostly journal articles) that are found
in the Wayback Machine. I believe this is a presentation layer over data
collected by IA’s FatCat project. Which
provides some ability to edit the metadata about the archived content.
In archival terms what these projects are doing is effectively what
finding aids
doe: describing scope, arrangement, and provenance, but wrapping it in
something that feels more like a reading room than a paper inventory.
They are themselves websites that will eventually need to be archived. I
think it’s interesting to think about them as a continuation of
something archives have been doing for a very long time. It’s also
interesting to think about the role that agentic coding tools played in
their production (at least in the case of the Jan 6 Archive).
Jonathan Gray and the Public Data Lab at King’s College
London run a project called Repurposing
Web Archives (with the Internet Archive and Internet Archive Europe)
that looks at the tools, methods, and stories of how researchers,
journalists, and artists actually work with the archived web: see their
recent Follow
the Changes post. Perhaps this idea of Reading Rooms for web
archives is a subset of the types of practices this project is
interested in? It seems like there is a gray area between research that
incorporates web archives, and more documentation oriented content for
providing an entry point into web archives?
If you know of other examples of Reading Rooms (or finding aids) for Web
Archives I’d love to hear about them!
This post was originally a thread over in the
Fediverse. Thanks to (freegovinfo?) for
the pointer to the Lawfare Media work.
The Uncanny Valley and Gell-Mann Amnesia Effect in the ACM Digital Library
Michael L. Nelson
2026-05-28
I serve on the ACM Digital Libraries Board, and we are navigating a number of changes to the ACM's Digital Library, which as a professional society and memory organization, is arguably the ACM's primary asset. A recent article (March, 2026) by Jack Davidson and Wayne Graves provides a status update of the ACM's move to open access, which includes establishing a "basic" and "premium" service level. Although there are some questions regarding the long-term implications of moving to open access, I, and presumably all authors, welcome the ACM's bold strategy for ensuring that our content reaches the widest possible audience.
Jack's and Wayne's article also addressed the DL's recent experimentation with AI/LLM enrichment of articles, specifically landing pages. And unfortunately, the experimentation got off on the wrong foot. Just before the holidays in 2025, the landing page for articles in the DL added AI-generated summaries as a sort of alternate or rival abstract. To make matters worse, these summaries were shown by default, and users had to select a tab to show the original, author-supplied abstracts. The figure below is an example taken from Dr. Casey Fielder (CU Boulder), whose social media post about the summaries being shown by default instead of the abstracts gained a lot of traction.
Fortunately, the expected behavior of showing the authors' abstract by default returned very quickly, and the AI-generated summary is now clearly marked as such, including the date that the summary was generated:
First, let me be clear: showing the AI-generated summary by default instead of the authors' abstract was a terrible idea and was uniformly rebuked. The DL board was not informed that this was going to happen, and I can't recall anyone on the DL board even suggesting it; perhaps it was just an oversight by an ACM staff member or engineer at Atypon. I don't recall exactly when the expected default behavior was restored, but it was soon after the author community complained.
My original suggestion at the DL board meetings (echoed by Dr. Fiesler) was to provide wiki-style editing on the AI-generated summaries, possibly limited to logged-in authors (a possible premium feature?). One can make a good argument for either opt-out or opt-in, but neither option adequately addresses the problem of the sizable back catalog of unreachable authors (JACM began in 1954).
But what I find interesting is the level of author backlash against AI-generated summaries, at least as I observed on social media. This is all anecdotal, and I realize people don't post about things for which they are neutral or have even mildly positive feelings about because, let's face it: carping is a lot more fun. But Dr. Fiesler and the others in the thread are all reasonable people and aren't just trolling. I think there's something more fundamental happening. I think our collective reaction (revulsion?) to AI-generated summaries can be explained by adapting two phenomena: the Uncanny Valley, and the Gell-Mann Amnesia Effect.
The Uncanny Valley is an hypothesis that posits that our emotional response to depictions of humans (expressions, speech, movement, etc.) initially rises as the likeness becomes more human-like, and then takes a sharp dive as the likeness becomes nearly human-like but not quite. Basically, most cartoon characters, anthropized animals, etc. are "cute", but the more realistic animated humans in movies like "Polar Express" (2004) are just creepy.
I propose that something similar happens with text. Most authors have no problem with AI tools enriching the work, for example: language translation, extracting citations, repairing/rewriting hyperlinks, suggesting related works, suggesting/assigning keywords and ACM CCS values, and any number of other services and derived content. But generating a summary that rivals the abstract? Yuck. No thanks. An error in citation parsing or CCS assignment? Meh, who cares, either ignore it or fix it, but no one takes to social media to complain. A subtle but detectable (if only by the author) error in a summary? That's glaring and viscerally wrong. And even if we can find no substantive errors, knowing the text is AI-generated, we will find fault with phrasing, the structure, and various minutiae (cf. humans' negative attitudes to replicants in Blade Runner). Extracting keywords is what computers do. Writing abstracts is what we do. If LLMs can write abstracts, what's our job?
Those assessments inevitably derive from us reviewing AI-generated summaries of our own work. Presumably, no one knows the material better than us, so the best anyone / anything else can do is be "as good as", certainly not "better". We're writing for our peers, and we share a nuanced, high-bandwidth vocabulary that outsiders just can't appreciate. On the other hand, if we have to read articles outside of our area of expertise, we often wonder why are the authors so obtuse? Why can't "those people" just write plainly?
This is the essence of the Gell-Mann Amnesia Effect, which was coined by Michael Crichton to describe the phenomena that the more you know about a topic, the more likely you are to see the flaws in a third party analysis, but at the same time not being as critical when that same third party summarizes a topic on which you are not an expert. Anyone who has been interviewed by the media has experienced this: the reporters inevitably butcher your hour-long exposition, provided in painstaking detail, covering all the nuances, edge cases, historical review, and possible future directions – all reduced to a minute or less of decontextualized soundbites. But that news outlet suddenly becomes a trusted and valuable source when they cover a topic outside of your expertise.
I suspect the Gell-Mann Amnesia Effect applies to AI-generated summaries as well: they are an abomination when applied to my work, but a useful de-jargoning tool for exploring unfamiliar or even adjacent sub-fields. This even presupposes that there should be multiple AI-generated summaries, aimed at different audiences (e.g., lay person, High School, undergraduate, researcher). In fact, the rival abstract in Dr. Fiesler's example might be the least useful summary, precisely because it does rival the author's abstract. But writing for audiences other than our own is a different skill set: writing for my fellow researchers at JCDL, Hypertext, Web Science, etc. is what I do, but writing for high schoolers is not what I do. Casting my work into something appropriate for high schoolers would be a good use of LLMs, and simplifications (if not outright errors) are to be expected.
In summary, I think it's natural to feel revulsion when the LLMs are used to rival our work: it falls into the textual uncanny valley, in a way that other generative works, such as translation, do not (at least not currently). But at the same time and based on the Gell-Mann Amnesia Effect, our harshest judgement of AI-generated summaries is reserved for areas in which we are an expert, and our assessment of AI-generated summaries improves as we apply them to areas further from our own.
With that in mind, it would make sense for the ACM DL to enable wiki-style editing on summaries, move away from the model of a single summary that rivals the author's abstract in length and complexity, and introduce multiple summaries, tailored to audience and intended purpose.
Are these good summaries? I guess so – although I'm not sure what else to evaluate them against. I don't know the first thing about proteomics, so the "General" summary is certainly the most accessible to me. The "Expert" summary is more detailed than the "General" summary, but still more accessible to me than the authors' abstract. That's not a surprise because 1) I haven't studied biology or chemistry since High School, some 40 (!) years ago, so Schär et al. aren't writing for me, and 2) the summaries are both about half the length of the authors' abstract. I saved all three into separate files:
% wc -w bio-*txt | grep -v total
219 bio-abs.txt
107 bio-expert.txt
88 bio-general.txt
Two hundred words is a good target for abstracts. I'm guessing the prompts for the AI-generated summaries had a target of about 100 words, so by design even the "Expert" summary will not rival the authors' abstract (though metadata and wiki-style editing would be nice). The "Automated Services" tab has at the bottom a link to "Explore Further on ScienceCast":
I don't have an account (yet) on ScienceCast, so that's the end of my exploration for now. But there's clearly a bigger AI↔paper ecosystem to explore, for both me personally and the ACM DL.
–Michael
2026-06-02 Update: In another chat with Martin Klein, and had just discovered the institutional repository at Niigata University. It does not a native English interface, so all of the translations shown below are via Chrome and thus a little clunky. When you first visit the repository, it asks you to choose a persona or level from three choices: "adult", "junior and senior High School students", and "Elementary school student"
I did a search for "web archiving". The hits are not especially relevant (perhaps no one at Niigata is active in the field), but they are sufficient to demonstrate the personas.
Chrome's translation for Elementary School students is not smooth, but I'm guessing that's an issue with Chrome and not the LLM that Niigata is using – presumably there is less training data for translating "children's" Japanese?
The landing page Niigata's institutional repository does have the regrettable "embedded PDF" interface, and it does list a truncated "AI Explanation" above the "Summary by the author" (to be fair, perhaps it's named "summary by the author" instead of "abstract" is a function of the translation)
It is a little hard to evaluate this three-level approach, since there's the added dimension of language translation. But it feels like an interesting application of LLMs, and aside from being listed at the top of the SERP, it does not seem to be in competition with the authors' abstract.
Note that the landing page displayed above is likely an experimental and/or local UI since it is hosted at niigata-u.ac.jp, and is very different from the more conventional looking landing page for associated the handle which resolves to nii.ac.jp.
The college wage premium, that is, the increased earnings associated with having a college degree as opposed to only being a high school graduate, hasn’t changed at all in the past 25 years, because median real wages have been flat as a pancake for everybody, no matter what their formal education level, for the past quarter century.
I wonder what’s happened to capital over this time? Value of S & P 500, inflation-adjusted, 1/2000 to 9/2025 (same period as the wage data):
2000: $1,394
2025: $6,688
On average, for more than the students' entire lives, stock-owners like Schmidt and (to a much lesser extent) I have stolen every last drop of the productivity increase of US workers at every age and education level. (See the actual numbers in the appendix)
Now, the perpetrators of this theft are telling their victims, the students and the public at large, that whether they like it or not they will be subjected to AI because that will make the perpetrators even richer. The victims have been informed that this new technology will:
Nothing better illustrates the contempt of the Epstein class for the proletariat than that these oligarchs would expect the graduating class to enthusiastically accept this prospect.
I was fooling around with FRED this morning, as one does, and here are some stats: (The FRED numbers are presented in nominal dollars; I’ve converted them to CPI-adjusted dollars).
Median usual weekly earnings of workers with a high school degree only:
2000: $968
2025: $980
Median usual weekly earnings of workers with a bachelor degree only:
2000: $1,587
2025: $1,580
...
Median usual weekly earnings of people with a bachelor’s degree or higher:
2000: $1,705
2025: $1,747
Here is a short list of YouTube videos on this topic:
As a boomer, I think this post might be the exception that proves Ms. Baba's rule.
Note that every single one of the ads that I saw watching these videos in an incognito window was advertising an AI company! As are 49% of all the billboards in the Bay Area. Read the room, guys!
This post is being shared on both the dataindex.us newsletter and the Library Innovation Lab Blog.
“Is data changing? Is it being disappeared? How do we know? How can we know?” This interrogative refrain rang through just about every conversation I had when, almost a year ago, I came to Harvard Law School Library to lead the Public Data Project. Thanks to the dataindex.us Data Checkup, a plan is in place to do this complicated but essential work. Through the careful scaffolding dataindex.us has constructed and the assiduous research of its staff, more than a dozen federal datasets have “health assessments,” and the team continues to add to this list.
In October 2025, the Public Data Project partnered with dataindex.us to develop a data monitoring toolkit that could both work at scale and be user-driven. In addition to creating an automated tool that can process large numbers of datasets, we also want the user to determine which datasets they want to monitor. Let’s face it, when it comes to federal data, one person’s byzantine, inscrutable dataset is another person’s trove of invaluable ground truth. The anecdotes of data use collected by essentialdata.us offer varied examples of the ways people benefit from federal datasets. The range of uses is a clear indication that people need to be able to monitor the data that matters to them.
At the Public Data Project, we are creating a toolkit that will enable users to detect and monitor changes to federal datasets over time. It will enable users to select a dataset and track changes within the data itself, as well as to automate the monitoring of external sources that indicate whether the data might be changing. Indicators of change to a given dataset range from somewhat obvious sources, like major news sites, to more obscure sources, like the U.S. Code. At present, our tool development has produced two components.
First, Binoc is a command-line tool and library to generate changelogs for datasets that don’t have them.
Unlike generic diffing utilities intended to describe line-level differences in plain-text content such as source code or Markdown, Binoc aims to efficiently summarize changes in real-world datasets, including file additions and deletions, row-level updates, and schema alterations. Given a series of dataset snapshots captured at different points in time, Binoc detects what changed, expresses any changes as a minimal structured diff, and produces a human-readable summary. Binoc is currently in a collaborative design phase of development, with new features being added regularly. We welcome feedback from early adopters.
We have also begun the research for a second component of the data monitoring toolkit development.
We have created an AI benchmarking exercise to compare and to evaluate how well AI can monitor data and assess its risk when considered next to the processes and conclusions of a careful researcher. The goals of the exercise are to:
Test how well AI can assess various types of risk to federal datasets;
Evaluate what baseline a popular search model would use to answer those without a custom search harness;
Surface and reflect on the tacit knowledge necessary to perform risk assessment, including the sources needed, the steps involved, and the difficulty of defining criteria;
Create awareness and community through an intellectually engaging activity that includes both individual research and group reflection.
We have conducted an initial test run of this exercise with a group of 10 information professionals. After introducing the participants to the dataindex.us rubric to assess the risk level of a given dataset, each participant was assigned a dataset and asked to evaluate it across three of the six risk dimensions outlined in the rubric. Each participant was either assigned the first three dimensions — Historical Data Availability, Future Data Availability, and Data Quality — or the latter three — Statutory Context, Staffing and Funding, and Policy. For the first hour, participants more or less worked alone, diligently researching a subject that they lacked expertise in, but for which they had clear guidelines for the kind of information they sought. Participants then opened ChatGPT, and fed it prompts that we had scripted and tailored for each dataset. First in a form that asked them specific questions and then as a group compared their results with ChatGPT’s, participants reflected on their findings. Going through their three assessment dimensions, participants compared their conclusions to those of AI, reflecting on what AI missed, what they missed, and on what parts of the rubric may have led to confusion.
This exercise gave us an early insight into the potentials and pitfalls of AI’s ability to assess data risk, as well as ways in which we might tweak both the exercise and the assessment rubric. This group of participants were information professionals, not policy wonks, and we are eager to see how area specialists’ experience might lead to different outcomes in this exercise. In addition, we want to experiment with prompt engineering and give participants more leeway in their interaction with AI. In the next iteration of the exercise, we will rely on the transcription of each participant’s interactions with AI for analysis, rather than asking individuals to respond in a form.
What we liked most about this exercise, however, were the collective reflections not just on AI, but on public data more generally. One participant described it as an “excellent empathy-building exercise” because, through the work, both alone and as a group, participants become aware of the importance of and perils to public data. They reflected on whether and how to translate their own empathetic experience to AI.
Win free books from the June 2026 batch of Early Reviewer titles! We’ve got 251 books this month, and a grand total of 3,098 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.
The deadline to request a copy is Thursday, June 25th at 6PM EDT.
Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the UK, the US, Canada, Australia, Germany, New Zealand, Ireland, Malta, Italy, Latvia and more. Make sure to check the message on each book to see if it can be sent to your country.
Thanks to all the publishers participating this month!
Happy June, DLF community! Thanks to everyone who participated in Community Voting for the 2026 Virtual DLF Forum. We appreciate your input as we work with the Forum Planning Committee to build this year’s program.
Look out for updates this month: the program release, registration opening, and Digital Storytelling Fellows applications. We’re excited to share what’s next!
Early Bird Registration Open: Early bird registration for iPRES 2026 is open until July 13. The conference will be held in Copenhagen, Denmark, from September 21-25. A call for Ad Hoc sessions is
Office closure: CLIR and DLF are closed on Thursday, June 18 and Friday, June 19, in observance of Juneteenth.
This month’s open DLF group meetings:
For the most up-to-date schedule of DLF group meetings and events (plus conferences and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find the meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.
DLF Born-Digital Access Working Group (BDAWG): Tuesday, 6/2, 2pm ET / 11am PT.
DLF Digital Accessibility Working Group (DAWG): Tuesday, 6/2, 2pm ET / 11am PT.
DLF AIG Cultural Assessment Working Group: Monday, 6/8, 1pm ET / 10am PT.
AIG Metadata Assessment Group: Friday, 6/12, 2pm ET / 11am PT.
AIG User Experience Working Group: Friday, 6/19, 11am ET / 8am PT.
Digitization Interest Group: Monday, 6/22, 2pm ET / 11am PT.
Committee for Equity & Inclusion: Monday, 6/22 3pm ET / 12pm PT.
DLF Open Source Capacity Resources Group: Wednesday, 6/24, 1pm ET / 10am PT.
DAWG Policy & Workflows: Friday, 6/26, 1pm ET / 10am PT.
DAWG IT & Development: Monday, 6/29, 1pm ET/ 10am PT.
DLF Climate Justice Working Group: Tuesday,6/30, 3pm ET / 12pm PT.
In the hours following the news that Redhat Insights' JavaScript packages fell
victim to a supply chain attack via NPM, developers and systems administrators
scrambled ensure all of their projects were unaffected from a supply chain attack that steals credentials for AWS, GCP, Azure, Kubernetes, HashiCorp Vault, npm, and CircleCI before then self-propagating via said stolen npm credentials and the bypass_2fa setting. This establishes persistence via Claude Code hooks and VS Code task injection. If you have installed the affected package, reprovision your development hardware.
This is is due to the affected dependencies being distributed via
NPM, the only package manager where these supply-chain
attacks regularly happen. "This was a terrible tragedy, but sometimes these
things just happen and there's nothing anyone can do to stop them," said
programmer Lady Eulah Howell, echoing statements expressed by hundreds of thousands of
programmers who use the only package manager where 90% of the world's
supply-chain attacks have occurred in the last decade, and whose projects are
20 times more likely to fall victim to supply chain attacks. "It's a shame, but
what can we do? There really isn't anything we can do to prevent supply-chain
attacks from happening if the maintainers don't want to secure access to their
accounts in a robust manner". At press time, users of the only package manager
in the world where these vulnerabilities regularly happen once or twice per
week for the last year were referring to themselves and their situation as
"helpless".
For more information, please see upstream documentation published by
Redhat Insights' JavaScript packages at the following link: redhat-javascript-clients-06-2026.
This post is the first in a series in which I write about experiences or specific challenges from my day-to-day work. Planned posts include descriptions of a bug and how this impacted the coworkers, how I wrote a script to parse log data… I’m hoping that these will be interesting for other librarians that work in entirely different areas, for my colleagues who are solving different problems on different systems (or maybe eventually the same one after we migrate), and for those who are thinking about doing this kind of work in the future.
When we talk about the ILS or LSP, it can sound like we’re talking about a single system. And we are, some of the time. But just like our permissions shape what we can see and do, the ways we access the system and its data may lead to entirely different experiences. More importantly, if you don’t know how different tools and even databases work, you may end up with inaccurate results or not knowing that something is possible.
For example, our Sirsi ILS and reporting system(s) consist of two separate databases. These databases can be accessed in: one way for most folks (two for people using a BLUEcloud module), two-to-three ways for some, four ways if you’re special, and five ways if you’re one of two people.
Diffusion of Databases
The Sirsi Symphony Database, fka Unicorn1, underlies the whole thing. This Oracle database is the ultimate database of record. If we load MARC, it ends up in the Symphony database. If we place orders, they become entries across Symphony tables. If we loan materials, it triggers a series of updates in the Symphony database.
BLUEcloud Analytics runs off a separate database, also Oracle.2 This separation is common and appropriate. Alma also uses a separate Oracle database and FOLIO has the option of Metadb built with PostgreSQL. The analytics databases don’t contain live data. Instead, they’re updated regularly overnight, based on things that have occurred in the primary database. Change a title? It’ll show up in analytics tomorrow.3 Check out a book? That transaction will show up in circ stats tomorrow.
This is an appropriate choice for three reasons:
It’s a bad idea to run large analytical queries on production. Plus, static indexes are much more efficient to search.
The analytics system has no real demand overnight, so its server can do a full reindex before running any scheduled jobs.
The analytics database can be designed differently.
Following that last point, the analytics database isn’t just a snapshot of production. It has a fundamentally different design. It anonymizes circulation transactions, but it also builds completely different indexes from the ones we need for daily work. For example, it indexes circulation data by hour, day, month, and year as well as by circulation desk. Sometimes we want big numbers. Sometimes we want to see which desks get the most traffic. Those aren’t the kind of searcehs we need to do in day-to-day work. It indexes MARC as fields and subfields, including invalid ones like λ.
Accessing the Databases
Most of my coworkers only access Symphony using one tool: Workflows. A few also use BLUEcloud Circ.4 Using the client, they look for records, update them, perform transactions, etc. We import single MARC records using Workflows wizards. We import batches of MARC records using Workflows reporting (and FTP). Global item updates are done in Workflows. The Workflows reporting module can be used to load, transform, or extract data, history, or (some) statistics.
Next, we have BLUEcloud Analytics. A much smaller set of people (but still plenty) have rights here. As described above, Analytics is a completely separate database. It’s also designed in a way that’s more oriented toward statistical work. Folks use it to extract shelf lists, acquisitions data, spreadsheets of MARC subfields, etc. The indexes are enormous and joined queries can take some time to run (and you can only run joined queries which are supported by the system), but you can get a lot of data and can’t accidentally bring down production.
About four years ago, we got access to Data Control. This is probably my favorite Sirsi product5. Unlike Analytics, Data Control gives you the power to query or even update the Symphony database itself. That means it doesn’t have some things that are in Analytics. You can’t see an item’s transaction history, for example, just its current data.6 Even fewer people have access to this, most use it on our Stage server, and just a couple of us are allowed to run batch updates to production.7
seltools is like Data Control for the command line. More properly, Data Control is an interface that lets ordinary humans use seltools with enough scaffolding not to mess quite as many things up. seltools can do even more and can do it very quickly. It is a sysadmin tool and only two people here have rights to use it. It can do extraordinary work in seconds and could cause irreparable damage (or at least, damage that requires restoring from backup). AFAIK it dates back to the launch of Unicorn.
How I Access the Data
I have rights in Workflows, BLUEcloud, Analytics, and Data Control. I tend to use them as a kind of grab bag and often chain Analytics and Data Control in my work, sometimes performing interim steps with Python or OpenRefine.
Because Analytics isn’t querying live data, it’s a much better place to do initial MARC searches. If I want to find every record with a 699, for example, Analytics is the place to do that fast. Or I could look for every 100 or 700 with a subfield “e” or search for a particular piece of text in one or more fields.
But in terms of output, Analytics leaves a lot to be desired for MARC work. It’ll shows a field’s subfields like a table. For example:
Field
Subfield
data
264
a
New York
b
Grosset & Dunlap
c
[1972]
That’s fine if I only want to facet down to the subfield b in each row, but if I want to deal with the MARC data as a field it becomes a problem.
In the Analytics reports I use, it’s easy to add the bib key to a report if it wasn’t already in there. Before we got Data Control, my next step would be to actually switch to something like Z39.50 and download all the bibs manually, hoping I got everything (because our keys are not always in the 001, it’s a long story). I then had to do a delimited export in MarcEdit or write a pymarc script to get the fields I wanted.
Now, if I want to see a set of fields from the record, I simply upload that same set of bibkeys in Data Control8. I structure my query to include the tables I want and output the fields I need from each table. I can then export them into a much nicer spreadsheet with the MARC field (and indicators, if desired) printed the way it appears in the original MARC. I can also export the entire set of records as MARC.
264
|aNew York :|bGrosset & Dunlap,|c[1972]
An Example Update
But, even better, I have the rights to update the data. In most cases, I can even use regular expressions. For example, when we added a new ILLiad request placement module to our MyAccount app, we grabbed the 020 (ISBN field) straight from the Symphony API.9 Unfortunately, about 600,000 of our 020 fields followed the pre-2013 structure, when qualifying information was still included in the subfield a. In 2013, subfield q was introduced to handle things like “(paperback)”. This unexpected data was messing with ILLiad’s automated processes. We could’ve changed the script, but it made more sense to fix the actual data, since we niw had the tools.
First, I ran an Analytics query to find all records where the 020a contained (,), or any letter except x. I exported the data, extracted the bibkey column, and then broke it into batches of 25,000 bibkeys.
I spent a few weeks working on our stage server to develop the appropriate regex-based find and replace patterns to move qualifying data into a subfield q. I had to handle various edge cases: no parentheticals, only one half of the parenthetical, etc. Once I felt confident, I ran a batch of about 5000 on stage and QAd my results thoroughly. I then spent the next month running batches in production. I limited batch sizes and chose days when we didn’t have other jobs which would trigger big reindexes (you can only do so many jobs in a night or the reindex will take forever and throw off all the other chron jobs).
Once the project was done, I was able to re-run queries in Analytics to ensure there weren’t any issues remaining.
I can also click into and update single records from Data Control results page or set it to let me modify a particular field and paste repeating data into that field. The former is useful when there might be other related fields which need to be updated or I need more context. The latter is useful when only some of the results need to be updated or the person hasn’t yet got regex privileges on production.
Clashing Designs
So that’s what it looks like when things go well. Tech librarianship so often involves what Marshall Breeding called “Knitting Systems Together” that I almost don’t think about the ways I hop across tools. At most I feel a minor irritation. Recently, I ran across a case where the difference between system designs and who had permissions to access what was making a huge difference in my coworkers’ abilities to get their work done.
In theory, the data in Analytics should mirror what’s in Symphony, at most with a different structure. However, when a barcode is updated in Symphony (generally via Workflows), Analytics completely drops entries related to that barcode. The entries are not transferred to the new barcode. Data that’s still in the item record is retained, so we have the item last activity date, the circulation count (an incremented field), etc. But we can’t see the item transaction history.
Now, there were a couple things we could do about this… I’ll describe how system logs come into play in my next post!
Specifically, it’s MicroStrategy whose Wikipedia page starts off like any other data analytics software and then …pivots to Bitcoin. It’s Michael Saylor’s company, if that name means anything to you. ↩︎
Timing could be more frequent, but I believe most have daily updates. ↩︎
BLUEcloud is Sirsi’s next-gen browser client. To my knowledge, we still only use the circulation module and many people still use Workflows for circulation. ↩︎
It’s extremely powerful, though extremely fragile – but that could also describe me, so I can only be so annoyed by it. ↩︎
Transactions here meaning every time the item was scanned, some of which is available via Analytics. There is also transaction history in Symphony but it’s in logs. ↩︎
It also supports two kinds of batch updates – a batch modify which lets you edit fields individually in a browser interface and a batch substitute which lets you run updates on fields using regular expressions. If you wanted to update a MARC 500 field on a set of items, for example, someone with batch modify permissions could display all 500 fields on the records, click Modify, and then paste a new text into any field they wanted to replace (while skipping 500 fields which didn’t match). Someone with regex permissions could find all notes matching the old note and sub it with the new note. ↩︎
Why not do the whole search in Data Control? It is painfully slow compared to Analytics, especially for MARC searches. For the cases when Data Control is designed better for searching, I’ll export a set of keys for the overall records I want to search within and then perform it as a scoped search, which is much faster. ↩︎
We only use the APIs for integrations not for reporting/updates/etc., so I didn’t list it above. Seltools are much faster and more powerful. ↩︎