Planet Code4Lib

April 2025 Early Reviewers Batch Is Live! / LibraryThing (Thingology)

Win free books from the April 2025 batch of Early Reviewer titles! We’ve got 187 books this month, and a grand total of 3,309 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.

If you haven’t already, sign up for Early Reviewers. If you’ve already signed up, please check your mailing/email address and make sure they’re correct.

» Request books here!

The deadline to request a copy is Friday, April 25th at 6PM EDT.

Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to Canada, the US, the UK, Australia, New Zealand, Spain, Peru, Netherlands, Ireland, Germany and more. Make sure to check the message on each book to see if it can be sent to your country.

The Once and Future MeUnderestimated: The Surprisingly Simple Shift to Quit Playing Small, Name the Fear, and Move Forward AnywayRestored: Partnering with God in Transforming Our Broken PlacesSpeak to Me of HomeThe Great Dinosaur SleepoverBig City BunsFeat of ClayParadise OnceJust Making: A Guide for Compassionate CreativesTo My Irniq: to My SonThe Haunted BlizzardThe Meaning of the MurderThe Little Book of Data: Understanding the Powerful Analytics That Drive AI, Make or Break Careers, and Could Just End up Saving the WorldHomeschooling: You're Doing It Right Just by Doing ItHeirs & SparesThe Women of Wild CoveHow to Talk to Your SucculentMurtaghPeter Nimble and His Fantastic Eyes (Revised Edition)Sometimes Death Is a BlessingSophie Quire and the Last Storyguard (Revised Edition)The War of the MapsI Would Give You My TailAmoya Blackwood Is BraveWarm and FuzzySand CakesThe BequestThe First LiarVampireThe Doctor, the Witch and the Rose StoneAliveThe Bonnybridge UFO Enigma (A Modern Day Mystery)Hunting LaqanaBigger: EssaysOf Flesh and BloodSing Me Home to CarolinaSelf-DrivingThe Last StopThe Simplest Ways to Develop Consistency and Healthy Habits That LastThe Power Of Respect: Building Bridges to Human Peace and Human DignityWings and Whispers: Tales of Friendship: Volume 1Postcards to Herself: A Prose Poetry NovellaThe UnravelingKinds of Cool: A Jazz Poetry AnthologySafest Family on the Block: 101 Tips, Tricks, Hacks, and Habits to Protect Your FamilyI Rock My Hair: Pretty and Protected By the CROWN ActMousterWorksA Man if West Destiny: An Arrangement of WordsBloodstained Vows: A Sin So Dark, No Confession Can Absolve ItThe Girl Next Door: A Professional Sex Worker Reveals the Truth about Men, Money, and the Business of DesireBluegrass Dreams Aren't for Free: A Collection of Interconnected StoriesDepth ControlGathering the Pieces of DaysFried Chicken CastañedaSomething Happened in CarltonThe Dragon's BrainUnconverted: Memoir of a MarriageMy Ordinary LifeRogue Drone ProjectSomething For Everyone: Mindfulness StoriesThe Best AdviceGemma SommersetLetters to GodA Beautiful Autumn and Other StoriesI Lift Therefore I Am: How Philosophy in Fitness Can Transform Your Life1960s Nostalgia Activity Book for Seniors: Retro Themed Variety Puzzles with Illustrations and Trivia for a Fun Walk down Memory LaneReclaimed BaggageWomen of NoteOnly SmokeDon't Let Me DrownBirds in a Land of no Trees: Notebook A: Habits and HabitatsTracesChaos BeckonsCiudad de Dios HistoriasThe Almanac of Canadian Figure SkatingThe Third RingHeartlessThe Reckoning: Shadows of EquilibriumStaying Married Is the Hardest PartReal WeightBelieve Nothing, Know Nothing: The Lightworkers' Ultimate Survival ManualBlack As Hell, Strong As Death, and Sweet As Love: A Coffee Travel GuideTwilight of EvilBreeze: Mindful Short PoemsThe King ReturnsHalf MoonThe Visionary Leader: The Success Principles of the World's Greatest VisionariesSmart Money Kids: A Parent's Guide to Digital Finance EducationWhere Demons ResideMy Bully, My Aunt, and Her Final GiftTwo NovellasMafia Marriage MayhemBee Battle: Spelling, Spiders and the Secrets of LikeBlood on the MoonlightIconoclast; or, the Death and Resurrection of Lazarus KeatonSt. Patrick's Day Mini Coloring and Spring Activity Busy Book for Toddlers and Kids Ages 3-5The N8 SelfUnderstanding Monogamy: A Pathway to Conscious and Compassionate ConnectionsThe Emptiness AlgorithmSomewhere Past the EndBitterfrostWhat Was It Like Growing up in The 80s?: A Journal to Revisit and Share the Totally Awesome 80sFool's ErrandsDead People AnonymousThe Lightning in the Collied Night (2nd Edition)The Stupid GooseOf Constellations and ClockworkPortalmania: StoriesFlorida CrossingOutplayedImagine If...: Tupac Did Not Go to VegasImagine If... : Bruce Lee Made Two Hollywood MoviesDreams From Communism: Satire from the Past, Lessons for the PresentHeavens KeysArt of AgonyHarmony of ChangeBound by SecretsThose Who Went BeforeBad Boys, Bad Boys: What Does a Lawyer Do?!: Plus Interesting Stories Curious Minds Want to KnowEric The Colorful Cricket Starts KindergartenInfusions of Faith: Fuel for Your JourneyMasterin English Pronunciation for Korean Speakers: Speak Clearly, Naturally, and Confidently with Step-by-Step TechniquesFell Hope: Artimer's RevengeRaised By a Narcissist: That Woman aka My MotherWar's Echo, Love's SongThe Freedom to LoveCrossroads: Volume 2Murder on the Grand CanalWhat We Say in the DarkThe Spring in My HeartThe FairkindJoy in Sorrow, Hope for Tomorrow: ReunionThe Italian Soul: How to Savor the Full Joys of LifeEngineer's Primer on Investing: Financial Independence Retire Early (FIRE) V1.0FruitThe DervishThe Construction of ShadowsThe Break of DawnFortune's Price: A Gold Rush OdysseyThe Soul's AwakeningCaptain Kidd: A True Story of Treasure and BetrayalDomina el negocio del automóvil: Guía Completa de Estrategia y Diseño de CochesAmerican WolfAmerican WolfMien: CircuitsReason to BePoems for Princesses with Peas Under Their MattressesSecrets at the Aviary InnAisha (R.A.): Beyond Sixleft on readSuper Easy Diabetic Air Fryer Cookbook With Color PhotosBeyond the Ocean DoorStrange Conjurings: The Intrepid Gnome's Anthology of Weird and Eldritch Tales: 1819-1929 (Volume 1)Strange Conjurings: The Intrepid Gnome's Anthology of Weird and Eldritch Tales: 1819-1929 (Volume 2)Nina's FriendsNugget the Space Chicken and the Dragon of IshenorLove's GiftIt's No Fun AnymoreRelentless BladesBirds in a Land of no Trees (Notebook A - Habits and Habitats)UltimartAshley UndoneThe Real Wolf ManThe Time EngineersAfter The FallUnexpected Awakening: 22 Days at a Buddhist Monastery Freed Me from AbuseEarthly Reflections: The Science of Reflected Light in Our Natural WorldWorlds ApartVicious CircleThe Heart of ResistanceI Swiped Left Again: The Evolving Woman's Guide to Dating, Twin Flames, and HealingTrapped 21: AschenputtelExcavationsExcavationsWhere The Guava Tree StandsNeither Out Far Nor in DeepPeriander the Avenger: The Final Son of AtlantisWhen You're in Deep Water Call... Moses: The Life Story of Legendary Bail Bondsman Bounty HunterDevil's DowryYour Knowledge Or Your Life?The Gales of AlexandriaThe Vatican DealThe Vatican DealSmall Change: From a Workshop on the Anthropocene in Wagga WaggaPromising Young ManSaving The Lost Girl: A Memoir of HealingAwake 21: Den lille havfrue

Thanks to all the publishers participating this month!

Akashic Books Alcove Press Anaphora Literary Press
Artemesia Publishing Autumn House Press Baker Books
Bellevue Literary Press Bookmaker Books Broadleaf Books
City Owl Press Crooked Lane Books Diego Costa
Egret Lake Books eSpec Books Flat Sole Studio
Gnome Road Publishing Greenleaf Book Group HarperCollins Leadership
Henry Holt and Company Identity Publications Inhabit Media Inc.
Prolific Pulse Press LLC PublishNation Purple Diamond Press, Inc
Purple Moon Publishing Revell Rootstock Publishing
Running Wild Press, LLC Th3rd World Studios (3WS Books) Tundra Books
Type Eighteen Books Unsolicited Press UpLit Press
Wise Media Group WolfSinger Publications YMAA Publication Center

🇮🇩 Open Data Day 2025 in Pari Island: Collaborating for Climate Justice through Coral Watch / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices.

Together with the local community, bring the ODD spirit through this activity

Pari Island, a gem in the Thousand Islands Regency, lies 232 kilometers from Jakarta. Renowned for its pristine beaches, vibrant coral reefs, and lush mangrove forests, Pari Island boasts incredible coastal and marine potential. However, beneath this natural beauty lies a heartbreaking reality: corporate interests, cloaked under the guise of tourism privatization, threaten the island’s delicate ecosystems.

Beach reclamation on Tengah Island for private tourism development. Credit: Yunsan.E

Unmasking Environmental Injustice

Large-scale corporations have continually encroached on Pari Island, engaging in land reclamation and sea dredging around Tengah Island and Kudus Island, two islands within the Pari Island cluster. These activities have resulted in the destruction of mangroves planted by local communities and the decimation of coral reef zones. The consequences of these actions are dire risking ecological disasters and threatening marine biodiversity.

In response, the local community of Pari Island has risen to demand climate justice, holding corporations accountable and advocating for the cessation of these environmentally destructive practices. Leading this grassroots movement are several local communities Forum Pulau Pari and Perempuan Pulau Pari (Women Group). Their fight for justice is a testament to the islanders’ resilience and deep connection to their environment.

Women Group of Pari Island

Citizen Science for Coral Conservation

In celebration of Open Data Day, we joined the local community in investigating the impacts of reclamation on coral reefs using publicly accessible remote sensing data. Through a hands-on demonstration of open tools rooted in citizen science, we employed Coral Watch method that simplifies coral health monitoring. Our activities extended beyond data collection. We brought environmental education to the island’s youth through storytelling and coloring competitions, instilling a sense of environmental stewardship in the next generation.

Preparing the tools that will be used for the Coral survey

Alarming Findings: Sediment, Bleaching, and Urchins

The Coral Watch observations revealed troubling results. Coral reefs near the reclamation sites were smothered by sediment, a consequence of dredging and land expansion. In some areas, corals showed signs of bleaching, particularly along the reef’s outer edges. However, not all findings were grim other stations exhibited healthy coral populations, offering a glimmer of hope.

One peculiar discovery emerged at a site favored by local fishers for snorkeling: a high concentration of sea urchins (Echinoidea). While sea urchins are natural reef inhabitants, excessive populations can hinder coral growth and disrupt reef balance. Such ecological shifts warrant further investigation to understand the long-term impacts.

1 – Two people take reef samples using snorkeling method and bring coral watch chart. One person stays on the boat to record coordinates and secure data sheet.
2 – This coral plate is in a state of concern because it is being hit by reclamation sedimentation.
3 – It is not uncommon to find coral that has shown signs of bleaching.
4 – Healthy corals have strong color pigments.

Empowering Communities with Data

For Pari Island’s residents, these findings are more than data points; they serve as crucial evidence in their fight for climate justice. The observations gathered through citizen science empower the community, providing a scientific basis to advocate for the protection of their natural heritage.

Storytelling about marine life with children

The battle for Pari Island is far from over. Yet, through collective action and open data initiatives, local voices grow louder, demanding accountability and sustainable practices. This island, with its rich marine biodiversity and resilient community, stands as a symbol of hope and determination in the face of environmental injustice.

Images of Coralwatch Chart and tools used for survey

Image of community-led coral monitoring using Coral Watch, capturing local participation.


About Open Data Day

Open Data Day (ODD) is an annual celebration of open data all over the world. Groups from many countries create local events on the day where they will use open data in their communities. ODD is led by the Open Knowledge Foundation (OKFN) and the Open Knowledge Network.

As a way to increase the representation of different cultures, since 2023 we offer the opportunity for organisations to host an Open Data Day event on the best date over one week. In 2025, a total of 189 events happened all over the world between March 1st and 7th, in 57 countries using 15+ different languages. All outputs are open for everyone to use and re-use.

For more information, you can reach out to the Open Knowledge Foundation team by emailing opendataday@okfn.org. You can also join the Open Data Day Google Group or join the Open Data Day Slack channel to ask for advice, share tips and get connected with others.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 1 April 2025 / HangingTogether

The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by a team of OCLC contributors.

Building spaces of solidarity and connection

The free virtual summit open to all that is being sponsored by ALA’s Social Responsibilities Round Table (SSRT) “Building Spaces of Solidarity and Connection: A Historical, Theoretical, and Practical Lens on Libraries’ Commitment to Inclusion, Diversity, Equity, and Justice” will offer five sessions over the course of Monday, 7 April 2025. The sessions are entitled as follows: “Cultural Partnerships and Programming to Support Inclusive Communities;” “Toward Equity and Inclusion in Library Services and Professions;” “How Can Libraries Support Immigrant and Refugee Communities?;” “Institutional Disregard, Denial, and Disfranchisement: Treatment and Experiences of African Americans in LIS Professions and Practices;” and “Issues in School Librarianship, Children’s Literature, and Youth Services.” One must register separately for each individual session and each session will be recorded.

Speakers from across the United States will share how they and their institutions (including public libraries, academic libraries, and library-adjacent entities) have dealt compassionately with communities who have been marginalized and otherwise made vulnerable. Programming suggestions, tools to help provide more equitable services, and plans to develop inclusive spaces will be on offer. Full details on the schedule and each speaker can be found at https://www.ala.org/srrt/buildingspaces. Contributed by Jay Weitz.

Recommendations for gender identity in archival descriptions

The Yale Reparative Archival Description (RAD) Working Group has released their recommendations for addressing gender identity in archival description. This is the most recent addition to a set of guidelines that have recommendations for identifying how reparative descriptive practices have been applied, documentation for a project to identify full names for women previously referred to only by their husband’s names, and guidelines for addressing materials related to Japanese American incarceration during World War II.

I appreciate the work that the RAD Working Group has done over time and the guidelines for gender identity in archival descriptions are a great addition. The guidelines pertain not only to those engaging in description of collections but also those working in acquisitions, collection development and even donor relations. For collections acquired in the past, it can be difficult to know with certainty how a given person identified but for collections that are acquired now and in the future can be respectful of preferred identity, which may be nuanced. The work of the RAD Working Group, which has been ongoing since 2019, reflects how this work can take time and focused attention. Contributed by Merrilee Proffitt

Creating a psychologically safe environment with inclusivity

Academic librarians Amanda Clay Powers and Dustin Fife discuss how to create a psychologically safe team in the library workplace in the March 2025 issue of College & Research Libraries News. The article “Psychological Safety in Libraries: It’s a Team Sport” captures a conversation in which the librarians share their insights about what works. Powers explains the importance of inclusivity saying, “Part of this is the hard work of making diverse teams safe. For me, that has meant all library retreats around microaggressions, anti-racism, LGBTQIA+ inclusion, accessibility in libraries, and intersectional identities.” Fife recommends considering co-workers with curiosity rather than criticism, writing, “How we do things is rarely as important as why we do things together and curiosity will bolster that connectivity.”

This article is an important reminder for me of why institutions undertake DEI initiatives. Employees feel safer when they are accepted and understood by their peers and supervisors, creating a more positive and productive workplace culture. When people feel unsafe, there is an automatic physiological reaction called the “fight or flight response” that in the short term will cause people to be unable to focus on their work and in the long term can cause burnout and quitting. Creating a psychologically safe workspace is good for individuals and organizations. Contributed by Kate James.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 1 April 2025 appeared first on Hanging Together.

DLF Digest: April 2025 / Digital Library Federation

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here

Hello and happy April, DLF community! It’s officially spring, and our community is staying quite busy with Working Group meetings and events. We’re particularly excited to share that the DLF Forum Call for Proposals is now open through the middle of April. Read on to learn how to be a part of all the great things happening in our world this month. We’ll see you around soon!

— Team DLF

This month’s news:

 

This month’s open DLF meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF’s working days are Monday through Thursday.

  • DLF Born-Digital Access Working Group (BDAWG): Tuesday, 4/1, 2pm ET / 11am PT
  • DLF Digital Accessibility Working Group (DAWG): Wednesday, 4/2, 2pm ET / 11am PT
  • DLF AIG User Experience Working Group: Friday, 4/18, 11am ET / 8am PT
  • DLF AIG Metadata Assessment Working Group: Thursday, 4/24, 1:15pm ET / 10:15am PT
  • DLF Digital Accessibility Policy & Workflows Subgroup: Friday, 4/25, 1pm ET / 10am PT
  • DLF Digital Accessibility IT & Development Group: Monday, 4/28, 1:15pm ET / 10:15am PT
  • DLF Digitization Interest Group: Monday, 4/28, 2pm ET / 11am PT
  • DLF Committee for Equity & Inclusion:  Monday, 4/28, 3pm ET / 12pm PT

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at info@diglib.org

Get Involved / Connect with Us

Below are some ways to stay connected with us and the digital library community: 

The post DLF Digest: April 2025 appeared first on DLF.

Lessons learned from implementing a research support hub in the library / HangingTogether

This post is part of a growing series on the Library Beyond the Library. 

Photo of a team of rowers on ocean wavesPhoto by Quino Al on Unsplash

Research institutions like those in the OCLC Research Library Partnership (RLP) are striving to meet growing demands for research support services that assist scholars and researchers throughout the research lifecycle. These services are distributed across a complex web of campus units at most institutions, creating an invisible barrier to access for researchers.

Montana State University (MSU) has addressed this challenge by establishing the Research Alliance: a centralized hub that co-locates several research support units within the main library, improving awareness, convenience, and usage by researchers. Additionally, co-location fosters improved knowledge sharing and collaboration among participating units.

The Research Alliance exemplifies how libraries are adapting to new operational structures and communicating an evolving value proposition. By hosting the Research Alliance, the MSU library is strategically positioned as a central research support hub for the entire campus, both physically and operationally. This initiative provides a robust framework for ongoing cross-unit engagement and partnership, increasing visibility and engagement with other units, and ensuring closer alignment with institutional priorities.

Operationalizing a cross-unit partnership is a multi-phase project

Creating this centralized research support hub required years of persistent and patient effort to socialize the idea, identify partners, and eventually secure institutional support. But making the institutional decision to establish the Research Alliance was just the first phase, which I roughly equate to the initiating or design phase in project management lingo. According to the Project Management Body of Knowledge (PMBOK)[1] project management is described in five discrete phases:

  • Initiating—defining a new project and obtaining authorization to begin
  • Planning—processes to define objectives, scope, and next steps
  • Executing—performing the work to implement the project
  • Monitoring and controlling—tracking, reviewing, and initiating changes as needed
  • Closing—processes to close out a project, if applicable

As any project manager knows, the work really begins with project approval, and this was certainly true for the Research Alliance.

I recently spoke with Doralyn Rossmann, Dean of the Library, and Jason Clark, Head of Research Analytics, Optimization, and Data Services (ROADS) at MSU. They expanded our understanding of the MSU Research Alliance story and offered a candid inside view of the significant challenges and frequently invisible library leadership presented by the project’s planning and execution stages.

Challenges

Turnover in executive leadership. The leaders of all of the original units – the library, the center for faculty excellence, and the vice president of research all left their positions after the agreement to form a partnership was made, but before any physical movement or logistics regarding shared spaces took place. And, while information technology’s research cyberinfrastructure was envisioned to have a role in the alliance, the chief information officer had been minimally involved in initial planning. In turn, subsequent leaders had to move a partially realized vision into being a reality.

Ill-defined scope and partners. When the Research Alliance concept was initially approved by the leaders of the involved units, it was still a loosely defined vision – even the name “Research Alliance” was not established. Details of the agreement were not captured in a Memorandum of Understanding (MOU) at this phase of the project, resulting in significant uncertainty for subsequent leaders. The office for undergraduate research entered the alliance at the eleventh hour as the reality of the alliance became more visible, creating needs for scoping expectations for subsequent growth.

Poorly defined leadership roles. Working primarily on a handshake agreement, there was no clear articulation of goals, roles, or responsibilities. No one was assigned responsibility to lead the effort, and the project advanced largely due to the adoption of new work responsibilities by library leadership. The library was the implied leader in large part because the Research Alliance was to be housed there —- this involved taking responsibility for architectural planning, remodeling, formulating an MOU, and convening ongoing committees of both executive sponsors and Alliance partners.

Internal pushback. Not everyone was excited about the idea of centralizing research support services in the library, and there was significant pushback from both students and library employees. Some students who protested the loss of study space even left hostile messages on a library whiteboard, visible to Research Alliance workers. Some library staff members who had little awareness about the strategic reasons for the partnership opposed the focus on a research support unit housing non-library personnel in what had been quiet library study spaces. Library leaders had to manage resistance, explain library and campus priorities, and provide sustained leadership on a project unpopular with some.

Benefits

Enhanced and visible library value proposition. As I’ve described in an earlier blog post, the Research Alliance positions the library as the physical and strategic center for research support on campus. Just as physical collections have long demonstrated the library’s contribution to supporting scholarship, today the Research Alliance offers a tangible, compelling story about how the library is evolving to support research and campus priorities in new ways. It demonstrates the library’s evolving role in support for research that may not be recognized by other campus stakeholders particularly those who have a role in decisions about library funding and autonomy. 

Increased interaction with other campus leaders. Implementing the multi-tenant Research Alliance has required increased interaction between the library dean and other campus executives. While the work of sustaining this group has fallen to the library dean, it offers a significant opportunity for her to demonstrate how the work of the library aligns closely with institutional priorities. Busy campus leaders may still adhere to a collections-focused view of the library, failing to understand the important roles that research libraries now play in research support and student success. Ongoing engagement with these leaders—on a project that supports high priority campus research productivity—provides an opportunity for the university librarian to help other stakeholders better understand the library’s value. Additionally, the Research Alliance has brought numerous faculty and administrators into the library who had not regularly entered the building in recent years. This has had the added benefit of these people seeing how incredibly busy the library is and has created increased interactions between library employees and people coming into use Research Alliance services.

Improved services for researchers and students. And, of course, there is the intended benefit of the Research Alliance—to reduce friction for researchers navigating campus services. More researchers come together in this shared environment and this results in improved referrals and collaborations between units. 

Recommendations to other libraries

Doralyn and Jason have gained valuable insights from their experience implementing the Research Alliance at MSU, including these recommendations that can help other libraries undertaking similar collaborative efforts:

Create decision support structures ASAP

Assemble executive sponsors and convene regularly. Executive sponsors play a crucial role in decision-making and removing project obstacles. They should meet regularly—at least quarterly—to receive updates, offer advice, and make joint decisions. Establishing their oversight role early in the project planning can significantly reduce project friction. Ideally, a member of this group would meet regularly with members of the alliance to communicate between the executives and members to understand needs and communicate priorities.

Establish an MOU. Drafting an MOU requires a thorough examination of roles, responsibilities, leadership, decision making authority, and partner expectations. It also involves defining the collaboration’s objectives, scope, and efforts, serving as an important project planning document. Finally, it should delineate equitable financial responsibilities; given the situation of the Research Alliance in the library, it was, de facto, responsible for space and furniture costs. This MOU should be revisited annually to accommodate needed adjustments.

Create goals and an assessment plan. Part of planning the partnership is also determining how team members will monitor and track progress. Ideally, the partnership has clearly established goals and all members of the partnership have annual goals that set expectations to participate in meeting the goals of the partnership. Assessment planning can be integrated into the MOU process and provide resources for measuring success, resourcing, and evaluating whether the partnership should continue.

Communicate library value

Prepare for internal pushback. Resistance is almost a certainty, particularly in university environments characterized by decentralized, independent agents who often work at cross-purposes and have the freedom to openly oppose institutional initiatives. Universities are “complex adaptive systems” that make collaboration and change particularly challenging.

Relentlessly communicate the project (and library) value proposition. Establishing a new offering like the Research Alliance provides a visible, ongoing way to sell the library’s value to a wide range of stakeholders. Leverage this in storytelling, personal narratives, and metrics that demonstrate the central role the library plays in supporting campus strategic priorities.

Ensure the library receives financial recognition. While the library reaps significant benefits by hosting a research support hub, it also incurs costs, including the reallocation of valuable campus space. Libraries should seek remuneration by leveraging funding from participating research partners and through facilities and administration (F&A) monies. This isn’t just about the budget—it’s also about other units recognizing and rewarding the library’s value.

Extending the library beyond the library

The MSU Research Alliance represents a shift in library operational structures, as libraries forge new partnerships with other campus units to support research and institutional priorities. OCLC Research describes this shift as extending “the library beyond the library.” A current research project examines increasing library interactions with the broader campus, and how library expertise and capacities are combined with those of other campus units. By documenting the transformation of collaborative operational structures and value propositions, this project will provide valuable insights and recommendations for library leaders navigating the expanding landscape of cross-campus partnerships.

Writing this blog post was a collaborative effort. Many thanks to Doralyn Rossmann and Jason Clark for their input and for inviting us to continue the conversation. Special thanks also to Merrilee Proffitt.


[1] The PMBOK is maintained by the US-based Project Management Institute (PMI) professional organization which offers the Project Management Profession (PMP) certification.

The post Lessons learned from implementing a research support hub in the library appeared first on Hanging Together.

Holy cow—did the people show up for today's #TeslaTakedown! / Peter Murray

I don't know how many there were at the protest today in front of the Easton Tesla store, but for the first time we covered all four corners of the intersection.

The panorama view captures a large protest in front of a Tesla store. The crowd lines the sidewalk, holding various signs in support of their cause. Cars drive by as some protesters wave flags. The scene includes a large, decorative giraffe sculpture on the corner, adding a whimsical touch to the setting. The atmosphere is active and engaged, with participants covering all four corners of the intersection.

I think there were at least 600 people...maybe more. Some observations:

  1. The weather was good—windy, but warm—and the families with young children did come out again. But there were just MORE people there overall.
  2. This week I recognized more cars making a circuit around the block. More people honking with thumbs up, turning around, then coming back again. I don't remember seeing that on past Saturdays.
  3. There were more Tesla sedans driving by that I remember seeing in the past. Quite possibly, they were just making the circuit around the block, too.
  4. I'm starting to recognize familiar faces at each protest.

There was no live music this time, but that was okay because there was definitely more noise from the sidewalks and more energy in the air. The Proud Boys made noise about coming in counter-protest, but I didn't see them. One of the event marshals said they were there early, but the police effectively separated them. As the panorama picture shows, though, we had all four corners covered and we were raucous.

This week's protest sign

The protest sign features bold red text saying, “The SIGNAL is coming from INSIDE the house!” Surrounding the text are cut-out images of six individuals, indicating figures of political or public interest. In the background, an image of the White House is visible, with the American flag prominently displayed.This week's #TeslaTakedown protest sign.

I went off-script this week with a sign about the political nonsense we have at the moment. It is a play on the phrase "The call is coming from inside the house!" — a play on a famous movie trope where the police tell the person in a home that they have traced the antagonist's call to that home. In this case, the danger to democracy is coming from inside the Whitehouse! Or, at least, that is what I was aiming for. This sign probably only get's one week's use; let's hope by next week, one or more people on this sign are fired because of the released of what sure looks like classified information on a Signal group chat.

If you want to use this 26" by 16" sign for yourself, I've made it available for download. Ping me on Mastodon or Bluesky if you use it, and include a picture if you'd like!

A protester is standing in front of a Tesla store, holding a sign that reads, “The SIGNAL is coming from INSIDE the house!” with images of political figures and the White House. The protester wears a light blue shirt, sunglasses, and a sun hat. In the background, a crowd of people gather, some holding signs, as they participate in the protest near the store.My protest sign at the #TeslaTakedown.

Issue 113: More on Copyright and Foundational AI Models / Peter Murray

Two years ago this month, I wrote a DLTJ Thursday Threads article on the copyright implications of foundational AI models. A lot has happened in those 24 months. This issue mostly focuses on lawsuits, plus an announcement of a service offering image generation from licensed content. These articles highlight the growing tension between content creators and technology companies as AI technologies increasingly rely on large datasets that include licensed and, in some instances, pirated content.

Also on DLTJ this past week:

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

New York Times sues OpenAI and Microsoft for alleged copyright infringement in AI training

The New York Times sued OpenAI and Microsoft on Wednesday over the tech companies’ use of its copyrighted articles to train their artificial intelligence technology, joining a growing wave of opposition to the tech industry’s using creative work without paying for it or getting permission. OpenAI and Microsoft used “millions” of Times articles to help build their tech, which is now extremely lucrative and directly competes with the Times’s own services, the newspaper’s lawyers wrote in a complaint filed in federal court in Manhattan.
New York Times sues OpenAI, Microsoft for using articles to train AI, Washington Post, 27-Dec-2023

We're starting with the lawsuits, and this is one of the bigger ones. At the time the lawsuit was filed, OpenAI announced deals with content providers to use their backfiles of content, but the New York Times was a holdout. The lawsuit claims that OpenAI and Microsoft used millions of Times articles, which directly competes with the newspaper's services. While OpenAI maintained that it respects content creators' rights and believes its practices fall under fair use, the lawsuit cites instances of AI reproducing Times articles verbatim. This case has had many twists and turns, including a report last year that OpenAI intentionally trashed the research of the Times' lawyers. You can follow along with the court case in the Southern District of New York.

LATE BREAKING NEWS: As I was finishing the edits on this issue, I saw the judge issued a brief ruling on the defendant's motion to dismiss. In short, the lawsuit continues, but the portions on "common law unfair competition by misappropriation claims" are dismissed. The full version of the opinion hasn't been released yet, but should be coming soon!

US judge favors OpenAI, permits unfair competition claim in authors' copyright lawsuit

A US district judge in California has largely sided with OpenAI, dismissing the majority of claims raised by authors alleging that large language models powering ChatGPT were illegally trained on pirated copies of their books without their permission. By allegedly repackaging original works as ChatGPT outputs, authors alleged, OpenAI&aposs most popular chatbot was just a high-tech "grift" that seemingly violated copyright laws, as well as state laws preventing unfair business practices and unjust enrichment. According to judge Araceli Martínez-Olguín, authors behind three separate lawsuits—including Sarah Silverman, Michael Chabon, and Paul Tremblay—have failed to provide evidence supporting any of their claims except for direct copyright infringement.
Judge rejects most ChatGPT copyright claims from book authors, Ars Technica, 13-Feb-2024

A US judge has largely sided with OpenAI in a lawsuit brought by authors alleging that ChatGPT was trained using pirated copies of their books. The judge dismissed most claims except for direct copyright infringement. While authors failed to show ChatGPT outputs were substantially similar to their works, one unfair competition claim was allowed to proceed based on OpenAI allegedly using copyrighted works without permission. This case has been quiet for a while because I think the remaining claims were consolidated into the Tremblay v. OpenAI, Inc. case being overseen by the same judge.

Thomson Reuters wins landmark U.S. AI copyright case, potentially establishing legal precedent

Thomson Reuters has won the first major AI copyright case in the United States. In 2020, the media and technology conglomerate filed an unprecedented AI copyright lawsuit against the legal AI startup Ross Intelligence. In the complaint, Thomson Reuters claimed the AI firm reproduced materials from its legal research firm Westlaw. Today, a judge ruled in Thomson Reuters’ favor, finding that the company’s copyright was indeed infringed by Ross Intelligence’s actions.
Thomson Reuters Wins First Major AI Copyright Case in the US, Wired, 11-Feb-2025

Last month, Thomson Reuters achieved a significant legal victory by winning the first major AI copyright case in the United States. Notably, the court rejected the notion that using content to train a foundational language model was not fair use. This case sets a precedent in the ongoing discussions surrounding copyright laws and artificial intelligence, and its outcome may influence how AI-generated content is treated under copyright law.

Microsoft guarantees legal protection for Copilot users from copyright lawsuits

Some customers are concerned about the risk of IP infringement claims if they use the output produced by generative AI. This is understandable, given recent public inquiries by authors and artists regarding how their own work is being used in conjunction with AI models and services. To address this customer concern, Microsoft is announcing our new Copilot Copyright Commitment. As customers ask whether they can use Microsoft’s Copilot services and the output they generate without worrying about copyright claims, we are providing a straightforward answer: yes, you can, and if you are challenged on copyright grounds, we will assume responsibility for the potential legal risks involved.
Microsoft announces new Copilot Copyright Commitment for customers, Microsoft, 7-Sep-2023

In mid-2023, Microsoft announced a "Copilot Copyright Commitment" to address customer concerns regarding potential copyright infringement when using its AI-powered tools. The commitment includes indemnity in cases where customers are sued for copyright infringement, provided they have implemented necessary guardrails and content filters. The company acknowledges the need to respect authors' rights and aims to balance innovation with protecting creative works. This either says something about how Microsoft trained its foundational models with all copyright-free and licensed content, or that Microsoft believes its lawyers are better than everyone else's.

Meta's training of its AI with pirated LibGen books sparks legal and ethical debate

Court documents released last night show that the senior manager felt it was “really important for [Meta] to get books ASAP,” as “books are actually more important than web data.” Meta employees turned their attention to Library Genesis, or LibGen, one of the largest of the pirated libraries that circulate online. It currently contains more than 7.5 million books and 81 million research papers. Eventually, the team at Meta got permission from “MZ”—an apparent reference to Meta CEO Mark Zuckerberg—to download and use the data set.
The Unbelievable Scale of AI’s Pirated-Books Problem, The Atlantic, 10-Mar-2025

The article discusses the ethical and legal implications of Meta's use of pirated books from Library Genesis (LibGen) to train its AI model, called Llama 3. Faced with high costs and slow licensing processes for acquiring legal texts, Meta employees opted to access LibGen, which contains over 7.5 million books and 81 million research papers. Internal communications revealed that Meta acknowledged the medium-high legal risks of this strategy and discussed methods to mask their activities, including avoiding the citation of copyrighted materials. The communications were part of a motion for partial summary judgement in a lawsuit against Meta. And Meta is not going quietly—in response to that motion, it has filed dozens of documents on the court docket.

Nvidia denies copyright infringement in use of shadow libraries for AI training

Nvidia seemed to defend the shadow libraries as a valid source of information online when responding to a lawsuit from book authors over the list of data repositories that were scraped to create the Books3 dataset used to train Nvidia&aposs AI platform NeMo. That list includes some of the most "notorious" shadow libraries—Bibliotik, Z-Library (Z-Lib), Libgen, Sci-Hub, and Anna&aposs Archive, authors argued. However, Nvidia hopes to invalidate authors&apos copyright claims partly by denying that any of these controversial websites should even be considered shadow libraries.
Nvidia denies pirate e-book sites are “shadow libraries” to shut down lawsuit, Ars Technica, 28-May-2024

Nvidia is the company making the news for creating the GPUs that are so popular with companies training foundational models. In creating their own model, they say that using "shadow libraries" like Z-Library and Library Genesis does not necessarily violate copyright law, and that its AI training process is a "highly transformative" fair use of the content. On the other hand, authors have argued that the AI models are derived from the protected expression in the training dataset without their consent or compensation. Nvidia's position seems pretty gutsy...admit that you are using copyrighted content, and arguing that such use is okay. A ruling against them would take a bit bite out of their sky-high stock market valuation. The case is currently in the discovery phase.

Getty images launches AI image generator using its licensed library

Generative AI by Getty Images (yes, it’s an unwieldy name) is trained only on the vast Getty Images library, including premium content, giving users full copyright indemnification. This means anyone using the tool and publishing the image it created commercially will be legally protected, promises Getty. Getty worked with Nvidia to use its Edify model, available on Nvidia’s generative AI model library Picasso.
Getty made an AI generator that only trained on its licensed images, The Verge, 25-Sep-2023

In 2023, Getty Images launched a AI image generation tool that uses its vast library of licensed images. The company says users of its output have full copyright indemnification for commercial use. Developed in partnership with Nvidia (yes—the same Nvidia mentioned in the article above) and leveraging the Edify model, this tool allows users to create images while being protected legally. Getty plans to compensate creators whose images are used to train the AI model and will share revenues generated from the tool. Unlike traditional stock images, AI-generated photos will not be included in Getty’s existing content libraries.

This Week I Learned: "But where is everybody?!?" — the origins of Fermi's Paradox

The eminent physicist Enrico Fermi was visiting his colleagues at Los Alamos National Laboratory in New Mexico that summer, and the mealtime conversation turned to the subject of UFOs. Very quickly, the assembled physicists realized that if UFOs were alien machines, that meant it was possible to travel faster than the speed of light. Otherwise, those alien craft would have never made it here. At first, Fermi boisterously participated in the conversation, offering his usual keen insights. But soon, he fell silent, withdrawing into his own ruminations. The conversation drifted to other subjects, but Fermi stayed quiet. Sometime later, long after the group had largely forgotten about the issue of UFOs, Fermi sat up and blurted out: “But where is everybody!?”
All by ourselves? The Great Filter and our attempts to find life, Ars Technica, 26-Mar-2025

This retelling of the Fermi Paradox coms from this story about why, despite the vastness of the universe, we have yet to encounter evidence of extraterrestrial civilizations. Enrico Fermi famously posed the question, "Where is everybody?" suggesting a disconnect between the expectation of abundant intelligent life and the lack of observable evidence. The concept of the Great Filter is introduced, proposing that there may be significant barriers preventing intelligent life from becoming spacefaring. The article goes on to speculate where we are relative to the "Great Filter" — are we past it, or is it yet in front of us? In other words, have we survived the filter or is our biggest challenge ahead of us?

What did you learn this week? Let me know on Mastodon or Bluesky.

It is hard to write with a cat on your lap

Black and white cat curled up and sleeping on a person's denim-clad lap. The room is cozy, with furniture and soft lighting in the background.
This issue is a little rushed because I couldn't do my usual writing and editing. This cute ball of fuzz is the reason.

Software Supply Chain Attack / David Rosenthal

Joel Wallenberg interviewed me on 14th February for his article in the 28th February edition of Grant's Interest Rate Observer entitled Memo to the bitcoiners. Alas, it is paywalled, but among the many quotes from me Wallenberg used was that blockchain-based systems "are very vulnerable to supply-chain attacks".

Exactly a week after the interview and a week before the article went to press, we got an example, the biggest cryptocurrency heist in history. Below the fold I discuss the details.

Among the countries the US recently voted with in the UN General Assembly is the Democratic Peoples' Republic of Korea (DPRK). The US administration may have been impressed with the DPRK's innovative, diversified business model. This includes divisions responsible for supporting the Russian invasion of Ukraine with soldiers and arms, counterfeiting $100 bills, drug smuggling, and trafficking wildlife and humans. One of the most profitable divisions is responsible for stealing cryptocurrencies, and on 21st February it had a major success by stealing Ethereum with a notional value of $1.5B from the Bybit exchange.

The Economist asks Why are North Korean hackers such good crypto-thieves? It is a good question because:
n 2023 North Korean hackers made away with a total of $661m, according to Chainalysis, a crypto-investigations firm; they doubled the sum in 2024, racking up $1.34bn across 47 separate heists, an amount equivalent to more than 60% of the global total of stolen crypto. The ByBit operation indicates a growing degree of skill and ambition: in a single hack, North Korea swiped the equivalent of $1.5bn from the exchange, the largest-ever heist in the history of cryptocurrency.
Source
The reason is that they are a mainstay of the DPRK's economy:
Crypto-thievery is a more efficient way to earn hard currency than traditional sources, such as overseas labourers or illegal drugs. The United Nations Panel of Experts (UNPE), a monitoring body, reported in 2023 that cyber-theft accounted for half of North Korea’s foreign-currency revenue. North Korea’s digital plunder last year was worth more than three times the value of its exports to China, its main trade partner. “You take what took millions of labourers, and you can replicate that with the work of a few dozen people,” says Mr Carlsen.
The attribution of the Bybit theft to the DPRK is convincing, because the modus operandi was very similar to last October's theft of $50M from Radiant Capital's multisig cold wallet. Radiant reported:
Attackers were able to compromise the devices of at least these three core contributors through a sophisticated malware injection. These compromised devices were then used to sign malicious transactions.

Although three compromised devices have been confirmed, it is likely that more were targeted — the means by which they were compromised remains unknown and under investigation. The devices were compromised in such a way that the front-end of Safe{Wallet} (f.k.a. Gnosis Safe) displayed legitimate transaction data while malicious transactions were signed and executed in the background.
This is kind of like Obi-Wan Kenobi's "These aren't the droids you're looking for".

Radiant was being careful. They used Safe{Wallet} to implement a multi-sig wallet, which required at least three signatures to authorize a transaction. And they checked the transactions carefully:
Each transaction was simulated for accuracy on Tenderly and individually reviewed by multiple developers at each signature stage. Front-end checks in both Tenderly and Safe showed no anomalies during these reviews.

To underscore the significance of this point, the compromise was completely undetectable during the manual review of the Gnosis Safe UI and Tenderly simulation stages of the routine transaction.
Source
Safe{Wallet} is an open-source Web app that creates and manages access to multi-sig wallets. In effect it interposes a user interface layer above the Ethereum blockchain. Last May Alex Miguel's Safe Wallet Review 2025: Pros, Cons, & Features complimented the UI:
The UI for a Safe is clean and intuitive. You can browse the assets in the safe easily, as well as check and manage the participating addresses.
The nature of the Ethereum ecosystem means that this user interface has some issues:
Front-end verification of all three multi-signature transactions showed no signs of compromise, aside from Safe App transaction resubmissions due to failures. It is important to highlight that resubmitting Safe transactions due to failures is a common and expected occurrence. Transactions submitted on the Safe front-end can fail due to gas price fluctuations, nonce mismatch, network congestion, insufficient gas limit, smart contract execution errors, token insufficiency, pending transactions, front-end synchronization issues, timeouts, or permission/signature errors in multi-signature setups. As a result, this behavior did not raise immediate suspicion. The malicious actors exploited this normalcy, using the process to collect multiple compromised signatures over several attempts, all while mimicking the appearance of routine transaction failures.
It is important to note that, at the time they were writing, Radiant did not know how the developer's devices were compromised to sign the malicious transactions. They did set out a list of precautions which, I believe, would have prevented the subsequent Bybit heist, in particular:
Take raw transaction data out of your wallet provider when a signature is prompted (e.g., Metamask, Rabby) and plug it into https://etherscan.io/inputdatadecoder. Confirm that the function you are calling, and the ToAddress all match up with the intended behavior. If a device were to be compromised, like in the case above, it would either not decode at all, call a different function than the one you thought you were calling, or it would result in a different owner address than the one you intended.
The most comprehensive analysis of the Bybit heist I found was Cyfrin's The Safe Wallet Hack That Led to Bybit’s $1.4B Heist. In summary, the attackers:
  1. Compromised a developer machine at Safe
  2. Injected malicious JavaScript into a development container
  3. Specifically targeted Bybit exchange to stay undetected longer
  4. Manipulated what Bybit signers saw in the Safe interface
...
This represents a sophisticated supply chain attack rather than a direct compromise of end-user devices.

This attack followed an almost identical pattern to recent attacks on WazirX Exchange and Radiant Capital. This suggests that the same threat actors are repeatedly using this technique successfully.

It's now believed that the Safe UI was the compromise point in all these cases rather than end-user machines, which explains the similar attack patterns.
Because the attackers had access to the Safe developer's machine, they had access to one of Safe's S3 buckets, from which the Safe code running on the machines at Bybit downloaded the malware.

According to Thanh Nguyen of Verichains' preliminary report:
By examining the machines of three Signers from Bybit, malicious JavaScript payload from app.safe.global was discovered in the Google Chrome cache files.

There are two javascript files that were modified: _app-52c9031bfa03da47.js and 6514.b556851795a4cbaa.js.
...
From the Wayback Archive (https://web.archive.org/), we also identified an instance of this malicious JavaScript file dating back to Feb 19, 2025 17:29:05
The malware was planted on 19th February around 17:29 GMT, and removed about 2 minutes after the heist on 21st February at 14:15 GMT.

Later the same day, @zachxbt attributed the heist to the DPRK's Lazarus Group:
At 19:09 UTC today, @zachxbt submitted definitive proof that this attack on Bybit was performed by the LAZARUS GROUP.

His submission included a detailed analysis of test transactions and connected wallets used ahead of the exploit, as well as multiple forensics graphs and timing analyses.

The submission has been shared with the Bybit team in support of their investigation. We wish them all the best.
Cyfrin agrees with Radiant's key precaution:
Bybit’s team trusted what they saw on their screens, and Safe’s engineering team trusted that their systems were secure.

“Don’t trust, verify” has become a blockchain mantra for a reason.

When the signers reviewed the transaction but did not verify the calldata on their physical hardware devices, everything appeared correct. This underscores the critical need for thorough transaction verification beyond what’s displayed on the screen.

To spell it out, the computers showed a spoofed transaction that tricked them, but their wallets showed the malicious transaction. They could have caught this on the hardware wallet, but as of today, calldata can be tricky to verify on a wallet.
It is likely that the Safe developer's machine was compromised via a phishing attack. That was the modus operandi for the Radiant heist:
The attack began on Sept. 11, when a Radiant Capital developer received a Telegram message from someone impersonating a trusted former contractor. According to the message, the contractor was looking for a new job opportunity in smart contract audits. It requested comments on the contractor’s work and provided a link to a compressed PDF detailing their next assignment. The hackers even mimicked the contractor’s legitimate website to add credibility.

The zip file contained a disguised executable named INLETDRIFT. Upon opening, it installed malware on the developer’s macOS device, granting attackers access to the developer’s system. The malware was designed to communicate with a hacker-controlled server.
The security of a multi-billion dollar exchange depended upon the developers at a software supplier not clicking on a link in a Telegram message.

The new US administration's enthusiasm for grifts such as cryptocurrencies means the outlook for the DPRK's cryptocurrency division is rosy:
Tackling the problem requires multilateral efforts across governments and the private sector, but such collaboration has been fraying. Russia used its UN veto to gut the UNPE last year. President Donald Trump’s cuts to American development aid have hit programmes aimed at building cyber-security capacity in vulnerable countries.

By contrast, the North Korean regime is throwing ever more resources at cybercrime. South Korea’s intelligence services reckon its cybercrime force grew from 6,800 people in 2022 to 8,400 last year. As the crypto-industry expands in countries with weaker regulatory oversight, North Korea has an increasingly “rich target environment”, says Abhishek Sharma of the Observer Research Foundation, an Indian think-tank. Last year, Mr Sharma notes, North Korea attacked exchanges based in India and Indonesia.
Update 27th March
© Grant's Interest Rate Observer
Two weeks after the article that quoted me appeared, the front page of Grant's Interest Rate Observer summed the situation up with this pithy cartoon, which I use with permission. The 28th March edition featured another excellent cryptocurrency-focused article entitled The two lives of Donald J. Trump:
Crypto prices are prone to crashes — one could say that crypto exchanges are designed to facilitate them. It would be strange if the immensity of the market caps involved did not introduce a new element of risk into a world that hardly needs more. As for America’s multitasking president, he boasted (through his lawyer) in 2019 that he could commit murder on Fifth Avenue in broad daylight and not be charged with a crime. A conflict of interest is no capital crime, but Trump continues to mix the business of state with the pleasures and temptations of money-making.
The article features extensive quotes from the wonderful Prof. Carol Alexander about the problem of "rehypothecation" in DeFi:
One can borrow up to 78.5% loan-to-value on Aave against the special tokens and make an interest spread of 29 basis points by staking the newly borrowed ether. Starting with 100 ETH, for instance, one could stake them (earning 3%), use the automatically generated liquid staking tokens to borrow 78 more ETH, stake those to earn a net 29 basis points, use the resulting 78 more liquid staking tokens to borrow 78% of 78, i.e., 61, more ether, stake them for another 29 basis points and so on. ... In fact, an enterprising decentralized financier will need to iterate the borrowing and staking process 12 times until he is levered 4.5 times, having borrowed 3.5 ether tokens for each originally laid out, before the resulting digital hoard will even break 4% in yield,
This all sounds great until the price of the collateral drops, perhaps in one of the flash crashes to which cryptocurrencies are subject:
On May 19, 2021, a flash crash in multiple cryptocurrencies briefly knocked more than 25% off the price of bitcoin and 34% off the price of ether before the prices recovered somewhat, all inside of an hour.
In minutes the collateral will be liquidated automatically, and if that doesn't clear the position, so will any other assets the exchange can get its hands on. The article quotes Alexander:
the entire portfolio, not just in ether, but all the other positions, will get taken out by the exchange. The exchange is not supposed to sell these things straight away. They have things called “guarantee funds” or “insurance funds,” which are supposed to absorb the liquidated positions [and hold them]. But they don’t. They dump them into the market. I’ve got all the data to show that.
This is the business into which Trump's World Liberty Financial wants to get:
“With World Liberty Financial [Trump] stands to make billions. . . . They want to provide lending and borrowing services like Aave. Aave is the biggest shadow bank [in decentralized finance],” Alexander explains, representing a market for crypto-token deposits and loans worth $29.1 billion.

AI bots are destroying Open Access / Eric Hellman

There's a war going on on the Internet. AI companies with billions to burn are hard at work destroying the websites of libraries, archives, non-profit organizations, and scholarly publishers, anyone who is working to make quality information universally available on the internet. And the technologists defending against this broad-based attack are doing everything they can to preserve their outlets while trying to remain true to the mission of providing the digital lifeblood of science and culture to the world.

Yes, many of these beloved institutions are under financial pressures in the current political environment, but politics swings back and forth. The AI armies are only growing more aggressive, more rapacious, more deceitful and ever more numerous.

I'm talking about the voracious hunger of AI companies for good data to train Large Language Models (LLMs). These are the trillion-parameter sets of statistical weights that power things like Claude, ChatGPT and hundreds of systems you've never heard of. Good training data has lots of text, lots of metadata, is reliable and unbiased. It's unsullied by Search Engine Optimization (SEO) practitioners. It doesn't constantly interrupt the narrative flow to try to get you to buy stuff. It's multilingual, subject specific, and written by experts. In other words, it's like a library.

At last week's Code4lib conference hosted by Princeton University Library, technologists from across the library world gathered to share information about library systems, how to make them better, how to manage them, and how to keep them running. The hot topic, the thing everyone wanted to talk about, was how to deal with bots from the dark side.

robot head emoji with eyes of sauron

Bots on the internet are nothing new, but a sea change has occurred over the past year. For the past 25 years, anyone running a web server knew that the bulk of traffic was one sort of bot or another. There was googlebot, which was quite polite, and everyone learned to feed it - otherwise no one would ever find the delicious treats we were trying to give away. There were lots of search engine crawlers working to develop this or that service. You'd get "script kiddies" trying thousands of prepackaged exploits. A server secured and patched by a reasonably competent technologist would have no difficulty ignoring these.

The old style bots were rarely a problem. They respected robot exclusions and "nofollow" warnings. The warning helped bots avoid volatile resources and infinite parameter spaces. Even when they ignored exclusions they seemed to be careful about it. They declared their identity in "user-agent" headers. They limited the request rate and number of simultaneous requests to any particular server. Occasionally there would be a malicious bot like a card-tester or a registration spammer. You'd often have to block these based on IP address. It was part of the landscape, not the dominant feature.

The current generation of bots is mindless. They use as many connections as you have room for. If you add capacity, they just ramp up their requests. They use randomly generated user-agent strings. They come from large blocks of IP addresses. They get trapped in endless hallways. I observed one bot asking for 200,000 nofollow redirect links pointing at Onedrive, Google Drive and Dropbox. (which of course didn't work, but Onedrive decided to stop serving our Canadian human users). They use up server resources - one speaker at Code4lib described a bug where software they were running was using 32 bit integers for session identifiers, and it ran out!

The good guys are trying their best. They're sharing block lists and bot signatures. Many libraries are routinely blocking entire countries (nobody in china could possibly want books!) just to be able to serve a trickle of local requests. They are using commercial services such as Cloudflare to outsource their bot-blocking and captchas, without knowing for sure what these services are blocking, how they're doing it, or whether user privacy and accessibility is being flushed down the toilet. But nothing seems to offer anything but temporary relief. Not that there's anything bad about temporary relief, but we know the bots just intensify their attack on other content stores.

direct.mit.edu  Verifying you are human. This may take a few seconds. direct.mit.edu needs to verify the security of your connection before proceeding. Verification is taking longer than expected. Check your internet connection and refresh the page if the issue persists.
The view of MIT Press's Open-Access site from the Wayback Machine.

The surge of AI bots has hit Open Access sites particularly hard, as their mission conflicts with the need to block bots. Consider that Internet Archive can no longer save snapshots of one of the best open-access publishers, MIT Press because of cloudflare blocking. (see above) Who know how many books will be lost this way?  Or consider that the bots took down OAPEN, the worlds most important repository of Scholarly OA books, for a day or two. That's 34,000 books that AI "checked out" for two days. Or recent outages at Project Gutenberg, which serves 2 million dynamic pages and a half million downloads per day. That's hundreds of thousands of downloads blocked! The link checker at doab-check.ebookfoundation.org (a project I worked on for OAPEN) is now showing 1,534 books that are unreachable due to "too many requests". That's 1,534 books that AI has stolen from us! And it's getting worse.

Thousands of developer hours are being spent on defense against the dark bots and those hours are lost to us forever. We'll never see the wonderful projects and features they would have come up with in that time.

The thing that gets me REALLY mad is how unnecessary this carnage is. Project Gutenberg makes all its content available with one click on a file in its feeds directory. OAPEN makes all its books available via an API. There's no need to make a million requests to get this stuff!! Who (or what) is programming these idiot scraping bots? Have they never heard of a sitemap??? Are they summer interns using ChatGPT to write all their code? Who gave them infinite memory, CPUs and bandwidth to run these monstrosities? (Don't answer.)

We are headed for a world in which all good information is locked up behind secure registration barriers and paywalls, and it won't be to make money, it will be for survival. Captchas will only be solvable by advanced AIs and only the wealthy will be able to use internet libraries.

Or maybe we can find ways to destroy the bad bots from within. I'm thinking a billion rickrolls?

Notes:

  1. I've found that I can no longer offer more than 2 facets of faceted search. Another problematic feature is "did you mean" links. AI bots try to follow every link you offer even if there are a billion different ones.
  2. Two projects, iocaine and nepenthes are enabling the construction of "tarpits" for bots. These are automated infinite mazes that bots get stuck in, perhaps keeping the bots occupied and not bothering anyone else. I'm skeptical.
  3. Here is an implementation of the Cloudflare Turnstyle service (supposedly free) that was mentioned favorably at the conference.
  4. It's not just open access, it's also Open Source.
  5. Cloudflare has announced an "AI honeypot". Should be interesting.
  6. One way for Open Access site to encourage good bot behavior is to provide carrots to good robots. For this reason, it would be good to add Common Crawl to greenlists: https://commoncrawl.org/ccbot
  7. Ian Mulvaney (BMJ) concurs
















Why I am dedicating time to the CSTD Working Group on Data Governance and the 2025-2026 UN Working Group on Safeguards for Digital Public Infrastructure / Open Knowledge Foundation

Photo: Lucas Pretti/OKFN

As Chief Executive Officer of the Open Knowledge Foundation, I have been selected to join and work in the United Nations Commission on Science and Technology for Development (CSTD) Working Group on Data Governance at All Levels and the Working Group on Universal Safeguards for Digital Public Infrastructure (DPI).

We have led the open movement towards a rights-enhancing and inclusive digital society for over two decades. Our legal tools in the Open Data Commons have enabled unrestricted data sharing, empowering global communities to innovate and collaborate. The Open Definition sets a standard for openness, guiding countless individuals and institutions to share their data, knowledge and tools. The School of Data, one of our flagship projects, has provided thousands worldwide with skills to turn data into actionable insights. As members of the Digital Public Goods Alliance, we have actively advocated for open, interoperable, sustainable, and lasting technology that serves all. Through CKAN—a digital public good that was created by our technical teams —we have provided hundreds of governments and organisations with a platform for open data sharing, fostering economic opportunities, accountability and innovation all over the world. We are not only thinkers but doers, and it is such accumulated experience that I am bringing to the groups.

In changing times, when multilateral cooperation mechanisms are questioned, when the word openness is being abused to mask extractive businesses, and when technology and data are used to commit injustices at large scale, it is urgent to be present in the spaces where key actors convey to explore solutions to break such dynamics and build together feasible paths towards a different digital society, breaking down data monopolies and delivering unprecedented access to knowledge, for all —unlocking pathways for innovation and growth. That is why principles of open knowledge, open data, open collaboration and open standards should guide the data governance in every area, and our examples and experiences can lead to better, more effective data governance systems.

In the Working Groups, I will suggest the “open by design” principle, for datasets and systems to become the backbone of our future. Highly digitised societies—such as digital identities and payment platforms—, if designed with opacity, without meaningful participation and accountability, these risk entrenching exclusion, corruption and control; if they are discussed openly with the people they intend to serve and if they are developed open and kept open, they can enable accountability, interoperability and the ability to build upon, while also reducing the risks for the most vulnerable in society. Through my role in the Digital Public Goods Alliance and stewardship of CKAN, I have gained experience and a better understanding of the urgency of building Digital Public Infrastructure rooted in open standards, open-source software (ideally free software) and open data. Open technologies, standards, and practices remove barriers, ensuring that infrastructure enables participation and economic advancement rather than perpetuating disparities.

As the work progresses in these groups, I will share as many insights into its progress as I am permitted and seek advice and guidance from our valuable community and international network, which is closely in touch with the latest challenges and progress achieved in these areas.

Expanding access to current scholarship with JSTOR’s open access books / John Mark Ockerbloom

We’ve recently added more than ten thousand freely readable recent books from academic publishers to the extended shelves of The Online Books Page. These online titles are provided by JSTOR’s Open Access Books collection, and represent the first large-scale automated inclusion of new monographs to our catalog. We’ll be updating the listings from this collection on a regular basis. (Most of our extended shelves imports, which also include collections like HathiTrust, Project Gutenberg, and the Directory of Open Access Journals, get refreshed about once a month.)

We’ll also continue to add individual recent academic books of interest, both from that collection and others, to our curated collection. (You’ll see individually-chosen additions in our new books listing.) Automated addition of JSTOR’s open access books will let us include many more new titles than we’d be able to do manually by ourselves. They’ll help our site’s readers find up to date information from experts on many subjects, complementing the historical sources and other older books also available in our catalog. (As always, I recommend that information seekers also consult their local libraries to get a fuller understanding of the subjects our collection covers, since local libraries have many copyrighted books we can’t offer for free reading on The Online Books Page. But we now also have a growing number of recent books you’re not likely to find, or obtain as easily, in your local library.)

A growing number of academic and peer-reviewed monographs and edited volumes are now available through open access. The Directory of Open Access Books now lists over 90,000 of them, and the related OAPEN site offers direct downloads of many such books as well, as do many individual publisher sites. We may expand our automated listings in the future to cover a wider array of open access academic titles. But I consider JSTOR’s collection to be a good starting point for automated imports, both because it includes over 140 high-quality publishers, and because it provides high-quality metadata. That makes it relatively easy to integrate their offerings with the other books in our catalog, using similar author and subject headings, and avoiding redundant listings. As always, our “extended shelves” imports are subject to change and not guaranteed to be as persistent as the “curated collection” listings we catalog ourselves. But you can always request that we add specific titles you’re interested in to our curated collection, whether from JSTOR’s offerings or elsewhere.

JSTOR has a growing collection of information resources that are freely and openly available to all readers (as well as resources they offer by subscription or by registered access). Along with JSTOR’s open access books, we also link to issues of many of the serials JSTOR offers, sometimes via JSTOR’s own Early Journal Content offerings, sometimes via other online providers. We’re also continuing to document the public domain status of serials from JSTOR and other providers in our Deep Backfile knowledge base, (You can use the “Contact us” links in our Deep Backfile tables to let us know about particular serials you’re interested in, or for which you have information that we could use.)

My thanks to JSTOR, and to a growing number of publishers and authors, for making current scholarship freely available and discoverable!

Announcing our new course: Quality and Consistent Data with the Open Data Editor / Open Knowledge Foundation

Part of our goal in developing the Open Data Editor (ODE) this year is to increase digital literacy among key communities through the application. We’re thrilled to introduce our latest learning resource: Quality and Consistent Data with the Open Data Editor – a free, hands-on course designed to help non-technical users improve their data skills with the Open Data Editor (ODE).

Developed in partnership with Open Knowledge Brazil and Escola de Dados, this course is part of our mission to increase digital literacy and empower communities with practical tools for working with data—no coding required.

What you’ll learn

The course follows a learn-by-doing approach, covering essential topics like:

Detecting and fixing errors in tables

✅ Learn to work with tabular data

✅ Don’t get lost when validating your spreadsheet

✅ Clean up your spreadsheets to gain valuable insights

FAIR data principles

✅ Understand how to guarantee the Findability, Accessibility, Interoperability, and Reuse of digital assets

✅ Master all about metadata

✅ Understand the Open Definition and the principles of open data

Learning-by-doing

✅ Follow in the footsteps of an exclusive instructor

✅ Get hands-on and learn from your mistakes and successes

✅ Learn how to handle databases without writing code

Why ODE? More than just error detection

The Open Data Editor (ODE) isn’t just a tool for spotting mistakes in tables—it’s a gateway to open data literacy. Pilot organisations have shared that taking this course was key for their teams to advance their data knowledge, helping them not only clean datasets but also understand how to handle, share, and use open data effectively. Whether you’re new to data or looking to refine your skills, ODE makes the process intuitive and accessible.

Who is this for?

This course is specially designed for:
🔹 Non-technical users who work with data but lack advanced technical training
🔹 Journalists, researchers, and civil society members looking to handle data more effectively
🔹 Anyone interested in improving data consistency and reliability

Why take this course?

The Open Data Editor (ODE) simplifies data work, making it accessible to everyone. Whether you’re cleaning up a dataset, checking for errors, or preparing data for analysis, this course will give you the confidence to do it efficiently and and knowledge of the principles that govern good practices in open data management.

📢 Enroll now and start mastering data quality with ODE!

The course will soon be translated into Portuguese and Spanish. If you would like to contribute to the translation into other languages, please contact us at info@okfn.org.

Join us in making data more open, usable, and reliable for all. Happy learning!

Open Data Editor AI integration: breaking the input-output logic / Open Knowledge Foundation

In collaboration with Patricio Del Boca

This post discusses Open Knowledge Foundation’s (OKFN) approach to integrating AI into the Open Data Editor (ODE), our new open source desktop application for data practitioners to detect errors in tables.

To ensure thoughtful AI integration, we are emphasising the importance of interdisciplinary collaboration, user feedback, and ethical considerations rather than just technical implementation. We are prioritising user needs, such as limiting data shared with OpenAI to metadata and making AI features optional, while acknowledging practical constraints like deadlines and resource limitations.

Learn more about the lessons learned during the process, including the value of asking critical questions, avoiding perfectionism, and balancing innovation with ethical and practical considerations.

Why has OKFN opted for a non-technical approach to reflect on AI? 

OKFN is convinced that AI integrations cannot be approached solely from the technical side. There are conversations and reflections that teams must have both internally and with the community to ensure that such integrations are targeted to help with specific problems, and at the same time ensure that they are communicated in a clear and transparent way. These conversations not only helped us define what kind of implementation we were going to do, but also allowed us to build internal capacity and new soft skills to be able to answer key questions or be aware of possible red flags when working with AI. 

Limitations and general comments on AI reflections

As in any project, our work had certain external and internal deadlines and constraints. On the one hand, we had a limited period of time for implementation. Initially, we defined a workflow of two months, four work calls and asynchronous revision of a document with guidelines and ideas. However, as we progressed with the reflections, we called Madelon Hulsebos (our AI Consultant) to resolve doubts and validate the integration angle. It should be noted that the first general conversation was not only between Madelon and Romina Colman (Product Owner). The meeting was also attended by part of the project’s technical team and our UX designer. Why? Not only to address the integration from all angles, but also to de-centre the idea that working with AI is something that is solved at the code level and avoid the current disconnect seen in the ecosystem where there are conversations between technical people on the one hand and academia, activists and NGOs on the other. 

This is what we learned:

  • Create an interdisciplinary team to reflect on artificial intelligence integrations: While incorporating AI into products may be easy with a skilled technical person, it is important not to limit its implementation to the realm of code and to open up the space for conversations where multiple voices can be heard. For example, in some of our early team talks we talked about possible integrations where the conclusion was ‘But we don’t need AI to do that’.
  • Ask yourselves many questions: it is important not to take anything for granted, to ask questions all the time throughout the process. For example, in the case of the ODE these were some of the questions we asked ourselves: ‘How do I explain to the user what is happening at this stage?‘, ‘How do I make sure they know what information we are sharing when they click on the button?‘, etc.
  • Listen to your user needs and concerns: months before we started working with AI integration, in the first interviews with our potential users we asked them if they used AI for their work with data. At that time, journalists and people working in NGOs shared their concerns about its use and the importance of privacy in their work. These two elements were central to the decisions we made later, such as the idea of having a predefined prompt in the application or the idea of explaining what information ODE shares if the user decides to work with AI.
  • Don’t strive for perfection: in every project there are deadlines, resource constraints and AI work is often one component of many. It is necessary to create a balance between all elements. For example: in our case, as mentioned above, the ODE team wants to start testing with open source LLMs. When we started talking about it, we realised that we couldn’t do it in the time we had and so we started asking ourselves how to improve and narrow down the integration (bearing in mind that we were using OpenAI) and then move on to other models.

The ODE uses OpenAI as a third party for the AI integration. Will we keep using it?

The current AI integration of ODE uses OpenAI. This decision was a practical one: it was easy to implement and it is the current state-of-the art when integrating LLMs to applications.

The quick integration of Open AI provided useful feedback:

  • We were able to quickly test integrations and uses of LLMs into the ODE to start exploring capabilities and limits of the technology.
  • User research that we conducted in March 2024 showed that journalists (potential users of the ODE) and other data practitioners expressed concerns on using the ODE if the app shared the full content of their files. 

Based on the feedback we made two decisions:

  • Limit the data we share with Open AI to only the metadata of the file.
  • Make the use of Open AI optional (nothing is shared by default)

In 2025, we are working on:

  • Keep exploring other uses cases where LLMs can be useful
  • Work towards the integration of local LLMs, this is, to make ODE work with LLMs running in the users machine (instead of a third party provider)

ManoWhisper Visualizations / Nick Ruest

Introduction

Back in November, I posted about ManoWhisper, where I briefly introduced the work I was doing with the UWaterloo team for the Digital Feminist Network. Today, I’ll delve a bit more into that work and provide a closer look at some of the tools and their output in the context of a broader research project I’m working on. For that project, we have compiled an ever-growing dataset of podcast transcripts comprising over 11,000 episodes from 20 podcasts associated with the Intellectual Dark Web, conspiracy theories, QAnon, the Alt-Right, White Supremacist/Nationalist movements, and the Manosphere.

What’s the purpose of this thing and all it’s utilities with silly names? Let’s start here. Nobody – and I mean nobody – wants to sit down and closely read over 11,000 transcripts of material from these communities. It can get real traumatic real quick. So, what are some ways we can do macro and micro-level readings of the content? Or give us the ability to ask the data some questions, perform information retrieval, and allow us to do close reading to see things in full or fuller context? I’ll walk through a few of the ManoWhisper utilities below, and hopefully that’ll provide some insight into them. If you just want to see a bunch of the visualizations produced with the utilities, jump over here.

Show Statistics

Broad overviews are a good way to get some basic understanding of a given podcast in our dataset. They’ll help us get a good perspective on each host or team, and provide us with some comparators for the entire corpus. But also us to look at some apples and oranges together since not every podcast is the same format, style, or length. Still, they might share some underlying commonalities.

Red Pill Caliper takes a directory of transcripts (WebVTT files), and outputs six different histograms

Usage: red-pill-caliper.py [OPTIONS] TRANSCRIPTS

  Analyze a directory of podcast transcripts (WebVTT) and generate histograms.

  TRANSCRIPTS: Path to the directory of WebVTT files.

Options:
  -p, --podcast-name TEXT  Name of the podcast for graph titles.
  --help                   Show this message and exit.

As an example, we let’s take a look at “The Jordan B. Peterson Podcast” corpus.

Episode Lengths

The Jordan B. Peterson Podcast: Episode Lengths

Character Count per Episode

The Jordan B. Peterson Podcast: Episode Lengths

Words per Episode

The Jordan B. Peterson Podcast: Episode Lengths

Sentences per Episode

The Jordan B. Peterson Podcast: Episode Lengths

Vocabulary Size per Episode

The Jordan B. Peterson Podcast: Episode Lengths

Speaking Rate per Episode

The Jordan B. Peterson Podcast: Episode Lengths

Next, one entry point for analysis is keyword searching the corpus. Since they’re all text files, grep or ag works well here, but you might also want to examine how many times a given keyword or phrase (these are “wildcardable” as well) appears in a given podcast compared to others. Red Pill Resonator generates a keyword frequency graph of a comma separated list of terms or phrase for our corpus of podcasts.

Additionally, because not all episode lengths are the same, we should normalize the data by looking at the rate at which a keyword or phrase appears per episode. For instance, an Andrew Tate episode might only be 8 minutes long, while a Nick Fuentes episode might be 2 hours. If we simply count the occurrences of a term, the results could be distorted due to differences in episode lengths. That’s why I chose the per-episode metric as the default. You’ll see below that there is also a mode flag to compute overall counts if needed.

Usage: red-pill-resonator.py [OPTIONS] OUTPUT_IMAGE

  Generate keyword frequency graphs across a corpus of WebVTT files.

  Arguments:
    OUTPUT_IMAGE  Filename of the graph.

Options:
  --keywords TEXT           Comma-delimited list of keywords.  [required]
  --mode [overall|episode]  Choose whether to display overall keyword counts
                            (--mode overall) or average counts per episode
                            (--mode episode).  [required]
  --width INTEGER           Width of the graph in pixels.  [default: 800]
  --height INTEGER          Height of the graph in pixels.  [default: 600]
  --title TEXT              Title of the graph.  [default: Keyword Frequency
                            Graph]
  --help                    Show this message and exit.

This example looks at the keywords “trump, vance, biden, harris, walz” in the entire corpus of materials at the time, which included “Candace Owens”, “Loomer Unleashed”, “The Culture War Podcast with Tim Pool”, “The Jordan B. Peterson Podcast”, “The StoneZONE with Roger Stone”, and “The Tucker Carlson Show”.

Keyword Trend: 'trump,vance,biden,harris,walz

Emotions (episode)

The remainder of the functionality I’ll go over in this post uses three different text classification models to classify sentence segments and plot those classifications.

Emotional Rollercoaster breaks a given podcast episode transcript into sentence segments and feeds each sentence to j-hartmann/emotion-english-distilroberta-base to generate an emotion classification. The classifications are then plotted in a heatmap over time. I think this provides valuable insight into the vibe of a given episode and is well complemented by the Hate Speech and Misogynistic Speech classification utilities documented below.

Usage: emotional-roller-coaster.py [OPTIONS] INPUT_VTT_FILE OUTPUT_HTML_FILE

  Generate a heatmap of emotions for a given WebVTT transcript.

Options:
  -t, --title TEXT  Title of the heatmap

This example takes a look at an “America First with Nicholas J. Fuentes” episode where Nick Fuentes reviews the 2023 Barbie movie.

BARBIE REVIEW: Holly-WOKE Deconstructs ARYAN FANTASY With TOXIC DIVERSITY | America First Ep. 1192

Misogyny (episode)

Wave of Misogyny breaks a given podcast episode transcript into sentence segments and feeds each sentence to MilaNLProc/bert-base-uncased-ear-misogyny to generate a misogyny classification. The classifications are then plotted in a dual-axis chart over time.

Usage: wave-of-misogyny.py [OPTIONS] INPUT_VTT_FILE OUTPUT_HTML_FILE

  Generate a dual-axis chart of misogyny scores for a given WebVTT transcript.

Options:
  -t, --title TEXT  Title of the chart
  --help            Show this message and exit.

This example takes a look at the same “America First with Nicholas J. Fuentes” episode mentioned above.

America First with Nicholas J. Fuentes: BARBIE REVIEW Ep. 1192 (MilaNLProc/bert-base-uncased-ear-misogyny)

Hate (episode)

Dicks Hate the Police breaks a given podcast episode transcript into sentence segments and feeds each sentence to facebook/roberta-hate-speech-dynabench-r4-target to generate a hate speech classification. The classifications are then plotted in a dual-axis chart over time.

Usage: dicks-hate-the-police.py [OPTIONS] INPUT_VTT_FILE OUTPUT_HTML_FILE

  Generate a dual-axis chart of hate scores for a given WebVTT transcript.

Options:
  -t, --title TEXT  Title of the chart
  --help            Show this message and exit.

This example takes a look at a “Get Off My Lawn Podcast with Gavin McInnes” episode titled, “COMPOUND CENSORED 160 - TACS 1796: GOD IS A NATIONALIST”.

Get Off My Lawn: COMPOUND CENSORED 160 - TACS 1796: GOD IS A NATIONALIST (facebook/roberta-hate-speech-dynabench-r4-target)

Misogyny (Podcast)

The final two examples take a look at the corpus of an entire podcast instead of just an episode, and generate a donut chart.

Donut Hate Women breaks a given podcast’s transcripts into sentence segments and feeds each sentence to MilaNLProc/bert-base-uncased-ear-misogyny to generate a misogyny classification for the entire corpus. The classifications are then plotted in a donut chart.

Usage: donut-hate-women.py [OPTIONS] INPUT_PATH OUTPUT_HTML_FILE

  Generate a pie chart of misogynist vs non misogynist classifications for
  WebVTT transcripts.

  Accepts either a directory of VTT files or a single VTT file.

Options:
  -t, --title TEXT  Title of the chart
  --help            Show this message and exit.

This example takes a look the “Fresh & Fit Podcast” corpus.

Misogyny: Fresh & Fit

Hate (Podcast)

Finally, Donut Hate breaks a given podcast’s transcripts into sentence segments and feeds each sentence to facebook/roberta-hate-speech-dynabench-r4-target to generate a hate speech classification for the entire corpus. The classifications are then plotted in a donut chart.

Usage: donut-hate.py [OPTIONS] INPUT_PATH OUTPUT_HTML_FILE

  Generate a pie chart of hate vs not hate classifications for WebVTT
  transcripts.

  Accepts either a directory of VTT files or a single VTT file.

Options:
  -t, --title TEXT  Title of the chart
  --help            Show this message and exit.

This example takes a look “The Roseanne Barr Podcast” corpus.

Hate Speech: The Roseanne Barr Podcast

Support the IMLS / John Mark Ockerbloom

If you’ve found useful the many mid-20th century serials that are now freely readable online through The Online Books Page, you can thank the Institute of Museum and Library Services. The IMLS (as it’s generally known) funded the completion in 2018 of a survey of copyrights I led for serials published before 1950. The completed survey made it possible for the first time to quickly ascertain the public domain status of decades of serials. It also laid the groundwork for our bigger Deep Backfile project that now documents rights and free online availability for well over 10,000 serials. I’d say that’s a pretty good return on a $25,000 IMLS investment.

The IMLS has made that kind of investment many times over, with libraries and museums across all 50 states. The IMLS makes a lot of mostly modest grants for projects and programs that make a big difference in the communities libraries and museums serve. You can read about a sampling of them in Devon Akmon’s recent Conversation article. Or you can ask your local librarian or museum curator. They may tell you how the IMLS supports them providing access to information online, promoting literacy, preserving unique parts of our country’s cultural heritage, and many other functions. They do it with a budget that requires less than $1 per American per year. It’s hard to imagine a more efficient use of government funds.

Despite its efficiency, the IMLS has been a recurrent target for elimination. The first Trump administration proposed eliminating the IMLS starting with its first budget request. But Congress listened to its constituents and continued to fund it. This time, however, Trump has tried to shut it down on his own, without involving Congress. He issued an executive order for IMLS’s activities to be “eliminated to the maximum extent consistent with applicable law”, aiming to “effectuate an expected termination” of the agency. He also appointed a new acting director who’s pledged to act “in lockstep with this Administration” to carry out the president’s wishes.

So far, Trump has been unable to completely shut down the organization, as he has managed with some other agencies. When IMLS remote staffers got word that the new acting director and some of Elon Musk’s DOGE workers were going to come to the IMLS office last Thursday, they reported to the office in person, many of them dressed in formal (or funereal) black. Faced with more people than DOGE expected, an anonymous staffer related that “instead of laying everybody off immediately they left the building, because they didn’t want to create a scene with us there. Otherwise, they would have locked the doors and taken over our systems and sent a mass notification out to everyone.”

As I write this, IMLS is still operating. But in the next week, the new acting director may try to cut off its funding to libraries and museums. Or, he could try to make libraries and museums receiving funds to change their programs and collection policies to conform to his preferences, or Trump’s. The acting director has already issued a press release expressing his intent to”restore focus on patriotism”. And the president has not been hesitant to cut funding to institutions to make them change their programs to his liking, as some have now agreed to.

But right now, there’s still time for us who value libraries and museums to make our voices heard. We can tell our lawmakers that the IMLS should continue to be fully supported, and should support all the libraries and museums whose diverse programs and collections help make our country great. If you want to participate, see the calls to action at EveryLibrary and the American Alliance of Museums.

My protest signage improved at this week's #TeslaTakedown / Peter Murray

Protest sign with bold text: 'Our GOVERNMENT was FINE' in black and blue, followed by 'Now it is MUSKed UP!' in red. An image of Elon Musk is crossed out with a red circle and line. The bottom displays 'Fire Elon' in flaming orange letters.My protest sign for the #TeslaTakedown today.

I'm a long way from a career change to graphic design or protest communications, but this week was a definite improvement. About a half dozen people asked for pictures of my sign. That's a good signal, so I'm including instructions below on printing and making one yourself.

It was another windy, gloomy day at the Easton Tesla store, but the number of people increased from last Saturday. One organizer said between 450 and 500 people, which seemed about right to me. It was just a little more than we had last week. The weather forecast for next week is about 15 degrees warmer, so it will be interesting to see if the families with young children come out again like on March 8th.

The entertainment definitely improved. Someone set up an amplified acoustic guitar and a microphone, and people took turns singing. We marched two laps around the block, and there were many more honks and cheers from the cars driving by. Promptly at 5:30, the organizers walked around and asked people to leave to be respectful of the Columbus police dialogue team that had been called out for what was advertised as a one hour protest. That seemed reasonable.

This week's protest sign

Protester in front of a Tesla building holds a sign reading, “Our GOVERNMENT was fine. Now it is MUSKed up! Fire Elon.” with a image of Elon Musk crossed out. The person wears a blue jacket, standing against a gray, overcast sky, with bare tree branches framing the scene.My protest sign at the #TeslaTakedown.

Back to basics, I thought. People are driving by quickly, so too much text won't be read. So this was the idea.

  • Set the context: "Our GOVERNMENT was FINE."
  • Deliver the punchline: "Now it is MUSKed UP!"
  • Clear call-to-action: "FIRE ELON!" (in a flaming font, nonetheless)

And that seemed to work.

Refer to the March 8th blog post for instructions on creating the sign. If you want one too, I've uploaded the 6-page PDF of page tiles to make the sign. When you print them, line them up with three on top and three on the bottom. Then, trim the bottom and right edges of each page. For the two right-most pages, there will be a lot of extra, unused space to cut off, and there are crop marks you can use to trim at just the right spot. There are a few millimeters of overlap between pages, so your trimming doesn't have to be exact. Then line up the pages and tape them to a poster board (or, as in my case, a recycled campaign yard sign.) This is set up to make a 26" by 16" sign — the exact dimensions of a typical campaign yard sign! Ping me on Mastodon or Bluesky if you use it, and include a picture if you'd like!

So now that I've shown improvement week-by-week, I need to figure out how to step up my game for next Saturday...

🇹🇭 Open Data Day 2025 in Bangkok: Turning Data into Solutions for Floods, Smog, and Wildfires / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices.

For Open Data Day 2025, Thailand is spotlighting disasters as a key theme under the title:
“Floods, Smog, Wildfires: Decoding Thailand’s Endless Disaster with Data.”

We recognize disasters as part of a polycrisis—a web of interconnected crises that cannot be solved in isolation. In Thailand, floods, smog, and wildfires are not isolated events; they are deeply linked to socio-economic vulnerabilities, political decisions, and environmental degradation. We believe that better data and effective data use are key to tackling these challenges.

This year, WeVis (Civic Tech), in collaboration with partners ThaiPBS, theactive_net, Policy Watch – Thai PBS, and 101_pub, organized an Open Data Day onsite event on 2 March 2025 at Siam Paragon as a half-day, open-access, free-to-attend gathering, livestreamed on Facebook and YouTube. The event was not only part of the global celebration of International Open Data Day but also a reaffirmation of our commitment to transparency and civic engagement through open data. With around 100 attendees, including both registered participants and walk-ins, we welcomed a diverse audience—ranging from those familiar with Open Data to newcomers who had never heard of it before but were drawn in by the event’s format and discussions.

🔍 Activities

The event was structured into three main segments:

1. Policy Forum: Using Data to Solve Thailand’s Never-Ending Disasters

This panel discussion tackled Thailand’s ongoing disaster management challenges through a data-driven lens. We invited leading experts in disaster management to analyze floods, smog, and wildfires, exploring how data could improve policy and response strategies.

The takeaway from this session:

  • Thailand has plenty of disaster-related data, but it is scattered and disorganized, making it hard to use effectively.
  • Lack of standardized, accessible formats prevents data from reaching the right people at the right time.
  • Experts emphasized the need for greater data transparency, cross-agency data sharing, and community-level disaster preparedness.
  • One of the biggest upcoming challenges? Climate change, which is making disasters more frequent and unpredictable.
  • Additionally, certain personal data protection laws may pose hurdles in responding to disasters efficiently.
2. Mini Exhibition: Sonification – Listening to Disaster Data

This year, we introduced a new interactive approach to data interpretation: sonification—the process of converting data into sound, allowing attendees to listen to trends instead of reading graphs or statistics.

We showcased three datasets:

  1. Flood damage areas across Thailand.
  2. PM2.5 air pollution trends (monthly average data from 96 monitoring stations nationwide, spanning January–October 2024).
  3. Cumulative wildfire burn areas in nine northern provinces (data spanning 2018–2024).
3. Project Showcase Talk: Behind the Data – Understanding Disasters Through Open Data

More than 10 projects from 9 organizations shared their experiences in using data-driven approaches to tackle Thailand’s disaster challenges. The discussion emphasized how data can be leveraged for disaster prevention, response, and long-term resilience.

At WeVis, our commitment to Open Data Day remains unwavering. We see this as an opportunity to advocate for open data policies and push Thailand closer to a future where transparency, accessibility, and data-driven decision-making are the norm. We hope to see Thailand fully embrace open data in the near future!


About Open Data Day

Open Data Day (ODD) is an annual celebration of open data all over the world. Groups from many countries create local events on the day where they will use open data in their communities. ODD is led by the Open Knowledge Foundation (OKFN) and the Open Knowledge Network.

As a way to increase the representation of different cultures, since 2023 we offer the opportunity for organisations to host an Open Data Day event on the best date over one week. In 2025, a total of 189 events happened all over the world between March 1st and 7th, in 57 countries using 15+ different languages. All outputs are open for everyone to use and re-use.

For more information, you can reach out to the Open Knowledge Foundation team by emailing opendataday@okfn.org. You can also join the Open Data Day Google Group or join the Open Data Day Slack channel to ask for advice, share tips and get connected with others.

In OCLC v Anna's Archive, New/Novel Issues Sent to State Court / Peter Murray

The U.S. District Court for the Southern District of Ohio released an opinion in the case of OCLC v. Anna's Archive. As you may recall, the case stems from an accusation that Anna's Archive—a search engine for 'shadow libraries'—scraped the content of OCLC's WorldCat. Anna's Archive itself is an anonymous effort, and OCLC named one person in the lawsuit—Maria Matienzo—with a weak and dubious connection to Anna's Archive.

Here are bits of the court's order from its introduction (the start of the order) and conclusion (at the end):

This case is about data scraping. Plaintiff Online Computer Library Center, Inc. ("OCLC") is a non-profit organization that helps libraries organize and share resources. In collaboration with its member libraries, OCLC created and maintains WorldCat-the most comprehensive database of library collections worldwide. OCLC alleges that a "pirate library" named Anna's Archive along with Maria Matienzo, and other unknown individuals (collectively, "Defendants") scraped WorldCat's data. OCLC claims that, in doing so, Defendants violated Ohio law. Specifically, OCLC invokes causes of action arising under the Ohio common law of tort, contract, and property, as well as a provision of the Ohio criminal code.

But whether Ohio law prohibits the data scraping alleged here poses "novel and unsettled" issues. No Ohio court has ever applied its law as OCLC would have this Court do (as far as the Court is aware). Nor have courts uniformly applied analogous laws of other jurisdictions that way. So, to resolve this case, the Court would need to answer "novel and unsettled" questions about Ohio law.

When that is true-when a federal court faces "novel and unsettled" state-law issues-the federal court may certify those issues to the state's high court. Unwilling to sleepwalk into a drastic expansion of Ohio law, this Court thus resolves to certify the issues presented here.

[...]

The Court is sympathetic to OCLC's situation: a band of copyright scofflaws cloned WorldCat's hard-earned data, gave it away for free, and then ignored OCLC when it sued them in this Court. But mindful that bad facts sometimes make bad law, the Court requests that an Ohio court intervene before this Court makes any new state tort, contract, property, or criminal law.

The Court resolves to CERTIFY the novel Ohio-law issues identified above to the Supreme Court of Ohio. Plaintiff's counsel and Matienzo's counsel are ORDERED to propose an order containing all the information Ohio Supreme Court Practice Rule 9. 02 requires by April 11, 2025. The parties may file their proposed orders separately, or, if they so choose, they may file one joint proposed order. The Court will finalize a certification order afterward.

OCLC's motion for default judgment is DENIED without prejudice. See Lammert v. Auto-Owners (Mut. ) Ins., 286 F. Supp. 3d 919, 928-29 (M. D. Tenn. 2017) (adopting this same disposition). Because the answers to the certified questions may also determine Matienzo's motion to dismiss under Federal Rule of Civil Procedure 12(b)(6), ECF No. 21, the Court DENIES without prejudice that motion too. See id. The Court invites the parties to reraise their motions after the certification proceeding. See id.

The Court also grants OCLC leave to amend its Complaint to correct any of the above-identified pleading deficiencies.

OCLC has brought twelve claims against the defendants, including breach of contract, unjust enrichment, tortious interference, criminal violations under Ohio law, trespass to chattels (a fancy way of saying "breaking and entering", I think), and conversion of property to deny benefits to OCLC. Almost every claim raises novel legal questions that have not been definitively addressed by Ohio courts. As is the practice of federal courts in such cases, the judge decides to certify several "novel and unsettled" legal questions to the Supreme Court of Ohio, given the absence of clear precedent in state case law. Interestingly, this includes a question about the enforceability of "browserwrap" contracts—or terms of service that appear as links on the bottom of web pages. (This is contrasted with "clickwrap" contracts where the user must affirmatively click an "I agree" button or link.) Other questions include the definition of unjust enrichment in the context of data scraping and the interpretation of Ohio Revised Code § 2913.04 regarding unauthorized access to computer systems. If it were only that simple, though...the order also discusses potential preemption by federal copyright law, suggesting that OCLC’s claims may conflict with federal statutes, which complicates the legal landscape further.

There is a curious footnote at the end that makes me wonder if the judge is signalling to Matienzo's lawyers that there may be a way to get their client out of this sticky mess (legal citations removed):

As an aside, the Court also wonders whether the intracorporate conspiracy doctrine bars OCLC's conspiracy claims. Under that doctrine, an agreement between agents of the same legal entity is not an unlawful conspiracy. OCLC's conspiracy counts allege, in effect, that Matienzo is an agent of Anna's Archive who conspired with other agents of Anna's Archive to scrape WorldCat's data. If Anna's Archive is a legal entity, then OCLC may have alleged an intracorporate conspiracy. The Court pulls this thread no further (because it decides to certify). But, after the certification proceeding, the Court expects the parties will brief whether the intracorporate conspiracy doctrine applies here.

So it would seem that this case is not done, and the focus now shifts to Ohio's top state court. I don't know what the odds are that Ohio's court system will take up these questions or what happens if it declines to do so.

The Tech We Want: Join us to discuss the future of Frictionless Data in a 3-hour online summit / Open Knowledge Foundation

Mark your calendars! Next Friday, 28 March 2025, the Open Knowledge Foundation will host a new iteration of The Tech We Want initiative — our ambitious effort to reimagine how technology is built and used. This time we’ll focus on our decade-long Frictionless Data project, a progressive open-source framework for building data infrastructure – data management, data integration, data validation and data flows. The goal is to bring together maintainers, users, and contributors to discuss challenges, priorities, and governance for the project moving forward.

The Frictionless Data project aims to make it easier to work with data by reducing common data workflow issues (what we call friction). Its software and standards are the basis of many projects developed by the community all over the world, ranging from climate scientists, to humanities researchers, to government data centres. 

At the summit, we want to promote successful cases and projects, and collectively discuss how such a long-running project adapts to the current context and Open Knowledge’s current vision of technology.

What to Expect

👩🏽‍💻 We will start the event with short presentations of Frictionless use cases, in order to have some examples of how Frictionless is used today.

🗣 Open Knowledge’s Tech Lead, Patricio del Boca, will then briefly present The Tech We Want initiative, to share the current vision of technology that is driving all development efforts at OKFN today.

🤗 Following that, we will have a genuine community discussion on how to envision the future of the project with selected active members of the community leading the discussion.

Programme

🕒 15:00 CET – Frictionless Today (demos)

Here’s how we are using Frictionless today.

Confirmed speakers and projects:

  • Pierre Camilleri, Johan Richer: Validata (government data)
  • Romina Colman: Open Data Editor (data quality for non-technical audiences)
  • Peter Desmet: Camtrap DP (bioinformatics)
  • Nicholas Kellett: Deploy Solutions (climate/disaster response + citizen science)
  • Phil Schumm: Data curation with Frictionless at NIH/University of Chicago (biomedical data)
  • Adam Shepherd, Amber York: BCO-DMO (oceanography data)
  • Ethan Welty: Global englacial temperature database (glaciar monitoring data)

🕓 16:30 CET – The Tech We Want Vision

A presentation about new practical ways to build software that is useful, simple, long-lasting and focused on solving people’s real problems.

  • Patricio Del Boca – Open Knowledge Foundation

🕔 16:45 CET – Frictionless Tomorrow (community discussion)

What will Frictionless be like tomorrow and how will the current tech changes impact it?

Confirmed discussion leaders:

  • Pierre Camilleri
  • Peter Desmet
  • Stephen Diggs
  • Keith Hughitt
  • Kyle Husmann
  • Phil Schumm

Although it’s an event with and for the Frictionless community, the event is open to anyone interested, especially the open data community (technical and non-technical). If you use Frictionless Data or the Data Package in any project, please come along. If you want to learn more about how to optimise your data workflow and learn more about the future of the project, please come along too. 

For updates, follow the Open Knowledge Foundation on Mastodon, Bluesky, X and LinkedIn.

We can’t wait to see you there! 🚀

Bitcoin's Fee Spikes / David Rosenthal

I've written several times, for example in Fixed Supply, Variable Demand, about the mechanism that causes the cost of transacting on a blockchain like Bitcoin's to suffer massive spikes at intervals. When no-one wants to transact, fees are low. When everyone does, they are high. Below the fold I look in detail at a typical Bitcoin fee spike.
Average Fee
The spike that happened on 6th June last year was by no means the largest, but it did increase the average fee per transaction by a factor of 20 from the level 3 days earlier, and by a factor of 10 from the level the day before.

The bar chart shows the average fee per transaction for the period from the 1st to the 10th of June.

Mempool Count
Transactions waiting for a mining pool to select them for inclusion in a block are stored in the mempool. Here is a graph of the daily average number of transactions in the mempool for the same period. Note that:
  • The number of pending transactions ramps up and peaks the day before the spike in fees.
  • The number of pending transactions remains high for the days after the spike in fees.
Mempool Bytes
Transactions aren't all created equal, some are bigger than others. So there is a second measure of congestion, the number of bytes in the mempool. The chart shows the daily average byte count of the mempool for the same period. Note that the peak size of the mempool is the day after the fee spike, and number of transactions in the mempool that day is about the same as the preceding and succeeding days.

Transactions/Day
Bitcoin's block time is targeted at ten seconds, and:
Bitcoin blocks now have a theoretical maximum size of 4 megabytes and a more realistic maximum size of 2 megabytes
Thus the rate at which transactions are included in blocks is limited to an average of around 7 per second. The chart shows the number of transactions included in a block each day for the same period. Note that the day of the fee spike Bitcoin processed 42% of the average of the three days before and three days after. The variance of the transaction rate is high. For example, on 7th September the blockchain processed 76% more transactions than the average of the three following days.

Average $/Transaction
The blockchain.info website's statistics include an estimate of each day's US dollar value of transactions. By dividing this by the number of each day's transactions, we can estimate the average value of the transactions included in blocks each day, giving this chart. Note that the 6th June's average was around 2.5 times the average of the previous three days, and nearly 6 times the average of the following three days.

Median Wait 
Because the rate at which transactions arrive in the mempool is variable but the rate at which they leave is capped, even transactions with miner fees suffer a variable delay before being included in a block. Miners choose the most profitable transactions to include. This chart shows the median transaction wait time in minutes over the same period. Note that on average they wait about one block time, but that on 6th June they waited much less time.

Average/Median Times
The distribution of wait times is highly skewed, with a long tail of low-value transactions. This chart shows the average wait time for transactions with fees, expressed as a multiple of the median wait time. Note the ramp up in average wait time ratio, peaking the day after the fee spike.

The ratio between median and average wait times implies that the spike in the average value of transactions on 6th June, coming on top of the ramp in transactions waiting in the mempool from 1st June, caused a build-up of low-value transactions, which suffered massive delays as the miners chose the transactions with higher value and fees.

In 2017's The currency of the future has a settlement problem, Izabella Kaminska wrote:
Professional bitcoin OTC traders FT Alphaville spoke with see this as an alarming development and one of the drivers of rival cryptocurrency ether’s growing popularity. The views of one trader:
It definitely tempts people into ether. This is the biggest problem with bitcoin, it’s not just that it’s expensive to transact, it’s uncertain to transact. It’s hard to know if you’ve put enough of a fee. So if you significantly over pay to get in, even then it’s not guaranteed. There are a lot of people who don’t know how to set their fees, and it takes hours to confirm transactions. It’s a bad system and no one has any solutions.
Transactions which fail to get the attention of miners sit in limbo until they drop out. But the suspended state leaves payers entirely helpless. They can’t risk resending the transaction, in case the original one does clear eventually. They can’t recall the original one either. Our source says he’s had a significant sized transaction waiting to be settled for two weeks.
Almost 8 years later these problems remain.

Activating URIs in linky MARC: an OCLC RLP discussion summary / HangingTogether

The OCLC Research Library Partnership Metadata Managers Focus Group met in January 2025 to explore community cases for the use and re-use of linked data URIs found in MARC records. Members were invited to bring examples of work they are doing or have seen that activates linked data augmentations to MARC records in their discovery systems and integrations.

The round-robin of three separate discussions included 38 participants from 25 RLP institutions in 5 countries:

British LibraryRutgers UniversityUniversity of Illinois at Urbana-Champaign
Cleveland Museum of ArtSmithsonian InstitutionUniversity of Kansas
Cornell UniversityTufts UniversityUniversity of Leeds
National Gallery of ArtUniversity of ArizonaUniversity of Pittsburgh
National Library of AustraliaUniversity of CalgaryUniversity of Southern California
National Library of New ZealandUniversity of California, Los AngelesUniversity of Sydney
New York UniversityUniversity of California, RiversideUniversity of Tennessee, Knoxville
Princeton UniversityUniversity of ChicagoVirginia Tech
Radboud University Yale University

These discussions explored the technical and social challenges of transitioning to next-generation metadata formats. This blog post synthesizes the key topics that came up in the discussions, concerning URIs in MARC, the use of AI, and more general aspects of getting started with linked data in library metadata.

Background

Like many complex systems, our transition to the next generation of metadata has both technical and social challenges. Over the past decade, the metadata community has been on a journey to transform the way we work from “strings” of description in MARC records toward knowledge graph structures that identify different entities and their relationships. Because building a graph structure is a bit like building an arch—you need all the pieces in place for it to be self-supporting—we’ve also had to deploy scaffolding to temporarily support our transition. One piece of this scaffolding is “linky MARC,” that extends the MARC format to include linked data URIs. Emerging from the PCC Task Group on URIs in MARC, the inclusion of these URIs facilitates the move toward next-generation metadata formats such as BIBFRAME.

Drawing of the London Bridge under construction.New London Bridge under construction. © The Trustees of the British Museum / CC BY-NC-SA 4.0

As part of OCLC’s linked data strategy, we have added more than 500 million WorldCat Entities URIs to WorldCat records for five entity types: Persons, Organizations, Places, Events, and Works. Enriching WorldCat bibliographic records with WorldCat Entities URIs establishes a bridge between MARC data and linked data, providing a starting point for connecting data across local systems and workflows and for using linked data functionality, such as in local discovery systems. These URIs and associated WorldCat Entities data are free for anyone to retrieve and use, through our website or via API, under a Creative Commons Attribution-NonCommercial 4.0 (CC BY-NC4.0) license. (Meridian subscribers also have creating/editing privileges).

While generative AI may be pulling at our attention, it’s important to remember that linked data also emerged from an earlier era of symbolic artificial intelligence systems—developers of the semantic web needed to present factual metadata statements and a mechanism for machines to interpret them in reasoning activities. In current conversations about AI, this has largely been replaced by the kinds of stochastic prediction methods offered by large-language models. However, the work the community has done on linked data over the last decade remains significant. A recent paper highlighted the strengths of knowledge graphs (aka linked data), search engines, and large-language models in addressing specific user needs. The authors suggest that integrating these technologies could lead to more effective solutions and emphasize the importance of considering the diverse information needs of users. They believe “that research on their combination and integration—with users in mind—will be particularly fruitful.” 

The January 2025 Metadata Managers Focus Group round-robin session focused on the following discussion prompts:

  • What do you want to do with linked data URIs being added to MARC records?
  • What are you doing that activates linked data augmentations to MARC records in your discovery systems and integrations?

The discussions revealed that members are navigating many issues:

  • While linked data is of interest, regular work and Library Service Platform (LSP) migrations leave little room for experimentation or even adoption. Linked data is neither a driver for moving systems not supporting the move in any way.
  • Metadata managers need tools that support seamless workflows, with a focus on data quality and trust in features and sources.
  • While few attendees are yet using OCLC URIs, the need for linked data entities in repositories and cultural heritage collections is clear.
  • Generative AI and machine learning present opportunities for creating and improving linked data. But metadata managers are also concerned about maintaining awareness of when AI has been used in metadata workflows.
  • Participants wish to better understand what users want and need from linked data, especially as generative AI shakes up how researchers interact with our data.

Little time to think about linked data

While all participants were eager to learn about the benefits of URIs in MARC, most shared that there was little time to think about linked data. The immediate focus for many of them was the implementation of a new library service platform or the preparation for one. The latter would require prepping existing metadata for migration to the new platform in addition to keeping up with regular workflows, leaving little room for anything else.

This raises an important question about how the presence of linked data URIs in the metadata could make platform migrations smoother in the future. 

Participants also shared that, at this point in time, linked data is not yet a driver for switching LSPs.

URIs: Little experience, much curiosity

Few attendees had a clear vision for how they would use OCLC URIs in their systems and workflows, and how linked data features in their chosen LSP would take advantage of them.  Also, there was some confusion about why OCLC is adding URIs, and under which license terms these could be used.

However, participants noted the need to provide linked data entities for institutional repositories, data repositories, and cultural collections, and felt that WorldCat Entities could be useful in this area. Person and organization entities were of particular interest, less so subject or classification URIs.

Current linked data tools not yet convincing

A more modern LSP could be an opportunity to use MARC encoded URIs in new cataloging workflows or user-facing discovery features. However, several participants noted challenges of working with currently available linked data features. They are seeking seamless tools and finding disjointed environments that aren’t ready for production use.

For example, URIs could be used to expand information on a particular entity and expose that information to users in a knowledge card. This can be an easy configuration change in systems where this is a feature. According to our attendees who are exploring this, metadata quality is still a significant part of making such a feature accurate and useful. Anecdotally, they shared examples where Person cards pointed to the wrong entity or contained incorrect information. Since these may rely on external sources, such as Wikidata, it requires skill and knowledge to update the sources. Even then, it may take time for those changes to propagate to local environments and become visible to end users.

Organizations who have adopted open-source discovery layers may need to dedicate local resources to develop workflows and interfaces that take advantage of linked data URIs.

Even where linked data features are available, if the underlying metadata is not aligned, users may still need to navigate between silos to find what they are looking for. A common example is that it is not always possible to look across both monographs and article-level publications for the same person. Even when ORCIDs and ISNI persistent identifiers are available, they may not be integrated into a single entity. When expanding beyond just bibliographic materials to include art and special collections, similar challenges remain when entity information hasn’t been unified.

This lack of linked data integration with current systems, tools, and workflows is a barrier to adoption, as is the lack of trust in existing linked data features or sources.

Participants also desire to better understand user needs to inform their work. None of our participants actively pursued systematic studies of how their users take advantage of entity information or browse, and there was a consensus that user studies are needed in this area. In addition, much of what we’ve previously learned about users that inform current discovery experiences is being shaken up by access to interactive AI chatbots. We flagged this topic as something to explore in a future session.

Opportunities and concerns about AI-generated metadata

An ongoing concern among metadata managers is the role that AI will play in future cataloging and metadata workflows, especially as we move toward linked data approaches. During our discussion, metadata managers wanted to better understand how we can leverage AI’s advantages responsibly, in part by considering how best to inform others (catalogers, users, etc.) about the provenance of metadata records.

Balancing AI opportunities with quality

Participants with access to emerging AI cataloging assistants shared how these tools can help bootstrap brief records that are augmented in later human-centered workflows. Especially given that both general and special collections face significant backlogs (see our previous discussion about this in Keeping up with next-generation metadata for special collections and archives), having an AI assistant do the initial work of transcribing information from print would save time. Whether we can trust that these agents produce records of sufficient quality remains an area of study and interest.

Needs for consistent AI data provenance

A significant question that emerged in our conversation about the use of AI is how we are providing data provenance statements and at what level of granularity they are needed. For the agents mentioned above and in other early AI-driven MARC workflows, a consensus has developed around using a 588 Source of Description note to indicate that AI was used to create the description.

Because linky-MARC enables us to record a URI alongside textual labels, these URIs can also serve as a guide to data provenance. Participants noted that an LC/NACO Authority File record contains a great deal of information about how the authority was established and provides human-readable information that catalogers can use to make judgments about whether to trust it. This kind of trust can easily be extended to the linked data entities and URIs at id.loc.gov. How, then, can we express our confidence in different linked data sources in a way that replicates catalogers’ professional judgments, especially if those judgements were made by an AI agent? When presented with repeated entities that use different URIs, how do we choose among them? As one participant suggested, there may not be one single answer to this question. Rather, we may need to develop application profiles that express different levels of trust for different environments and use cases.

Final thoughts

This blog post started by talking about how we’re building a linked data ecosystem that requires some scaffolding to support emergent data structures and workflows. In addition to shepherding their organizations onto new platforms, LLM-powered AIs are disrupting how we think about discovery and meeting user needs. Just as Google search upturned our concept of the OPAC and ushered us into an era of modern discovery layers, AI is challenging us to reimagine what we are doing with linked data. An example of this is the Library of Congress’ “Modern MARC” approach, which will further extend “linky” MARC by including additional identifiers and embracing linked data modeling choices that are better suited for translation into knowledge graphs. Beyond displaying knowledge cards in user interfaces, knowledge graph structures can fulfill their original promise by minimizing LLM hallucinations through metadata trusted by the humans in the loop.     

Writing this blog post was a collaborative effort. Many thanks to all those who contributed, in particular my colleagues Rebecca Bryant, David Heimann, Erica Melko, Mercy Procaccini, Merrilee Proffitt and Chela Weber.

The post Activating URIs in linky MARC: an OCLC RLP discussion summary appeared first on Hanging Together.

Insta-Politics - digfemig & juxta / Nick Ruest

During my sabbatical last year, one of the projects I worked on involved collecting images and videos from Instagram daily. I focused on US and Canadian political leaders, aiming to align this project with the 2024 US presidential election and the upcoming 2025 Canadian Federal Election.

My UWaterloo DigFemNet colleagues and I were looking to track misogyny and hate via hashtags associated with candidates and party leaders, so I created a small Python utility that I called digfemig. It grabbed the top 100 media for a given Instagram hashtag, and I ran it each day on a cron job. The utility is basically a wrapper around some core functionality in instagrapi.

It worked for a time – August to mid-December 2024 – until Instagram/Meta banned my collecting account and my dog’s account for a short period of time 😅. We’d hoped to collect at least through the inauguration in January 2025 (at which time I quit social media which is another story), and possibly through the 2025 Canadian Federal Election. But alas, after being banned multiple times, I decided to give up.

The awesome team of folks I’m working with find the juxtas I’ve created in the past really useful, so I pulled together the images I was able to grab, and updated juxta to work with the csv files I produced with digfemig. If you’re curious, you can check them all out on the “Insta-Politics: Visualizing Political Leaders via Collages” page I created for the project.

Hopefully we’ll have more to say about these images in the future!

Rest easy Toke. I miss you buddy 😭.

Issue 112: Odds and Ends in Social Media Research / Peter Murray

Social media saturates nearly every facet of our lives, and understanding its effects on society has never been more critical. This week's DLTJ Thursday Threads delves into recent studies and discussions of why misinformation is spread on platforms and ways to counteract it. As platforms continue to shape the way we communicate and process information, they also spark moral outrage and other intense emotions that can lead to the further spread of false content. Researchers are exploring how these dynamics unfold, as well as the roles of opportunists who exploit these platforms for personal or political gain.

As we navigate these challenges, there are things that individuals can do and things that we could expect platforms to do to reduce the impact of misinformation. While individuals can adopt practices to avoid contributing to misinformation, there is also a call for platforms to refine their moderation strategies, such as combining fact-checking with community-driven initiatives. Amidst these discussions, the potential impact of social media on adolescent wellbeing remains a concern, with experts debating its true role in rising mental health issues among young adults.

Also on DLTJ this past week:

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

Moral Outrage Fuels Spread of Misinformation Online

“The vast majority of misinformation studies assume people want to be accurate, but certain things distract them,” says William J. Brady, a researcher at Northwestern University. “Maybe it’s the social media environment. Maybe they’re not understanding the news, or the sources are confusing them. But what we found is that when content evokes outrage, people are consistently sharing it without even clicking into the article.” Brady co-authored a study on how misinformation exploits outrage to spread online. When we get outraged, the study suggests, we simply care way less if what’s got us outraged is even real.
People will share misinformation that sparks “moral outrage”, Ars Technica, 2-Dec-2024

The article discusses the phenomenon of misinformation spreading online, particularly when it evokes "moral outrage." It starts with a fabricated quote attributed to Rob Bauer, chair of a NATO military committee, suggesting that NATO should preemptively strike Russia, which garnered significant attention despite being untrue. The misinformation received nearly 250,000 views on social media, amplified by figures like Alex Jones. This research challenges the common assumption that misinformation is primarily shared by mistake; rather, it suggests that outrage drives people to share content without verifying its accuracy. Because of that, traditional solutions aimed at promoting accuracy in sharing have proven ineffective.

Opportunists Exploit Online Conspiracy Theories for Influence and Profit

There has been a lot of research on the types of people who believe conspiracy theories, and their reasons for doing so. But there’s a wrinkle: My colleagues and I have found that there are a number of people sharing conspiracies online who don’t believe their own content. They are opportunists. These people share conspiracy theories to promote conflict, cause chaos, recruit and radicalize potential followers, make money, harass, or even just to get attention.
Some online conspiracy-spreaders don’t even believe the lies they’re spewing, The Conversation, 4-Oct-2024

The article explores the phenomenon of individuals who share conspiracy theories online without genuinely believing in them. The study identified various types of conspiracy-spreaders, including extremist groups that use conspiracies as a gateway for radicalization, governments that manipulate narratives for political gain, and others who do it for profit or to gain influence. Many everyday users even share conspiracies for attention or social validation, often without verifying the information. The article warns that these opportunists can eventually convince themselves of their own lies.

Strategies to Avoid Becoming a 'Misinformation Superspreader' on Social Media

Emerging psychology research has revealed some tactics that can help protect our society from misinformation. Here are seven strategies you can use to avoid being misled, and to prevent yourself – and others – from spreading inaccuracies.
7 ways to avoid becoming a misinformation superspreader when the news is shocking, The Conversation, updated 15-Jul-2024

Whether your social media feed is filled with content driven by moral outrage, attempts to influence you, or schemes to profit from you, there are ways that you can break the cycle of spreading this misinformation. The strategies are: educate yourself about common disinformation tactics, recognize your own vulnerabilities and biases, carefully evaluate the credibility of information sources, pause before sharing content, be aware of how emotions can influence the spread of misinformation, gently challenge the misinformation you see, and stand with others when you see someone else challenge it. The article emphasizes that while there is no perfect solution, these steps can help protect individuals and their social networks from the harmful effects of false and misleading information.

Combine Fact-Checking and Community Notes for Better Social Media Content Moderation

Even before Meta’s announcement that it was ending fact-checking in favor of a Twitter-style community notes approach or Musk’s tweet in relation to the war in Ukraine, researchers at CITP were looking at the pluses and minuses of both systems.... [Princeton Computer Science Ph.D student Madelyne] Xiao thinks that the political turmoil surrounding Meta’s announcement hardened people into two camps. The debate has “…unhelpfully positioned (human) fact-checking as a content moderation strategy contra notes-like systems.” They should be used in conjunction: “Neither system is flawless! Both have much room for improvement!”
Fact-checking or Community Notes? Why not both!, Center for Information Technology Policy Blog, 26-Feb-2025

This is just a brief article from the Center for Information Technology Policy, but one to go to for links to other research and researchers. It was published around the same time as the next article that highlights deficiencies in the community-notes approach. Researchers from CITP suggest that both fact-checking and community notes can be complementary rather than mutually exclusive. Ultimately, the discussion reflects a need for a more nuanced understanding of these moderation strategies in combating disinformation effectively.

Community Notes System Fails to Curb Misinformation on Social Media

The billionaire leaders of social media giants have long been under pressure to quell the spread of mis- and disinformation. No system to date, from human fact-checkers to automation, has satisfied critics on the left or the right. One novel approach winning plaudits recently has been Community Notes. The crowdsourced method, first introduced by Twitter before Elon Musk acquired it and rebranded it as X, allows regular users to submit additional context to posts, offering up supporting evidence to set the record straight.... The system has advantages over the alternatives, but its limits as an antidote to misinformation are clear. So are its benefits for executives who have been dogged by intense scrutiny over misinformation and censorship for the better part of a decade. It allows them to outsource responsibility for what happens on their platforms to their users. And also the blame.
Community Notes on X and Meta Can’t Save Social Media From Itself [opinion], Bloomberg, 18-Mar-2025

This article is marked as an opinion piece in Bloomberg, but it is well researched with lots of hyperlinks to source material. It discusses the limitations of the Community Notes system implemented by social media platforms like Ex-Twitter and Meta to combat misinformation. Community Notes allows users to provide context and evidence to posts, promoting a sense of democratized information sharing. However, the article argues that this approach falls short of addressing the deeper issues inherent in social media's structure. The authors suggest that without significant changes to how these platforms operate, even innovative systems like Community Notes cannot resolve the ongoing challenges of misinformation. Ultimately, the piece highlights the struggle between user-generated content and the need for reliable information in digital spaces.

Exploring the Complex Impact of Social Media on Teen Mental Health

Two things need to be said after reading The Anxious Generation. First, this book is going to sell a lot of copies, because Jonathan Haidt is telling a scary story about children’s development that many parents are primed to believe. Second, the book’s repeated suggestion that digital technologies are rewiring our children’s brains and causing an epidemic of mental illness is not supported by science. Worse, the bold proposal that social media is to blame might distract us from effectively responding to the real causes of the current mental-health crisis in young people.
The great rewiring: is social media really behind an epidemic of teenage mental illness?, Nature (book review), 29-Mar-2024

This is a review of The Anxious Generation: How the Great Rewiring of Childhood is Causing an Epidemic of Mental Illness, which was published a year ago. The article suggests that while social media can foster connections, it often fails to provide genuine social interaction, potentially leading to feelings of isolation and mental distress. Various studies are cited to examine the relationship between social media usage and mental health outcomes, indicating that the effects are complex and multifaceted. The discussion also addresses the need for further research to understand the nuances of this relationship.

This Week I Learned: Most plastic in the ocean isn't from littering, and recycling will not save us

Littering is responsible for a very small percentage of the overall plastic in the environment. Based on this graph from the OECD, you can see littering is this teeny-tiny blue bar here, and mismanaged waste, not including littering, is this massive one at the bottom. Mismanaged waste includes all the things that end up either in illegal dump sites or burned in the open or in the rivers or oceans or wherever. The focus on littering specifically, it's an easy answer because obviously there's nothing wrong with discouraging people from littering, but it focuses on individual people's bad choices rather than systemic forces that are basically flushing plastic into the ocean every minute. Mismanaged waste includes everything that escapes formal waste systems. So they might end up dumped, they might end up burned, they might end up in the environment.
You're Being Lied To About Ocean Plastic, Business Insider via YouTube, 26-Sep-2024

Contrary to popular belief, most plastic in the Great Pacific Garbage Patch stems from the fishing industry, with only a small fraction linked to consumer waste. The video highlights that mismanaged waste, rather than individual littering, is the primary contributor to plastic pollution, with 82% of macroplastic leakage resulting from this issue. It emphasizes the ineffectiveness of recycling as a solution, noting that less than 10% of plastics are currently recycled, and the industry has perpetuated the myth that recycling can resolve the plastic crisis. Microplastics, which are increasingly recognized as a major problem, originate from various sources, including tires and paint, with new data suggesting that paint is a significant contributor. The video stresses the need for systemic changes, including reducing plastic production and simplifying the chemicals used in plastics, as current efforts and pledges from corporations and governments have not effectively curbed plastic pollution.

What did you learn this week? Let me know on Mastodon or Bluesky.

Alan snoozing behind me

White and gray cat sprawled comfortably in a plush cat bed lined with a soft, yellow blanket. Sunlight streams through a nearby window, creating a warm, cozy scene.
The work best days are when Alan is audibly snoring in the cat bed behind me.

🇳🇬 Open Data Day 2025 in Owerri: Equipping Journalists to Address Climate and Agricultural Issues / Open Knowledge Foundation

This text, part of the #ODDStories series, tells a story of Open Data Day‘s grassroots impact directly from the community’s voices.

In a groundbreaking event held in Owerri, Imo State, journalists from various media organizations gathered for the Open Data Day 2025 workshop, aimed at equipping them with the necessary skills to utilize open data in climate and agricultural reporting. The workshop, hosted by Harriet Ijeomah, Managing Director of Harsco Media & Communications, was supported by Open Knowledge Foundation (OKFN) and Dataopian.

The event began with Harriet Ijeomah introducing the key supporters and emphasizing the crucial role of open data in tackling climate change and agricultural challenges in Nigeria. In her opening speech, she highlighted how Imo State has been severely impacted by extreme weather events, including flooding, excessive rainfall, and soil erosion. She referenced data from 2024, which showed that flooding displaced 7,832 people and affected 60,761 others in the state, while 16% of erosion incidents in Owerri were directly linked to intense rainfall.

Speaking on the significance of Open Data Day 2025, Ijeomah underscored the importance of open data in investigative journalism, stating, “Access to accurate and well-structured data is essential for journalists to tell compelling and solution-driven stories that can influence policy changes and drive sustainable development.”

Deep Dive into Data Reporting

The first session was led by Udochukwu Chukwu, a research and data expert, who delivered an extensive lecture on data reporting for agriculture and climate issues. He introduced participants to various sources of open data, with a strong emphasis on the Open Data Editor, showcasing its functionality and practical applications for journalists. Chukwu explained how data-driven journalism can help uncover hidden trends in climate change and agricultural production, making reference to real-world examples.

To reinforce the learning process, he facilitated a group exercise where journalists analyzed open datasets and developed reports on climate-related challenges in Imo State. These reports were designed to serve as a foundation for further investigative stories, bridging the gap between raw data and impactful storytelling.

Practical Insights from Experienced Journalists

Renowned development journalist Chinedu Hardy Nwadike took the stage to share his expertise on open data’s role in solution journalism. He stressed the need for journalists to harness available data in holding policymakers accountable and addressing key societal challenges. Hardy illustrated practical strategies for integrating data visualization into reports, ensuring that complex information is easily understood by the public.

The final session featured an insightful presentation by Adanna Ononiwu, who spoke on “How Journalists Can Position Themselves for International Opportunities” and “Leveraging Development Reporting as a Community Journalist.” She highlighted how data-driven journalism not only enhances the credibility of a journalist but also opens doors to global collaborations and funding opportunities. Adanna encouraged journalists to explore grants, fellowships, and networking platforms that support investigative and development reporting.

Engaging Discussions and Networking

Beyond the technical sessions, the event provided a platform for networking and collaboration among media professionals. The interactive discussions allowed participants to share their experiences, ask questions, and exchange ideas on innovative approaches to data journalism.

As the workshop concluded, journalists expressed their appreciation for the training, emphasizing how it broadened their perspective on using data to drive impactful storytelling. A key takeaway from the event was the realization that open data is a vital tool in investigative journalism, and when properly utilized, it can influence policies, raise awareness, and drive sustainable solutions in climate and agriculture.

Next Steps and Future Engagements

The event ended with a call to action for journalists to actively incorporate open data in their reporting and explore further training opportunities. The organizers assured continued engagement with participants through follow-up sessions, access to online resources, and mentorship programs to ensure sustained impact beyond the workshop.

With Open Knowledge and Dataopian supporting the initiative, the event marked a significant step toward strengthening the capacity of journalists in Imo State to use data for meaningful storytelling, ultimately contributing to informed decision-making and policy advocacy in Nigeria’s climate and agricultural sectors.


About Open Data Day

Open Data Day (ODD) is an annual celebration of open data all over the world. Groups from many countries create local events on the day where they will use open data in their communities. ODD is led by the Open Knowledge Foundation (OKFN) and the Open Knowledge Network.

As a way to increase the representation of different cultures, since 2023 we offer the opportunity for organisations to host an Open Data Day event on the best date over one week. In 2025, a total of 189 events happened all over the world between March 1st and 7th, in 57 countries using 15+ different languages. All outputs are open for everyone to use and re-use.

For more information, you can reach out to the Open Knowledge Foundation team by emailing opendataday@okfn.org. You can also join the Open Data Day Google Group or join the Open Data Day Slack channel to ask for advice, share tips and get connected with others.

Be a Part of the 2025 DLF Forum: Submit Your Proposal Today / Digital Library Federation

DLF is pleased to announce that we have opened the Call for Proposals for the in-person DLF Forum and Learn@DLF, happening in Denver, Colorado, November 16-19, 2025.

 

We encourage proposals from members and non-members; regulars and newcomers; digital library practitioners from all sectors (higher education, museums and cultural heritage, public libraries, archives, etc.) and those in adjacent fields such as institutional research and educational technology; and students, early- and mid-career professionals and senior staff alike. We especially welcome proposals from individuals who bring diverse professional and life experiences to the conference, including those from underrepresented or historically excluded racial, ethnic, or religious backgrounds, immigrants, veterans, those with disabilities, and people of all sexual orientations or gender identities. 

 

Learn more about our event and session formats, view the Call for Proposals, and submit.

 

The submission deadline for all proposals is Monday, April 14, at 11:59pm Mountain Time.

 

If you have any questions, please write to us at forum@diglib.org. We’re looking forward to seeing you in Denver this fall.

 

All best,

Team DLF

P.S. Want to stay updated on all things #DLFforum? Subscribe to our Forum newsletter.

The post Be a Part of the 2025 DLF Forum: Submit Your Proposal Today appeared first on DLF.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 18 March 2025 / HangingTogether

The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by a team of OCLC contributors.

Woman stands in front of a colorful tapestry reading a book.Photo by Delaram Bayat on Unsplash

Women’s History Month 2025

In the United States, March has been commemorated as Women’s History Month since 1987. Shorter observances go back to at least 1911, when 8 March was first declared as International Women’s Day. Each March, Toward Inclusive Excellence (TIE), the blog from ALA’s Association of College and Research Libraries (ACRL), collects all kinds of resources to keep libraries — especially but not just those that serve higher education — informed about important ideas and initiatives. “Commemorating Women’s History Month with TIE and Choice Content” brings together recent book reviews, blogs, webinars, podcasts, interviews, and other resources to help librarians, researchers, and other interested readers to prepare, learn, and study.

This annual service provided by ACRL’s Choice publishing unit may be more consequential than ever. “Moving Forward Together! Women Educating and Inspiring Generations” is the 2025 theme of National Women’s History Month as declared by the National Women’s History Alliance. Under its former name, the National Women’s History Project, NWHA was instrumental in establishing the commemorative month in 1987. Then as now, the organization has been devoted to “writing women back into history.” As part of that effort, OCLC worked with NWHA in 2023 to create the ”Women’s History” topic page on WorldCat.org. Thanks to the dynamic nature of the WorldCat bibliographic database, the preprogrammed searches are as up-to-date as WorldCat itself. Contributed by Jay Weitz.

A focus on elders and elder justice in libraries

When I read “Because I am Old” in the title of an academic journal article, it caught my attention–as it was intended to do. I know that older adults makeup a sizeable portion of active public library users and catering programs and services for them makes sense. The article describes the specific motivation for seniors to attend digital literacy training programs, which are all the things you would expect like feeling connected to society and family members and feeling ill-equipped to proficiently use information technology. Giving older adults a safe space to experiment is key to their success and boosts confidence. Libraries being that safe space feels natural, and many are already committed to serving this population. ALA’s new Elder Justice Task Force (EJTF) brings together librarians and advocates to centralize resources and learning around the myriad issues facing older adults. The EJTF will support librarians to create safe spaces for digital literacy and health and wellness courses that are sensitive to societal and internalized age discrimination. As noted in the Public Libraries Online article about the EJTF, serving older adult populations is often skipped in library science education and with the expected growth in this demographic the task force wants to aid librarians as they transition to a service model that cares for older adults at higher rates than children, young adults or adults.

There is a part of me that always has patience for people who need extra help. Sometimes that is people with disabilities, sometimes that is children and older adults. It’s a desire to be inclusive and caring in ways that society often moves past and forgets about, especially in the development of information technology. The EJTF fills a major gap in our social and library instruction, one that will affect the library profession in major ways. My hope is that the legacy of EJTF is that the library profession becomes more caring and inclusive of older adults in library programming.  Contributed by Lesley A. Langa.

American Library Association / Gallup research on diversity in librarianship

A new report from the American Library Association and Gallup, Empowering Voices, Inspiring Change: Advancing Diversity Within Librarianship, spotlights one of the profession’s most persistent issues, diversity in the library workforce. Based on interview and survey data (and drawing on previous research as well as workforce data), the report presents a nuanced view of the profession as it stands and makes recommendations for improvement.

One of the things I appreciated most about this report is that it digs into the concept of diversity as a nuanced and complex issue. The library workforce is not only white, but also overwhelmingly female, aging and lacking socioeconomic diversity. The profession faces issues around retention, degree attainment for new workers, and also faces issues due to persistent stereotypes around who librarians are, but what library work entails. One interviewee commented, “There’s this misconception that librarians are agreeable, nice, they drink tea and like cats.” As a dog-loving, non-tea drinker this resonates with me; similarly, my work had fallen outside of what the public would imagine as library work. I also appreciated the report’s focus on the intersectional experiences and the need for representation from multiple facets including neurodiversity. Contributed by Merrilee Proffitt

ALA president shares experience with autism

American Library Association (ALA) president Cindy Hohl shared that that she is one of the millions of people with autism spectrum disorder (ASD) in her 3 March 2025 message Spectrum of Leadership. The World Health Organization estimates that one child in 100 is autistic. It does not provide estimates on how many autistic adults there are in the world, but it seems reasonable to assume a similar figure for adults. Autistic adults may not disclose their status because of the social stigmas. ALA’s Reference and User Services Association has a webpage about ASD designed to provide information about providing services to ASD. While the page uses “may” language as in “a person with ASD may be unusually sensitive to smells,” this description of ASD patrons does not mention a single trait that might be seen as positive. It has instructions about contacting security or police to remove a person with ASD “in extreme cases” and informing law enforcement “that this person has a developmental disability and is not a criminal.” The tone of this page is a sharp contrast to Hohl’s description of ASD. Describing how she struggled with school, she writes, “I was not prepared to see my place in this world since I was focused on living in my own space and time, far from the societal pressures to be perfect. Turns out, I’m okay, and so are you.”

Hohl’s message is an important reminder that autistic people can be successful in a variety of careers, may communicate effectively with individuals and large groups, and may have any of the personality strengths that a neurotypical person does. As Hohl notes, the adversity she has experienced has taught her to use empathy when interacting with others. Hohl’s message reminded me of this quotation from the character Raymond Holt in the television show Brooklyn Nine-Nine: “Every time someone steps up and says who they are, the world becomes a better, more interesting place.” Contributed by Kate James.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 18 March 2025 appeared first on Hanging Together.

Using CloudFlare Turnstile to protect certain pages on a Rails app / Jonathan Rochkind

I work at a non-profit academic institution, on a site that manages, searches, and displays digitized historical materials: The Science History Institute Digital Collections.

Much of our stuff is public domain, and regardless we put this stuff on the web to be seen and used and shared. (Within the limits of copyright law and fair use; we are not the copyright holders of most of it). We have no general problem with people scraping our pages.

The problem is that, like many of us, our site is being overwhelmed with poorly behaved bots. Lately one of the biggest problems is with bots clicking on every possible combination of facet limits in our “faceted search” — this is not useful for them, and it overwhelms our site. “Search” pages are one of our most resource-constrained category of page in our present site, adding to the injury. Peers say even if we scaled up (auto or not) — the bots sometimes scale up to match anyway!

One option would be putting some kind of “Web Application Firewall” (WAF) in front of the whole app. Our particular combination of team and budget and platform (heroku) makes a lot of these options expensive for us in licensing, staff time to manage, or both. Another option is certainly putting the the whole thing behind (ostensibly free) CloudFlare CDN and using its built-in WAF, but we’d like to avoid giving our DNS over to CloudFlare, I’ve heard mixed reviews of CloudFlare free staying free, and generally am trying to avoid contributing to CloudFlare’s monopoly unaccountable control of the internet.`

Although ironically then, the solution we arrived at is still using CloudFlare, but Cloudflare’s Turnstile “captcha replacement”, one of those things that gives you the “check this box” or more often entirely interactive “checking if you are a bot” UXs.

[If you’re a tldr look at the code type, here’s the initial implementation PR in our open repo, there are some bug fixes since then
Update March 18 2025: There is now a gem implementation, bot_challenge_page. It is pre-1.0 and still evolving as we learn more about the problem space]

While this still might unfortunately lock people using unconventional browsers etc out (just the latest of many complaints on HackerNews), we can use this to only protect our search pages. Most of our traffic comes directly from Google to an individual item detail page, which we can now leave completely out of it. We have complete control of allow-listing traffic based on whatever characteristics, when to present the challenge, etc. And it turns out we had a peer at another institution who had taken this approach and found it successful, so that was encouraging.

How it works: Overview

While typical documented Turnstile usage involves protecting form submissions, we actually want to protect certain urls, even when accessed via GET. Would this actually work well? What’s the best way to implement it?

Fortunately, when asking around on a chat for my professional community of librarian and archivist software hackers, Joe Corall from Lehigh University said they had done the exact same thing (even in response to the same problem, bots combinatorially exploring every possible facet value), and had super usefully written it up, and it had been working well for them.

Joe’s article and the flowchart it contains is worth looking it. His implementation is as a Drupal plugin (and used in at least several Islandora instances); the VuFind library discovery layer recently implemented a similar approach. We have a Rails app, so needed to implement it ourselves — but with Joe paving the way (and patiently answering our questions, so we could start with the parameters that worked for him), it was pretty quick work, bouyed by the confidence this approach wasn’t just an experiment in the blue, but had worked for a similar peer.

  • Meter the rate of access, either per IP address, or as Joe did, in buckets per sub-net of client IP address.
  • Once client has crossed a rate limit boundary (in Joe’s case 20 requests per 24 hour period), redirect them to a page which displays the Turnstile challenge — and has the original destination in a query param in url —
  • Once they have passed the Turnstile challenge, redirect them back to their original destination, which now lets them in because you’ve stored their challenge pass in some secure session state.
  • In that session state record that they passed, and let them avoid a challenge again for a set period of time.

Joe allow-listed certain client domain names based on reverse IP lookup, but I’ve started without that, not wanting the performance hit on every request if I can avoid it. Joe also allow-listed their “on campus” IPs, but we are not a university and only have a few staff “on campus” and I always prefer to show the staff the same thing our users are seeing — if it’s inconvenient and intolerable, we want to feel the pain so we fix it, instead of never even seeing the pain and not knowing our users are getting it!

I’m going to explain and link to how we implemented this in a Rails app, and our choices of parameters for the various parameterized things. But also I’ll tell you we’ve written this in a way that paves the way to extracting to a gem — kept everything consolidated in a small number of files and very parameterized — so if there’s interest let me know. (Code4Lib-ers, our slack is a great place to get in touch, I’m jrochkind).

Ruby and Rails details, and our parameters

Here’s the implementing PR. It is written in such a way to keep the code conslidated for future gem extraction, all in the BotDetectController class, which means kind of weirdly there is some code to inject in class methods in the controller. While it does turnstile now, it’s written with variable/class names such that analagous products could be made available.

Rack-attack to meter

We were already using rack-attack to rate-limit. We added a “track” monitor with our code to decide when a client had passed a rate-limit gate to require a challenge. We start with allowing 10 requests per 12 hours (Joe at Lehigh did 20 per 24 hours), batched together in subnets. (Joe did subnets too, but we do smaller /24 (ie x.y.z.*) for ipv4 instead of Joe’s larger /16 (x.y.*.*)).

Note that rack-attack does not use sliding/rolling-windows for rate limits, but fixed windows that reset after window period. This makes a difference especially when you use such a long period as we are, but it’s not a problem with our very low count per period, and it does keep the RAM extremely effiicent (just an integer count per rate limit bucket).

When the rate limit is reached, the rack-attack block just sets a key/value in the rack_env to tell another component that a challenge is required. (setting in the session may have worked, but we want to be absolutely sure this will work even if client is not storing cookies, and this is really only meant as this-request state, so rack env seemed the good way to set state in rack-attack that could be seen in a rails controller)

Rails before_action filter to enforce challenge

There’s a Rails before_action filter that we just put on the application-wide ApplicationController, that looks for the “bot challenge key” required in the rack env — if present, and there isn’t anything in the session saying they have already passed a bot challenge, then we redirect to a “challenge” page, that will display/activate Turnstile.

We simply put the original/destination URL in a query param on that page. (And include logic to refuse to redirect to anything but a relative path on same host, to avoid any nefarious uses).

The challenge controller

One action in our BotDetectController just displays the turnstile challenge. The cloudflare turnstile callback gives us a token we need to verify server-side with turnstile API to verify challenge was really passed.

the front-end does a JS/xhr/fetch request to the second action in our BotDetectController. The back-end verify action makes the API call to turnstile, and if challenge passed, sets a value in Rails (encrypted and signed, secure) session with time of pass, so the before_action guard can give the user access.

if the JS in front gets a go-ahead from back-end, it uses JS document.replace to go to original destination. This conveniently removes the challenge page from the user’s browser history, as if it never happened, browser back button still working great.

In most cases the challenge page, if non-interactive, wont’ be displayed for more than a few seconds. (the language has been tweaked since these screenshots).

We currently have a ‘pass’ good for 24 hours — once you pass a turnstile challenge, if your cookies/session are intact, you won’t be given another one for 24 hours no matter how much traffic. All of this is easily configurable.

If the challenge DOES fail for some reason, the user may be looking at the Challenge page with one of two kinds of failures, and some additional explanatory text and contact info.

Limitations and omissions

This particular flow only works for GET requests. It could be expanded to work for POST requests (with an invisible JS created/submitted form?), but our initial use case didn’t require it, so for now the filter just logs a warning and fails for POST.

This flow also isn’t going to work for fetch/ajax requests, it’s set up for ordinary navigation, since it redirects to a challenge then redirects back. Our use case is only protecting our search pages — but the blacklight search in our app has a JS fetch for “facet more” behavior. Couldn’t figure out a good/easy way to make this work, so for now we added an exemption config, and just exempt requests to the #facet action that look like they’re coming from fetch. Not bothered that an “attacker” could escape our bot detection for this one action; our main use case is stopping crawlers crawling indiscriminately, and I don’t think it’ll be a problem.

To get through the bot challenge requires a user-agent to have both JS and cookies enabled. JS may have been required before anyway (not sure), but cookies were not. Oh well. Only search pages are protected by the bot challenge.

The Lehigh implementation does a reverse-lookup of the client IP, and allow-lists clients from IP’s that reverse lookup to desirable and well-behaved bots. We don’t do that, in part because I didn’t want the performance hit of the reverse-lookup. We have a Sitemap, and in general, I’m not sure we need bots crawling our search results pages at all… although I’m realizing as I write this that our “Collection” landing pages are included (as they show search results)… may want to exempt them, we’ll see how it goes.

We don’t have any client-based allow-listing… but would consider just exempting any client that has a user-agent admitting it’s a bot, all our problematic behavior has been from clients with user-agents appearing to be regular browsers (but obviously automated ones, if they are being honest).

Possible extensions and enhancements

We could possibly only enable the bot challenge when the site appears “under load”, whether that’s a certain number of overall requests per second, a certain machine load (but any auto-scaling can make that an issue), or size of heroku queue (possibly same).

We could use more sophisticated fingerprinting for rate limit buckets. Instead of IP-address-based, colleague David Cliff from Northeastern University has had success using HTTP user-agent, accept-encoding, and accept-language to fingerprint actors across distributed IPs, writing:

I know several others have had bot waves that have very deep IP address pools, and who fake their user agents, making it hard to ban.

We had been throttling based on the most common denominator (url pattern), but we were looking for something more effective that gave us more resource headroom.

On inspecting the requests in contrast to healthy user traffic we noticed that there were unifying patterns we could use, in the headers.

We made a fingerprint based on them, and after blocking based on that, I haven’t had to do a manual intervention since.

def fingerprint
result = “#{env[“HTTP_ACCEPT”]} | #{env[“HTTP_ACCEPT_ENCODING”]} | #{env[“HTTP_ACCEPT_LANGUAGE”]} | #{env[“HTTP_COOKIE”]}”
Base64.strict_encode64(result)
end

…the common rule we arrived at mixed positive/negative discrimination using the above

request.env["HTTP_ACCEPT"].blank? && request.env["HTTP_ACCEPT_LANGUAGE"].blank? && request.env["HTTP_COOKIE"].blank? && (request.user_agent.blank? || !request.user_agent.downcase.include?("bot".downcase))

so only a bot that left the fields blank and lied with a non-bot user agent would be affected

We could also base rate limit or “discriminators” for rate limit buckets on info we can look up from the client IP address, either a DNS or network lookup (performance worries), or perhaps a local lookup using the free MaxMind databases that also include geocoding and some organizational info.

Does it work?

Too early to say, we just deployed it!

I sometimes get annoyed when people blog like this, but being the writer, I realized that if I wait a month to see how well it’s working to blog — I’ll never blog! I have to write while it’s fresh and still interesting to me.

But encouraged that colleagues say very similar approaches have worked for them. Thanks again to Joe Corral for paving the way with a drupal implementation, blogging it, discussing it on chat, and answering questions! And all the other librarian and cultural heritage technologists sharing knowledge and collaboration on this and many other topics!

I can say that already it is being triggered a lot, by bots that don’t seem to get past it. This includes google bot and Meta-ExternalAgent (which I guess is AI-related; we have no particular use-based objections we are trying to enforce here, just trying to preserve our resources). While Google also has no reason to combinatorially explore every facet combination (and has a sitemap), I’m not sure if I should exempt known resource-considerate bots from the challenge (and whether to do so by trusting user-agent or not; our actual problems have all been with ordinary-browser-appearing user-agents).

Update 27 Jan 2025

Our original config — allowing 10 search results per IP subnet before turnstile challenge — was not enough to keep the bot traffic from overwhelming us. Too many botnets had enough IPs making apparently fewer than 10 requests each.

Lowering that to 2 requests was enough to reduce traffic enough. (Keep in mind that a user should only get one challenge per 24 hours unless IP address changes — although that makes me realize that people using Apple’s “private browsing” feature may get more, hmm).

Pretty obvious on these heroku dashboard graphs where our succesful turnstile config was deployed, right?

I think I would be fine going down to challenge on first search results, since a human user should still only get one per 24 hour period — but since the “success passed” mark in session is tied to IP address (to avoid session replay for bots to avoid the challenge), I am now worried about Apple “private browsing”! In today’s environment with so many similar tests, I wonder if private browsing is causing problems for users and bot protections?

You can see on the graph a huge number of 3xx responses — those are our redirects to challenge page! The redirect to and display of the challenge page seem to be cheap enough that they aren’t causing us a problem even in high volume — which was the intent, nice to see it confirmed at least with current traffic.

We are only protecting our search result page, not our item detail pages (which people often get to directly to google) — this also seems succesful. The real problem was the volume of hits from so many bots trying to combinatorially explore every possible facet limit, which we have now put a stop to.

Intergenerationally Accessible Makerspaces / Information Technology and Libraries

When envisioning a high-tech makerspace, you might first think about younger customers; however, since opening in December 2021, Carroll County Public Library’s Exploration Commons at 50 East has created a place of learning and sharing that knows no age limit. From programs designed for seniors, resources for entrepreneurs, and opportunities for volunteering, Exploration Commons provides a center for intergenerational knowledge exchange where our community’s senior population thrives. 

Open for Who? / Information Technology and Libraries

The open movement at large is one of radical redistribution of power. However, as it currently stands, access advocacy often falls short when it comes to individuals within the prison industrial complex. The pervasive assumption that open means accessible inhibits scholarship, especially for justice-impacted scholars who do not have "regular" internet access. While this is especially difficult for students experiencing incarceration, a lack of equitable access is an issue for all library users. This is a reflection on open access, intellectual freedom, digital equity, and the duty of librarians to be mindful of access barriers in conversations surrounding these topics, as well as an overview of the technology and scholarship access within the prison industrial complex. It's informed based on my own experience with justice-impacted scholars in California State Prisons as well as current literature around education programs and digital (in)equity issues for those currently and recently incarcerated. It is a call for furthering our discussions and advocacy by acknowledging current limitations in order to appreciate the full potential of openness as an equalizer, a philosophy and practice that promotes information equity for all.

Weed Your Budget / Information Technology and Libraries

This paper presents a case study about opportunities for libraries to elevate reporting on their organization's financial status, transactions, and available funds using data visualization, specifically with Power BI, based on work at University of Oregon Libraries. The Power BI dashboard aggregates data from the university’s financial system and integrated library system to provide a consolidated view, and to tell the library's story in a more engaging way. The author proposes using this type of business intelligence technology to provide a transformative impact on financial operations of libraries.

Chatbots and Scholarly Databases / Information Technology and Libraries

This viewpoint article explores Scopus AI—Elsevier’s innovative add-on to the Scopus database—which allows users to engage with Scopus in natural language rather than via Boolean operators. Scopus AI’s strength lies in combining the communication properties of a large language model with the information integrity of peer-reviewed sources. It does not substitute the need to review the literature but can be helpful in search, especially if stakes are low and a systematic approach is unnecessary. Because of increased sophistication of tools and information systems, the degree of competencies required from users also increases. Reasonable understanding of how AI works, as well as search expertise, a critical approach to source evaluation, and scientific skepticism remain essential. With these in place, and with a clear understanding of the purpose of various information tasks, users can be better positioned to decide how best to employ various tools to get the job done.

The 2023 Rhysida Ransomware Attack on the British Library / Information Technology and Libraries

The British Library is a flagship library that plays a pivotal role in the UK learning and research infrastructure, in addition to being a central conduit for international library linkages. However, in late October 2023, this premier institution was the subject of a cyberattack that has left it crippled. The Rhysida group perpetrated this catastrophic ransomware attack. Underfunding and threat identification are explored as potential weaknesses resulting in deficiencies in the British Library’s online security systems. To help prevent further such assaults in libraries, this Commentary also details what is known about the attack and how such breaches might be prevented in the future.

The Evolution of Library Automation Systems in Chinese Academic Libraries / Information Technology and Libraries

With the rapid development of information technology, automated systems in Chinese academic libraries have undergone a process from inception to maturity. This study aims to review the development history of automated systems in Chinese academic libraries and provide a forward-looking analysis of their future trends. Through the review of historical data, this paper summarizes the development characteristics of automated systems in Chinese academic libraries. By conducting a survey of automated systems in the libraries of 39 Double First-Class universities in China, the current status of automated systems in Chinese academic libraries is analyzed. The article concludes by proposing future directions for the development of automated systems in Chinese academic libraries.

Enhancing Discoverability / Information Technology and Libraries

As board game collections continue to grow in popularity at academic institutions, the University of Idaho Library wanted to establish an accessible and easy-to-use discovery tool for its physical game collection. At the collections’ origins, a LibGuide was created by librarians for this purpose, but as the collection records continued to grow into the triple digits, something more adaptable and customizable was necessary. In the summer of 2023, a team of librarians came together to build a custom site to house the game collection’s information using CollectionBuilder, an open-source framework for digital collections and exhibits websites that are driven by metadata and powered by modern static web technology. Creating this site came with unique challenges and considerations specific to the University of Idaho Library’s needs in displaying their board games. CollectionBuilder’s ability to support iterative and agile development, ample instructional documentation, and seemingly endless customizations makes it a suitable solution for a growing and changing collection discovery tool.

Another Saturday, another #TeslaTakedown / Peter Murray

Protest sign featuring the words “DEMOCRACY” in bold blue letters, with “Kings and Oligarchs are not American” in red and blue beneath. Images of Elon Musk and Donald Trump are crossed out with red circles and lines. The sign reads “Not MUSKocracy, Not TRUMPocracy” in bold blue and red text.My protest sign for the #TeslaTakedown today.

There was another #TeslaTakedown march today. It was a windy, gloomy day; despite that, lots of spirit at the Easton Tesla store. I was expecting the weather would keep people away, but there were about 400 people — 50 or so more than last Saturday.

There were three differences that I noticed. First, there weren't as many families with young children. I would chalk that up to the weather; if it were nicer, there probably would have been more. A week out, next Saturday's weather forecast is slightly cooler but sunny. We'll see what happens then.

The second difference was the increased police presence. They probably were expecting more people, and so had a bigger group there. Most wore pale blue "Columbus Police Dialogue Team" high-vis vests and walked behind us. The only interaction with them that I saw was when a young adult walked along the sidewalk shouting "Trump 2028!" with his phone up recording. As far as I could tell, the Dialogue Team was the only one to engage with this person.

The third difference was the support from the cars driving by...much more supportive with honking and thumbs up. I counted three middle fingers, which is one more than last weekend. Not a bad ratio for an hour-long protest.

I think the construction of my protest sign is improving. I noticed when I got there that I still had too many words in the sign to make it easy to read and understand at a distance. At least this one didn't require a blog post to fully explain.

No one knows how to use a point-and-shoot camera

Protester in front of a Tesla building holds a sign reading, “DEMOCRACY: Kings and Oligarchs are not American. Not MUSKocracy, Not TRUMPocracy,” with images of Elon Musk and Donald Trump crossed out. The person wears a blue jacket, standing against a gray, overcast sky, with bare tree branches framing the scene.My protest sign at the #TeslaTakedown.

I've had to ask people to take these pictures of me with the protest sign in front of the Tesla store. And everyone says they don't know how to use a camera anymore. Even people my age and older!

I brought an old point-and-shoot camera because it doesn't have any radios in it. One of the guidelines I've read for attending a protest is not to bring devices that can identify you, although I might be the only person following that guidance. If I were a technical Elon-ite, I'd recommend that the Tesla store's WiFi and Bluetooth hotspots capture metadata of every device they encounter. Although both Android and iOS use network address randomization, those are not foolproof for preventing devices from being tracked.

Now, admittedly, I'm not hiding from Elon Goons by posting this on my personal blog. Still, I'm also not going to make it easy for them to automate finding me either.

Archival Storage / David Rosenthal

I gave a talk at the Berkeley I-school's Information Access Seminar entitled Archival Storage. Below the fold is the text of the talk with links to the sources and the slides (with yellow background).

Don't, don't, don't, don't believe the hype!
Public Enemy

Introduction

I'm honored to appear in what I believe is the final series of these seminars. Most of my previous appearances have focused on debunking some conventional wisdom, and this one is no exception. My parting gift to you is to stop you wasting time and resources on yet another seductive but impractical idea — that the solution to storing archival data is quasi-immortal media. As usual, you don't have to take notes. The full text of my talk with the slides and links to the sources will go up on my blog shortly after the seminar.

Backups

Archival data is often confused with backup data. Everyone should back up their data. After nearly two decades working in digital preservation, here is how I back up my four important systems:
  • I run my own mail and Web server. It is on my DMZ network, exposed to the Internet. It is backed up to a Raspberry Pi, also on the DMZ network but not directly accessible from the Internet. Once a week there is a full backup, and daily an incremental backup. Every week the full and incremental backups for the week are written to two DVD-Rs.
  • My desktop PC creates a full backup on an external hard drive nightly. The drive is one of a cycle of three.
  • I back up my iPhone to my Mac Air laptop every day.
  • I create a Time Machine backup of my Mac Air laptop, which includes the most recent iPhone backup, every day on one of a cycle of three external SSDs.
Each week the DVD-Rs, the current SSD and the current hard drive are moved off-site. Why am I doing all this? In case of disasters such as fire or ransomware I want to be able to recover to a state as close as possible to that before the disaster. In my case, the worst case is not more than one week.

Note the implication that the useful life of backup data is only the time that elapses between the last backup before a disaster and the recovery. Media life span is irrelevant to backup data; that is why backups and archiving are completely different problems.

The fact that the data encoded in magnetic grains on the platters of the three hard drives is good for a quarter-century is interesting but irrelevant to the backup task.

MonthMediaGoodBadVendor
01/04CD-R5x0GQ
05/04CD-R5x0Memorex
02/06CD-R5x0GQ
11/06DVD-R5x0GQ
12/06DVD-R1x0GQ
01/07DVD-R4x0GQ
04/07DVD-R3x0GQ
05/07DVD-R2x0GQ
07/11DVD-R4x0Verbatim
08/11DVD-R1x0Verbatim
05/12DVD+R2x0Verbatim
06/12DVD+R3x0Verbatim
04/13DVD+R2x0Optimum
05/13DVD+R3x0Optimum
I have saved many hundreds of pairs of weekly DVD-Rs but the only ones that are ever accessed more than a few weeks after being written are the ones I use for my annual series of Optical Media Durability Update posts. It is interesting that:
with no special storage precautions, generic low-cost media, and consumer drives, I'm getting good data from CD-Rs more than 20 years old, and from DVD-Rs nearly 18 years old.
But the DVD-R media lifetime is not why I'm writing backups to them. The attribute I'm interested in is that DVD-Rs are write-once; the backup data could be destroyed but it can't be modified.

Note that the good data from 18-year-old DVD-Rs means that consumers have an affordable, effective archival technology. But the market for optical media and drives is dying, killed off by streaming, which suggests that consumers don't really care about archiving their data. Cathy Marshall's 2008 talk Its Like A Fire, You Just Have To Move On vividly describes this attitude. Her subtitle is "Rethinking personal digital archiving".

Archival Data

  • Over time, data falls down the storage hierarchy.
  • Data is archived when it can't earn its keep on near-line media.
  • Lower cost is purchased with longer access latency.
What is a useful definition of archival data? It is data that can no longer earn its keep on readily accessible storage. Thus the fundamental design goal for archival storage systems is to reduce costs by tolerating increased access latency. Data is archived, that is moved to an archival storage system, to save money. Archiving is an economic rather than a technical issue.

By Pkirlin CC BY-SA 3.0 Source
How long should the archived data last? The Long Now Foundation is building the Clock of the Long Now, intended to keep time for 10,000 years. They would like to accompany it with a 10,000-year archive. That is at least two orders of magnitude longer than I am talking about here. We are only just over 75 years from the first stored-program computer, so designing a digital archive for a century is a very ambitious goal.

Archival Media

Source
The mainstream media occasionally comes out with an announcement like this from the Daily Mail in 2013. Note the extrapolation from "a 26 second excerpt" to "every film and TV program ever created in a teacup".
"drive the size of a teacup"
Six years later, this is a picture of, as far as I know, the only write-to-read DNA storage drive ever demonstrated. It is from the Microsoft/University of Washington team that has done much of the research in DNA storage. They published it in 2019's Demonstration of End-to-End Automation of DNA Data Storage. It cost about $10K and took 21 hours to write then read 5 bytes.

The technical press is equally guilty. The canonical article about some development in the lab starts with the famous IDC graph projecting the amount of data that will be generated in the future. It goes on to describe the amazing density some research team achieved by writing say a gigabyte into their favorite medium in the lab, and how this density could store all the world's data in a teacup for ever. This conveys five false impressions.

Market Size

First, that there is some possibility the researchers could scale their process up to a meaningful fraction of IDC's projected demand, or even to the microscopic fraction of the projected demand that makes sense to archive. There is no such possibility. Archival media is a much smaller market than regular media. In 2018's Archival Media: Not a Good Business I wrote:
Source
Archival-only media such as steel tape, silica DVDs, 5D quartz DVDs, and now DNA face some fundamental business model problems because they function only at the very bottom of the storage hierarchy. The usual diagram of the storage hierarchy, like this one from the Microsoft/UW team researching DNA storage, makes it look like the size of the market increases downwards. But that's very far from the case.
Source
IBM's Georg Lauhoff and Gary M Decad's slide shows that the size of the market in dollar terms decreases downwards. LTO tape is less than 1% of the media market in dollar terms and less than 5% in capacity terms. Archival media are a very small part of the storage market. It is noteworthy that in 2023 Optical Archival (OD-3), the most recent archive-only medium, was canceled for lack of a large enough market. It was a 1TB optical disk, an upgrade from Blu-Ray.

Timescales

Second, that the researcher's favorite medium could make it into the market in the timescale of IDC's projections. Because the reliability and performance requirements of storage media are so challenging, time scales in the storage market are much longer than the industry's marketeers like to suggest.

Take, for example, Seagate's development of the next generation of hard disk technology, HAMR, where research started twenty-six years ago. Nine years later in 2008 they published this graph, showing HAMR entering the market in 2009. Seventeen years later it is only now starting to be shipped to the hyper-scalers. Research on data in silica started fifteen years ago. Research on the DNA medium started thirty-six years ago. Neither is within five years of market entry.

Customers

Third, that even if the researcher's favorite medium did make it into the market it would be a product that consumers could use. As Kestutis Patiejunas figured out at Facebook more than a decade ago, because the systems that surround archival media rather than the media themselves are the major cost, the only way to make the economics of archival storage work is to do it at data-center scale but in warehouse space and harvest the synergies that come from not needing data-center power, cooling, staffing, etc.

Source
Storage has an analog of Moore's Law called Kryder's Law, which states that over time the density of bits on a storage medium increases exponentially. Given the need to reduce costs at data-center scale, Kryder's Law limits the service life of even quasi-immortal media. As we see with tape robots, where data is routinely migrated to newer, denser media long before its theoretical lifespan, what matters is the economic, not the technical lifespan of a medium.

Hard disks are replaced every five years although the magnetically encoded data on the platters is good for a quarter-century. They are engineered to have a five-year life because Kryder's Law implies that they will be replaced after five years even though they still work perfectly. Seagate actually built drives with 25-year life but found that no-one would pay the extra for the longer life.

The Cloud

Source
Fourth, that anyone either cares or even knows what medium their archived data lives on. Only the hyper-scalers do. Consumers believe their data is safe in the cloud. Why bother backing it up, let alone archiving it, if it is safe anyway? If anyone really cares about archiving they use a service such as Glacier, when they definitely have no idea what medium is being used.

Threats

Source
Fifth, that bit rot is the only threat that matters; the idea that with quasi-immortal media you don't need Lots Of Copies to Keep Stuff Safe.

No medium is perfect. They all have a specified Unrecoverable Bit Error Rate (UBER) rate. For example, typical disk UBERs are 10-15. A petabyte is 8*1015 bits, so if the drive is within its specified performance you can expect up to 8 errors when reading a petabyte. The specified UBER is an upper limit, you will normally see far fewer. The UBER for LT09 tape is 10-20, so unrecoverable errors on a new tape are very unlikely. But not impossible, and the rate goes up steeply with tape wear.

The property that classifies a medium as quasi-immortal is not that its reliability is greater than regular media to start with, although as with tape it may be. It is rather that its reliability decays more slowly than that of regular media. Thus archival systems need to use erasure coding to mitigate both UBER data loss and media failures such as disk crashes and tape wear-out.

Another reason for needing erasure codes is that media errors are not the only ones needing mitigation. What matters is the reliability the system delivers to the end user. Research has shown that the majority of end user errors come from layers of the system above the actual media.

The archive may contain personally identifiable or other sensitive data. If so, the data on the medium must be encrypted. This is a double-edged sword, because the encryption key becomes a single point of failure; its loss or corruption renders the entire archive inaccessible. So you need Lots Of Copies to keep the key safe. But the more copies the greater the risk of key compromise.

Media such as silica, DNA, quartz DVDs, steel tape and so on address bit rot, which is only one of the threats to which long-lived data is subject. Clearly a single copy on such media, even if erasure coded, is still subject to threats including fire, flood, earthquake, ransomware, and insider attacks. Thus even an archive needs to maintain multiple copies. This greatly increases the cost, bringing us back to the economic threat.

Archival Storage Systems

At Facebook Patiejunas built rack-scale systems, each holding 10,000 100GB optical disks for a Petabyte per rack. Writable Blu-Ray disks are about 80 cents each, so the media to fill the rack would cost about $8K. This is clearly much less than the cost of the robotics and the drives.

IBM TS4300
Let's drive this point home with another example. An IBM TS4300 LTO tape robot starts at $20K. Two 20-pack tape cartridges to fill it cost about $4K, so the media is about 16% of the total system capex. The opex for the robot includes power, cooling, space, staff and an IBM maintenance contract. The opex for the tapes is essentially zero.

The media is an insignificant part of the total lifecycle cost of storing archival data on tape. What matters for the economic viability of an archival storage system is minimizing the total system cost, not the cost of the media. No-one is going to spend $24K on a rack-mount tape system from IBM to store 720TB for their home or small business. The economics only work at data-center scale.

The reason why this focus on media is a distraction is that the fundamental problem of digital preservation is economic, not technical. No-one wants to pay for preserving data that isn't earning its keep, pretty much the definition of archived data. The cost per terabyte of the medium is irrelevant, what drives the economic threat is the capital and operational cost of the system. Take tape for example. The media capital cost is low, but the much higher system capital cost includes the drives and the robotics. Then there are the operational costs of the data center space, power, cooling and staff. It is only by operating at data-center scale and thus amortizing the capital and operational costs over very large amounts of data that the system costs per terabyte can be made competitive.

Operating at data center scale, as Patiejunas discovered and Microsoft understands, means that one of the parameters that determines the system cost is write bandwidth. Each of Facebook's racks wrote 12 optical disks in parallel almost continuously. It would take over 800 times the time to write an entire disk to fill the rack. At the 8x write speed it takes 22.5 minutes to fill a disk, so it would take around 18,750 minutes to fill the rack, or about two weeks. It isn't clear how many racks Facebook needed simultaneously doing this to keep up with the flow of user-generated content, but it was likely enough to fill a reasonable-size warehouse. Similarly, it would take about 8.5 days to fill the base model TS4300.

Project Silica

I wrote about Microsoft's Project Silica a year ago, in Microsoft's Archival Storage Research. It uses femtosecond lasers to write data into platters of silica. Like Facebook's, the prototype Silica systems are data-center size:
A Silica library is a sequence of contiguous write, read, and storage racks interconnected by a platter delivery system. Along all racks there are parallel horizontal rails that span the entire library. We refer to a side of the library (spanning all racks) as a panel. A set of free roaming robots called shuttles are used to move platters between locations.
...
A read rack contains multiple read drives. Each read drive is independent and has slots into which platters are inserted and removed. The number of shuttles active on a panel is limited to twice the number of read drives in the panel. The write drive is full-rack-sized and writes multiple platters concurrently.
Their performance evaluation focuses on the ability to respond to read requests within 15 hours. Their cost evaluation, like Facebook's, focuses on the savings from using warehouse-type space to house the equipment, although is isn't clear that they have actually done so. The rest of their cost evaluation is somewhat hand-wavy, as is natural for a system that isn't yet in production:
The Silica read drives use polarization microscopy, which is a commoditized technique widely used in many applications and is low-cost. Currently, system cost in Silica is dominated by the write drives, as they use femtosecond lasers which are currently expensive and used in niche applications. ... As the Silica technology proliferates, it will drive up the demand for femtosecond lasers, commoditizing the technology.
I'm skeptical of "commoditizing the technology". Archival systems are a niche in the IT market, and one on which companies are loath to spend money. Realistically, there aren't going to be a vast number of Silica write heads. The only customers for systems like Silica are the large cloud providers, who will be reluctant to commit their archives to technology owned by a competitor. Unless a mass-market application for femtosecond lasers emerges, the scope for cost reduction is limited.

But the more I think about this technology, which is still in the lab, the more I think it probably has the best chance of impacting the market among all the rival archival storage technologies. Not great, but better than its competitors:
  • The media is very cheap and very dense, so the effect of Kryder's Law economics driving media replacement and thus its economic rather than technical lifetime is minimal.
  • The media is quasi-immortal and survives benign neglect, so opex once written is minimal.
  • The media is write-once, and the write and read heads are physically separate, so the data cannot be encrypted or erased by malware. The long read latency makes exfiltrating large amounts of data hard.
  • The robotics are simple and highly redundant. Any of the shuttles can reach any of the platters. They should be much less troublesome than tape library robotics because, unlike tape, a robot failure only renders a small fraction of the library inaccessible and is easily repaired.
  • All the technologies needed are in the market now, the only breakthroughs needed are economic, not technological.
  • The team has worked on improving the write bandwidth which is a critical issue for archival storage at scale. They can currently write hundreds of megabytes a second.
  • Like Facebook's archival storage technologies, Project Silica enjoys the synergies of data center scale without needing full data center environmental and power resources.
  • Like Facebook's technologies, Project Silica has an in-house customer, Azure's archival storage, with a need for a product like this.
$48,880 femtosecond laser
The expensive part of the system is the write head. It is an entire rack using femtosecond lasers, which start at around $50K. The eventual system's economics will depend upon the progress made in cost-reducing the lasers.

Retrieval


Svalbard by Oona Räisänen
CC-BY-SA4.0
The Svalbard archipelago is where I spent the summer of 1969 doing a geological survey.

The most important part of an archiving strategy is knowing how you will get stuff out of the archive. Putting stuff in and keeping it safe are important and relatively easy, but if you can't get stuff out when you need it what's the point?

In some cases access is only needed to a small proportion of the archive. At Facebook, Patiejunas expected that the major reason for access would be to respond to a subpoena. In other cases, such as migrating to a new archival system, bulk data retrieval is required.

But if the reason for needing access is disaster recovery it is important to have a vision of what resources are likley to be available after the disaster. Microsoft gained a lot of valuable PR by encoding much of the world's open source software in QR codes on film and storing the cans of film in an abandoned coal mine in Svalbard so it would "survive the apocalypse". In Seeds Or Code? I had a lot of fun imagining how the survivors of the apocalypse would be able to access the archive.

The voyage
To make a long story short, after even a mild apocalypse, they wouldn't be able to. Let's just point out that the first steps after the apocalypse are getting to Svalbard. They won't be able to fly to LYR. As the crow flies, the voyage from Tromsø is 591 miles across very stormy seas. It takes several days, and getting to Tromsø won't be easy either.

Archival Storage Services

Because technologies have very strong economies of scale, the economics of most forms of IT work in favor of the hyper-scalers. These forces are especially strong for archival data, both because it is almost pure cost with no income, and because as I discussed earlier the economics of archival storage only work at data-center scale. It will be the rare institution that can avoid using cloud archival storage. I analyzed the way these economic forces operate in 2019's Cloud For Preservation:
Much of the attraction of cloud technology for organizations, especially public institutions funded through a government's annual budget process, is that they transfer costs from capital to operational expenditure. It is easy to believe that this increases financial flexibility. As regards ingest and dissemination, this may be true. Ingesting some items can be delayed to the next budget cycle, or the access rate limit lowered temporarily. But as regards preservation, it isn't true. It is unlikely that parts of the institution's collection can be de-accessioned in a budget crunch, only to be re-accessioned later when funds are adequate. Even were the content still available to be re-ingested, the cost of ingest is a significant fraction of the total life-cycle cost of preserving digital content.
Cloud services typically charge differently for ingest, storage and retrieval. The service's goal in designing their pricing structure is to create lock-in, by analogy with the drug-dealer's algorithm "the first one's free". In 2019 I used the published rates to compute the cost of ingesting in a month, storing for a year, and retrieving in a month a petabyte using the archive services of the three main cloud providers. Here is that table, with the costs adjusted for inflation to 2024 using the Bureau of Labor Statistics' calculator:
Archival Services
Service In Store Out Total Lock-in
AMZN Glacier $2,821 $60,182 $69,260 $132,263 13.8
GOOG Coldline $4,514 $105,319 $105,144 $214,977 12.0
MSFT Archive $7,962 $30,091 $20,387 $58,440 8.1
The "Lock-in" column is the approximate number of months of storage cost that getting the Petabyte out in a month represents. Note that:
  • In all cases getting data out is much more expensive than putting it in.
  • The lower cost of archival storage compared to the same service's near-line storage is purchased at the expense of a much stronger lock-in.
  • Since the whole point of archival storage is keeping data for the long term, the service will earn much more in storage charges over the longer life of archival data than the shorter life of near-line data.
There may well be reasons why retrieving data from archival storage is expensive. Most storage technologies have unified read/write heads, so retrieval competes with ingest which, as Patiejunas figured out, is the critical performance parameter for archival storage. This is because, to minimize cost, archival systems are designed assuming bulk retrieval is rare. When it happens, whether from a user request or to migrate data to new media, it is disruptive. For example, emptying a base model TS4300 occupies it for more than a week.

Six years later, things have changed significantly. Here is the current version of the archival services table:
Archival Services
Service In Store Out Total Lock-in
AMZN Glacier Deep Archive $500 $10,900 $49,550 $60,950 50.0
GOOG Archive $500 $13,200 $210,810 $224,510 175.6
MSFT Archive $100 $22,000 $40,100 $62,200 20.0
Points to note:
  • Glacier is the only one of the three that is significantly cheaper in real terms than it was 6 years ago.
  • Glacier can do this because Kryder's Law has made their storage about a factor of about 6 cheaper in real terms in six years, or about a 35% Kryder rate. This is somewhat faster than the rate of areal density increase of tape, and much faster than that of disk. The guess is that Glacier Deep Archive is on tape.
  • Google's pricing indicates they aren't serious about the archival market.
  • Archive services now have differentiated tiers of service. This table uses S3 Deep Archive, Google Archive and Microsoft Archive.
  • Lock-in has increased from 13.8/12.0/8.1 to 50/175/20. It is also increased by additional charges for data lifetimes less than a threshold, 180/365/180 days. So my cost estimate for Google is too low, because the data would suffer these charges. But accounting for this would skew the comparison.
  • Bandwidth charges are a big factor in lock-in. For Amazon they are 77%, for Google they are 38%, for Microsoft they are 32%. Amazon's marketing is smart, hoping you won't notice the outbound bandwidth charges.
Looking at these numbers it is hard to see how anyone can justify any archive storage other than S3 Deep Archive. It is the only one delivering Kryder's Law to the customer, and as my economic model shows, delivering Kryder's Law is essential to affordable long-term storage. A petabyte for a decade costs under $120K before taking Kryder's Law into account and you can get it all out for under $50K.

LOCKSS

Original Logo
The fundamental idea behind LOCKSS was that, given a limited budget and a realistic range of threats, data would survive better in many cheap, unreliable, loosely-coupled replicas than in a single expensive, durable one.

Replacing one drive takes about 15 minutes of work. If we have 30,000 drives and 2 percent fail, it takes 150 hours to replace those. In other words, one employee for one month of 8 hour days. Getting the failure rate down to 1 percent means you save 2 weeks of employee salary - maybe $5,000 total? The 30,000 drives costs you $4m.

The $5k/$4m means the Hitachis are worth 1/10th of 1 per cent higher cost to us. ACTUALLY we pay even more than that for them, but not more than a few dollars per drive (maybe 2 or 3 percent more).

Moral of the story: design for failure and buy the cheapest components you can. :-)
Brian Wilson, CTO of BackBlaze pointed out eleven years ago that in their long-term storage environment "Double the reliability is only worth 1/10th of 1 percent cost increase". And thus that the moral of the story was "design for failure and buy the cheapest components you can".

In other words, don't put all your eggs in one basket.

Meet the organisations selected to pilot the Open Data Editor / Open Knowledge Foundation

Today, we are delighted to announce the selected organisations for a new pilot programme of the Open Data Editor (ODE), Open Knowledge’s new open source desktop application for data practitioners to detect errors in tables. Since we opened the call for applications last January, we’ve received 71 applications in total, and we couldn’t be more inspired by the level of interest in testing it.

The applications came from 41 different countries and organisations with a wide range of data practices. The Open Data Editor has the potential to be a useful tool for working with different types of datasets, such as climate, gender, infrastructure, health, genomic, GLAM, digital literacy, public sector data and government accountability, among others.

Our goal this year is to enhance digital literacy through no-code tools. Using Open Data Editor helps to fill the literacy gap for people and teams in non-technological sectors. The selected organisations will work closely with our team from March to June 2025, integrating the application into their internal workflows and thus informing the next development steps through their use. And while doing so, they will gain lasting data skills across their institutions.

We are impressed by the vital work those organisations are doing in their fields. We hope that better data will enhance this work through no-code tools like the Open Data Editor. We’re looking forward to seeing ODE in action to help solve real-life problems.

Read more about the pilot organisations below, in alphabetical order:


Bioinformatics Hub of Kenya Initiative (BHKi) will be working with genomic data and metadata

Country: 🇰🇪 Kenya
Area of Knowledge: Computational Biology
Targeted Datasets: Genomic data and metadata, and Community data

Staff and students at the BHKi campus outreach at the University of Nairobi. Photo: BHKi (source)

Pauline Karega, Coordinator

  • What will you use the Open Data Editor for?

“We will explore the use of ODE to curate and harmonise genomic data and metadata, and community data. Additionally, we will gather information on bioinformatics/computational biology learners, who collect and use this data, and their experiences with the data. Our goal is to track their needs and interests across East Africa, to better inform outreach activities.”

  • Why is a no-code tool useful for you?

“We have a big community of students and researchers doing fieldwork who are new to data collection and have minimal resources, but need to accurately collect, store and communicate data to others. This tool can be a gentle introduction to practice as they learn advanced coding skills to curate and work on the data.”


City of Zagreb will tackle the challenges of working with infrastructure data

Country: 🇭🇷 Croatia
Area of Knowledge: Public Administration
Targeted Datasets: Infrastructure data

ZG HACKL – Hackathon for Open Zagreb 2024 (source)

Kristian Ravic, Senior advisor

  • What will you use the Open Data Editor for?

“ODE will be used for bridging interoperability challenges within our existing (open) data infrastructure, ensuring frictionless data exchange between various data platforms, data sources and data formats. We will use ODE to standardise open data processes and data management procedures by improving data transformation and data validation processes and publishing of open data on multiple data platforms.”

  • Why is a no-code tool useful for you?

“Currently, we have a non-technical data team with low or no-coding skills. So, there is a skill-gap within our organization and a tool like ODE would help us be more efficient in connecting various data (re)sources and, thus publishing more high value datasets on our central open data platform.”


The Demography Project will focus on water, air quality and electoral data

Country: 🇰🇪 Kenya
Area of Knowledge: Environmental Justice
Targeted Datasets: Water and Air quality data, and Electoral data

Flagging off of the 2024 Ondiri Wetlands Conservation Run in Kikuyu Municipality. Photo: The Demography Project (source)

Richard Muraya, Executive Director

  • What will you use the Open Data Editor for?

“ODE is a welcome opportunity for enhanced data quality of our environmental monitoring projects as well as our electoral/parliamentary monitoring project. We intend to deploy ODE to analyse large open government and citizen-generated datasets on atmospheric (air quality) and freshwater resources for enhanced personal and collective responsibility over rapidly degrading natural resources in Kenya.”

  • Why is a no-code tool useful for you?

“We are a small team of determined volunteer citizen scientists, environmental journalists and innovators with inadequate technical expertise in programming or advanced computing techniques. A no-code tool will truly be a game-changer in how my team reviews and validates our tabulated environmental datasets to identify trends and errors and ultimately publish our outputs for collective climate action.”


Observatoire des armements / CDRPC will work with defence spending data

Country: 🇫🇷 France
Area of Knowledge: Weapon Monitoring
Targeted Datasets: Arms and defence expenditure databases

Act for nuclear disarmament in partnership with several organisations. Photo: Guy Dechesne (source)

Sayat Topuzogullari, Coordinator

  • What will you use the Open Data Editor for?

“We are setting up a monitoring network of arms companies, the Weapon Watch Open Data Environment platform to monitor public and private spending on defence and armaments. It will be used by researchers, journalists, investigators, parliamentarians, whistleblowers and activists across Europe. In this context, we need a simple data entry tool, suitable for people who are not computer literate.”

  • Why is a no-code tool useful for you?

“Our monitoring network aims to involve as many people as possible in defense and security issues, a taboo subject in France. For this reason, it’s essential to offer an easy tool for finding errors in databases and improving data quality.”


Open Knowledge Nepal will work together with local governments and their infrastructure data

Country: 🇳🇵 Nepal
Area of Knowledge: Government Data
Targeted Datasets: Infrastructure data (local governments)

Moment of the Data Hackdays 2024 in Kathmandu. Photo: OKNP (source)

Nikesh Balami, CEO

  • What will you use the Open Data Editor for?

“We plan to integrate ODE into the IDMS project workflow, a system that allows local governments to store, update, and access ready-to-use data. We will use it to audit existing system datasets, identify errors, and enhance data quality. The goal is to ensure that citizens have access to high-quality datasets. Additionally, we will localise ODE user guides in the Nepali language to assist local government and non-technical users in learning how to use the tool effectively.”

  • Why is a no-code tool useful for you?

“ODE will empower non-technical staff to identify, clean, and validate data without requiring coding skills. By allowing municipal staff to handle data errors directly, it will reduce the burden on technical teams and improve data efficiency. The current manual data cleaning process is time-consuming, and introducing a no-code tool like ODE will simplify and streamline workflows, saving time and resources while ensuring high-quality datasets.”


Features to simplify your work

Above you learned what these organisations will do with ODE. But you can also download it now and try it out for yourself. ODE is an open, free desktop application, available for Linux, Windows, and macOS.

Here are a few tasks that ODE can help you with:

📊 If you have huge spreadsheets with data obtained through forms with the communities you serve, ODE helps you detect errors in this data to understand what needs to be fixed.

🧑🏽‍🍼 If you manage databases related to any social issue, ODE can quickly check if there are empty rows or missing cells in the data collected by different social workers and help you better allocate assistance.

🏦 If you monitor and review government spending or public budgets for a particular department, ODE helps you find errors in the table and make the data ready for analysis.

The Open Data Editor helps you find errors in your tables and spreadsheets


A big thanks again to all organisations that applied for the pilot programme! We will be launching another call for a second cohort in May 2025, and we strongly encourage those who were not selected to apply again. We know a lot of you spend much more time than you desire cleaning up the data before starting to do the work that is really of interest to you. ODE is here to help. 

If you have any questions or want any additional information about ODE, you can contact us at info@okfn.org.

Funding

All of Open Knowledge’s work with the Open Data Editor is made possible thanks to a charitable grant from the Patrick J. McGovern Foundation. Learn more about its funding programmes here.

Issue 111: End-to-end Encryption / Peter Murray

This week's thread of articles looks at the ever-evolving landscape of digital security and privacy through end-to-end encryption. End-to-end encryption is a method of securing communication where only the people communicating can read the messages. In principle, it prevents potential eavesdroppers — including telecom providers, Internet providers, and even the provider of the communication service — from being able to access the cryptographic keys needed to decrypt the conversation. In practice, governments and others want to be able to put themselves in the middle of those conversations for both noble and dishonorable reasons. From unprecedented cyberattacks leading US officials to urge citizens to use encrypted messaging apps, to tech companies like Apple butting heads with the UK government over data privacy, the balance of power and privacy is under constant tension.

Also on DLTJ this past week:

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ's Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

U.S. government urges use of encrypted messaging apps in the wake of a major telecom breach

Amid an unprecedented cyberattack on telecommunications companies such as AT&T and Verizon, U.S. officials have recommended that Americans use encrypted messaging apps to ensure their communications stay hidden from foreign hackers. The hacking campaign, nicknamed Salt Typhoon by Microsoft, is one of the largest intelligence compromises in U.S. history, and it has not yet been fully remediated. Officials on a news call Tuesday refused to set a timetable for declaring the country’s telecommunications systems free of interlopers. Officials had told NBC News that China hacked AT&T, Verizon and Lumen Technologies to spy on customers.
U.S. officials urge Americans to use encrypted apps amid cyberattack, NBC News, 3-Dec-2024

Late last year, the U.S. announced a significant attack against telecommunication companies. This hacking campaign, known as Salt Typhoon, is one of the largest intelligence breaches in U.S. history, with officials stating that the full extent of the compromise has not been resolved. The attackers accessed various types of sensitive information, including call metadata and live conversations of specific targets, notably around Washington, D.C. In light of that, the FBI and CISA recommended that Americans use messaging apps that feature end-to-end encryption. There is more than just a touch of irony here because federal law enforcement pushed for the passage of the Communications Assistance for Law Enforcement Act (CALEA) in the mid-1990s that put backdoors into telecommunications equipment for law enforcement. It was these backdoors that were used by the Salt Typhoon attackers. There is no such thing as an encryption backdoor that will only be used by authorized law enforcement.

Apple takes on the UK government over data access demands

Apple is taking legal action to try to overturn a demand made by the UK government to view its customers&apos private data if required... It is the latest development in an unprecedented row between one of the world&aposs biggest tech firms and the UK government over data privacy. In January, Apple was issued with a secret order by the Home Office to share encrypted data belonging to Apple users around the world with UK law enforcement in the event of a potential national security threat. Data protected by Apple&aposs standard level of encryption is still accessible by the company if a warrant is issued, but the firm cannot view or share data encrypted using its toughest privacy tool, Advanced Data Protection (ADP). ADP is an opt-in feature and it is not known how many people use it.
Apple takes legal action in UK data privacy row, BBC News, 4-Mar-2025

In response to the UK order, Apple removed ADP from the UK market rather than create a "backdoor" for access. The UK Home Office maintains that privacy is only compromised in exceptional cases related to serious crimes. But, as the previous article points out, there is no such thing as a law-enforcement-only capability; if there is a weakness in an encryption system, it will eventually be exploited by someone with the time or talent to break it.

Sweden's proposed backdoor in encrypted messaging apps ignites global privacy concerns

Sweden’s law enforcement and security agencies are pushing legislation to force Signal and WhatsApp to create technical backdoors allowing them to access communications sent over the encrypted messaging apps.... The bill could be taken up by the Riksdag, Sweden’s parliament, next year if law enforcement succeeds in getting it before the relevant committee, SVT Nyheter reported. The legislation states that Signal and WhatsApp must retain messages and allow the Swedish Security Service and police to ask for and receive criminal suspects’ message histories, the outlet reported. Minister of Justice Gunnar Strömmer told the Swedish press that it is vital for Swedish authorities to access the data.
Swedish authorities seek backdoor to encrypted messaging apps, The Record, 25-Feb-2025

A few paragraphs down in the article, the Swedish Armed Forces are mentioned as opposing the bill because they routinely use Signal, and a backdoor could introduce vulnerabilities that bad actors could exploit.

Signal Foundation president warns of threat to privacy

Stop playing games with online security, Signal president warns EU lawmakers, TechCrunch, 17-Jun-2024

The open source Signal messaging app is considered the gold standard for end-to-end encrypted messaging. Meridith Whittaker is the president of the Signal Foundation, and she has strong words for lawmakers' efforts to weaken encryption algorithms. Ms Whittaker was also quoted in the previous article about Sweden's efforts. The European Commission originally proposed legislation to scan private messages for child sexual abuse material, but the European Parliament has rejected the approach. Experts like Whittaker argue this would create vulnerabilities that could be exploited by hackers and hostile states. The EU's data protection supervisor has also voiced concerns that the plan threatens democratic values.

Signal Foundation prepares for quantum threats with a revision to its end-to-end encryption protocol

The Signal Foundation, maker of the Signal Protocol that encrypts messages sent by more than a billion people, has rolled out an update designed to prepare for a very real prospect that’s never far from the thoughts of just about every security engineer on the planet: the catastrophic fall of cryptographic protocols that secure some of the most sensitive secrets today. The Signal Protocol is a key ingredient in the Signal, Google RCS, and WhatsApp messengers, which collectively have more than 1 billion users.
The Signal Protocol used by 1+ billion people is getting a post-quantum makeover, Ars Technica, 20-Sep-2023

I don't know if quantum computing will be what breaks the current generation of encryption protocols, but progress in faster hardware and more research into encryption means that the day will come at some point. The Signal protocol revision uses a "post-quantum cryptography algorithm" adopted by the U.S. National Institute of Standards and Technology (NIST). There are researchers on both sides of this divide: those working to advance encryption protocols and those seeking to break them.

Apple Launches Post-Quantum Encryption in iMessage

While practical quantum computing technology may still be years or decades away, security officials, tech companies, and governments are ramping up their efforts to start using a new generation of post-quantum cryptography. These new encryption algorithms will, in short, protect our current systems against any potential quantum computing-based attacks. Today Cupertino is announcing that PQ3—its post-quantum cryptographic protocol—will be included in iMessage.
iMessage Gets Post-Quantum Encryption in New Update, WIRED, 21-Feb-2024

Apple follows Signal's lead in deploying its own quantum-safe encryption protocol for iMessage. Apple is using the same Kyber algorithm tha Signal adopted. Deploying post-quantum encryption now aims to limit the impact of "harvest now, decrypt later" attacks, where encrypted data is collected and held until quantum computers can break it.

Exploring the intersection of AI and end-to-end encryption

Recently I came across a fantastic new paper by a group of NYU and Cornell researchers entitled “How to think about end-to-end encryption and AI.”... I was particularly happy to see people thinking about this topic, since it’s been on my mind in a half-formed state this past few months. On the one hand, my interest in the topic was piqued by the deployment of new AI assistant systems like Google’s scam call protection and Apple Intelligence, both of which aim to put AI basically everywhere on your phone — even, critically, right in the middle of your private messages. On the other hand, I’ve been thinking about the negative privacy implications of AI due to the recent European debate over “mandatory content scanning” laws that would require machine learning systems to scan virtually every private message you send.
Let’s talk about AI and end-to-end encryption, Matthew Green, 17-Jan-2025

This blog post discusses the implications of AI technologies on the security and privacy of encrypted communications. The author emphasizes the importance of maintaining robust encryption standards in the face of evolving AI capabilities that could potentially undermine these protections. Take, for example, the need for AI agents to be snooping in on your conversations so it has the context to take actions on your behalf: "Agent, book a two-person reservation at the restaurant Dave just messaged me about." The author advocates for a collaborative approach between cryptographers and AI developers to ensure that AI advancements do not compromise encrypted data security.

This Week I Learned: Plants reproduce by spreading little plant-like things

This is where pollen comes in. Like sperm, pollen contains one DNA set from its parent, but unlike sperm, pollen itself is actually its own separate living plant made of multiple cells that under the right conditions can live for months depending on the species... So this tiny male offspring plant is ejected out into the world, biding its time until it meets up with its counterpart. The female offspring of the plant, called an embryosac, which you're probably less familiar with since they basically never leave home. They just stay inside flowers. Like again, they're not part of the flower. They are a separate plant living inside the flower. Once the pollen meets an embryosac, the pollen builds a tube to bridge the gap between them. Now it's time for the sperm. At this point, the pollen produces exactly two sperm cells, which it pipes over to the embryosac, which in the meantime has produced an egg that the sperm can meet up with. Once fertilized, that egg develops into an embryo within the embryosac, hence the name, then a seed and then with luck a new plant. This one with two sets of DNA.
Pollen Is Not Plant Sperm (It’s MUCH Weirder), MinuteEarth, 7-Mar-2025
Pollen is not sperm...it is a separate living thing! And it meets up with another separate living thing to make a seed! Weird! The video is only three and a half minutes long, and it is well worth checking out at some point today.

What did you learn this week? Let me know on Mastodon or Bluesky.

Pickle and Mittens bask in a sunspot

Two cats basking in a sunbeam on a carpeted floor. One black and white cat lies on its back, while the other stretches out comfortably. A woven basket and a cat toy are nearby, enhancing the cozy scene.

Improving Search Results Display / Library Tech Talk (U of Michigan)

Sample results from the library's default "Everything" results screen showing three results from the Catalog, Articles, Databases, and Online Journals search areas.
Image Caption

The University of Michigan Library's "Everything" results page after the changes described in this post were implemented.

LIT’s Design and Discovery department received generous support from an anonymous donor to fund a Library User Experience Research Fellow position. Our first fellow was Suzan Karabakal, a master’s student at the U-M School of Information. She investigated and recommended changes to the way Library Search presents results. Suzan conducted user research to identify specific changes we could make to improve our “Everything” results screen and search results for Catalog and Articles.

Open Data Day 2025 – Activities & Impact Report / Open Knowledge Foundation

Last Friday, we wrapped up another edition of Open Data Day with the feeling that despite recent backlashes and new structural and geopolitical challenges to open data and technologies, the open data community remains active and resilient in every corner of the planet.

In 2025, we saw powerful bottom-up energy across 189 events in 57 countries and more than 15 languages. We are small data teams within public institutions, enthusiast collectives working with communities, innovative academics, mappers, and makers from around the world. And together we make a difference!

Before starting the #ODDStories 2025 series, with stories of impact from events around the world, here’s a high-level report with the main figures and data on the 2025 edition.

We at the Open Knowledge Foundation would like to thank Datopian and the Communauté d’Afrique Francophone des Données Ouvertes (CAFDO) for their support in sponsoring the mini-grants. This financial support allows us to expand local open data networks in places that often don’t receive any form of funding (such as Francophone Africa, the focus of an exclusive stream in this year’s edition).

Let’s move on to 2026 with an even bigger, more diverse and impactful event!


About Open Data Day

Open Data Day (ODD) is an annual celebration of open data all over the world. Groups from many countries create local events on the day where they will use open data in their communities. ODD is led by the Open Knowledge Foundation (OKFN) and the Open Knowledge Network.

As a way to increase the representation of different cultures, since 2023 we offer the opportunity for organisations to host an Open Data Day event on the best date over one week. All outputs are open for everyone to use and re-use.

For more information, you can reach out to the Open Knowledge Foundation team by emailing opendataday@okfn.org. You can also join the Open Data Day Google Group or join the Open Data Day Slack channel to ask for advice, share tips and get connected with others.

NDSA Levels of Digital Preservation ‘Sightings in the Wild’ / Digital Library Federation

The NDSA Levels Steering Group has been diligently keeping tabs on how the Levels get deployed “in the wild” at our monthly meetings. The task is a way to understand the needs of the digital preservation community, and to document the usefulness of the NDSA’s signature digital preservation tool. 

Levels sightings occur in a variety of formats, ranging from their use in program assessment, to specific applications for “niche” digital preservation adjacent activities, to vendors employing them as a yardstick to gauge their service offerings. They’ve been spotted in numerous published articles on digital preservation, including the oft-cited “How to Talk to IT about Digital Preservation”, and in preservation reports that range from presentations on cyber-security to a graduate student assessing the state of digital preservation for the College Park Maryland Aviation Museum. The NDSA Levels are not just used in the United States, but are an international presence as well. The Bibliothèque et Archives nationales de Québec has used the NDSA levels to help define their information model, as shared at iPres 2024; and the UK Archives Accreditation standards suggest using the NDSA Levels or DPC-RAM for self-assessment. 

The scholarly literature on digital preservation is fertile ground for sightings, too, with several articles invoking the Levels just in the past year:

Have you seen an example of the NDSA Levels being used by colleagues or referenced in a presentation or even by a vendor? We’d love to hear about it! As always, we also encourage the whole community to provide feedback on the Levels – including Levels sightings! – at any time.

The post NDSA Levels of Digital Preservation ‘Sightings in the Wild’ appeared first on DLF.

My Economic Wake-Up Call Protest Sign: A #TeslaTakedown Story / Peter Murray

Protest poster in red, white, and blue. It says “Our fellow Americans” with arrows pointing left, down, and right. Underneath it says: “How much do you have in common with Elon Musk?”My protest sign for the #TeslaTakedown today.

I made a sign for today's #TeslaTakedown, and I should have listened to my family. They suggested that the initial version, without the "How much do you have in common with Elon Musk?" at the bottom, was too confusing. Adding that sentence improved understanding, but now there was too much to read in a protest sign for cars whizzing past. My point was that me and the person driving by giving me a middle finger have far more in common than what either of us have with Elon Musk (and Donald Trump).

Ladder rungs of economic prosperity

My son is taking a gen-ed psychology class in his first year of college, and he was describing a class exercise demonstrating the difficulty people have with probabilities and proportions. That got me thinking about the "cosmic distance ladder". The cosmic distance laddercosmic distance ladder is a series of methods astronomers use to determine the distances to celestial objects, acting as a tool to map the universe. It's called a ladder because each step relies on the previous one, starting with measurements to nearby galaxies and progressing to farther objects.

Let's suppose the median net worth of an American—the point at which half the people in the country have more and half the people in the country have less—is $100,000. So, standing there in the middle of the protest, the 10 people around me have a total net worth of $1,000,000—a million dollars. And the 100 people or so between and the street corner? That is ten million dollars. And the 1,000 people walking and driving by the #TeslaTakedown protest? That's a hundred million dollars.

The capacity of a triple-A minor league baseball stadium is about 10,000 people; the median net worth of that crowd is a billion dollars. The capacity of Ohio Stadium, where the Ohio State University football team plays, is 100,000, and the net worth is ten billion dollars. It is only at this point that reach Donald Trump's net worth. The population of Franklin County, Ohio—the seat of the state capitol—is just over 1,000,000 people, and the total median net worth is $100,000,000. Elon Musk's net worth is about $350,000,000—so three Franklin-county's-worth of people.

So when I said that the person throwing the middle finger at me has more in common with me than Musk, that's what I meant.

The actual median net worth of an American household is just short of $200,000, so we are not that far off with this economic prosperity ladder (assuming two earners per household). And the really perverse part? Remember that the median is the point at which half the population is above that amount and half the population is below. Don't confuse that with the average...that'll be the sum of everyone's net worth divided by the number of people in the country. That number is just over $1,000,000 per household. The highest highs have skewed the average that much.

Now, at the risk of reducing a person's value to a dollar amount, that is what I was trying to say in my sign. Wealth is a tool, not an identity; a person's kindness, resilience, and hope for a better world hold real value...and I saw a lot of kindness, resilience, and hope at the protest today. Yet, despite the immense kindness and resilience I was in the middle of, it's disheartening how wealthy people are overriding the collective interests of the country.

And that is far too much to put on a sign.

About making the protest sign

I'm adding my notes here about creating the protest sign, because clearly I'll need to make a different one for next week. The base of the sign is an old campaign yard sign—about 26 inches by 16 inches. Using a graphics program, I made an image at those dimensions. My plan was to print it out as a set of tiles on letter-sized paper and then tape them together. Unfortunately, the MacOS printer driver doesn't do this (...anymore? I thought it did at one point). Fortunately, a free web service called Rasterbator will make a PDF of tile pages for me. I uploaded the sign, selected US-Letter paper in landscape orientation, then selected an output size of "2.45 sheets wide". That will output 6 sheets for a final size of 24.98" x 15.38"...pretty close! This is what it looked like in the end.

Protester in front of a Tesla building holds a sign reading, “Our Fellow Americans: How much do you have in common with Elon Musk?” The person is dressed warmly in a blue jacket, sunglasses, and a beanie. The Tesla logo is visible on the building behind them, with bare tree branches framing the side.My protest sign at the #TeslaTakedown.

The Oligopoly Publishers / David Rosenthal

Source
Rupak Ghose's The $100 billion Bloomberg for academics and lawyers? is essential reading for anyone interested in academic publishing. He starts by charting the stock price of RELX, Thomson Reuters, and Wolters Kluwer, pointing out that in the past decade they have increased about ten-fold. He compares these publishers to Bloomberg, the financial news service. They are less profitable, but that's because their customers are less profitable. Follow me below the fold for more on this.

The leader in the market is RELX, which used to be Elsevier:
In the earlier share price chart RELX was the lower of the three lines but given its higher dividend payouts its overall shareholder returns have paced with its peers, and it is one of the best-performing large-cap stocks in the UK. The following chart shows that it is now in the top five largest in the UK by stock market value and bigger than BP. Its price-to-earnings multiple of around 30x is more like a tech giant.
Well, it is only 3/4 of Nvidia's or Apple's PE but around the same as Microsoft.
The rest of the oligopoly is:
The second largest listed player is Thomson Reuters worth around $80 billion (and this trades on a much higher price-to-earnings multiple than RELX). Add in Wolters Kluwer, and the much smaller Springer Nature which was listed late last year, and the combined stock market value of the four firms is more than $215 billion.
Thomson Reuters' PE is around 36, a bit less than Apple's. The $215B market cap is pricing in massive growth, which given the uncertain economic conditions and the finances of their customers seems optimistic.

Source
The four companies made about $8B in profit on about $25B in revenue, or a gross margin of around 32%:
The following chart illustrates the revenues and operating profits of the four firms in 2024. RELX generated as much operating profits as its two largest peers combined.
Ghose explains how the oligopoly publishers can extract so much rent by quoting Dan Davies' book The Unaccountability Machine:
Firstly, the customer base is captive and highly vulnerable to price gouging. A university library has to have access to the best journals, without which the members of the university can’t keep up with their field or do their own research. Secondly, although the publishers who bought the titles took over the responsibility for their administration and distribution, this is a small part of the effort involved in producing an academic journal, compared to the actual work of writing the articles and peer-reviewing them. This service is provided to the publishers by academics, for free or for a nominal payment (often paid in books or subscriptions to journals). So not only does the industry have both a captive customer base and a captive source of free labour, these two commercial assets are for the most part the same group of people.
Ghose then comments:
In essence, the academics fight to help the publishers extract monopoly profits. The whole process depends on the value of having your articles appear or receive citations in the best journals. This becomes an unaccountability sink to which universities outsource their whole system of promotion and hiring. Davies compares this to the PageRank algorithm used by Google.
The big four publishers' $25B in revenue doesn't come exclusively from publishing content they get for free from academics, but about half of it does:
This market for academic publishing generates around $20 billion of revenue per year with half of the market controlled by the five largest firms: Elsevier (part of RELX), John Wiley & Sons, Taylor & Francis, Springer Nature, and SAGE.
The distinctive feature of the market over the last decade or more has been the publishers failing to exercise their gate-keeping role, and the resulting flood of low-quality papers.

Source
Elsevier has been frantically acquiring and spawning journals:
the chart below illustrates that RELX-owned Elsevier is not only the clear market leader, but it has also grown significantly in volume.
Does anyone think there is 70% more high-quality research than there was a decade ago? The graph comes from Excessive growth in the number of scientific publications by Benoît Pier and Laurent Romary:
The first result concerns the total number of articles published, which follows very closely an exponential growth (+5.6% per year). Even taking into account the increase in the number of researchers over this period, we can deduce that the time spent on obtaining the results, validating and peer reviewing them has decreased significantly.
This growth has been powered by the rise of Article Processing Charges (APCs). A journal that lives on APCs rather than subscriptions has a strong disincentive to reject papers
By computing the average annual number of articles published per journal, the authors observe that the growth (significant, but not overwhelming) of the traditional publishers is mainly due to the expansion of the number of journals in their catalogues, whereas the very strong growth of Frontiers and MDPI is the result of an explosive increase in the number of articles per journal. It should be noted that these two publishers, which appeared more recently, are thriving through publication fees paid by authors.
Pier and Romary conclude:
The triad (MDPI, Frontiers, Hindawi) increasingly use special issues to publish more and more articles, and this phenomenon is accompanied by a significant shortening of the time allotted to the peer-review process.
Pier and Romary base their article upon The strain on scientific publishing by Mark A. Hanson et al whose abstract reads:
Scientists are increasingly overwhelmed by the volume of articles being published. Total articles indexed in Scopus and Web of Science have grown exponentially in recent years; in 2022 the article total was approximately ~47% higher than in 2016, which has outpaced the limited growth - if any - in the number of practising scientists. Thus, publication workload per scientist (writing, reviewing, editing) has increased dramatically. We define this problem as the strain on scientific publishing. To analyse this strain, we present five data-driven metrics showing publisher growth, processing times, and citation behaviours. We draw these data from web scrapes, requests for data from publishers, and material that is freely available through publisher websites. Our findings are based on millions of papers produced by leading academic publishers. We find specific groups have disproportionately grown in their articles published per year, contributing to this strain. Some publishers enabled this growth by adopting a strategy of hosting special issues, which publish articles with reduced turnaround times. Given pressures on researchers to publish or perish to be competitive for funding applications, this strain was likely amplified by these offers to publish more articles. We also observed widespread year-over-year inflation of journal impact factors coinciding with this strain, which risks confusing quality signals. Such exponential growth cannot be sustained. The metrics we define here should enable this evolving conversation to reach actionable solutions to address the strain on scientific publishing.
Source
Hanson et al present a very revealing figure, showing (A) the evolution through time of the mean time taken to review papers, and (B) the distribution of those times for 2016, 2019 and 2022. The figure shows two clear groups of publishers, more and less predatory:
  • The triad of predatory publishers do far less reviewing, and the amount they do decreases over time.
  • The less predatory publishers do far more reviewing and the amount they do increases over time.
  • The predatory publishers' distribution of review times increasingly skews to the short end, where the other publishers' distribution is stable and skews slightly to the long end.
So Elsevier is the biggest player in academic publishing. How big is it?:
RELX Elsevier has over 3,000 journals including leading brands like the Lancet. These journals published more than 720,000 articles in 2024. This is almost one-fifth of all scientific articles. Elsevier’s online platform ScienceDirect has tens of millions of pieces of peer-reviewed content from tens of millions of researchers. Its citation database has content from tens of thousands of journals.

The scientific, technical, and medical information division is RELX’s second largest with revenues of around $4 billion of revenues, roughly equally split between articles/research content vs databases, tools, and electronic reference solutions. The business is largely subscription fee driven and 90% of its revenues are from electronic products rather than print publishing and face-to-face events.
But is it growing fast enough to justify a P/E of 30?:
Despite the growth in the number of articles, this is RELX’s slowest-growing division with 3-4% revenue growth and 4-5% profit growth in recent years. They are expanding in line with our faster than the industry. But it is a very healthy cash cow with $1.5 billion of operating profits in 2024. This circa 40% operating margin is the best in the RELX group and far superior to other professional and business information businesses. The number two player Springer Nature has lower operating margins of around 30% in their academic publishing business, but this is much higher than their other products.
30-40% operating margins demonstrate the oligopoly's rent extraction. Four of every ten dollars Elsevier extracts from the world's education and research budgets goes straight to their shareholders. And more goes to their executives' salaries and bonuses.

Announcing the NDSA Climate Watch publication / Digital Library Federation

Last year, a dedicated group of digital stewardship professionals formed the Climate Watch Working Group. Our mission is to raise awareness of the threats that climate change presents to preserving cultural heritage materials, especially those in digital form.

We are excited to announce the publication of our Climate Watch Substack. 

Global warming and climate change are currently wreaking havoc on the world. As digital stewardship professionals, it is our responsibility to mitigate threats that impede our ability to manage digital materials through time. Climate change not only jeopardizes our data through more frequent and more severe weather disasters, but also through reductions in food supply, mass migrations, economic contraction, and political upheaval. 

Climate change reports and publications are published frequently and can be hard to keep on top of. Our objective is to curate the most relevant articles ensuring that we stay informed about the current state of the climate. We hope these resources will help make the case for appropriate long-term planning so that our field can begin proactive adaptation efforts.

We aim to publish quarterly updates reviewing climate change-related news, articles, reports, and publications. Each brief review will highlight key points from the publication and relate it back to why it is relevant to digital preservation specifically, or cultural heritage overall. Additionally, we will also maintain a list of Foundational Resources for digital stewardship colleagues that seek to become more familiar with climate scholarship, climate risks, and how these risks will impact our work, our workforce, and the collections we steward.

The post Announcing the NDSA Climate Watch publication appeared first on DLF.