Planet Code4Lib

A writer of pessimism and grace / John Mark Ockerbloom

William Golding called the bipolar Catholic author Graham Greene “the ultimate chronicler of twentieth-century man’s consciousness and anxiety”. Both Greene’s thrillers and his more serious novels are suffused with concerns of politics and religion, flawed institutions, characters who betray others and their own consciences, and grace and redemption in unexpected places.

His first novel, The Man Within, was published in 1929. It joins the public domain in 56 days.

Author Interview: Andrea Jo DeWerd / LibraryThing (Thingology)

Andrea Jo DeWerd

LibraryThing is pleased to sit down this month with author Andrea Jo DeWerd, who, in addition to her career in publishing and as an independent book marketer, recently saw her debut novel, What We Sacrifice for Magic, released by Alcove Press. DeWerd worked for more than a decade in the marketing and publicity departments of a number of Big 5 publishers, including Crown, Random House, Simon & Schuster, and most recently, the Harvest imprint of HarperCollins. In 2022 she launched her own marketing and publishing consulting agency, the future of agency LLC. Her authorial debut, published in late September, is a fantastical coming-of-age story following three generations of Minnesota witches during the 1960s. DeWerd sat down with Abigail to answer some questions about this new book.

How did the idea for What We Sacrifice for Magic first come to you, and how did the story develop? Did your heroine Elisabeth come first? Was it always a multi-generational family story in your mind, always a witchy tale?

I was trying to write a very different book about the American Dream, and my own family’s experience with it. My grandfather’s family were Dutch immigrants in Minnesota. My great-grandfather and his cousin operated several feed mills and fish hatcheries. The next generation, my grandfather and his brothers, all became doctors. I was fascinated by this story, and by what happens after the American Dream is achieved—what happens to the next generation? But it was too close to home for me to write in the years after my grandfather passed away.

What We Sacrifice for Magic grew out of the question: what were the women doing while the men were building their empire? I started to imagine a world in which the men ostensibly held the power, but beneath the surface, it was really the women pulling the strings; a world in which the women could be running a full-on witchcraft operation out of the side door of the kitchen while the men were off fighting their wars and building their supposed influence.

Elisabeth’s voice came to me first. I started to hear her voice, and the first thing I knew about her was that she was ruled by water. From there, I explored how she would’ve come to be that way, who would’ve taught her about her power, and Magda, her grandmother, her teacher, emerged pretty quickly.

Your book addresses themes of familial history, obligation and conflict, and the individual’s struggle to both belong to and be independent of the family circle. How does the witchy element in your story add to or complicate those themes? How different would your story be if the Watry-Ridder women weren’t witches?

In many books with magic, the magic acts as the deus ex machina that lifts the characters out of their unfortunate situations. Magic breaks oppressive forces in many ways. For Elisabeth, magic is what is holding her back, her burden. Aside from that magical burden, Elisabeth would still need her coming-of-age journey. I believe that even without magic, Elisabeth would’ve always felt separate from her family. She needed to learn who she is on her own, away from the reputation of her family and the name she was born to.

Without magic, this story becomes a much more familiar one. Anyone who has ever dealt with the pressures of a family business knows what it feels like to be torn between wanting to forge your own path and getting pulled back into the family responsibility. Adult children who take care of their aging parents know that tug-of-war as well. I think we all feel family pressure in some way or another in our lives, and beneath the magic, that is what I wanted to explore in this book.

What We Sacrifice for Magic is set in your own home state of Minnesota, and opens in 1968. What significance do the setting and time period have to your story?

The setting came to me first. Elisabeth, ruled by water, was always going to be from a small lakeside town in Minnesota. The town of Friedrich was inspired by my own beloved Spicer, Minnesota, where my family has had a cabin on Green Lake since 1938. The lake felt so integral to this story and this community that the Watry-Ridder family serves.

Moreso, this family had to come from a place that was rural enough for them to fly under the radar, a pastoral community that just accepted their local eccentrics, and even came to depend on them. I was also fascinated by the sort of gossip that happens in a small town. In a closeknit community, it’s impossible to walk down the street without everybody knowing everything about you, who you’re dating, etc. I wanted to see Elisabeth and her younger sister, Mary, engage with that gossip, and it certainly shapes them as they’re growing up in Friedrich with the sometimes unwanted attention.

More broadly, 1968 was a time when many young women were starting to have more choices in their education and the opportunity for careers outside of the home, in many parts due to contraception. Those choices were not available to Elisabeth—she is stuck in this small town, tied to her community, as she watches her high school classmates going off to their next chapters.

What influence has your career in publishing and book marketing had on your storytelling? Have you been inspired by any of the authors whose books you have promoted?

I started writing this book when I was working full-time as a book marketer at Random House. I had been a creative writing minor in college, but I wasn’t really writing in my first 8 years in New York while I was in grad school and volunteering and focused on other things. I was inspired to start writing again in earnest when I would be in meetings with these amazing authors like Catherine Banner and Emma Cline, who were both a few years younger than me. I thought if they found time to do it, why couldn’t I? On the flip side, I was working with Helen Simonson at the time, who said that she didn’t really get to start writing until her kids were grown and out of the house, and I thought, “I’m single, I don’t have kids, what am I waiting for?”

I was also greatly inspired by Laura Lynne Jackson’s books The Light Between Us and Signs. Her first-person account of how close we are to the spirits on the other side very much influenced my own personal spiritual beliefs, some of which are woven into Elisabeth’s outlook and her experiences with her guide from the other side, Great-Grandma Dorothy, and the energy healing work that the family does.

Tell us about your writing process. Do you have a particular place you prefer to write, a specific way of mapping out your story? Did you know from the beginning what the conclusion would be?

I wrote at least 50% of this book long-hand in a journal. I write in the morning in bed before the rest of the world comes crashing in, i.e. before I look at my phone or email. My phone stays in the kitchen until after I’m done writing for the day. Once I got further into the story, though, I switched to drafting on my laptop when I was really building momentum.

I don’t believe you have to write every day. I have a day job! I write maybe a few days a week, and this book came together 100 words at a time. I would write a single paragraph in the morning before hopping in the shower and heading into Random House. My writing group talks often about setting realistic goals because the minute you set a lofty goal and miss that first day of “write every day,” it makes it that much harder to get back on track.

I barely outlined this book. This was very much a discovery writing project, but when I got into revision, I reverse-outlined what had happened so far in the book so that I could confidently write my way through to the end. I didn’t know the exact ending of the book until I was about ⅓ of the way through. I remember emailing my writing group one day to say, “I think I just wrote the last line of my book.”

For revision, the book Dreyer’s English by friend and former Random House colleague Benjamin Dreyer was essential to me. It was very helpful to read books like his as I was enmeshed in the revision process.

What can we look forward to next from you? Do you have other writing projects in the offing?

I am working on something completely different next! I am finishing a first draft this fall of my second novel, a contemporary Christmas rom-com set in southern Minnesota. There’s Christmas cookies, a local hottie, and a girl home from the big city. I’m approaching this book a little differently—starting with an outline!

Tell us about your library. What’s on your own shelves?

I am very much a mood reader and I read just about every genre out there. I love sci fi and fantasy or romance for a quick vacation read. I try to keep up with the new, big literary novels. I have my section of craft books, like Big Magic and Bird by Bird. I have sections of series that I’m hoping to finish one day, like Outlander. I’m always reading our clients’ books for work. I have a celebrity chef’s memoir and a performance and productivity expert to read next for work. But truthfully, my shelves are full of books I haven’t read that have come with me from job to job. I have classics, I have the hot releases dating back to 2010, I have signed copies of books I’ve worked on, like Educated and Born a Crime. I also have an amazing cookbook collection from my time working in lifestyle books, lots of Mark Bittman and Jacques Pépin and Dominique Ansel.

What have you been reading lately, and what would you recommend to other readers?

I just finished the new Louise Erdrich novel, The Mighty Red. She’s my favorite author and as a contemporary Minnesotan author, she has had a huge impact on me as a reader and a writer. I think Erdrich most accurately captures contemporary women—and the myriad ways the world disappoints us—like no one else I’ve ever read. I make a point to buy the new books by Louise Erdrich and William Kent Krueger, another Minnesotan author, in hardcover from indie bookstores when I’m back in MN. If you haven’t read Louise Erdrich before, one of my favorite books is The Round House. I recommend that book to everyone.

Information Quality Lab at the 2024 iSchool Research Showcase / Jodi Schneider

While I’m in Cambridge, today members of my Information Quality Lab present a talk and 9 posters as part of the iSchool Research Showcase 2024, noon to 4:30 PM in the Illini Union. View posters from 12 to 1; during the break between presentation sessions 2-2:45; and 4-4:30 PM.

TALK by Dr. Heng Zheng, based on our forthcoming JCDL 2024 paper:
Addressing Unreliability Propagation in Scientific Digital Libraries
Heng Zheng, Yuanxi Fu, M. Janina Sarol, Ishita Sarraf, Jodi Schneider

POSTERS
Addressing Biomedical Information Overload: Identifying Missing Study Designs to Design Multi-Tagger 2.0
Puranjani Das, Jodi Schneider

Assessing the Quality of Pathotic Arguments
Dexter Williams

Cognitive and Behavioral Approaches to Disinformation Inoculation through a Hidden Object Game
Emily Wegrzyn

Distinguishing Retracted Publications from Retraction Notices in Crossref Data
Luyang Si, Malik Oyewale Salami, Jodi Schneider

Harmonizing Data: Discovering “The Girl From Ipanema”
John Rutherford, Liliana Giusti Serra, Jodi Schneider

“I Lost My Job to AI” — Social Movement Emergence?
Ted Ledford, Jodi Schneider

Recognizing People, Organizations, and Locations Mentioned in the News
Xioran Zhou, Heng Zheng, Jodi Schneider

Representation of Socio-technical Elements in Non-English Audio-visual Media
Puranjani Das, Travis Wagner

What People Say Versus What People Do: Developing a Methodology to Assess Conceptual Heterogeneity in a Scientific Corpus
Yuanxi Fu, Jodi Schneider

Panel: The Tech We Want is Built and Maintained with Care / Open Knowledge Foundation

The Tech We Want Summit took place between 17 and 18 October 2024 – in total, 43 speakers from 23 countries interacted with 700+ registered people about new practical ways to build software that is useful, simple, long-lasting, and focused on solving people’s real problems.

In this series of posts, OKFN brings you the documentation of each session, opening the content generated during these two intense days of reflection and joint work accessible and open.

Above is the video and below is a summary of the topics discussed in:

[Panel 2] The Tech We Want is Built and Maintained with Care

17 October 2024 – 11:30 UTC

Digital technologies need people to care for them and keep them alive. In a time of obsession for innovation and disruption, in this panel we will shine a light on the invisible but essential work of maintenance.

Summary

This panel sheds light on the often invisible, essential work of maintaining digital infrastructure, particularly open source software. The speakers argue passionately that the maintenance of software systems, like the ongoing care of a garden, is crucial to the sustainability of digital ecosystems. They highlight the systemic problems that maintainers face, such as burnout, lack of recognition and inadequate funding, and call for a radical shift in how this work is valued and supported.

Emphasising the ethical and social consequences of neglect, and the urgent need for a supportive community and adequate funding, the panellists argue for a culture of shared responsibility and visibility. They urge both corporations and open source communities to recognise this work, to create supportive structures, and to recognise that maintenance is as critical as innovation. The discussion is a clarion call to action, emphasising that we must prioritise care and sustainability in our digital world.

Read More

BIBFRAME Dilemmas for Libraries: Challenges and Opportunities / Richard Wallis

I recently attended the 2024 BIBFRAME Workshop in Europe (BFWE), hosted by the National Library of Finland in Helsinki. It was an excellent conference in a great city!

Having attended several BFWEs over the years, it’s gratifying to witness the continued progress toward making BIBFRAME the de facto standard for linked data in bibliographic metadata. BIBFRAME was developed and is maintained by the Library of Congress to eventually replace the flat record-based metadata format utilised by the vast majority of libraries – MARC (a standard in use since 1968).

This year, Sally McCallum from the Library of Congress shared significant updates about their transition to becoming a BIBFRAME-native organisation. In August 2024, they began a pilot with 15 cataloguers inputting records directly into BIBFRAME, marking the start of the next stage of a long journey. This process not only involved adopting a new system but also retraining a large number of staff—a significant challenge but a major step forward.

Several other organisations, including the Share Community, OCLC, Ex Libris, and FOLIO LSP, also presented their advancements in linked bibliographic metadata and BIBFRAME. While the progress is encouraging, there are some dilemmas, not really addressed in the conference, that libraries face as they consider adopting BIBFRAME, and I’d like to explore those here.

Table of contents

#1: Should linked data only be limited to bibliographic resources?

One of the key benefits of linked data is its ability to connect and relate resources across different domains, not just within traditional library systems. However, many libraries aiming to leverage linked data are primarily focused on bibliographic resources, especially as current BIBFRAME-enabled cataloguing solutions are often seen only as replacements for MARC-based systems.

The challenge arises when libraries want to integrate other types of resources—such as archival collections, historical documents, or art-related information—that don’t neatly fit into the BIBFRAME model. BIBFRAME excels at describing bibliographic resources, but it struggles with the nuances of these other resource types. There are initiatives to extend BIBFRAME to handle arts materials etc., but they are still very [bibliographic] library system focused.

Dilemma: Should a library implement a linked data solution solely for bibliographic resources (essentially as a MARC replacement), or should they adopt a broader linked data strategy that integrates all types of resources across the organisation?

My thought: If a [linked data enabled] replacement for a current library system is all you are looking for, that’s fine. However, if that is all, you need to examine the benefits that would accrue from such a significant move and investment. If your ambition is to present a linked aggregated view of all your resources to your users, a BIBFRAME replacement library system probably will not be flexible enough. 

#2: How to bridge the gap between the library world and the wider web?

One of the widely-touted benefits of BIBFRAME is the ability to share library data more openly across the web. In theory, other libraries, research institutions, and even the broader public could link to a library’s BIBFRAME data. For the library community, BIBFRAME offers a comprehensive linked data vocabulary that facilitates data sharing.

However, outside of the library world, the web at large, driven by the search engines, is largely adopting Schema.org as the preferred vocabulary for sharing data. Libraries have long been seen as silos, with their data mostly confined to standalone search interfaces and complex data formats such as MARC. 

BIBFRAME, while a step forward, doesn’t fully resolve this issue. Yes, it makes data more open and linked, but it still speaks primarily to the library community. If libraries want their data to enrich the wider web, they may need to also incorporate Schema.org alongside BIBFRAME to ensure comprehension and therefore visibility of their resources.

Dilemma: Should libraries focus exclusively on sharing data within the library and research community using BIBFRAME, or should they also aim to make their data more accessible to the general web audience by enriching their data with Schema.org terms?

My thought: Whatever specialist online discovery routes our users may take, they and we are also users of the wider web in general. To make best use of our resources we need our potential users to be guided to those resources. Guided from where they are, which is often not within a library interface or specialist site. To be visible beyond library focused sites, our resources need to be also described using the de facto vocabulary for the rest of the web – Schema.org.  

#3: The costs and challenges of transitioning to BIBFRAME

Transitioning to BIBFRAME can involve significant upheaval for a library, especially for those still reliant on MARC-based systems. Replacing these systems often comes with substantial costs, retraining efforts, and disruptions to daily operations.

Many libraries may question whether the perceived benefits of linked data and BIBFRAME—such as improved data sharing and discoverability—are worth the investment. For smaller institutions, the costs of a full-scale BIBFRAME implementation may seem prohibitive, especially when the advantages are not always immediately tangible.

Dilemma: Should libraries undertake a full-scale, costly transition to BIBFRAME and linked data, or is there a way to adopt linked data principles more gradually, without completely overhauling existing systems?

My thought: My many years working with libraries has taught me that any significant change in systems and or practices often results in far greater investment in time, people, and money than was initially envisaged.  Part of the reason for this being the integrated nature of traditional library systems.  Swapping out one system for another, say to change cataloguing practices, will often result in changes to circulation and acquisition processes for example. All this whilst the library needs to continue its business as usual.  Equally, is retraining of staff a necessary first step to adopting linked data, or could/should it be a more evolutionary process.

My recent work, in partnership with metaphacts, for the National Library Board Singapore has demonstrated that it is possible to make significant beneficial moves into linked data, without replacing established systems and processes or disrupting business as usual. A route others may want to consider.

In addition to attending the BFWE conference, I had the privilege of delivering a presentation titled “Building a Semantic Knowledge Graph at National Library Board Singapore” [slides, video] This project represents a two-year effort to develop and deliver a linked data management system based on both BIBFRAME and Schema.org, powered by metaphactory. What makes this initiative unique is that it integrates data from various systems across the library without requiring a complete systems replacement.

Conclusion

Since its launch 18 months ago, this system has continued to evolve, delivering linked data services back into the library. The approach has allowed the library to realise many of the benefits of linked data without the disruption of replacing its core systems. These benefits include cross-system entity aggregation & reconciliation, navigational widgets for non-linked systems, and an open linked data knowledge graph interface. Besides leveraging the benefits of linked data for library curators, the immense knowledge graph built across data sources united using Schema.org data modelling opens the opportunities of publishing rich cross-domain data to the general public. To learn more about our work with NLB, have a look at this metaphacts blog post.

For those grappling with any of the dilemmas I’ve outlined here or interested in exploring linked data further, feel free to reach out—I’d be happy to help facilitate a discussion.

(Note: This post is also featured as a guest post on the metaphacts blog)

A woman who made her mark on the map / John Mark Ockerbloom

Emma Willard had remarkable persistence. She founded the first higher education institution for women in America, and appealed tirelessly for its support in multiple states. She wrote textbooks for it that include groundbreaking work in history and graphic design.

Alma Lutz’s 1929 biography of Willard, joining the public domain in 57 days, is titled Emma Willard, Daughter of Democracy. May all American daughters and other children of democracy vote to defend it today.

“You know it too well already…” / John Mark Ockerbloom

“I listen to Mussolini’s gentle voice talking to me of friendship, while my ears still ring with the death threats…”

French Prix Goncourt laureate Maurice Bedel wrote in the 1920s and 30s of the appeal and threat of fascism, and the people seduced by it in Italy and Germany. Parts of his book Fascisme An VII appeared in English translation in the November 1929 Atlantic as “A Frenchman Looks at Fascism“. It joins the public domain in both Europe and America in 58 days.

SantaThing 2024: Bookish Secret Santa! / LibraryThing (Thingology)

It’s the most wonderful time of the year: the Eighteenth Annual SantaThing is here at last!

This year we’re continuing to focus on indie bookstores. You can still order Kindle ebooks, we have Kenny’s and Blackwell’s for international orders, and also stores local to Australia, New Zealand, and Ireland.
» SIGN UP FOR SANTATHING NOW!

What is SantaThing?

SantaThing is “Secret Santa” for LibraryThing members.

How it Works

You pay $15–$50 and pick your favorite bookseller. We match you with a participant, and you play Santa by selecting books for them. Another Santa does the same for you, in secret. LibraryThing does the ordering, and you get the joy of giving AND receiving books!

Sign up once or thrice, for yourself or someone else.

Even if you don’t want to be a Santa, you can help by suggesting books for others. Click on an existing SantaThing profile to leave a suggestion.

Every year, LibraryThing members give generously to each other through SantaThing. If you’d like to donate an entry, or want to participate, but it’s just not in the budget this year, be sure to check out our Donations Thread here, run once again by our fantastic volunteer coordinator, mellymel1713278.

Important Dates

Sign-ups close MONDAY, November 25th at 12pm EST. By the next day, we’ll notify you via profile comment who your Santee is, and you can start picking books.

You’ll then have a little more than a week to pick your books, until THURSDAY, December 5th at 12pm EST (16:00 GMT). As soon as the picking ends, the ordering begins, and we’ll get all the books out to you as soon as we can.

» Go sign up to become a Secret Santa now!

Supporting Indie Bookstores

To support indie bookstores we’re teaming up with independent bookstores from around the country to deliver your SantaThing picks, including BookPeople in Austin, TX, Longfellow Books in Portland, ME, and Powell’s Books in Portland, OR.

And to continue previous years’ success, we’re bringing back the following foreign retail partners: Readings for our Australian participants, Time Out Books for the Kiwi participants, and Kennys for our Irish friends.

And since Book Depository has closed, this year we’re offering international deliveries through Kennys and Blackwell’s.

Kindle options are available to all members, regardless of location. To receive Kindle ebooks, your Kindle must be registered on Amazon.com (not .co.uk, .ca, etc.). See more information about all the stores.

Shipping

Some of our booksellers are able to offer free shipping, and some are not. Depending on your bookseller of choice, you may receive $6 less in books, to cover shipping costs. You can find details about shipping costs and holiday ordering deadlines for each of our booksellers here on the SantaThing Help page.
» Go sign up now!

Questions? Comments?

This is our EIGHTEENTH year of SantaThing. See the SantaThing Help page further details and FAQ.
Feel free to ask your questions over on this Talk topic, or you can contact Kate directly at kate@librarything.com.
Happy SantaThinging!

Open Knowledge Achieves US Charitable Organisations Equivalency Status / Open Knowledge Foundation

We’re thrilled to announce that the Open Knowledge Foundation (OKFN) has achieved NGOsource Equivalency Determination (ED) certification, formally establishing our recognition as equivalent to a US public charity. This status represents a major milestone for OKFN and opens new avenues for partnerships and support from US-based donors and foundations.

What is NGOsource Equivalency Determination (ED)?

The Equivalency Determination process, administered by NGOsource, evaluates nonprofit organisations outside the United States to confirm their operations are in accordance with the guidelines that US tax authorities require for public charitable organisations. By meeting NGOsource’s rigorous criteria, Open Knowledge demonstrates its commitment to transparency, accountability, and impact on a global scale. This designation means that foundations and individuals in the United States can now make tax-deductible grants and donations to OKFN with fewer restrictions, knowing that their contributions are directed toward a recognised, vetted nonprofit organisation.

What This Means for Our Work and Communities

As a certified organisation, OKFN can now access new grant opportunities and accept tax-deductible donations from US-based donors. This expanded support base will enable us to continue our work on a global scale, advancing open knowledge, promoting transparency, and advocating for and building accessible, digital tools that serve the public interest worldwide. With US charity recognition, OKFN is now even better positioned to partner with organisations, donors, and advocates who share our vision of a world open by design, where all knowledge is accessible to all.

Renata Avila, CEO of Open Knowledge, shared her appreciation for the recognition: “Receiving NGOsource’s Equivalency Determination isn’t just an acknowledgement of OKFN’s work; it’s a profound opportunity to expand our mission. With US charity recognition, we can continue to nurture and grow a network of leaders and communities in every region of the world. We will also continue to innovate legal, technical and accessibility tools for citizens and governments to unlock the potential of open knowledge, data and digital technologies that can be applied to their work and transform lives. This will lead to open knowledge, data and technologies that are open, participatory, accountable and sustainable for a better world and empowered communities everywhere.” 

This achievement is directly in line with OKFN’s vision and strengthens its ability to advocate and implement open knowledge initiatives worldwide. You can find this and other relevant information about OKFN’s institutional functions on our Governance page.

We’re excited about the opportunities this opens up and grateful for the continued support of our community. Thank you for being part of our journey towards a fair, sustainable and open future.

“He himself is so much bigger than his books” / John Mark Ockerbloom

It’s the last day of Diwali, the Hindu festival of lights that’s also celebrated by various other traditions in India, and in the Indian diaspora.

Among the Indian diaspora’s cultural ambassadors was Newbery medalist Dhan Gopal Mukerji. His 1929 books include Hindu Fables for Little Children, illustrated by Kurt Wiese, introducing tales he grew up with in India to a wide variety of readers. John Neihardt reviewed it when it came out. It goes public domain in 59 days.

Cut-ups and LLMs / Ed Summers

If language is a virus what are LLMs?

I’ve had this kinda random notion about Large Language Models (LLM) and the Cut-up technique rumbling around in my brain for the past year. Unless you’ve been living in a cave I’m guessing you already know about LLMs. You probably already know about Cut-ups too, but just in case here is how Burroughs and Gysin describe this creativity tool (Burroughs & Gysin, 1982, p. 34):

Writing is fifty years behind painting. I propose to apply the painters’ techniques to writing; things as simple and immediate as collage or montage. Cut right through the pages of any book or newsprint… lengthwise, for example, and shuffle the columns of text. Put them together at hazard and read the newly constituted message. Do it for yourself. Use any system which suggests itself to you. Take your own words or the words said to be “the very own words” of anyone else living or dead. You’ll soon see that words don’t belong to anyone. Words have a vitality of their own and you or anybody else can make them gush into action.

p. 34 of The Third Mind
p. 35 of The Third Mind

Burroughs famously used this technique in his Nova Trilogy (and elsewhere) to mix together the works of other authors (Shakespeare, Rimbaud, Kerouac, Genet, Kafka, Eliot, Conrad, …). It has since been widely used as a creativity tool, apparently by musicians like David Bowie, Kurt Cobain and Thom Yorke. The purpose of this hack isn’t simply to come up with new ideas, but to dismantle discursive systems of control:

The Burroughs machine, systematic and repetitive, simultaneously disconnecting and reconnecting—it disconnects the concept of reality that has been imposed on us and the plugs normally dissociated zones into the same sector–eventually escapes from the control of its manipulator; it does so in that it makes it possible to lay down a foundation of an unlimited number of books that end by reproducing themselves. (Burroughs & Gysin, 1982, p. 17)

So what do LLMs and Cut-ups have to do with each other?

One superficial way of thinking about LLMs is as the cut-up machine, par excellence. LLMs are built by taking a massive amount of content from the Web, chopping it up into words (tokens), and then creating a neural network that represents the likelihood of one token following another. This allows new text to be generated word by word given an initial sequence, or prompt. Similar to the cut-up, it’s no longer possible to attribute LLM generated text to a particular author or authors. The very idea of authorship and attribution is completely dissolved in the model.

However the big difference with LLMs is that they are optimized for predicting the next likely word, given an initial sequence of words. An LLM is ultimately a statistical representation of likely text. The Cut-up on the other hand is specifically designed to break the typical associations of words, but without totally obscuring where those words came from.

LLMs discipline communication, and routinize language in an attempt to simulate meaningful text. Cut-ups intentionally break word associations in order to reveal non-obvious, possibly absurd, latent meanings in given texts. Comparing LLMs to Cut-ups unmasks the LLM as a normalization tool for language control, and the Cut-up as a tool for wresting back control, for peeking inside the discursive machinery of language.

I was reminded of this today when I ran across a lovely short paper by Max Kreminski entitled Computational Poetry is Lost Poetry which he presented at the Halfway to the Future conference (open-access Proceedings) recently.

In this paper he draws a comparison between Found Poetry, where poetry is discovered in everyday use of language, and LLMs “whose central purpose is to arrange units of language, without fully understanding them, in combinations that can later be found to be poetry”. He calls this LLM generated text “Lost Poetry”. Of course not everyone using LLMs is trying to write poetry, or even think creatively, so this analogy doesn’t totally work for all LLM use cases. But he goes on to make some insightful observations about the flaws of generative AI:

I argue that machines are often usefully creative because they fail to see things completely as humans do: their oversights and inabilities lead them to mix human-like with non-human-like creative decisions in unanticipated ways, and thereby to supply human creators with ideas that they otherwise never would have considered. Somewhat counter-intuitively, then, I suggest that a dogged pursuit of perfect overlap between human and machine understanding of aesthetic domains may in fact inhibit the usefulness of machines as generators of unexpected inputs to the human creative ecosystem.

The flaws that we see in these generative systems is what makes them useful, and the quest to build bigger and bigger models that better model “reality” is at cross purposes with their use in creative endeavors. It’s the glitches that provide value. He goes on to say:

… the design of novel computationally creative systems could be guided in part by a deliberate choice of what to make invisible to the machine. By selectively limiting the machine’s capacity to take certain facets of human aesthetic perception into account, we can produce different kinds of losers that can help to break us out of familiar patterns toward new techniques of expressive communication.

This is a provocative idea I think, that it’s the limitations that we build (intentionally or not) into computational systems that make them legible to us humans. These limitations help distinguish one tool from another. Just as Cut-ups engineer for the unexpected, and transgress predictable narrative structures, these LLM generated “losers” have more potential for creative thought because they are errors. Maybe this is one move too far, but there seems to be some parallels here to seamful design, where “strategic revelation of complexity, error, or backgrounded tasks” provide value instead of distraction (Inman & Ribes, 2019).

But perhaps Kreminski, as a chief scientist at a generative AI company, is trying very hard to find value in these statistical models that ultimately drive out and exploit creativity. They do this by disciplining language by normalizing our words to fit a the types of words found on the World Wide Web at a particular point in time. I wish him well in his efforts to make these models smaller, more quirky, and more useful for actual artists–instead of larger and smoother for people who don’t want to employ human artists anymore.

I do wonder, what would it look like if LLMs worked more like Cut-ups, where we got unfamiliar juxtapositions and the sources weren’t completely obfuscated/concealed?

References

Burroughs, W. S., & Gysin, B. (1982). The third mind (First paperbound edition). New York: Seaver Books.
Inman, S., & Ribes, D. (2019). Beautiful Seams. In CHI Conference on Human Factors in Computing Systems proceedings. Glasgow, Scotland. https://doi.org/10.1145/3290605.3300508

A Room of One’s Own, for all / John Mark Ockerbloom

“A woman must have money and a room of her own if she is to write fiction.”

Virginia Woolf’s classic 1929 essay on feminism and creative work has inspired numerous analyses (like this one), adaptations (like this one), and projects (like this one).

Copyright is one way writers get money, but it often enriches publishers and estates more than it helps creators. We begin this year’s anticipating Woolf’s A Room of One’s Own arriving in the US public domain in 60 days.

The remainder of the Roaring 20s about to join the public domain / John Mark Ockerbloom

Just two months from now, much of the world will celebrate another Public Domain Day, welcoming a year’s worth of works into the public domain. Many countries that have had life+70 years copyright terms for a while will get works by authors who died in 1954. Those still fortunate enough to still have life+50 years terms will get works by authors who died in 1974. The rules in the United States are more complicated, but we’ll have nearly all our remaining copyrights from 1929 expire. That means that, for us, essentially all of the publication history of the “roaring 20s” will be public domain when the new year arrives.1 That’s a wide sweep of culture available for everyone to enjoy, share, build on, and reuse.

The Twenties encompass the start of national women’s suffrage, the rise of the Jazz Age and the Harlem Renaissance, and the dawn of “talking” motion pictures, and extend to the “Black Tuesday” stock market crash and the beginning of the Great Depression. The Twenties had political upheaval to match the cultural and economic upheaval, including civil war in Ireland and many other places around the world, the birth of fascism in Europe, and the revival and decline of the Ku Klux Klan as waves of anti-immigrant and racist sentiment washed over much of America. But the decade also saw widespread international efforts to try to end war generally among nations. While the 1928 pact that many nations signed on to has often been viewed as a failure for not preventing World War II, it set a precedent for later international cooperation and peacekeeping efforts that can be credited with more success.

As I have in past years, I’ll be featuring a Public Domain Day countdown in the days leading up to New Year’s Day 2025, each day featuring an interesting work that will be joining the public domain then. You can follow it on this blog, or using RSS readers or social media that can connect with this blog. That includes Mastodon and other “fediverse” sites that connect with Mastodon using the ActivityPub protocol. I’ll also boost or link to the daily posts from my Mastodon account. (Most of the posts will have 500 characters or fewer, the size of a typical Mastodon post; a few may be longer.) You might also be able to follow my boosts and links from Bluesky (since my account is hooked up to Bridgy Fed), as well as possibly from Threads if they’ve enabled following Mastodon accounts. (That was on their roadmap for 2024, but I don’t know if it’s working yet.) My posts will include the hashtag . I’ll be focusing on works joining the US public domain that are of interest to me, but you’re also welcome to post about works of interest to you joining the public domain where you are, and use the same hashtag if you like.

Right now for me, and for many others I’ve talked to, it’s hard to think much beyond next Tuesday. But I hope these posts help us anticipate some good things coming in the future, built on the knowledge and creativity of the past. May we all see and help bring about a better future in the days to come!

  1. The rules in the US are different for unpublished works, and for sound recordings that aren’t part of motion pictures. (I told you US copyright law was complicated.) But this January 1, along with publications from 1929, we will be welcoming sound recordings released in 1924 (which have a 100-year term) into the public domain, as well as many unpublished works by people who died in 1954. For lots more details and special cases, see Cornell University Library’s public domain table. ↩

November 2024 Early Reviewers Batch Is Live! / LibraryThing (Thingology)

Win free books from the November 2024 batch of Early Reviewer titles! We’ve got 209 books this month, and a grand total of 4,102 copies to give out. Which books are you hoping to snag this month? Come tell us on Talk.

If you haven’t already, sign up for Early Reviewers. If you’ve already signed up, please check your mailing/email address and make sure they’re correct.

» Request books here!

The deadline to request a copy is Monday, November 25th at 6PM EST.

Eligibility: Publishers do things country-by-country. This month we have publishers who can send books to the US, the UK, Canada, Australia, Ireland, Netherlands, New Zealand, Germany, Italy, Spain and more. Make sure to check the message on each book to see if it can be sent to your country.

A Place No Flowers GrowTunnel of Hope: Escape from the Novogrudok Forced Labor CampFarmhouse on the Edge of Town: Stories from a B&B in the Mountains of Western MaineWould You Rather? True Crime Edition: 1,000+ Thought-Provoking Questions and Conversation Starters on Serial Killers, Mysteries, Crimes, Supernatural Activity and MoreHeart of the GlenSerial BurnThe Indigo HeiressA Furnace SealedEmotional Confidence: 3 Simple Steps to Manage Emotions with Science and ScriptureMade to Be She: Reclaiming God's Plan for Fearless FemininityBring Back Your People: Ten Ways Regular Folks Can Put a Dent in White Christian NationalismIllusory Dwellings: Aesthetic Meditations in KyotoHere Goes NothingFrom the Ground Up: The Women Revolutionizing Regenerative AgricultureDivision Street: AmericaConfidentialWhat If We Were All Friends!I Am Wind: An AutobiographyWicked KingSingle PlayerA Death in DiamondsMother's First Aid: Mother's Guide from Birth to FourI Refused to Be a War BrideHard FoodMarry a Mensch: Timeless Jewish Wisdom for Today's Single WomanA Fateful EncounterThe Language of MothersAlterationsSoviet Jewry Reborn: A Personal JourneyAcademy of Unholy BoysAccess All AreasCaribbean HolidayThe WeirdotsA Simple Guide to Staying Healthy & Living LongerThe Bugs1950s Nostalgia Activity Book for Seniors: 50 Retro Themed Word Search Puzzles with Illustrated Fun Facts and Trivia for a Fun Walk Down Memory LaneNostalgic Trivia for Seniors: Relive Your Favorite Memories of 5 Decades of Americana (1950s-1990s) with 500 Multiple-Choice Questions and Illustrated ThemesEnemies of the StateHunger In The StonesThe Mall WalkersOur Comeback Tour is Slaying MonstersObsidian PrinceBewitching RosemaryHeart Games 3: A Christian Romance (Christmas) Puzzling ExperienceNew Prize for These Eyes: The Rise of America's Second Civil Rights MovementSomewhere Toward Freedom: Sherman's March and the Story of America's Largest EmancipationBearslayerDeath Of A Spy?Love is the Answer, No Matter what the QuestionHow to Chop Tops: A Pictorial Guide to Hot Rodding's Most Popular ModificationTime BeforeFehuUnderstanding Adolescence for Girls: A Body-Positive Guide to PubertyUnderstanding Adolescence for Boys: A Body-Positive Guide to PubertyAmorphous: Breaking the MoldThe Old Secret at Hotel OregonHotel ImpalaAntidote: A New Emotional Wellth Framework™ to Build ResiliencePractical Money Skills for Teens: Personal Finance Simplified, with a Quickstart Guide to Budgeting, Saving, and Investing for a Stress-Free Transition to Financial Independence365 Inspirational & Motivational Quotes to Live By: Daily Wisdom to Inspire Personal Growth, Resilience, Positivity, and MindfulnessLandscapes & Landmarks Coloring Book for Adults: Scenic Beauty and Iconic Places from All 50 States of America for Mindful Relaxation and Stress ReliefLeadership Bites: An Approachable Handbook for Emerging LeadersConspiracy of CatsFive Minutes from a MeltdownBefore the King: Joanna's StoryWhen Stars Light the SkyDiversity, Equity, and Inclusion Essentials You Always Wanted to KnowPython Essentials You Always Wanted to KnowBlockchain Essentials You Always Wanted to KnowDigital Shock: Seven Shocks That Are Shaping the FutureLove by the BookSarah: Discovering LoveThe Matrix of the MindBeing a Woman Over Forty: The 40 Things You Should Know by NowA Choir of WhispersConductoid - Scars of the DominatayEvolving Through Life Transitions: A Coping Strategies Guide, Resource, and Workbook Designed for Individuals, Therapists, and Their ClientsCursed EarthAllies, Arson, and Prepping for the ApocalypseNo One Will Save UsTrance Formation: My Hero's Journey of How I Turn Life's Greatest Challenges into Life's Greatest Gifts. A Spiritual Awakening Real Life StoryDentro di Te: Un Viaggio Illustrato di Mindfulness per Bambini CuriosiConversation with XenexThe God FrequencyBeyond the Dismal Veil: Five Short Horror Romance StoriesPaper FaceDiez recordatorios para la mujer Cristiana solteraSpared: A Memoir of Risk and ResolveTeslamancerAnny in LoveRooted and RememberedHouse of SecretsThe Lightning SeedThe Twisted Tree Dig3 Strikes: Finding Love in Forbidden PlacesThe Focused Faith : Detox Your Digital Life Reclaim Hijacked Attention Build Habits for Focus & JoyStrawberry GoldBackupThe Happy Hunting Ground of All MindsFood Freedom: Empowerment Manual for Liberation Through FoodMeantime in GreenwichThe Inheritance of Amaya MontgomeryPlea to a Frozen GodOne Night BoyfriendA Fate Far Sweeter: Passion & Peril In UkraineMien: CurrentsHealing Your Innocent Inner Child: Your Workbook to Overcoming Past Trauma, Regaining Emotional Stability and Practicing Self-Compassion With 20 Scientific-Backed Practical ExercisesThe Songs of MagicLight LockedDragon FlameAliens Versus FootballEffortless Monthly Bills Checklist: Stop Stressing and Achieve Financial Clarity in Just Minutes a Month with this Easy to Use 4-Year Workbook for Tracking your Bills!Monthly Bill Payment Checklist: Take Control Of Your Finances Today!Poseidon's Progress: The Quest to Improve Life at SeaLeave No Trace: FestiFellThe QuietThe Unexpected GuestsHolding On To Her Identity: Losing My Wife To Alzheimer'sLouisa Sophia and a Legion of SistersMy Thanksgiving Coloring Book for Kids: Ages 4-6Be the Weight Behind the SpearThe Fixer: The Good Criminal: Part OneA Madness UnmadeGrandma Mcbee, How Slow Can She Be?Prompted: Synaesthete and Other StoriesIdentity Crisis (A Lawyer's Tale): How Divorce Nearly Ended My LifeMilk Before MeatWife No. 56One Year Without Sugar: Unlocking the Secrets to Weight LossThe Last Nuclear WarGitel's FreedomShort StoriesThe Street Illuminati: DragonShort StoriesDelusions of ChenilleMutated Files: Case OneThe Trench of the DeadAbout the BoyThe Trillion Dollar CowBetween Taste and SoundThe Curse of the Smoky Mountain TreasureLa Vaca Cuesta DemasiadoWhen the Roman Bough Breaks: How a History of Violence and Scandal Shaped the Roman Church, and Hope for Catholics in the GospelThe Misfits: Tails of AlienationsHow the HeussKid Moved the Mole-Lid!Devil's DefenseNew American CaféOld White Man WritingThe King of Myths: Gods and Legends from Every CultureThis Thing is Starving3 by 3: Self Help Discovery and Inner Growth BookThe Keeper's SecretA Hot Chocolate for TwoBlades of ObsessionAn Author's Guide to MonstersPreterism From The BeginningThe Vanishing Heiress: The Unsolved Disappearance of Dorothy ArnoldThe Silent Witness: The Unsolved Murder of Mary Rogers: A Scandal That Shook New York and Inspired Edgar Allan PoeAnywhenWhispers from the Murder Farm: The Case of Belle Gunness: Inside the Mind of America’s Darkest Femme FataleThe Emotional Intelligence Advantage: Transform Your Life, Relationships, and CareerCurse of the Maestro and Other StoriesA Certain Slant of LightPresented with LoveLet the Purring Begin: Sapphire's TaleThe Ultimate Guide to Rapport: How to Enhance Your Communications and Relationships with Anyone, Anytime, AnywhereTangled up in MurderThe Starlight ContingencyFree Life Revolution: Zero-Cost Hacks to Transform Your Body & Mind: Book OneDelitti Fuori OrarioNo One Will Save UsThe Little Hedgehog's Second ChristmasUnboundBlair CountyA Legend of the SailorsThe Displacement Dilemma: Navigating the Survival of Human Expertise in an AI-Driven WorldVivid Visions: Tales Woven from the Threads of Diverse ImaginationsWould You Rather? True Crime Edition: 1,000+ Thought-Provoking Questions and Conversation Starters on Serial Killers, Mysteries, Crimes, Supernatural Activity and MoreThe Business Rescue Casebook IIIThe BreakdownThe Life and Times of Sherlock Holmes: Essays on Victorian England, Volume FiveEchoes of the TombSuper Psyched: Unleash the Power of the 4 Types of Connection and Live the Life You LoveFinding KIND: Discovering Hope and Purpose While Loving Kids with Invisible Neurological DifferencesSatan, Get Out in the Name of Jesus: An Autobiographical Account of a Personal Struggle Against Demonic Forces of DarknessSide Quest: StoriesWillful Wanderer: A MemoirThe Fields of Britannia: The Darkness Before the DawnFree Life Revolution: Zero-Cost Hacks to Transform Your Habits & Horizons: Book TwoShallow DepthsThe Intentional Leader: A Guide To Elevate Your Residential Service BusinessDark ArteriesPoinsettia LaneThe ProjectionistA Change in Destiny: Dark SuspicionsTales of a Toy Soldier: The CorpseVesselFelones de Se: Poems about SuicideNoir Dirt Cheap: Film Noir in the Public Domain, Volume 1Queen of TradesShadows Under a Dipping SunDavid and the Lost NookThe Evolution of Nora O'Brien PachecoShort Stories from Faraway PlacesBeyond Beliefs: The Incredible True Story of a German Refugee, an Indian Migrant and the Families Left BehindLosing ItA Blessed FallPick Me!The Secret of the Mind-Garden

Thanks to all the publishers participating this month!

Alcove Press Arctis Books USA Baker Books
Before Someday Publishing Bethany House Broadleaf Books
CarTech Books Census Press City Owl Press
Crooked Lane Books Entrada Publishing eSpec Books
Gefen Publishing House IngramSpark Islandport Press
Lerner Publishing Group The New Press Prosper Press
PublishNation Purple Diamond Press, Inc Purple Moon Publishing
Revell Riverfolk Books Rootstock Publishing
Running Wild Press, LLC Simon & Schuster Somewhat Grumpy Press
Stone Bridge Press Tundra Books Twisted Road Publications
Unsolicited Press Vibrant Publishers Wise Media Group
Yorkshire Publishing

Gli oggetti digitali del catalogo SBN / Raffaele Messuti

Ho recentemente scoperto la disponibilità delle API del catalogo SBN, sebbene non sappia da quanto tempo siano state rilasciate. È un argomento di cui mi sono interessato in passato, più per curiosità personale che per necessità professionale, credendo molto nel valore di dati e metadati aperti nel settore dei beni culturali. Anni fa avevo individuato l'esistenza di alcune API non ufficiali utilizzate dalle applicazioni mobili del catalogo, che ancora oggi funzionano, seppur con funzionalità limitate. Queste API continuano a suscitare interesse in ricercatori o sviluppatori che mi contattano per avere ulteriori dettagli, che purtroppo non sono in grado di fornire.

Quella che segue è una mia analisi di queste nuove API ufficiali del catalogo SBN, e il modo in cui le ho utilizzate per uno specifico caso di studio: ottenere l'elenco dei documenti per i quali è disponibile una risorsa digitale. L'intero catalogo SBN conta 20+ milioni di documenti. Il sottoinsieme che a me interessa, con i documenti digitalizzati, poco meno di un milione (938.000+). Ottenere la lista dei documenti di cui è disponibile l'oggetto digitale mi è sembrato un buon esperimento per esplorare il catalogo in modo casuale e scoprirne qualche contenuto rilevante (serendipità!).

Ho riscontrato alcune particolarità nella modellazione dei dati, e la mancanza di una documentazione dettagliata e completa mi ha fatto procedere a tentativi e intuizioni. Non intendo criticare o sminuire il lavoro svolto dall'ICCU, anzi credo che sia un risultato importante e spero che una maggiore discussione pubblica su questi strumenti e interfacce sui dati possa contribuire a migliorarli e incentivarne l'utilizzo.

Voglio però precisare che negli ultimi tempi ho maturato una visione diversa e meno ortodossa sul modo in cui i dati dei beni culturali dovrebbero essere distribuiti: ne ho scritto qui Beyond HTTP APIs: the case for database dumps in Cultural Heritage, sostenendo che dovremmo preferire degli export completi, autonomi e pronti all'uso rispetto alle API.

Quickstart per usare le API

Le API sono raggiungibili da questo portale https://api.iccu.sbn.it/devportal/apis. L'utilizzo non è pubblico e anonimo, per potere essere usate è necessario registrare un account e successivamente creare delle chiavi OAuth2, che serviranno per generare un token da includere in tutte le chiamate.

Il prodotto software qui usato è WSO2 API manager e da quello che ho potuto capire espone direttamente delle API di Solr (in sola lettura, ovviamente). Esistono diverse API, divise per servizio, presentate graficamente con una sorta di tavola periodica. Non è immediatamente chiaro a cosa di riferiscono e la terminologia usata è per persone che già conoscono l'ecosistema dei servizi di SBN. A me risulta del tutto ignoto cosa siano CA (Cataloghi Storici) oppure IC (ICFE Services), e ho intuito che AB si riferisse all'Anagrafe Biblioteche. Ma quello a cui sono interessato è SB, SBN Integrato.

ICCU API Portal

Ognuna della API ha ovviamente delle chiamate e delle risposte di tipo diverso. Sono messi a disposizione degli SDKs già pronti in Java e Javascript. Per la mia attività ho preferito iniziare a scrivere una libreria in linguaggio Go: la trovate qui https://github.com/atomotic/iccu. Non è un SDK completo, è ancora un modulo molto spartano, e col tempo potrei completarlo.

La cattura dei documenti con oggetto digitale

Per esplorare il catalogo ho abbandonato fin da subito l'idea di interpretare in tempo reale le risposte delle API: ho deciso di salvarmi tutti i dati in locale e poi successivamente parsarli. Ho salvato le risposte in un database SQLite, estremamente semplice: un field doc di tipo json in cui salvo il json raw risultante dalla api, e una colonna bid popolata automaticamente dal field unimarc 003

CREATE TABLE sbn (
        bid TEXT GENERATED ALWAYS AS (json_extract(doc, '$.unimarc.fields[1].003')) VIRTUAL,
        doc json
);

CREATE INDEX bid_idx on sbn(bid);

La API chiamata è la seguente: gli argomenti rilevanti sono presenza_digitale=Y e format=full (diversamente avrete un oggetto minimale non completo di tutto l'unimarc).

GET https://api.iccu.sbn.it/sbn/1.0.0/search
	format=json
	detail=full
	page-size=500
	presenza_digitale=Y

Questo un esempio completo di risposta di un singolo record RAV0302299 (http://id.sbn.it/bid/RAV0302299).

Ho usato una paginazione abbastanza alta, 500 documenti per risposta. Aumentare il numero di documenti restituiti fa diminuire il numero di chiamate HTTP e può velocizzare tantissimo la cattura; ma c'è il problema che spesso alcuni documenti contengono errori di encoding e il JSON restituito non è valido. Quando li incontrate perderete il contenuto di quei documenti nella finestra di paginazione: è capitato anche nella mia analisi, e non ho ulteriormente indagato ne ho voluto implementare un parsing più efficiente: ho perso qualche migliaio di documenti, ed è un margine di errore accettabile.

Ne ho ottenuto un database di 936500 righe, del peso di 4.7G. Non distribuirò pubblicamente questo database (non ho ben chiara la licenza d'uso di questi dati), ma se qualcuno fosse interessato lo condivido.

Come nel caso di attività di scraping, anche in questo caso di utilizzo di API restano valide delle norme di buona condotta: limitare l'aggressività e la velocità delle chiamate, identificarsi sempre nello User Agent delle chiamate HTTP (anche se queste API hanno un token quindi presumo che l'origine e ogni attività sia sempre rintracciabile).

Il codice usato per la cattura è qui disponibile: https://github.com/atomotic/iccu/cmd/sbn-metadata-fetch

L'analisi e l'esplorazione dei metadati

Pensavo ingenuamente che mi sarebbero state sufficienti delle query SQL nel campo JSON del database SQLite per poter esplorare questi dati: purtroppo la mancanza di uno schema e la modellazione di alcuni dati rendono difficoltoso poter fare tutto in SQL, e ho dovuto scrivermi dei metodi all'oggetto Go che implementassero alcune logiche su questi dati.

Non sono interessato a TUTTI i metadati disponibili, ma solo ad un insieme ridotto, la mia necessità è ottenere i link agli oggetti digitalizzati più che i metadati. Dalla trasformazione dei metadati di origine ho voluto ottenere degli oggetti semplificati come il seguente (sono volutamente mancanti dati come gli autori, etc).

{
  "bid": "IT\\ICCU\\VIAE\\007373",
  "id": "http://id.sbn.it/bid/VIAE007373",
  "idmanus": "",
  "title": "Risposta apologetica, e critica alle osservazioni, ed alla lettera del molto reverendo padre Cantova della Compagnia di Gesu, stampate in Milano l'anno 1752. Contro a chi ha ultimamente difesa la necessita dell'amor di Dio nel sagramento della penitenza",
  "iiif": [
    "https://jmms.iccu.sbn.it/jmms/metadata/UW01alpnX18_/b2FpOmJuY2YuZmlyZW56ZS5zYm4uaXQ6MjE6RkkwMDk4Ok1hZ2xpYWJlY2hpOlZJQUUwMDczNzM_/manifest.json"
  ],
  "link": [
    "http://books.google.com/books?vid=IBSC:SC000005684",
    "http://books.google.com/books?vid=IBSC:SC000008356",
    "http://teca.bncf.firenze.sbn.it/ImageViewer/servlet/ImageViewer?idr=BNCF0003334533"
  ],
  "type": "Testo",
  "material": [
    "Libro antico"
  ],
  "thumbnails": [
    "https://jmms.iccu.sbn.it/jmms/resource/ad/first/UW01alpnX18_/b2FpOmJuY2YuZmlyZW56ZS5zYm4uaXQ6MjE6RkkwMDk4Ok1hZ2xpYWJlY2hpOlZJQUUwMDczNzM_"
  ],
  "start_date": 1753,
  "end_date": 1753
}

Lo script https://github.com/atomotic/iccu/cmd/sbn-metadata-transform estrae i dati dal db SQLite e genera un file in formato JSON Lines (~500M). Questo export è così pronto per essere caricato in diversi altri strumenti più adatti all'analisi dei dati, come SOLR o DuckDB.

Ho preferito usare DuckDB, e questo è il modo in cui ho caricato i dati:

~ duckdb sbn.duckdb "create table digital as select * from read_json_auto('sbn.jsonl');"
~ duckdb sbn.duckdb

D .schema
CREATE TABLE digital(bid VARCHAR, id VARCHAR, idmanus VARCHAR, title VARCHAR, iiif VARCHAR[], link VARCHAR[], "type" VARCHAR, material VARCHAR[], thumbnails VARCHAR[], start_date BIGINT, end_date BIGINT);

Ho esportato il database DuckDB in formato parquet e lo si può scaricare da qui https://atomotic.github.io/data/sbn.digital.parquet (93M).

Il file parquet può essere usato direttamente in DuckDB shell nel browser, senza installare nulla. È sufficiente creare una tabella (esempio):

CREATE TABLE digital AS FROM 'https://atomotic.github.io/data/sbn.digital.parquet';

Alcune query dimostrative:

Numero di documenti raggruppati per tipologia

D SELECT
    type,
    COUNT(*) AS count
  FROM digital
  GROUP BY type order by count desc;
┌───────────────────────────────────┬────────┐
│               type                │ count  │
varchar              │ int64  │
├───────────────────────────────────┼────────┤
│ Testo                             │ 506962│ Registrazione sonora musicale     │ 310053│ Risorsa grafica                   │  53829│ Musica manoscritta                │  20721│ Testo manoscritto                 │  19221│ Musica a stampa                   │  11565│ Registrazione sonora non musicale │   7180│ Risorsa cartografica a stampa     │   4483│ Risorsa elettronica               │   1965│ Risorsa cartografica manoscritta  │    406│ Risorsa da proiettare o video     │     72│ Oggetto tridimensionale           │     29│ Risorsa multimediale              │     14├───────────────────────────────────┴────────┤
13 rows                          2 columns │
└────────────────────────────────────────────┘

Numero di manifest IIIF

D SELECT COUNT(*) as manifest
    FROM (
        SELECT DISTINCT unnest(iiif)
        FROM digital
    );
┌──────────┐
│ manifest │
│  int64   │
├──────────┤
341324└──────────┘
D SELECT COUNT(*) as link
    FROM (
        SELECT DISTINCT unnest(link)
        FROM digital
    );
┌─────────┐
│  link   │
│  int64  │
├─────────┤
1045225└─────────┘

Riguardo ai link esterni ho voluto estrarre l'host del server e poi raggrupparli, in modo da indentificare la provenienza. Ho utilizzato trurl per il parsing della URL, che mi ha rilevato anche diversi errori di parsing, ma li ho tralasciati considerandoli marginali:

~ duckdb --list sbn.duckdb "SELECT DISTINCT TRIM(unnest(link)) AS unique_links FROM digital;" \
    | trurl -f - --get "{host}" --accept-space > urls.txt

Il file urls.txt contiene la lista degli host, non ordinata. Sarebbero sufficienti sort, uniq e wc per poter fare dei conteggi, ma c'è topfew (del noto Tim Bray!) che è molto più efficiente. Google Books, l'Istituto Centrale dei Beni Sonori, e la Teca della BNCF sono le sorgenti predominanti.

~ topfew -n 30 urls.txt

363190 books.google.com
312041 opac2.icbsa.it
134072 teca.bncf.firenze.sbn.it
58043 www.internetculturale.it
46714 books.google.it
12614 www.braidense.it
8558 www.bibliotecamusica.it
6290 www.widejef.com
6091 www.bdl.servizirl.it
5020 archive.org
4284 www.14-18.it
4276 corago.unibo.it
3772 www.google.it
3574 sbn.comune.eboli.sa.it
3562 www.cmarchiviodigitale.com
3177 digiteca.bsmc.it
3103 www.polodigitalenapoli.it
2602 www.aggiornamentisociali.it
2330 hdl.handle.net
2304 www.proquest.com
2280 atena.beic.it
1879 www.fondazionecircoloartistico.it
1698 badigit.comune.bologna.it
1546 doi.org
1431 digital.fondazionecarisbo.it
1431 5.175.50.107
1311 www.omeka.unito.it
1274 www.byterfly.eu
1196 www.repubblicaromana-1849.it
1164 turismo.comune.sanginesio.mc.it

Tra gli host figurano alcune cose bizzarre, molti IP e anche diversi file linkati da Google Drive (e mi sembra una pessima idea linkare in un catalogo degli oggetti da un file storage)

~ grep drive.google urls.txt | wc -l
467

Ancora peggio ci sono anche diversi link a Facebook. E al tempo stesso, mi meraviglio, che non ci siano link verso Wikisource o Wikimedia Commons (ma mi riservo di indagare ulteriormente).

Criticità incontrate

I problemi che ho incontrato non sono di natura tecnica sulle API, ma riguardano la modellazione dei metadati:

  1. La struttura non è uniforme. C'è un oggetto unimarc che è una rappresentazione in json dell'xml unimarc (non è comodissimo da parsare ma va bene così), mentre invece ci sono una serie di campi accessori al di fuori di quell'oggetto (come ad esempio i manifest IIIF) oppure altri dati che duplicano informazioni già contenute nell'unimarc. Sospetto che siano dati presenti lì per facilitarne l'accesso. Penso che sia comunque normale per una base dati longeva come SBN dovere essere costretti ad aggiungere al bisogno dei campi accessori.

  2. Alcuni valori non sono completi: ad esempio i manifest IIIF riportano solo il path, e manca sempre l'host. Con qualche euristica sono riuscito a ricavarlo, ma sarebbe bene che i valori fossero sempre completi. Altre volte invece ho notato che alcuni campi contengono valori multipli divisi con qualche carattere separatore: è il caso dei link esterni alcune volte divisi da " | ".

  3. Locazione dell'oggetto digitale. Ho capito che possono essere di due tipi: manifest IIIF, che vengono anche visualizzati con un viewer direttamente nel catalogo web, oppure sono dei collegamenti a pagine esterne (ma possono esserci entrambi manifest e link). I manifest sono riportati con dei field nel livello principale dell'oggetto: esistono dig_cover, dig_manifest, dig_preview e dig_preview_URL, e non sempre mi è chiara la ridondanza. I link esterni invece sono riportati nell'oggetto unimarc in 899.u o altri.

  4. Alcuni vocabolari fanno uso di lettere singole (ad esempio nel campo tipologie e materiale). Questi vocabolari sono scarsamente documentati, in questi casi sarebbe bene usare una URI (risolvibile!) che porti ad una pagina di documentazione. Esempio:

    Codice a un carattere del tipo documento: a=Testo b=Testo manoscritto c=Musica a stampa d=Musica manoscritta e=Risorsa cartografica a stampa f=Risorsa cartografica manoscritta g=Risorsa da proiettare o video i=Registrazione sonora non musicale j=Registrazione sonora musicale k=Risorsa grafica l=Risorsa elettronica r=Oggetto tridimensionale m=Risorsa multimediale
    ---
    codice ad un carattere del tipo materiale: v=Audiovisivi c=Cartografia g=Grafica A=Libro antico N=Libro moderno M=Musica
    
  5. Manca uno schema: questo è il maggiore dei problemi. Ho dovuto procedere a tentativi ed euristiche per potere parsare quelle risposte, e sono certo di non avere individuato tutte le possibili casistiche o possibilità di errori. I metadati hanno bisogno obbligatoriamente di schemi, con i quali poter effettuare validazioni e costraint. Di possibili tecnologie ne esistono diverse, di complessità variabile: JSONSchema, Avro, Protobuf. Penso sia sufficiente un buon JSONSchema per iniziare. Esistono anche alcune cose nuove come PKL o CUE, finora mai impiegate in un ambito di serializzazione di metadati, che secondo me sono interessanti e il mondo delle digital libraries potrebbe iniziare a valutarle.

Conclusioni

Al netto dei problemi di modellazione dei dati mi sembra che l'infrastruttura tecnologica di questo prodotto di API sia altamente funzionante. Mi piacerebbe sapere se esistono delle statistiche di utilizzo o reali di esempi di integrazione su cataloghi o portali esterni. Penso poi che il mondo Wikidata, dove già esistono diverse integrazioni con il catalogo SBN, possa trarre beneficio da queste API e rendere più veloci e automatici diversi processi già esistenti.

DLF Digest: November 2024 / Digital Library Federation

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation. See all past Digests here

 

Hello! We hope those who attended the DLF Virtual Forum enjoyed the panels. November brings Election Day, Thanksgiving, and a round of working group meetings. 

— Team DLF

This month’s news:

  • Now available from the Digital Library Pedagogy Group: #DLFteach Toolkit Volume 4: Critical Digital Literacies, an open-access resource designed to support both information professionals and educators.
  • Now available from the Cultural Assessment Working Group: The Inclusive Metadata Toolkit serves as a centralized guide to the range of inclusive metadata tools and resources currently available to equip practitioners to implement inclusive metadata practices in their day-to-day work.
  • Register: 2024 IIIF Online Meeting, November 12-14.
  • Climate Action Webinar #3: Combatting Climate Anxiety Through Data: December 5, 2024, 3:30 pm – 5:00 pm ET. Learn how curating scientific data orients GLAMR institutions in the public conversation and can help combat climate anxiety through action.
  • Closures: CLIR and DLF offices will be closed for Thanksgiving 11/25 – 11/29. 

 

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Meeting dates are subject to change. Can’t find meeting call-in information? Email us at info@diglib.org. Reminder: Team DLF working days are Monday through Thursday.

  • DLF Born-Digital Access Working Group (BDAWG) Monthly Meeting: Tuesday, 11/05, 2 pm ET/11 am PT
  • DLF Digital Accessibility Working Group: Wednesday, 11/06, 2 pm ET/11 am PT
  • DLF AIG Metadata Working Group: Thursday, 11/07, 1:15 pm ET/10:15 PT
  • DLF AIG Cultural Assessment Working Group: Monday, 11/11, 2 pm ET/11 am PT
  • DLF AIG Cost Assessment Working Group: Monday, 11/11, 3 pm ET/12:00 pm PT
  • DLF AIG User Experience Working Group: Friday, 11/15, 11 am ET/8 am PT
  • DLF Committee for Equity & Inclusion: Monday, 11/18, 3 pm ET/12:00 pm PT
  • DLF Digital Accessibility Working Group – IT Subgroup (DAWG-IT): Monday, 11/25, 1:15 pm ET/10:15 PT
  • DLF Climate Justice Working Group: Wednesday, 11/27, 12:00 pm ET/ 9 am PT
  • DLF Digital Accessibility Policy & Workflows Subgroup: Friday, 11/29, 1:00 pm ET/10 am PT

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member organization. Learn more about our working groups on our website. Interested in scheduling an upcoming working group call or reviving a past group? Check out the DLF Organizer’s Toolkit. As always, feel free to get in touch at info@diglib.org

 

Get Involved / Connect with Us

Below are some ways to stay connected with us and the digital library community: 

Contact us at info@diglib.org.

The post DLF Digest: November 2024 appeared first on DLF.

1.5C Here We Come / David Rosenthal

Source
John Timmer's With four more years like 2023, carbon emissions will blow past 1.5° limit is based on the United Nations' Environmental Programme's report Emissions Gap Report 2024. The "emissions gap" is:
the difference between where we're heading and where we'd need to be to achieve the goals set out in the Paris Agreement. It makes for some pretty grim reading. Given last year's greenhouse gas emissions, we can afford fewer than four similar years before we would exceed the total emissions compatible with limiting the planet's warming to 1.5° C above pre-industrial conditions.
...
The report ascribes this situation to two distinct emissions gaps: between the goals of the Paris Agreement and what countries have pledged to do and between their pledges and the policies they've actually put in place.
Source
Back in 2021 in my TTI/Vanguard talk I examined one of these gaps, the one between the crypto-bros' energy consumption:
The leading source for estimating Bitcoin's electricity consumption is the Cambridge Bitcoin Energy Consumption Index, whose current central estimate is 117TWh/year.

Adjusting Christian Stoll et al's 2018 estimate of Bitcoin's carbon footprint to the current CBECI estimate gives a range of about 50.4 to 125.7 MtCO2/yr for Bitcoin's opex emissions, or between Portugal and Myanmar.
and their rhetoric:
Cryptocurrencies assume that society is committed to this waste of energy and hardware forever. Their response is frantic greenwashing, such as claiming that because Bitcoin mining allows an obsolete, uncompetitive coal-burning plant near St. Louis to continue burning coal it is somehow good for the environment.

But, they argue, mining can use renewable energy. First, at present it doesn't. For example, Luxxfolio implemented their commitment to 100% renewable energy by buying 15 megawatts of coal-fired power from the Navajo Nation!.

Second, even if it were true that cryptocurrencies ran on renewable power, the idea that it is OK for speculation to waste vast amounts of renewable power assumes that doing so doesn't compete with more socially valuable uses for renewables, or indeed for power in general.
Source
Note that the current CBECI estimate shows that Bitcoin's energy consumption has increased 43% since 2021, a 12.7%/yr increase.

Follow me below the fold for more details of the frantic greenwashing, not just from the crypto-bros but from the giants of the tech industry that aims to ensure that:
Following existing policies out to the turn of the century would leave us facing over 3° C of warming.
Luxxfolio wasn't an exception. The latest example of Bitcoin greenwashing comes from Hunterbrook Media:
  • TeraWulf Inc. (NASDAQ: $WULF) brands itself as a “zero-carbon Bitcoin miner” — and claims its commitment to renewable energy will help it land AI data center contracts. But the New York Power Authority, which supplies 45% of the facility’s energy, told Hunterbrook Media: “None of the power that NYPA provides the firm can be claimed as renewable power.”
  • The rest of TeraWulf’s power is sourced from the New York grid, which is less than half zero-carbon, according to the New York Independent System Operator, the organization responsible for managing the state’s wholesale electric marketplace.
  • The only way TeraWulf can legally substantiate its zero-carbon claims is by purchasing renewable energy credits (RECs), according to New York and federal regulators, but a TeraWulf spokesperson confirmed that the company has not done so. “Without the REC, there is no legal claim to the renewable attributes of electricity,” a spokesperson for the New York State Energy Research and Development Authority confirmed in an email to Hunterbrook.
These lies were just the start, Hunterbrook documents lies about most aspects of their business. Note TeraWulf's pivot to AI. In Bitcoin Miners Take Divergent Paths Six Months After Revenue ‘Halving’, David Pan explains that TeraWulf is part of a trend:
Six months after rewards for validating transactions on the Bitcoin network were reduced by half, crypto mining companies are choosing between two divergent paths to remain viable.

Public miners including MARA Holdings, Riot Platforms and CleanSpark are keeping the Bitcoin they produce with the expectation that the digital asset will rise in value. At the same time, an increasing number of companies are spending more on developing data centers that power artificial intelligence applications.
It isn't just the crypto-bros who are apperently lying about using renewables. Back in July Adele Peters revealed that Amazon says it hit a goal of 100% clean power. Employees say it’s more like 22%:
Today, Amazon announced that it hit its 100% renewable electricity goal seven years early. But a group of Amazon employees argues that the company’s math is misleading.

A report from the group, Amazon Employees for Climate Justice, argues that only 22% of the company’s data centers in the U.S. actually run on clean power. The employees looked at where each data center was located and the mix of power on the regional grids—how much was coming from coal, gas, or oil versus solar or wind.

Amazon, like many other companies, buys renewable energy credits (RECs) for a certain amount of clean power that’s produced by a solar plant or wind farm. In theory, RECs are supposed to push new renewable energy to get built. In reality, that doesn’t always happen. The employee research found that 68% of Amazon’s RECs are unbundled, meaning that they didn’t fund new renewable infrastructure, but gave credit for renewables that already existed or were already going to be built.
And in August Amy Castor and David Gerard posted How to fix AI’s ghastly power consumption? Fake the numbers!:
Big tech uses a stupendous amount of power, so it generates a stupendous amount of CO2. The numbers are not looking so great, especially with the ever-increasing power use of AI.

So the large techs want to fiddle how the numbers are calculated!

Companies already have a vast gap between “market-calculated” CO2 and actual real-world CO2 production. The scam works a lot like carbon credits. Companies cancel out power used on the coal/gas-heavy grid in northern Virginia by buying renewable energy credits for solar energy in Nevada.

So in 2023, Facebook listed just 273 tonnes of “net” CO2 and claimed it had hit “net zero” — but it actually generated 3.9 million tonnes.

In practice, RECs don’t drive new clean energy or any drop in emissions — they only exist for greenwashing.

It gets worse. Large techs are already the largest buyers of RECs. So they’re lobbying the Greenhouse Gas Protocol organization to let them report even more ludicrously unrealistic numbers.

RECs currently have to be on the same continent at the same time of day. Amazon and Facebook propose a completely free system with no geographical constraints. They could offset coal power in Virginia with wind power from Norway or India.

This will make RECs work even more like the carbon credit market — where companies can claim hypothetical “avoided” CO2 against actual, real-world CO2.
Source
In Data center emissions probably 662% higher than big tech claims. Can it keep up the ruse? Isabel O'Brien reinforced the message:
Amazon is the largest emitter of the big five tech companies by a mile – the emissions of the second-largest emitter, Apple, were less than half of Amazon’s in 2022. However, Amazon has been kept out of the calculation above because its differing business model makes it difficult to isolate data center-specific emissions figures for the company.

As energy demands for these data centers grow, many are worried that carbon emissions will, too. The International Energy Agency stated that data centers already accounted for 1% to 1.5% of global electricity consumption in 2022 – and that was before the AI boom began with ChatGPT’s launch at the end of that year.

AI is far more energy-intensive on data centers than typical cloud-based applications. According to Goldman Sachs, a ChatGPT query needs nearly 10 times as much electricity to process as a Google search, and data center power demand will grow 160% by 2030. Goldman competitor Morgan Stanley’s research has made similar findings, projecting data center emissions globally to accumulate to 2.5bn metric tons of CO2 equivalent by 2030.

In the meantime, all five tech companies have claimed carbon neutrality, though Google dropped the label last year as it stepped up its carbon accounting standards. Amazon is the most recent company to do so, claiming in July that it met its goal seven years early, and that it had implemented a gross emissions cut of 3%.
Because the tech giants are funnelling vast amounts of cash to Nvidia for hardware to train AIs to, for example, tell people to eat at Angus Steakhouse, or put glue on pizza, convince them that black people's IQ is inferior to whites, hallucinate patient's responses to doctors, persuade teens to commit suicide, and so on they will need lots of power. The smart miners have figured out that their access to lots of power is worth more to the AI bubble than the Bitcoin it could mine. Especially since the halvening. The market has figured this out too:
while the shares of the majority of the companies have underperformed Bitcoin’s more than 60% rally this year with future mining revenue constrained, traders appear to be voting which strategy will succeed, with those embracing AI posing the largest gains.

MARA and Riot, two of the largest publicly traded Bitcoin miners and both “hodlers,” have seen their shares slump 20% and 36%, respectively, this year.
On the other hand:
Northern Data AG is examining a possible sale of its crypto mining business to free up funds for expanding its artificial-intelligence operations.

The Frankfurt-listed company, whose main shareholder is stablecoin issuer Tether Holdings Ltd., would use proceeds from the sale of Peak Mining to focus on its AI solutions unit, it said in a statement Monday. Shares of Northern Data jumped as much as 12% on the news, and were up 9.8% as of 12:06 p.m. in Frankfurt.
The big tech companies are desperate for power:
  • They are continuing to burn coal at plants that were due to shut down in Montana, Omaha (Google & Facebook), Utah, Georgia and Wisconsin:
    “This is very quickly becoming an issue of, don’t get left behind locking down the power you need, and you can figure out the climate issues later,” said Aaron Zubaty, CEO of California-based Eolian, a major developer of clean energy projects. “Ability to find power right now will determine the winners and losers in the AI arms race. It has left us with a map bleeding with places where the retirement of fossil plants are being delayed.”
  • Morgan Stanley estimates that:
    The datacenter industry is set to emit 2.5 billion tonnes of greenhouse gas (GHG) emissions worldwide between now and the end of the decade, three times more than if generative AI had not been developed.
  • S&P Global Commodity Insights:
    noted that only 54 gigawatts of the US coal industry is projected to be powered off by 2030 – down 40 percent from a prediction made in July last year. The total number of coal plants retired by 2050 is still expected to be roughly the same, but the pace of retirement from now to the end of the decade will be significantly slower compared to last year's estimates.
    ...
    Coal plants can credit their new lease on life to the datacenter industry, which is expanding and upgrading existing bit barns as well as building new facilities. The age of AI requires lots of energy – Google search powered by AI alone is expected to use ten times the power of a more traditional information request, according to the International Energy Agency's (IEA) January report.
  • Microsoft signed a 20-year contract to restart Three Mile Island:
    Constellation Energy shut down the Unit 1 reactor in 2019 — not the one that melted down in 1979, the other one — because it wasn’t economical. Inflation Reduction Act tax breaks made it viable again, so Constellation went looking for a customer. Microsoft has signed up for 835 megawatts for the next 20 years.
    ...
    Other mothballed nuclear reactors want to restart for data centers, including Palisades in Michigan and Duane Arnold in Iowa. These both shut down because renewables and natural gas were cheaper — but the data centers need feeding.

    TMI Unit 1 should be back online in 2028, going into the strained local grid — so when the AI bubble pops, the clean-ish power will still be there.
  • Google and Amazon have signed deals for Small Modular Reactors (SMRs), and so has Oracle, but:
    Google has signed a deal with California startup Kairos Power for six or seven small modular reactors. The first is due in 2030 and the rest by 2035, for a total of 500 megawatts.

    Amazon has also done three deals to fund SMR development.
    ...
    Only three experimental SMRs exist in the entire world — in Russia, China, and Japan. The Russian and Chinese reactors claim to be in “commercial operation” — though with their intermittent and occasional hours and disconcertingly low load factors, they certainly look experimental.

    Like general AI, SMRs are a technology that exists in the fabulous future. SMR advocates will talk all day about the potential of SMRs and gloss over the issues — particularly that SMRs are not yet economically viable.

    Kairos doesn’t have an SMR. They have permission to start a non-powered tech demo site in 2027. Will they have an approved and economically viable design by 2030?
Of course, the nuclear options won't add CO2 to the atmosphere, but they won't come on line until after we've breached 1.5C. The result is the rapidly increasing "emssions gap" of the large tech companies. But the problem is even worse than it appears. In my EE380 talk I discussed the carbon emmissions from Bitcoin's hardware:
Bitcoin's growing e-waste problem by Alex de Vries and Christian Stoll concludes that:
Bitcoin's annual e-waste generation adds up to 30.7 metric kilotons as of May 2021. This level is comparable to the small IT equipment waste produced by a country such as the Netherlands.
That's an average of one whole MacBook Air of e-waste per "economically meaningful" transaction.
Source
Why does Bitcoin generate so much e-waste?:
The reason for this extraordinary waste is that the profitability of mining depends on the energy consumed per hash, and the rapid development of mining ASICs means that they rapidly become uncompetitive. de Vries and Stoll estimate that the average service life is less than 16 months. This mountain of e-waste contains embedded carbon emissions from its manufacture, transport and disposal. These graphs show that for Facebook and Google data centers, capex emissions are at least as great as the opex emissions.
Lindsay Clark's GenAI's dirty secret: It's set to create a mountainous increase in e-waste points out that AI has the same problem:
Computational boffins' research claims GenAI is set to create nearly 1,000 times more e-waste than exists currently by 2030, unless the tech industry employs mitigating strategies.

The study, which looks at the rate AI servers are being introduced to datacenters, claims that a realistic scenario indicates potential for rapid growth of e-waste from 2.6 kilotons each year in 2023 to between 400 kilotons and 2.5 million tons each year in 2030, when no waste reduction measures are considered.
Assuming that the tech giants eventually succeed in generating profits from their massive investments in AI data centers, it is likely that the economic life of Nvidia's hardware is longer than that of Bitmain's mining rigs. But the investment is much bigger, so it is likely that the capex emissions from AI data centers add greatly to the overall climate impact of AI. Even if they never make profits, the capex emissions from the current build-out will still be in the atmosphere.

Interestingly, the mainstream media has started to pay attention. Back in June the Washington Post's Evan Halper and Caroline O'Donovan's AI is exhausting the power grid. Tech firms are seeking a miracle solution reported on the latest shiny object:
So near the river’s banks in central Washington, Microsoft is betting on an effort to generate power from atomic fusion — the collision of atoms that powers the sun — a breakthrough that has eluded scientists for the past century. Physicists predict it will elude Microsoft, too.

The tech giant and its partners say they expect to harness fusion by 2028, an audacious claim that bolsters their promises to transition to green energy but distracts from current reality.
Even if they could "harness fusion by 2028", it would be too late to avoid 1.5C. But no-one has yet built a fusion reactor with a positive power output, so the 2028 claim is obvious BS. Pay attention to their actions not words:
In fact, the voracious electricity consumption of artificial intelligence is driving an expansion of fossil fuel use — including delaying the retirement of some coal-fired plants.
...
The data-center-driven resurgence in fossil fuel power contrasts starkly with the sustainability commitments of tech giants Microsoft, Google, Amazon and Meta, all of which say they will erase their emissions entirely as soon as 2030. The companies are the most prominent players in a constellation of more than 2,700 data centers nationwide, many of them run by more obscure firms that rent out computing power to the tech giants.

“They are starting to think like cement and chemical plants. The ones who have approached us are agnostic as to where the power is coming from,” said Ganesh Sakshi, chief financial officer of Mountain V Oil & Gas, which provides natural gas to industrial customers in Eastern states.
And this month the New York Times' David Gelles' The A.I. Power Grab reported that Nvidia was also pushing the "AI will solve the climate" fantasy:
Nvidia’s chips are incredibly power-hungry. As the company rolls out new products, analysts have taken to measuring the amount of electricity needed to power them in terms of cities, or even countries.

There are already more than 5,000 data centers in the U.S., and the industry is expected to grow nearly 10 percent annually. Goldman Sachs estimates that A.I. will drive a 160 percent increase in data center power demand by 2030.

Dion Harris, Nvidia’s head of data center product marketing, acknowledged that A.I. was creating a huge spike in power usage. But he said that over time, that demand would be offset as A.I. made other industries more efficient.

“There is sort of a myopic view on the data center,” he said, “but not really an understanding that a lot of those technologies are going to be the main way that we’re going to innovate our way to a net-zero future.”
Apart from continuing to burn fossil fuels as fast as they can and signing deals that won't make a difference until after the world has committed to 1.5C, what are the tech giants doing? Just like the crypto-bros, they are greenwashing, and spinning ludicrous futures to prevent current action. Here, for example, is Eric Schmidt:
Eric Schmidt, the former chief executive of Google, recently said that the artificial intelligence boom was too powerful, and had too much potential, to let concerns about climate change get in the way.

Schmidt, somewhat fatalistically, said that “we’re not going to hit the climate goals anyway,” and argued that rather than focus on reducing emissions, “I’d rather bet on A.I. solving the problem.”
Schmidt at Sun
Full disclosure: I reported to Schmidt at Sun Microsystems, and my opinion of him is less negative than most of my then peer engineers. But I would not expect him to sacrifice immediate profits for the health of the planet. He is right that “we’re not going to hit the climate goals anyway", but that is partly his fault. Even assuming that he's right and AI is capable of magically "solving the problem", the magic solution won't be in place until long after 2027, which is when at the current rate we will pass 1.5C. And everything that the tech giants are doing right now is moving the 1.5C date closer.

Celebrating Halloween with Gothic fiction in WorldCat / HangingTogether

We love Halloween at OCLC. Some of us decorate our cubicles. Some of us dress in costume. All of us rejoice in the amazing resources represented in WorldCat that are often read at this time of year. In this post I share with you, my fellow bibliophiles and Gothic fiction fans, a few of my favorite resources available in WorldCat—hopefully at a library near you!

Office cubicle wall decorated for Halloween with the theme "Gothic fiction"The OCLC cubicle of Kate James; photo courtesy of the author

Tales of the Grotesque and Arabesque

This two-volume collection of short stories contains “The Fall of the House of Usher.” Told by an unnamed narrator, this story describes a seemingly haunted house that splits into half after all the members of the Usher family die. The story is an exemplar of Gothic fiction and has been adapted multiple times as a film and television program. The 2023 limited series The Fall of the House of Usher, created by Mike Flanagan, is actually a loose adaptation of multiple Poe stories including “The Fall of the House of Usher,” “The Tell-Tale Heart,” and “The Black Cat.” Tales of the Grotesque and Arabesque also includes several lesser-known Poe stories such as “The Duc de L’Omelette.” Poe is best known for writing horror, but “The Duc de L’Omelette” is humorous. After dying from eating an ortolan, the Duc goes to hell and plays cards with Baal-Zebub, Prince of the Fly.

750 copies were printed in this 1850 publication of Tales of the Grotesque and Arabesque. During the printing run, the typeset of volume 2, pages 213 and 219 loosened causing variations such as some copies having page 213 numbered as 231. Member libraries holding copies of this book include the Newberry Library, National Library of Scotland, and University of Sydney. Harvard University has inscribed by Poe on the front endleaf: “For Miss Anna and Miss Bessie Pedder, from their most sincere friend, The Author.”

Frankenstein

The title page of volume 1 of the 1818 publication of the novel Frankenstein. The title is given as "Frankenstein; or, The Modern Prometheus,"The title page of volume 1 of the 1818 publication of Frankenstein, courtesy of the Library of Congress, Rare Book and Special Collections Division

Its well-known that this novel is a result of a competition among Mary Shelley, Percy Bysshe Shelley, John Polidori, and Lord Byron. However, it less known among today’s readers that the first edition, published in 1818 lacked any statement of authorship. The preface was written by Mary’s husband, Percy, and the novel was dedicated to her father, the writer and philosopher William Goodwin. Some critics speculated that Percy Bysshe Shelley was the author, and others speculated that the author was a woman. While anonymous novels were not rare in this time period, the British Critic’s harsh review of Frankenstein reveals a contempt for female authorship that Shelley would have anticipated: “The write of it is, we understand, a female; this is an aggravation of that which is the prevailing fault of the novel; but if our authoress can forget the gentleness of her sex, it is no reason why we should; and we shall therefore dismiss the novel without further comment.” (For more information on anonymous authorship of this time, see the University of Minnesota Press blog.)

Many subsequent editions and adaptations as motion pictures, plays, musicals and comic books dispute the British Critic’s review. That journal ceased publication in 1843, but in what is probably the most recent publication of the novel, Dover Publications published Frankenstein in August 2024. The novel appeals to horror fans with its reanimated monster, but it has broad appeal for any reader who has ever felt like they don’t belong. Member libraries holding copies of the 1818 edition include the British Library and Smith College. The Library of Congress owns a copy and had digitized it for anyone who wants to read it freely online.

Varney the Vampire, or, The Feast of Blood

This horror story, generally attributed to James Malcolm Rymer and Thomas Peckett Prest, was first published as a penny dreadful between 1845-1847. It was published as a book in 1847, but sadly I could not find any records in WorldCat for the 1847 print edition. (Catalogers if you are reading this and your library has a copy of this edition, please contribute your record to WorldCat!) We do have a record for a reprint of the 1847 edition with new prefatory matter, which I have provide the link for above. This is not a classic like Dracula or Frankenstein. In fact, it is more like the 19th-century version of the low-budget horror movie. The 1847 book was 232 chapters—847 pages with two columns of text on each page. This is because the author was paid by the typeset line. The protagonist is the vampire Frances Varney, and he is the first vampire described in fiction as having sharpened teeth. Perhaps Bram Stoker was inspired by Varney in his description of Dracula.

Whether you celebrate Halloween by Trick or Treating, watching a scary movie, reading a good novel, or attending a costume party, may you have a Happy Halloween!

The post Celebrating Halloween with Gothic fiction in WorldCat appeared first on Hanging Together.

Now Available: Inclusive Metadata Toolkit from the Cultural Assessment Working Group / Digital Library Federation

From the Cultural Assessment Working Group

The DLF Cultural Assessment Working Group (CAWG) is excited to announce the publication of the Inclusive Metadata Toolkit! This toolkit serves as a centralized guide to the range of inclusive metadata tools and resources currently out there, in order to equip practitioners to implement inclusive metadata practices in their day-to-day work.

The toolkit consists of two components:

  1. The Inclusive Metadata Toolkit guide document, which provides context for the listed tools and resources in order to make them easier to use and navigate

  2. The complete Inclusive Metadata Toolkit Resource Directory, which serves as a sortable and filterable directory of inclusive metadata tools and resources to help you wherever your institution is at

We hope the Inclusive Metadata Toolkit Resource Directory can continue to change and grow, providing a living directory as more inclusive metadata tools and resources are created and published over time. Additional resources can be suggested through the Inclusive Metadata Toolkit Suggested Resource & Feedback Form. General feedback or questions are also welcome.

The post Now Available: Inclusive Metadata Toolkit from the Cultural Assessment Working Group appeared first on DLF.

Mapping Openness in Europe: A Regional Meeting with Open Knowledge Foundation / Open Knowledge Foundation

On 10 October 2024 the regional call for Europe for the Open Knowledge Network was held online. The discussion was facilitated by Esther Plomp, the Regional Coordinator for Europe.

Objectives and context of the meeting

The Europe regional call aimed to understand whether it would be helpful to map the connections between existing OFKN network and chapter members, based on a pilot map created by Esther. Discussions led to the conclusion that instead of mapping the network members – it would be more helpful to map the projects on which they are working on.

Mapping 

Esther kicked off the call by presenting the pilot map for the European region with a heavy focus on the OKFN and its individual members. She also shared alternative mappings that have been made available by others in the Open landscape: 

As well as overviews related to Open Knowledge such as: 

After a quick review of the pilot map and the available resources, our discussion went into a different direction: It would be more helpful to find synergies between the activities in the different topic areas, rather than mapping individual members. This would result in more focus towards concrete activities geared towards international cooperation. A good way forward would be the working groups that were discussed in the OKFN Gathering in Katowice, such as the ‘Open Knowledge Festival 2025’, a ‘toolkit for regional advocacy’, and the ‘Mentorship programme; for the network. This led to a discussion of the Project Repository as a good example of how the ongoing activities in the network are already mapped. 

Project Repository

The Project Repository is an overview of open projects set up by both Network Members and other projects promoting openness developed by organisations that are close allies to the Network. During the Europe regional call it became clear that it would be helpful to continue building on the Project Directory and make it easier for Network Members to find, understand and replicate projects.Ultimately, the Project Repository can support outreach and capacity building: When projects have similar goals the teams can take action together. 

Increase awareness

The Project Repository may increase awareness for projects and facilitate people working together on similar goals. The way that the Project Directory is currently set up may not yet facilitate this collaboration, as the Directory is currently underused by the Network members. If the Repository is not widely used amongst network members it is highly unlikely that all their projects are currently listed. The overview may also become outdated if there are no clear mechanisms to update existing projects. To avoid an overflood of information and make it easier to get started, it may be more helpful to list fewer projects that are still active and focus on their progress or replicability for others. There are many benefits to copying existing projects instead of reinventing the wheel. One benefit is that it is easier to get funding for projects if you have a proof of concept. The Project Repository can be especially helpful here.

More information about how successful projects can be replicated

It is currently difficult to determine which projects would be easier to build on for others based on the Project Repository structure. For example, right now both Network projects and external projects are listed in the same colours.

It would be helpful if the Project Repository focused more on projects that could be replicated in other regions. For this, different information is needed than is currently available in the Project Repository. For example:

  • What are the requirements to successfully implement the project?
  • What are the success factors?
  • How can you get started with a small prototype version of the project?
  • How can the existing project support other teams that want to replicate the project? 
  • User stories: how can this project be useful for certain audiences?
  • How does this project link with other projects in the Project Repository? Which Network members were involved?
  • What is the level of a project? Is it part of a larger (international) initiative?

This may require some restructuring to make it clearer what projects are active and which ones are easier to replicate. Additionally it may be easier to find relevant projects when they are filtered on the problems that they aim to address. The possibility of using the Project Repository as an incubator space was also raised. 

Share success stories

One example highlighted by the Switzerland chapter was the Prototype fund, where Switzerland learned from the German experiences. Both teams will co-present their collaboration to inspire others in the next OKFN network call!

Next Steps

To improve the awareness for the Project Repository the next Open Knowledge Foundation Network call on the 26th of November will focus on this topic. 

To move us into action, Esther will kick off the working group on the Mentoring Programme. In the future it will be important to track the progress of these different working groups: what is moving forward and where are we moving? 

Support / Ed Summers

I like Molly White’s idea that the web isn’t just a place for big corporations. It’s a place where we can try new things, and support others that are doing work that helps and inspires us.

The early web was marked by a lot of idealism, which has turned out to have been way off the mark given the degree to which we are exploited online. But not all the web is this way. We have more autonomy and agency than we think. We are able to experiment in ways that big tech can’t. We can wire things together in ways that they won’t, and talk about things that they won’t, and focus on our communities and co-ops and unions in ways that they can’t and won’t. And we can choose to support the people we see building the web this way.

It’s a simple idea, but worth watching the whole talk to fully understand this point she is making.

With that in mind, I thought it would be kind of fun to add a page to my website listing the people and projects I choose to financially support who work on the web. You can find it here.

I was wondering, is there a way to markup the HTML to indicate that I support these projects?

The primary purpose is to communicate this list to other people, so there’s not really a strong use case for marking it up so it could be understood by software. But I do like how tools like [StreetPass] can discover people in the Fediverse as you browse the web.

I suppose there’s probably some way to cobble something together with schema.org or Microformats, but perhaps something a well-known-URI for discovery would be helpful?

A Long Time Coming: Building Browse Features for Our Library Catalog / Library Tech Talk (U of Michigan)

A sample Author Browse record showing the main entry for Mark Twain along with other versions he used.
Image Caption

The Author Browse entry for "Twain, Mark, 1835-1910"

When we moved our library catalog from Aleph to Alma in 2020, we left behind the Aleph OPAC (also known as Mirlyn Classic), which we had used as our “legacy” catalog for years even after moving first to a VuFind-based discovery layer (known as VuFind Mirlyn), and then to our current, homegrown, Library Search application. This describes how we built our authority browse features.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 29 October 2024 / HangingTogether

The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by a team of OCLC contributors.

Rebuilding trust in authoritative sources

Two hands cupping a baby fern.Photo by Noah Buscher on Unsplash

Libraries have long been THE trusted source for authoritative information, whether one was trying to settle a bet or compose a doctoral dissertation. Our current era, though, has seen a disconcerting decline in people’s trust in so many institutions and disciplines, from science and the news media to government and organized religion. Helping to both cause and exacerbate the plunge in trust has been our ever-increasing reliance on technology that blurs the boundaries between verifiable truths and outright falsehoods. Because of the library’s traditional position as a reliable source, today’s libraries also have a vital role in the effort to re-establish and enhance the trust that has been dismantled. On 12 November 2024 at 3:00 p.m. Eastern Time, OCLC’s WebJunction will present a free webinar “How do we rebuild trust in authoritative information sources?” Rachel Moran, Senior Research Scientist at the University of Washington (OCLC symbol: WAU) Center for an Informed Public; Jamie Collins, Director of Kentucky’s Marion County Public Library (OCLC symbol: KJ6); and Kristen Calvert, Programs and Events Administrator of Dallas Public Library (OCLC symbol: IGA) in Texas, will lead an hour-long discussion about the problem and suggested paths toward dealing with it.

For the past year, the Center for an Informed Public, “an interdisciplinary research initiative at the University of Washington dedicated to resisting strategic misinformation, promoting an informed society and strengthening democratic discourse” has been working with WebJunction on a multiyear project to create an information literacy program for libraries and their communities nationwide. The webinar registration site includes links to a related wealth of information from the Center for an Informed Public, which is itself well worth exploring. Contributed by Jay Weitz.

DLF Inclusive Metadata Toolkit

Earlier this month, the Digital Library Federation Cultural Assessment Working Group (CAWG), released the Inclusive Metadata Toolkit, a resource to support the work of reparative and inclusive description. The toolkit includes a guide and a resource directory. The guide contains contextual information to support learning, strategic approaches, and implementation. The guide is static, whereas the directory can grow to accommodate more tools and resources. 

I was delighted to see this toolkit, and love how it is structured. I particularly appreciate the set of tools that can be used to interrogate and take action on sets of existing metadata. An area where there is an opportunity to add resources is in developing practices for working with local communities. This will be the focus of a conversation later this month at the ATALM Conference (aka the 2024 International Conference of Indigenous Archives, Libraries, and Museums) with my colleague Mercy Procaccini; Selena Ortega-Chiolero (Museum Specialist, Chickaloon Village Traditional Council); and Melissa Stoner (Native American Studies Librarian, University of California, Berkeley – Ethnic Studies Library, OCLC Symbol: CUY). In a discussion session titled “Opening Doors, Inviting Critique: Indigenizing Metadata Practices,” they will highlight creating meaningful, respectful and reciprocal relationships with communities. Mercy and I have learned so much from discussions with Melissa and Selena and look forward to outcomes and additional knowledge sharing. Contributed by Merrilee Proffitt

Conference centers “Community of Care”

As attendees gather for the Library Assessment Conference (LAC) next week in Portland, OR, they’re invited to help co-create a “Community of Care” that will support the individual and collective needs of the conference. This initiative transcends traditional conference setups by providing spaces and resources that prioritize diverse needs—including sensory comfort, accessibility, and mental well-being. Such measures foster a welcoming, inclusive environment where all participants can fully engage and feel valued. Building on similar intentions at recent ALA conferences, “a community of care is an extension of self-care to remove the burden of navigating problematic systems and harmful cultural norms from the individual. ARL staff and the LAC planning group recognize that facilitating a community of care is one solution for harm reduction and that there is more work to be done to truly disrupt, change, and eliminate systems that perpetuate inequity.” This is a significant step toward ensuring that all participants, regardless of their needs, can experience the conference fully and comfortably.

The term Community of Care really resonated with me as I read through the conference materials. It feels like it is coming from a place of inclusion instead of mere compliance, which they reference. I am excited to attend the conference and see what this commitment to inclusivity and wellness looks like on the ground. Contributed by Brooke Doyle.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 29 October 2024 appeared first on Hanging Together.

Now Available: #DLFteach Toolkit Volume 4: Critical Digital Literacies / Digital Library Federation

From the Digital Library Pedagogy Group (#DLFteach)

We are excited to announce the release of #DLFteach Toolkit Volume 4: Critical Digital Literacies, an open-access resource designed to support both information professionals and educators.
With a thematic focus on critical digital literacies, Volume 4 of the Toolkit offers adaptable lesson plans and learning objects that help learners develop the skills necessary to consume and create information in a digital landscape, as well as the habits of mind necessary to understand and critique information systems and their underlying power structures. It encourages both skills-based outcomes and contextual thinking, making clear the inequities and structural biases of many digital tools.

Instructors using the Toolkit will learn strategies for promoting inclusivity, accessibility, and digital pedagogy in their teaching practices, helping learners engage thoughtfully with emerging technologies and enact strategies to correct inequities of use and impact.

Please feel free to make the Toolkit your own, and to share it with interested colleagues and professional networks!
Those interested in editing future volumes of the DLFTeach Toolkits can contact Alex Wermer-Colan (alex.wermer-colan@temple.edu).
Shared on behalf of the editors,
Ashley Peterson, Alexandra Solodkaya, Mackenzie Salisbury

The post Now Available: #DLFteach Toolkit Volume 4: Critical Digital Literacies appeared first on DLF.

Panel: The Tech We Want is Political / Open Knowledge Foundation

The Tech We Want Summit took place between 17 and 18 October 2024 – in total, 43 speakers from 23 countries interacted with 700+ registered people about new practical ways to build software that is useful, simple, long-lasting, and focused on solving people’s real problems.

In this series of posts, OKFN brings you the documentation of each session, opening the content generated during these two intense days of reflection and joint work accessible and open.

Above is the video and below is a summary of the topics discussed in:

[Panel 1] The Tech We Want is Political

17 October 2024 – 10:30 UTC

Since the Snowden revelations, citizen efforts have been focused in patching a broken system of surveillance, extractivism of people and the planet and rights erosion. This conversation will discuss the current state of the things and the viability of uniting technical and political efforts to move in a different direction.

Summary

Renata Ávila, CEO of OKFN, moderates a discussion exploring the broad intersections of technology, politics and society. Panellists, including Anita Gurumurthy (IT for Change), Bolaji Ayodeji (DPGA) and Poncelet Ileleji (Jokkolabs Banjul), explore the political implications of technology, the need for a new social paradigm, and the role of public governance and investment in promoting democratic and sustainable technological development.

The conversation spans global examples, including Brazil and Africa, highlighting challenges such as the digital divide and emphasising international cooperation for digital inclusion. A significant part focuses on Africa’s Digital Transformation Strategy 2020-2030, showcasing digital public goods (DPGs) and their role in sustainable development. Discussions include examples from India, such as app-based platforms for women’s empowerment and public AI for oral language assessment.

The crucial balance between technological opportunities and risks in Africa, the importance of digital sovereignty and innovative local solutions will be explored, as well as the need for political will, stakeholder engagement and sound governance to ensure equitable technological progress.

Read More

Six things that caught my eye / Hugh Rundle

Like many people I often read things and feel the urge to share them with other people. I've experimented with a few different ways to do this, but having a "newsletter" feels like a second job. I set up a section on this blog just for sharing links but I looked at all my open tabs today and decided they might warrant a full blog post. This is my blog, I can break the rules if I want to 😛.

Nearly Universal Principles of Projects (NUPP)

As you might expect from the name, NUPP identifies project management principles rather than trying to provide a toolkit. The principles all seem sensible, and I especially like this comment on planning document templates:

Looking for a “template” is the opposite of doing something based on a purpose.

Vale

Getting involved in more open source software projects has exposed me to the wide range of linting tools for software developers. Long-term readers of this blog may know of my interest in plain text, and I've dreamed recently of a customisable linter for prose writing. Enter Vale, "an open-source, command-line tool that brings your editorial style guide to life".

I spent a little bit of time yesterday setting up configuration for Vale, including downloading a Hunspell Australian English dictionary. Expect me to write more on this topic in a future post!

Jupyter Notebooks that run entirely locally

Normally running Jupyter notebooks requires a special server. Jupyter Lite is a new project leveraging WASM to run Jupyter notebooks from any modern web browser:

The goal is to provide a lightweight computing environment accessible in a matter of seconds with a single click, in a web browser and without having to install anything.

Blacklight Query

Blacklight from The Markup shows how many trackers a given website uses. Try your local university or state-owned media site - you may be surprised. For researchers testing hundreds of sites at a time, the Blacklight interface can be tedious, so they have introduced a command-line tool called Blacklight Query.

This story from 404 Media is about US law but many countries tend to follow where the US leads. This story caught my attention because it exists in the same universe as tech companies blatantly violating copyright laws to build "AI" tools. It seems politicians aren't interested in letting people fix things they own, but would like to help corporations plunder intellectual property.

Governance on Fediverse microblogging servers

This report from Erin Kissane and Darius Kazemi outlines their findings from a research project about governance on medium-to-large sized fediverse servers. I appreciate the thoughtful approach to how this report is presented: there are both web and PDF versions, and they have also provided "suggested reading pathways" rather than a single "executive summary".


When Is a Book Not a Book? / Distant Reader Blog

Question: When is a book not a book? Answer: When it is on a computer.

I draw a strong distinction between the things we call books, and the things we call books when they are saved on a computer. The former are codices, and the later are digital files. For the most part codicies are akin to collections of pages bound between a pair of covers, and the later are manifested in formats such as but not limited to: Portable Document Format (PDF), HTML, epub, etexts, and various word process or files.

Why do I draw strong distinction between these things? Because, since they are manifested differently, they lend themselves to different functions. These differences offer various advantages, distadvantages, strengths, and weaknesses. Consequenlty, one set of these things (books) can be read one way, and the other set of things (digital files) can be read another way.

I have enjoyed using the traditional reading process to read the book by Mortimer Aldler called How To Read A Book. The book outlines four over-arching reading processes: 1) elementary, 2) inspectional, 3) analytic, and 4) syntopical. The processes make sense to me.

I have also enjoyed the combined processes of text mining, natural language processing, and machine learning to read books. Moreover, I have created a tool allowing the student, researcher, or scholar to read digital files of narrative text. The tool echoes the processes outlined by Adler, but does them in a digital environment.

Increasingly, academics do not read real books (codices). Instead academics increasingly read digital files, and since the formats are inheriently different, so must be the processes of reading them. The balance of this essay outlines how...

[This essay was never finished.]

Building a Collection of HathiTrust Items / Distant Reader Blog

While collections are rarely finished, I have finished creating and curating the collection of HathiTrust files. To cut to the chase, I collected and curated approximately 345,000 items. See:

The process required just about every aspect of librarianship:

  • Collections - I needed to articulate and implement a collection management policy. Of the 800,000 items available to me, I wanted only the items written in English, described as books, and were deduplicated. Deduplication was the most difficult aspect of the problem. In the end, I reduced duplication from 20%-30% to about 2%; about 2% of the items from the Trust are duplicated. I identified 345,000 items to collect.
  • Acquisitions - Given the 345,000 identifiers, the acquisitions process locally cached the items from the Trust's computers. This was easy, but took about 24 hours to complete.
  • Cataloging - Given the 345,000 identifiers, the cataloging process harvested MARC records describing each item and modified them to meet my local cataloging practice. More specifically, pre-coordinated subject headings were converted into simpler FAST headings, two 856 fields were added denoting original/canonical locations and local/cached locations, and local notes were added denoting data format (etext) and collection (HathiTrust). The resulting records were then poured into an open source integrated library system called Koha.
  • Stacks maintenance - Given the 345,000 identifiers, the set of plain text files -- OCRed versions of the originals -- were saved on a local Web server, thus, every item has a URL in the "stacks". See: https://distantreader.org/stacks/trust/
  • Public service - The Koha application supports a very simply cataloging interface and a more sophisticated index. The former is easier to use. The later is more expressive and more full featured. More importantly, search results point directly to found items. No landing pages. No splash pages. No authentication. Moreover, there are zero links to maintain. Most importantly, the index allows one to create, curate, and use data sets (I all them "study carrels") from search results. Do a search. Download the results. Curate the results to suit your particular research question. Create a data set. Analyse and read the result.

What is the use case for this whole thing? What is the problem that I'm trying to address? The answer is simple. I'm addressing information overload. Using the index the student, scholar, or researcher is able to:

  1. create large sets of relevant content, such as: the complete works of any given author, a comprehensive set of things published as broadsides, a set of dozens if not hundreds of scholarly articles on a given topic, et cetera
  2. create a study carrel of the results
  3. employ both computer technology and traditional reading techniques to use and understand the content of the carrel

Using these features, I am easily able to:

  • compare and contrast the works of Plato and Aristotle
  • list dozens of definitions of "social justice"
  • observe the ebb and flow of ideas across just about any book
  • observe the ebb and flow of ideas across a collection of books
  • reduce a set of thousands of articles on a given topic to a couple dozen most relevant items

Fun fact: It took me about one month to this work. Thus, I did the whole of library processing at an average rate of 17,000 items/day or about 35 items/minute.

Another fun fact: The computer hosting the library catalog application (Koha) runs on 2-core computer with 4 GB of RAM and 60GB of disk space. This is about the size of your desktop computer, if not smaller. It costs me $25/month to keep the catalog up and running. The Distant Reader application -- the tool used to create study carrels -- is much bigger: 60 cores, 200 GB of RAM, and 5TB of disk storage. The Center for Research Computing hosts the Reader application.

Final fun fact: The whole of the Reader's library holdings is now about .7 million items.

Finished Creating a Collection of Carrels / Distant Reader Blog

I have finished creating and curating a collection of data sets I call "study carrels". †

As data sets, study carrels are intended to be computed against, and they are akin to "collections as data". On average, each study carrel includes about 100 textual items on a given topic, of a particular genre, or by an given author. They can be analyzed ("read") in a myriad of ways including but not limited to:

  • bibliographics
  • concordancing
  • feature analysis
  • full-text indexing
  • large-language models
  • linked data
  • network graph analysis
  • semantic indexing
  • topic modeling

Through these reading techniques all sort of research questions can be addressed, and they range from the mundane to the sublime:

  • how big is this collection?
  • how difficult is it to read the items in this collection?
  • what are the most frequent ngrams and named-entities?
  • what is discussed, what do those things do, and how are they described?
  • what words can be used to denote the aboutness of the collection, and what sentences contain those words?
  • what are the latent themes in the collection, and how have those themes ebbed and flowed over time, or compared to authors?s
  • how is Penelope in Homer's epics similar and different from the main characters of Jane Austen's works?
  • what is the defintion of climate change, and how has it been manifested?

My next steps are four-fold: 1) describe the collection in greater detail, 2) describe how the collection can be accessed programmatically or through a Web browser, 3) demonstrate how to model carrels, and 4) address big philosophic questions like what is truth, beauty, honor, and justice.

Here are a few fun facts:

  • the collection includes 3,000 carrels comprised of 315,000 items for a total of 3.5 billion words
  • the largest carrel is on the topic of English literature and is comprised of 72,000,000 words which is equal to 90 Bibles or 280 copies of Moby Dick
  • the content of the carrels comes from repositories such as Project Gutenberg, EarlyPrint, a dataset called CORD-19, and journal articles harvested via OAI-PMH

Finally, I assert the process of reading something online is inherently different from reading something in an analog form. "Duh!" For example, in an analog form we inherently observe the size of a document and state whether it is long or short. Such is not nearly as easy to do in a digital environment. What are you to do? Measure sizes in bytes? Similarly, analog books include all sorts of tools to assist in the reading process: tables of contents, running chapter headings, page numbers, indexes, back-of-the-book indexes, maybe annotations written in the margins, etc. Things like this are poorly manifested in the digital environment. On the other hand, te digital environment does include rudimentary find (control-f). If the process of reading is different in different environments, then we need different tools to do our reading. Heck, I use my glasses to help me read. I read with my pencil in hand. Why not use a computer to help me read? Moreover, traditional -- close -- reading does not scale, but distant reading does. For example, how long would it take you to use traditional reading techniques to outline the characteristics of 280 novel-length things? Study carrels are an attempt to address all of these issues.

Peruse the collection at http://carrels.distantreader.org/

Thank you for listening.

† - All that said, collections are never really finished.

Author Interview: Andrew K. Clark / LibraryThing (Thingology)

Andrew K. Clark

LibraryThing is pleased to present our interview with novelist and poet Andrew K. Clark, whose work has been published in The American Journal of Poetry, UCLA’s Out of Anonymity, Appalachian Review, Rappahannock Review, and The Wrath Bearing Tree. Deeply influenced by his upbringing and family history in western North Carolina, Clark received his MFA from Converse College, and made his book debut in 2019, with the poetry collection Jesus in the Trailer. His first novel, Where Dark Things Grow, a work of magical realism set in the Southern Appalachian Mountains in the 1930s, is due out this month from Cowboy Jamboree Press, and is available in our current monthly batch of Early Reviewer giveaways. Clark sat down with Abigail to answer some questions about his new book.

Where Dark Things Grow follows the story of a teenage boy with a troubled home life, who finds something magical and uses it to embark on a course of revenge. How did the story idea first come to you? Did it start with the character of Leo, with the theme of revenge, or with something else?

The novel came from a short story I wrote about my grandfather’s childhood growing up in Southern Appalachia and grew from there. I’ve always been drawn to magical realism and supernatural stories, so I was interested in mixing a sort of hardscrabble Appalachian setting with those more fantastical elements. Initially the story started with Leo, but as I got into the difficulties he faced, I realized he, like all of us, have a choice: to respond to adversity with anger or with resilience. His story is finding his way to resilience after a dark turn toward revenge and violence borne out of his family’s struggles, what he sees happening to missing young women, and a lack of empathy from the community.

Tell us more about wulvers. What are they, where do they come from, and what kinds of stories and traditions are associated with them?

One of the decisions I made early on in writing the novel was that I would use folklore elements from my own cultural heritage, as much as possible. So wulvers come from Scottish folklore. I use them quite differently than they appear in the lore, mixing in elements of horror and even the notion of direwolves from the Game of Thrones books. In Scottish tradition, wulvers are benevolent, and there are stories of them doing things like placing fish in the window sill of families that were struggling, that sort of thing. So in my novel there is a benevolent wulver, but there is also a dark, sinister one causing mischief. In the folklore, one thing that stuck with me is the wulvers can walk on their hind legs, much like a human, so mine do this when they want to seem imposing.

What made you decide to set Where Dark Things Grow during the 1930s, at the height of the Great Depression? Is there something significant about that period, in terms of the story you wanted to tell?

My grandparents grew up during the Great Depression in Southern Appalachia, and that period of time has always fascinated me. My grandfather was a story teller in the Appalachian tradition (my people came to Western NC in 1739), so I grew up hearing a lot of stories, including what it was like to grow up in the 1930s. One thing that always interested me is that Asheville is seen as this wealthy Gilded Age kind of place in literature and popular culture, but for my grandparents, the Great Depression brought almost no change to their lives – they were very poor before it started and so they didn’t feel the pain that some did. As a matter of fact, my grandfather would say their lives got better because of the Great Depression because my great grandfather got a job with the TVA. I always knew I wanted to write a story about a teenager growing up in this time period, and that story grew into Where Dark Things Grow.

You have described yourself as deeply rooted in the region of western North Carolina, where your ancestors have lived since before the American Revolution. In what ways has this geographic and cultural background influenced your storytelling? Which parts of your story are universal, and which parts could only happen in Southern Appalachia?

What’s often said about Appalachian writers is that the landscape is often a central character to story. That’s true for Where Dark Things Grow and so I don’t think it could happen anyplace else, in the same way. The major themes of the novel: revenge, the corrupting influence of power, criminal behavior (human trafficking), the struggle between good and evil, friendship and family, are universal and could be present in any setting. I think at the heart of every story is this sense of conflict, and so in that way, even if my reader doesn’t have reference points for Southern Appalachia, they can connect to the story and see themselves in the characters.

Your first book was a collection of poetry, and you have published individual poems in numerous publications. What was it like to write a novel instead? Does your writing process differ, when approaching different genres? Are there things that are the same?

I think one thing I carry to my prose is a focus on the structure and sound of the individual sentence. I always admire a well crafted sentence in a book I’m reading. So in that focus on language, there doesn’t feel to be as much of a difference as one might think. What’s different is that a single poem captures a more singular feeling or scene in the case of a narrative poem. In fiction, scenes build on each other and excavate themes more deeply over time. What I do find is that I feel comfortable with the novel form and the poem form; I am not as comfortable with the in between, short stories, if that makes sense. If I have that little to say, it feels more natural to distill it down into a poem. That said, I love short fiction, and read a lot of short story collections. In some ways a poetry collection or short story collection is a perfect vehicle for our modern attention challenged brains. But I love to get immersed in a world, in the lives of characters, the way I can with a novel. I think I’ll always write both.

What’s next for you? Are you working on more poetry, do you intend to write more novels, or branch out still further?

One thing I am happy about for readers is that my second novel, Where Dark Things Rise, is coming next fall from Quill and Crow Publishing House. It is a loose sequel to Where Dark Things Grow, which was published by Cowboy Jamboree Press. These two novels took about seven to eight years to write, and while the first book is set in the 1930s, the second is set in the 1980s, both in the Asheville / Western North Carolina area. I have started a third novel, which is quite different but also in the horror / magical realism genre. I have some poems assembled for a second poetry collection as well.

Tell us about your library. What’s on your own shelves?

My taste is pretty eclectic. You’ll find a lot of southern fiction by writers like William Gay, Ron Rash, Taylor Brown, Daniel Woodrell, S.A. Cosby, etc. You’ll also find a lot of magical realism novels: Murakami, Marquez, Toni Morrison, Jesmyn Ward, Robert Gwaltney, etc. And of course horror novels by Andy Davidson, Paul Tremblay, Stephen King, Stephen Graham Jones, Nathan Ballingrud, etc. I also have a couple of shelves dedicated to poetry books. Some favorites: Ilya Kaminsky, Kim Addonizio, Jessica Jacobs, Tyree Daye, bell hooks, Anne Sexton, W.S. Merwin, Ada Limón – I could go on and on.

What have you been reading lately, and what would you recommend to other readers?

One of my favorites this year is Taylor Brown’s Rednecks, about the West Virginia mine wars of the 1910s and 1920s. It’s a rich narrative; one of the most compelling historical fiction novels I’ve read. I’d also recommend The Hollow Kind by Andy Davidson, which mixes historical fiction elements, horror, and folklore in a delightful way. The Red Grove by Tessa Fontaine is a 2024 favorite, and definitely has elements of magical realism. For poetry, I’m really digging Bruce Beasley’s Prayershreds right now.

readme.txt / Distant Reader Blog

About Distant Reader Study Carrels
==================================

tl;dnr - Distant Reader study carrels are data sets, and they are
designed to be read by computers as well as people. The purposes of
study carrels are to: 1) address the problem of information overload,
and 2) faciliate reading at scale. See the Distant Reader home page
(https://distantreader.org) for more detail.


Introduction
------------

The Distant Reader and the Distant Reader Toolbox take collections of
files as input, and they output data sets called "study carrels".
Through the use of study carrels, students, researchers, and scholars
can analyze, use, and understand -- read -- large corpora of narrative
text, where "large" is anything from a dozen journal articles to
hundreds of books. Through this process you can quickly and easily
address research questions ranging from the mundane to the sublime:

  * How many items are in this carrel; is the size of this corpus
    big or small?
  
  * What are the things mentioned in this carrel, what do they do,
    and how do they do it?
  
  * In more than a few sentences, what is the content of this
    carrel about and provide specific examples.
  
  * What are the over-arching themes in the carrel, and how have
    they ebbed and flowed over time?
  
  * What is St. Augustine's definition of love, and how does it
    compare to Rousseau's?

  * How do the sum of writings by Plato and Aristotle compare?
  
The balance of this document outlines the structure of every study
carrel and introduces you on how to use them.


Layout
------

Study carrels are directories made up of many subdirectories and files.
Each study carrel contains these two directories:

  1. cache - original documents used to create the carrel

  2. txt - plain text versions of the cached content; almost all
     analysis is done against the files in this directory
  
There are additional subdirectories filled with tab-delimited files of
extracted features:

  1. adr - email addresses
  2. bib - bibliographics (authors, titles, dates, etc.)
  2. ent - named-entities (people, organizations, places, etc.)
  3. pos - parts-of-speech (nouns, verbs, adjectives, etc.)
  4. urls - URLs and their domains
  5. wrd - statistically signficant keywords
  
Even though none of the files in the subdirectories have extensions of
.tsv or .tab, they are all tab-delimited files, and therefore they can
be imported into any spreadsheet, database, or programming language.

Each study carrel includes two more subdirectories:

  1. figures - where images are saved

  2. etc - everything else, and of greatest importance is the
     carrel's stop word list, bag-of-word representation of the
     carrel, and the carrel's SQLite database file
  
Depending on how the carrel was computed against (modeled), there may be
a number of files at the root of each study carrel, and these files are
readable by a wide variety of desktop applications and programming
languages:

  * index.csv - if the study carrel creation process was augmented
    with a metadata values file (authors, titles, dates, etc.), then
    that file is echoed here

  * index.gml - a Graph Modeling Language file of the carrel's
    author(s), titles, and computed keywords, and useful for
    visualizing their relationships

  * index.htm - an HTML file summarizing the characteristics of the
    extracted features; start here

  * index.json - same as the index.txt file, but in a JSON form

  * index.rdf - bibliographic characteristics encoded in the form
    of the Resource Description Framework, and intended for the
    purposes of supporting the Semantic Web

  * index.tsv - a very very rudimentary list of characateristics
    denoting whence the carrel came and when

  * index.txt - a bibliographic report in the form of a plain text
    file

  * index.xml - a browsable interface to the study carrel; renders
    much easier on the Web than on your local computer

  * index.zip - the whole study carrel compressed into a single
    file for the purposes of collaborating, sharing, and downloading


Desktop Applications
--------------------

Study carrels are designed to be platform- and network-independent. What
does this mean? It means two things: 1) no special software is needed to
read study carrel data, and 2) if the study carrel is saved on your
local computer, then no Internet connection is needed to analyze it.
That said, you will want to employ the use of a variety of desktop
applications (or programming languages) in order to get the most out of
a study carrel.


Text Editors

Text editors are not word processors. While text editors and word
processors both work with text, the former are more about the
manipulation of the text, and the later are more about graphic design.
The overwhelming majority of data found in study carrels is in the form
of plain text, and you will find the use of a descent text editor
indispensible. Using a text editor, you can open and read just about any
file found in a study carrel. That's very important!

A good text editor supports powerful find and replace functionality,
supports regular expressions, has the ability to open multi-megabyte
files with ease, can turn on and off line wrapping, and reads text files
created on different computer platforms. The following two text editors
are recommended. Don't rely on Microsoft Word nor Google Docs, they are
word processors.

  * BBEdit (https://www.barebones.com/products/bbedit/)
  * NotePad++ (https://notepad-plus-plus.org/)


Word Cloud Applications

The use of words clouds is often viewed as sophmoric. This is true
because they are to often used to illustrate the frequency of all words
in a text. On the other hand, if word clouds illustrate the frequencies
of specific things -- keywords, parts-of-speech, or named entities --
then word clouds become much more complelling. After all, "A picture is
worth a thousand words."

A program called Wordle is an excellent word cloud program. It takes raw
text as input. It also accepts delimited data as input. The resulting
images are colorful, configurable, and exportable. Unfortunately, it is
no longer supported; while it will run on most Macintosh comuters, it
will no longer run (easily) on Windows computers. (I would pay a fee to
have Wordle come back to life and brought up-to-date.) If Wordle does
not work for you, then there are an abundance of Web-based word cloud
applications.

  * Wordle (https://web.archive.org/web/20191115162244/http://www.wordle.net/)
  

Concordances

Developed in the 13th Century, concordances are all but the oldest of
text mining tools. They function like the rudimentary find function you
see in many applications. Think control-f on steroids.

Concordances locate a given word in a text, display the text surrounding
the word, and help you understand what other words are used in the same
context. After all, to paraphrase a linguist named John Firth, "One
shall know a word by the company it keeps." The following is a link to a
concordance application that is worth way more than what you pay for it,
which is nothing.

  * AntConc (https://www.laurenceanthony.net/software/antconc/)
  

Spreadsheet-Like Applications

The overwhelming majority of the content found in study carrels is in
the form of plain text, and most of this plain text is structured in the
form of tab-delimited text files -- matrixes or sometimes called "data
frames". These files are readable by any spreadsheet, database, or
programming language. Microsoft Excel, Google Sheets, or Macintosh
Numbers can import Reader study carrel delimited data, but these
programs are more about numerical analysis and less about analyzing
text.

Thus, if you want to do analysis against Reader study carrel data, and
if you do not want to write your own software, then the use of an
analysis program called OpenRefine is highly recommended. OpenRefine
eats delimited data for lunch. Once data is imported, OpenRefine
supports powerful find and replace functions, counting and tabulating
functions, faceting, sorting, exporting, etc. While text editors and
concordances supplement traditional reading functions, OpenRefine
supplements the process of understanding study carrels as data.

  * OpenRefine (https://openrefine.org/)


Topic Modeling Applications

Topic modeling is a type of machine learning process called
"clustering". Given an integer (I), a topic modeler will divide a corpus
into I clusters, and each cluster is akin to a theme. Thus, after
practicing with a topic modeler, you can address questions like: what
are the things this corpus is about, to what degee are themes manifested
across the corpus, and which documents are best represented by the
themes. After supplementing the corpus with metadata (authors, titles,
dates, keywords, geners, etc.) Topic modeling becomes even more useful
because you can address additional questions, such as: how did these
themes ebb and flow over time, who wrote about what, and how is this
style of writting different from that style.

A venerable MALLET application is the grand-daddy of topic modeling
tools, but is a command-line driven thing. On the other hand, a program
called Topic Modeling Tool, which is rooted in MALLET, brings topic
modeling to the desktop. Like all the applications listed here, it's use
requires practice, but it works well, it works quickly, and the data it
outputs can be used in a variety of ways.

  * MALLET (https://mimno.github.io/Mallet/)
  * Topic Modeling Tool (https://github.com/senderle/topic-modeling-tool)
  

Network Analysis Applications

Texts can be modeled in the form of networks -- nodes and edges. For
example, there are authors (nodes), there are written works (additional
nodes), and specific authors write specific works (edges). Similarly,
there are works (nodes), there are keywords (additional nodes), and
specific works are described with keywords (edges). Given these sorts of
networks you can address -- and visualize -- all sorts of questions: who
wrote what, what author wrote the most, what keywords dominate the
collection, or what keywords are highly significant (central) to many
works and therefore authors?

Network analysis is rooted in graph theory, and it is not a trivial
process. On the other hand, a program called Gephi makes the process
easier. Import one of any number of different graph formats or
specifically shaped matrixes, apply any number layout options to
visualize the graph, filter the graph, visualize again, apply clustering
or calcuate graph characteristics, and visualize a third time. The
process requires practice, some knowledge of graph theory, and an
aesthetic sensibility. In the end, you will garnder a greater
understanding of the content in your carrel.

  * Gephi (https://gephi.org)


Command-Line (Shell) Interface
------------------------------

The Distant Reader and its companion, the Distant Reader Toolbox, are
implemented as a set of Python modules. If you have Python installed,
then from the command line you can install the modules -- the Toolbox:

  pip install reader-toolbox

If you are a developer, then you may want to use GitHub to install from
source:

  git clone https://github.com/ericleasemorgan/reader-toolbox.git
  cd reader-toolbox
  pip install -e .

Once installed, you can run variations of the rdr ("reader") command.
For example, running rdr without any arguments returns a menu:
  
  Usage: rdr [OPTIONS] COMMAND [ARGS]...

  Options:
    --help  Show this message and exit.
  
  Commands:
    about          Output a brief description and version number of the...
    adr            Filter email addresses from <carrel>
    bib            Output rudimentary bibliographics from <carrel>
    browse         Peruse <carrel> as a file system
    build          Create <carrel> from files in <directory>
    catalog        List study carrels
    cluster        Apply dimension reduction to <carrel> and visualize the...
    collocations   Output network graph based on bigram collocations in...
    concordance    A poor man's search engine
    documentation  Use your Web browser to read the Toolbox (rdr) online...
    download       Cache <carrel> from the public library of study carrels
    edit           Modify the stop word list of <carrel>
    ent            Filter named entities and types of entities found in...
    get            Echo the values denoted by the set subcommand
    grammars       Extract sentence fragments from <carrel> as in:
    info           Output metadata describing <carrel>
    ngrams         Output and list words or phrases found in <carrel>
    notebooks      Download, list, and run Toolbox-specific Jupyter Notebooks
    play           Play the word game called hangman
    pos            Filter parts-of-speech, words, and lemmas found in <carrel>
    rdfgraph       Create RDF (Linked Data) file against <carrel>
    read           Open <carrel> in your Web browser
    readability    Report on the readability (Flesch score) of items in...
    search         Perform a full text query against <carrel>
    semantics      Apply semantic indexing against <carrel>
    sentences      Given <carrel> save, output, and process sentences
    set            Configure the location of study carrels, the subsystem...
    sizes          Report on the sizes (in words) of items in <carrel>
    sql            Use SQL queries against the database of <carrel>
    summarize      Summarize <carrel>
    tm             Apply topic modeling against <carrel>
    url            Filter URLs and domains from <carrel>
    web            Experimental Web interface to your Distant Reader study...
    wrd            Filter statistically computed keywords from <carrel>
    zip            Create an archive (index.zip) file of <carrel>

Use the rdr command to build study carrels and do analysis against them.
For example: 1) create a directory on your desktop and call it
"practice", 2) copy a few PDF files into the directory, 3) open your
terminal, 4) change directories to the desktop, and 5) run the following
command to create your a carrel named "my-first-carrel":

  rdr build my-first-carrel practice -s

Once you get this far, you can run many other rdr commands:

  * rdr info my-first-carrel
  * rdr bib my-first-carrel
  * rdr concordance my-first-carrel
  * rdr tm my-first-carrel
  
For more detail, run the rdr command with the --help flag. See also the
documentation: https://reader-toolbox.readthedocs.io/

Power-user hints: The output of many rdr commands are designed to be
post-processed by the command line shell. For example, suppose you have
a study carrel named "homer", then the following command will display
the results of the bib command one screen at a time:

  rdr bib homer | more

A carrel's bibliographics can also be output as a JSON stream, and by
piping the output to many additional commands, you can create a prettier
bibliography:

  rdr bib homer -f json              | \
  jq -r '.[]|[.title,.summary]|@tsv' | \
  sed "s/\t/ -- /"                   | \
  sed "s/$/\n/"                      | \
  fold -s                            | \
  less

Suppose you wanted to extract rudimentary definitions of the word
"whales" from a carrel named "moby", then the following is a quick and
dirty way to get the job done:

  Q='whales are'; rdr concordance moby -q "$Q" -w 60 | sed "s/^.*$Q/$Q/"
  
Creating a nice list of sentences is similar:

  Q='whales are'; rdr sentences moby -p filter -q "$Q" | \
  sed "s/$/\n/"                                        | \
  fold -s                                              | \
  less -i --pattern="$Q"
  

Write your own software
-----------------------

The Reader Toolbox can also be imported into Python scripts, and
consequently you can combine its functionality with other Python
modules. For example the power-user concordance command, above, can be
written as a Python script:

  # configure
  CARREL = 'moby'
  QUERY  = 'whales are'
  WIDTH  = 60
  
  # require
  from rdr import concordance
  from re  import sub
  
  # do the work, output, and done
  lines = concordance(CARREL, query=QUERY, width=WIDTH)
  [print(sub('^.*{}'.format(QUERY), QUERY, line)) for line in lines]
  exit()

Use pydoc to learn more: pydoc rdr


Summary
-------

Distant Reader and the Distant Reader Toolbox take sets of narrative
text as input, and they output data sets called study carrels. The
content of study carrels are intended to be read by computers as well as
people. Study carrels are platform- as well as network-independent, and
therefore they are designed to stand the test of time.

Use desktop software and/or the Reader Toolbox to build, download,
search, browse, peruse, investigate, and report on the content of study
carrels. Use the extracted features as if they were items found in a
back-of-the-book index, and use them as input to concordances for the
purpose of closer reading. Topic model study carrels to enumerate latent
themes and address the question, "How do themes ebb and flow over time?"
Import the index.gml files into Gephi (or any other network graph
application) to visualize how authors, titles, and dates are related.
All of this is just the tip of the iceberg; study carrels can do much
more.

Study carrels are intended to address the problem of information
overload. They make it easier to use and understand large volumes of
text -- dozens of books or hundreds of journal articles. Through the
process you can address all sorts of research questions, and in the end
you will have supplemented your traditional reading process and you will
have been both more thorough and more comprehensive in your research.

The 2024 DLF Forum is a Wrap! / Digital Library Federation

What a Journey! Thank You for Joining Us at the Second DLF Forum of the Year

We’ve just concluded our second DLF Forum of the year, following the in-person event at Michigan State University in July. A heartfelt thank you to everyone who joined us virtually this week!

We were thrilled to welcome nearly 700 digital library, archives, and museum professionals from member institutions and beyond. With over 100 speakers and 35 sessions, including an insightful talk by Featured Speaker Andrea Jackson Gavin, the event was full of valuable discussions and collaborations.

A special thanks to our incredible Program Committee for their hard work in reviewing and selecting sessions for both the virtual and in-person programs, and to our generous sponsors who provided essential support, from technology to coffee breaks and swag. We couldn’t have done it without you!

If you weren’t able to register for the Virtual Forum, here are some ways to see what happened:

Subscribe to the DLF Forum newsletter to hear news and updates about the forthcoming 2025 DLF Forum.

The post The 2024 DLF Forum is a Wrap! appeared first on DLF.

The Tech We DON’T Want: Bring your scary tech story to our Halloween / Open Knowledge Foundation

It started as an inside joke: ‘Why don’t we have a Halloween party with the tech we don’t want?’ We could talk about bugs, bad code, closed and proprietary stacks, disappearing dependencies, PDFs, things we generally hate.

The idea got people excited. And then we thought, ‘Why not open it up to anyone who wants to come?’

So we asked some AI we hate to create a poster (which turned out awful). And here we are.

Next Thursday, October 31st, 11:00 CEST, bring your scary tech story and celebrate Halloween – or Buggyween – at this open meeting with the Open Knowledge Foundation team. We’ve just been inspired by last week’s The Tech We Want Summit, and thought it would be a great opportunity to unload all the worst we see out there in a session opposite.

Let’s make a toast with bad coffee and sweets to the technologies we don’t want (like Zoom!).

Sacred Mountain / Ed Summers

Otherwise known as Dziãgais’â-ní or Sierra Blanca. It is sacred ground for the Mescalero Apache Tribe.

Open Data Commons in the age of AI and Big Data / Open Knowledge Foundation

Text originally published by CNRS, Paris

Earlier this year, the Centre for Internet and Society, CNRS convened a panel at CPDP.ai. The panel brought together researchers and experts of digital commons to try and answer the question at the heart of the conference – to govern AI or to be governed by AI?

The panel was moderated by Alexandra Giannopoulou (Digital Freedom Fund). Invited panelists were Melanie Dulong de Rosnay (Centre Internet et Société, CNRS), Renata Avila (Open Knowledge Foundation), Yaniv Benhamou (University of Geneva) and Ramya Chandrasekhar (Centre Internet et Société, CNRS).

The common(s) thread running across all our interventions was that AI is bringing forth new types of capture, appropriation and enclosure of data that limit the realisation of its collective societal value. AI development entails new forms of data generation as well as large-scale re-use of publicly available data for training, fine-tuning and evaluating AI models. In her introduction, Alexanda referred to the MegaFace dataset – dataset created by a consortium of research institutions and commercial companies containing 3 million CC-licensed photographs sourced from Flickr. This dataset was subsequently used to train facial-recognition AI systems. She referred to how this type of re-use illustrates the new challenges for the open movement – how to encourage open sharing of data and content, while protecting privacy, artists’ rights and while preventing data extractivism.

There are also new actors in the AI supply chain, as well as new configurations between state and market actors. Non-profit actors like OpenAI are leading the charge in consuming large amounts of planetary resources as well as entrenching more data extractivism in the pursuit of different types of GenAI applications. In this context, Ramya spoke about the role of the state in the agenda for more commons-based governance of data. She noted that the state is no longer just a sanctioning authority, but also a curator of data (such as open government data which is used for training AI systems), as well as a consumer of these systems themselves. EU regulation needs to engage more with this multi-faceted role of the state.

Originally, the commons had promise of preventing capture and enclosure of shared resources by the state and by the market. The theory of the commons was applied to free software, public sector information, and creative works to encourage shared management of these resoruces.

But now, we also need to rethink how to make the commons relevant to data governance in the age of Big Data and AI. Data is most definitely a shared resource, but the ways in which value is being extracted out of data and the actors who share this value is determined by new constellations of power between the state and market actors.

Against this background, Yaniv and Melanie spoke about the role that licenses can continue to play in instilling certain values to data sharing and re-use, as well as serving as legal mechanisms for protecting privacy and intellectual property of individuals and communities in data. They presented their Open Data Commons license template. This license expands original open data licenses, to include contractual provisions relating to copyright and privacy. The license contemplates four mandatory elements (that serve as value signals):

  • Share-alike pledge (to ensure circularity of data in the commons)
  • Privacy pledge (to respect legal obligations for privacy at each downstream use),
  • Right to erasure  (to enable individuals to exercise this right at every downstream use).
  • Sustainability pledge (to ensure that downstream re-uses undertake assessments of the ecological impact of their proposed data-reuse).

The license then contemplates new modular elements that each licensor can choose from – including the right to make derivatives, the right to limit use to an identified set of re-users, and the right to charge a fee for re-use where the fee is used to maintain the licensor’s data sharing infrastructure. They also discussed the need for trusted intermediaries like data trusts (drawing inspiration from Copyright Management Organisations) to steward data of multiple individuals/communities, and manage the Open Data Commons licenses.

Finally, Renata offered some useful suggestions from the perspective of civil society organisations. She spoke about the Open Data Commons license as a tool for empowering individuals and communities to share more data, but be able to exercise more control over how this data is used and for whose benefit. This license can enable the individuals and communities who are the data generators for developing AI systems to have more say in receiving the benefits of these AI systems. She also spoke about the need to think about technical interoperability and community-driven data standards. This is necessary to ensure that big players who have more economic and computational resources do not exercise disproportionate control over accessing and re-using data for development of AI, and that other smaller as well as community-based actors can also develop and deploy their own AI systems.

All panelists spoke about the urgent need to not just conceive of, but also implement viable solutions for community-based data governance that balances privacy and artists’ rights with innovation for collective benefit. The Open Data Commons license presents one such solution, which the Open Knowledge Foundation proposes to develop and disseminate further, to encourage its uptake. There is significant promise in initiatives like the Open Data Commons license to ensure inclusive data governance and sustainability. It’s now the time for action – to implement such initiatives, and work together as a community in realising the promises of data commons.

Mapping Civil Society Organisations on Open Data in Francophone Africa: A Regional Meeting with Open Knowledge Foundation / Open Knowledge Foundation

On 14 October 2024, between 12:30 and 13:30, a crucial regional meeting for the coordination of French-speaking African countries in the Open Knowledge Network was held online. This virtual meeting brought together various stakeholders in the field of open data in French-speaking Africa, with the main aim of mapping the civil society organisations active in this field. The initiative was spearheaded by Narcisse Mbunzama, the Regional Coordinator for Francophone Africa, who led the presentation and discussions.

Objectives and context of the meeting

The meeting aimed to understand and assess the current landscape of civil society organisations engaged in open data across Francophone African countries. The idea was to create a mapping of these organisations to better understand their activities, structures, missions, as well as the challenges and opportunities they face.

A central point of this discussion was the exploration of the sources of funding for these organisations, as well as their relationships and collaborations with the Open Knowledge Network.

Presentation by Mr Narcisse Mbunzama: An Overview of Open Data

During this session, Mr Narcisse Mbunzama gave a detailed presentation that gave participants a clear view of the current state of open data initiatives in French-speaking African countries. The presentation highlighted a number of civil society organisations already active in this field, while also highlighting the specific dynamics in each country.

It emerged that while some organisations have succeeded in developing innovative and impactful projects, they often face a lack of financial support and recognition at international level. The presentation also highlighted a major challenge: the lack of formal collaboration between these local organisations and the Open Knowledge Foundation, as well as the lack of local chapters and individuals affiliated to the Open Knowledge Foundation in many French-speaking countries.

Challenges and opportunities for civil society organisations

The discussions revealed a number of challenges to the growth and impact of open data initiatives in the region.Some of the key barriers identified include:

  1. Lack of sustainable funding: The majority of civil society organisations rely on one-off funding, which limits their ability to develop long-term projects and make strategic plans for the sustainable development of open data.
  2. Lack of structured collaboration: Participants highlighted the lack of formal links between local organisations and the global Open Knowledge Network. This hinders the spread of good practice in open data.
  3. Lack of awareness of the Open Knowledge Foundation: In many French-speaking African countries, the existence of the Open Knowledge Foundation and its role in promoting open data is not well known. This limits the involvement of local players who could otherwise benefit from this global network.

However, the meeting also highlighted significant opportunities, including:

The rise of local initiatives: Several countries in French-speaking Africa are seeing a surge in innovative initiatives and projects promoting the use of open data in various sectors, such as governance, education and health.

Potential for collaboration: There is a strong desire among local organisations to collaborate and connect with the Open Knowledge Network to share resources, expertise and solutions adapted to local contexts.

Strengthening Collaboration and the Membership Process

A key part of the meeting was devoted to discussing the Open Knowledge Network membership process for organisations and individuals in French-speaking African countries. Mr Mbunzama explained the steps involved in joining the network, which include registering as a member and setting up local chapters to represent the Network in their respective countries.

Setting up local chapters was seen as a crucial step in strengthening the presence and impact of the Open Knowledge Foundation in the region. This would not only support local organisations but also facilitate better coordination and cooperation between open data initiatives across French-speaking African countries.

Next Steps and Future Call for Meetings

At the end of the meeting, it was proposed to issue a new call for a follow-up meeting that would focus on implementing the ideas discussed. This call, the date of which will be announced later, aims to deepen discussions on strategic partnerships and explore practical ways in which local organisations can work with the Open Knowledge Foundation to promote the adoption of open data.

The long-term goal is to build a strong and connected community of open data stakeholders in Francophone Africa, capable of overcoming local challenges while aligning with international standards. This will not only help to increase transparency and access to information in the region, but also promote sustainable development through policies based on reliable data that is accessible to all.

Conclusion

This regional meeting was a significant step towards a better understanding and integration of open data initiatives in French-speaking African countries. It laid the foundations for a more structured collaboration between local organisations and the Open Knowledge Network. By building on this momentum, it will be possible to create a robust and inclusive ecosystem that will support efforts towards transparency, innovation and sustainable development in the region.

Joining the Network and collaborating with the Open Knowledge Foundation is a crucial step for local organisations. They will be able to benefit from global expertise and shared resources to maximise the impact of their initiatives on the ground. The next meeting will be an opportunity to deepen these exchanges and define concrete actions to promote open data throughout the French-speaking region of Africa.

Open Data Editor: Our Open Source Dependency Just Disappeared / Open Knowledge Foundation

As the title says, both the repository and website of ReactDataGrid, an important dependency for our Open Data Editor, have suddenly disappeared—404 errors, DNS not resolving, just gone. Normally, we would create an issue in the repository (which we did), explore alternatives, allocate time and resources, and replace it. However, given the context of The Tech We Want initiative we’re currently running, I’d like to share a few additional thoughts.

Thinking of Open Source as Infrastructure

Interestingly, just a couple of days ago, I watched a conference talk titled Building the Hundred-Year Web Service with htmx by Alexander Petros that explores the analogy between physical infrastructure (bridges) and web pages. Now, this situation feels to me like a bridge in my city has vanished, and here I am in my car, staring at an empty gap, not understanding what happened or how to get to the other side. It feels strange and unexpected, something that shouldn’t happen: how can this bridge that I cross every day not be here anymore? My brain does not compute at the moment.

While I know that dependencies or projects disappearing isn’t the norm, this situation still gives me the unsettling feeling that the open-source ecosystem may not be as stable or reliable as I’d like to believe. I may be overreacting to this one example, but then my thoughts quickly turn to the recent takeover of Advanced Custom Fields and then to the back-and-forth licensing issues with Elasticsearch, and more recently, Redis to put some examples (my overthinking can keep going on).

I don’t have any clear answers or suggestions at this point, but I am left with a sense of unreliability. One lesson for me here is that just because something is open source and hosted on GitHub doesn’t mean it will always be accessible. Is GitHub becoming a critical piece of the internet infrastructure on which the whole ecosystem relies? I’d say yes. But what are the consequences of that? Is it good or bad? Should we be concerned? Should we panic? Should we design a plan B? I don’t think so, but I do think it’s worth discussing or at least writing these questions somewhere.

And what about the Open Data Editor?

Our goal with The Tech We Want is to promote the creation of software that can endure over time. So, having this happen just before an important release is doubly ironic and funny.

That said, due to recent changes in the project’s goals, we were already planning to migrate to a simpler stack with fewer dependencies and less turbulent release cycles (more on this later). The sudden disappearance of one of our core dependencies only reinforces the idea that we should aim to build simpler, less dependent technologies.

Read more

The Tech We Want Summit: Review the recordings of what was a great community moment in 2024 / Open Knowledge Foundation

💙 What can we say apart from THANK YOU? 💙

The Tech We Want Summit was a great moment in our year, bringing together our beloved community of technologists, practitioners and creators for two days to show that a different technology stack is possible (and we’re already doing it) – one that’s more useful, simpler, more durable and focused on solving people’s real problems.

At the Open Knowledge Foundation, we are grateful and motivated to continue promoting a fair, sustainable, and open future through technology.

Many thanks to the speakers and hundreds of participants from all over the world!

You can view the recordings by clicking on the links below:

Our team is now working on the summit documentation, which will be published in the coming weeks. Each panel will have its video edited with a summary and notes of what was discussed. We’ll be in touch with the community soon about the next steps in this initiative.

Some top-level stats:

🗣 43 Speakers in total
🌐 23 Countries represented
🤓 15 Demos of the tech we want
🌟 711 Participants
📺 14 Hours of live streaming
🤗 13 Content partners

Huge thanks again to our content partners:

2023 NDSA Storage Survey Report Published / Digital Library Federation

The NDSA is pleased to announce the release of the 2023 Storage Infrastructure Survey Report, available at https://doi.org/10.17605/OSF.IO/9QP4W 

From October 24 to November 22, 2023, the 2023 NDSA Storage Infrastructure Survey Working Group conducted a 51-question survey designed to gather information on the technologies and practices used in preservation storage infrastructure. 

This effort builds upon three previous surveys, conducted in 2011, 2013, and 2019. The survey encouraged responses from NDSA and non-DSA members to gain a broader understanding of storage practices within the digital preservation community. The survey received 138 complete responses, with most coming from the United States, but it did have a global reach. The 2023 survey also incorporated two new questions on storage and environmental impact. 

Some major takeaways from the report include:

  • The amount of preservation storage required for all managed copies appeared to stabilize relative to previous surveys. Fewer organizations reported higher allocations of storage, but the anticipated need for storage over the next three years remains elevated. 
  • Only 28% of respondents currently participate in a cooperative system – down from 45% in 2019 – and 63% indicate they are not considering a distributed storage cooperative. The use of commercial cloud storage providers rose from 46% in 2019 to 55% in 2023. 
  • Heavy use of an onsite storage element was reported by academic institutions (91%), archives (88%), and government agencies (71%). It also shows that use of onsite storage is most often combined with use of either independently managed offsite storage or commercial cloud storage managed by the organization. 
  • The leading offsite storage provider used by 56% of the responding academic institutions is Amazon Web Services. For responding archives, Amazon Web Services (36%) and Preservica (21%) are the most prevalent. Non-profits, museums, historical societies and public libraries use Amazon Web Services 45% of the time.
  • 52% of respondents said their organization is considering their environmental impact during storage planning. 

The proposed schedule for the Storage Infrastructure Survey to be conducted is every three years, allowing for ongoing tracking and analysis of approaches to preservation storage over time.  The next planned Storage Infrastructure goup is scheduled to kick off in 2026. Interested in participating in the next Storage Infrastructure Working group? A call for group members will go out in late 2025 or early 2026.  

~ NDSA 2023 Storage Infrastructure Survey Working Group

The post 2023 NDSA Storage Survey Report Published appeared first on DLF.

New OCLC Research report on open access discovery launched / HangingTogether

Our research report on Improving Open Access Discovery for Academic Library Users has just been published. It is a study into strategies to make scholarly, peer-reviewed open access (OA) publications more discoverable for library users. The findings are based on research conducted at seven academic library institutions in the Netherlands. We interviewed library staff about their efforts around OA discovery and surveyed library users about their experiences with OA. The synthesis of these findings provides new insights in the opportunity to improve OA discovery.

From OA availability to discoverability: bridging the gap

Cover of the OCLC Research report titled "Improving Open Access Discovery for Academic Library Users". The cover is an aerial view of a rural Dutch landscape.

From the very beginning we co-designed and carried out the OA discovery study in collaboration with two Dutch academic library consortia—Universiteitsbibliotheken en Nationale Bibliotheek (UKB) and Samenwerkingsverband Hogeschoolbibliotheken (SHB)—which have been, and still are, instrumental in the progress toward full OA to Dutch scholarly publications. Precisely because they were at the forefront of the shift to OA and investing heavily in OA publishing, they had arrived at a point that they wanted to assess the discoverability of OA publications and address the emerging gap between OA availability and discoverability.

This gap was first revealed by findings from the 2018-2019 OCLC Global Council survey of open content activities in libraries worldwide. The results clearly indicated a disbalance in academic library investment: more effort went into making previously closed content open than into promoting the discovery of open content. Yet, most respondents indicated that the latter was equally important to them. Also noteworthy was the near unanimity with which respondents indicated that OCLC had a role in supporting libraries to make open content discoverable. This was an encouraging acknowledgment of the importance of OCLC’s role in the open access ecosystem.

A series of knowledge sharing consultations with the Dutch academic library community in 2021 confirmed this perceived gap and the need to better understand the role of OA in user discovery behavior. As a result, UKB, SHB, and OCLC decided to carry out a research study that would investigate how expectations and behaviors of academic students, teachers, researchers, and professors could inform libraries’ efforts in making OA discoverable. This was the genesis of the Open Access Discovery project.

The making of the OA discovery landscape: libraries have a role to play

Library staff we interviewed described the emergence of a complex landscape for making OA publications discoverable. New players were eagerly staking out their territory while librarians did what they thought was best, but OA publications did not fit in their traditional processes. There were no guidelines, best practices, or benchmarks for adding OA publications to their collections and integrating them into user workflows. Although national collaborations and new processes were in place to create and expose metadata for institutionally authored OA publications, library staff faced challenges with publication deposits and metadata quality.

Our interviewees were not convinced that their efforts were making a difference for their users, but our report shows they were.

While they were correct in believing that the library was not the first place that users searched, the library search page was in the top three most searched systems. Users’ survey responses paint a somewhat confused picture of the role that OA plays in their discovery journey. Respondents did not find OA publications very easy to search for and access, and nearly half reported not knowing much about OA. However, most relied on OA alternatives when they encountered barriers to full-text access. Although OA was not their first consideration, the increasing amount of OA publications downstream affected their processes of discovery, access, and use. These findings led to the following observation in the report:

Library staff’s outreach and instruction had been primarily focused on increasing users’ awareness of publishing OA. Users needed additional instruction on discovering, evaluating, and using these new types of publications.”

Introducing the report to the Dutch library community

A group of four people shaking hands. Handing over the report to SHB and UKB representatives.

It was with pleasure and pride that Ixchel Faniel and I presented the final report, with findings and main takeaways, to UKB and SHB representatives at the OCLC Contactdag on 8 October 2024, in Amersfoort, the Netherlands. Contactdag is an annual gathering of professionals from Dutch academic and public libraries interested in the latest news about OCLC’s strategic direction and product development. It is also a forum where they share practices and innovative project results.

In my short remarks introducing the OA discovery report, I shared the main takeaway for the Dutch library community as follows:

“If you’re wondering whether your library’s investment in OA discovery is worth it, the answer is a resounding YES!”

The cover of the report—a photo of a Dutch polder landscape—is a nod to the Dutch setting of our research. It also serves as an analogy to the hard work needed to make OA publications discoverable. A polder is created by digging ditches and building dams and dikes to drain the tracks of lowland from water. As I told the audience, similarly to the polder, “there is still much work to be done. OA is still unchartered territory that needs to be explored and cultivated. We cannot afford to sit and watch!

Next steps: working smarter together

A groups of people sit around a table. One person has an open laptop - others are looking at a printed document. Break-out group at the workshop session on improving OA discovery, during the OCLC Contactdag, 8 October 2024

During the afternoon session of the OCLC Contactdag, participants discussed findings, challenges, opportunities, and next steps in break-out groups. Many recognized the dilemmas around OA discovery, as reflected in the report. They also were interested in using the findings to strategize how to proceed with improving OA discoverability.

A recurring theme was the need to collaborate. Participants discussed the potential benefits of working together on selecting OA titles by subject area and increasing users’ awareness of OA resources. They wanted to share practices on exposing institutional metadata, cooperating on metadata harvesting, and partnering with OCLC to improve the quality of metadata. They also talked about greater engagement, on campus and nationally, with recent Diamond OA publishing initiatives to advocate for discovery metadata that worked well both for library workflows and user needs. These ideas illustrate the need for cross-stakeholder collaboration from OA publishing to discovery and align nicely with the closing words from our report:

Truly improving the discoverability of OA publications requires all of the stakeholders involved to consider the needs of others within the lifecycle.

Read the report to learn more about bridging the gap between the availability and discovery of OA publications. https://oc.lc/oa-discovery

The post New OCLC Research report on open access discovery launched appeared first on Hanging Together.

pincushion / Ed Summers

Websites go away. Everything goes away, so it would be kind of weird if websites didn’t too, right? But not all web content disappears at the same rate. Some parts of the web are more vulnerable than others. Some web content is harder for us to lose, because it is evidence of something happening, it tells a story that can’t be found elsewhere, or it’s an integral part of a memory practice that depends on it.

Web archiving is one way of working with this loss. When building web archives web content is crawled, and stored so that a “replay” application (like the Wayback Machine) can make the content accessible as a “reborn digital” resource (Brügger, 2018). But with web archives the people doing this work are typically not the same people who created the content, which can lead to ethical quandaries that are difficult to untangle (Summers, 2020).

Furthermore, as we’ve seen recently with the Cyberattack on the British Library, the DDoS attacks on the Internet Archive, and lawsuits that threaten their existence, web archives themselves are also vulnerable single points of failure. Can web applications be built differently, so that they better allowed our content to persist after the website itself was no more?

As part of the Modeling Sustainable Futures: Exploring Decentralized Digital Storage for Community-Based Archives project I’ve been helping Shift Collective think about how decentralized storage technologies could fit in with the sustainability of their Historypin platform. This work has been funded by the Filecoin Foundation for a Decentralized Web, so we have naturally been looking at how Filecoin and IPFS as part of the technical answers here (Voss et al., 2023).

But perhaps a more significant question than what specific technology to use is how memory practices are changing to adapt to the medium of the web, and how much these changes can be guided in a direction that benefits the people who care about preserving their communities knowledge. We sometimes call these people librarians or archivists, but as the Records Continuum Model points out, many are involved in the work, including the individual users of websites who have invested their time, energy and labor in adding resources to them (McKemmish, Upward, & Reed, 2010).

For the last 15 years Historypin users have uploaded images, audio and video, and placed them as “pins” on a map. These pins can then be described, organized into collections, and further contextualized with metadata. Unsurprisingly, Historypin is a web application. It uses a server side application framework (Django), a database (MySQL), file storage (Google Cloud Storage), a client side JavaScript framework (Angular), and depends on multiple third party platforms like Youtube, Vimeo and Soundcloud for media hosting and playback.

What does it mean to preserve this assemblage? Historypin is a complex, running system, that is deeply intertwingled with the larger web. How could decentralized storage possibly help here? Can the complexity of the running software be reduced or removed? Can its network of links out to other platforms be removed without sacrificing the content itself?

Taking inspiration from recent work on Flickr Foundation’s Data Lifeboat, and some ideas from their technical lead Alex Chan, we’ve been prototyping a similar concept called a pincushion as a place to keep Historypin content safe, in a way that is functionally separate from the running web application. In an ideal local-first world, our web applications wouldn’t be so dependent on being constantly connected to the Internet, and the platforms that live and die there. But until we get there, having a local-last option is critically important.

The basic idea is that users should be able to download and view their data without losing the context they have added. We want a pincushion to represent a user’s collections, pins, images, videos, audio, tags, locations, comments…and we want users to be able to view this content when Historypin is no longer online, or even when the user isn’t online. Maybe the pincushion is discovered on an old thumbdrive in a shoebox under the bed.

This means that the resources being served dynamically by the Historypin application need to be serialized as files, and specifically as files that can be viewed directly in a browser: HTML, CSS, JavaScript, JPEG, PNG, MP3, MP4, JSON. Once a users content can be represented as a set of static files they can easily be distributed, copied, and opportunities for replicating them using technologies like IPFS become much more realistic.

pincushion is a small Python command line tool which talks to the Historypin API to build a static website of the user’s content. It’s not realistic to expect users to install and use pincushion, although they can if they want. Instead we expect that pincushion, or something like it, will ultimately run as part of Historypin’s system deployment, and will generate archives on demand when a user requests it.

At this point pincushion is a working prototype, but already a few design principles present themself:

  1. Web v1.0: A pincushion is just HTML, CSS and media files. No JavaScript framework, or asset bundling is used. Anchor tags with relative paths are used to navigate between pages, all of which are static page. These pages work when you load them locally from your filesystem, when you are disconnected from the Internet, or when the pages are mounted on the web somewhere…and also from IPFS.
  2. Bet on the Browser: A pincushion archive relies on modern browser’s native support for video and audio files. THe pincushion utility uses yt-dlp at build time to extract media from platforms like Youtube, Vimeo and Soundcloud and persist it as static MP4 or MP3 files. Perhaps the browser isn’t going to last forever, but so far it has proven to be remarkably backwards compatible as the web has evolved. If the browser goes away, then its unlikely we’ll know what HTML, CSS and image files are anymore. Preserving web content depends on evolving and maintaining the browser.
  3. Progressive Enhancement: A pincushion is designed to be viewed locally in your browser by opening an index.html from your file system. You can even do this when you aren’t connected to the Internet. But since you can zoom and pan to any region of the Earth in a map, it’s pretty much impossible to display a map offline. So some functionality, like viewing a pin on a map is only available when the browser is “online”.

These pincushion archives can be gigabytes in size, so I don’t want to link to one right here. But perhaps a few screenshots can help give a sense of how this works. Lets take a look at the archive belonging to Jon Voss, one of Historypin’s founders:

The “homepage” displaying Jon’s collections
A specific collection showing a set of pins
A video pin in a collection
Viewing the pin next to other pins on a map
Other pins tagged with “mission”
Other pins tagged with “mission”

So pretty simple stuff right? Intentionally so. In fact the archives load fine off of these:

Thumbdrives with pincushion archives on them for a workshop.

The truth is that this idea of making snapshots of your data available for download isn’t particularly new. Data Portability has been around as an aspirational and sometimes realizable goal for some time. Since 2018 the EU’s General Data Protection Regulation (GDPR) has made it a requirement for platforms operating in the EU to allow their data to be downloaded. This has raised the level of service for everyone. Thanks EU!

Before the GDPR, Twitter set itself apart by a fully functioning local web application codenamed Grailbird for viewing a users tweets. Similarly, work by Hannah Donovan’s on the Vine archive, and before that on the This Is My Jam archive (which sadly seems offline now) provided early examples of how web applications could be preserved in a read-only state (Summers & Wickner, 2019).

However just because you can download the data doesn’t mean it’s easy to use. Some of these archives are only JSON or CSV data files with minimal documentation. Others add only a teensy bit of window dressing that let you browse to the data files, but don’t really let you look at the actual items. Sometimes media files are still URLs out on the live web.

The pincushion tool is a working prototype, that will hopefully guide how to provide user data. But we are looking to the Flickr Data Lifeboat project to see if there are any emerging practices for how to create these archive downloads. A few things that we are thinking about:

  1. It would be great to have client-side search option using Pagefind or something like it?
  2. Can we enhance our HTML files with RDFa or Microdata to express metadata in a machine readable way?
  3. What types of structural metadata, such as a manifest, should we include to indicate the completeness and validity of the data?
  4. To what degree does it make sense to include other people’s content in an archive, for example someone’s comments on your pins, or pins that have been added to your collection?

References

Brügger, N. (2018). The archived web. MIT Press.
McKemmish, S., Upward, F., & Reed, B. (2010). Records continuum model. In M. Bates & M. N. Maack (Eds.), Encyclopedia of library and information sciences. Taylor & Francis.
Summers, E. (2020). Appraisal talk in web archives. Archivaria, 89. Retrieved from https://archivaria.ca/index.php/archivaria/article/view/13733
Summers, E., & Wickner, A. (2019). Archival Circulation on the Web: The Vine-Tweets Dataset. Journal of Cultural Analytics, 4(2). Retrieved from https://culturalanalytics.org/article/11048-archival-circulation-on-the-web-the-vine-tweets-dataset
Voss, J., Johnson, L., Jules, B., Collier, Z., Brown-Hinds, P., Castle, B., … Summers, E. (2023). A Shift Collective Report | December 2023 Modeling Sustainable Futures Proposing a Risk Assessment and Harm Reduction Model for Community-Based Archives Using Decentralized Digital Storage (p. 25). New Orleans: Shift Collective. Retrieved from https://inkdroid.org/papers/shift-ffdw-2023.pdf

Come Join the 2024 Halloween Hunt! / LibraryThing (Thingology)

It’s October, and that means the return of our annual Halloween Hunt!

We’ve scattered a hauntourage of ghosts around the site, and it’s up to you to try and find them all.

  • Decipher the clues and visit the corresponding LibraryThing pages to find a ghost. Each clue points to a specific page on LibraryThing. Remember, they are not necessarily work pages!
  • If there’s a ghost on a page, you’ll see a banner at the top of the page.
  • You have just two weeks to find all the ghosts (until 11:59pm EDT, Thursday October 31st).
  • Come brag about your hauntourage of ghosts (and get hints) on Talk.

Win prizes:

  • Any member who finds at least two ghosts will be
    awarded a ghost Badge ().
  • Members who find all 12 ghosts will be entered into a drawing for one of five LibraryThing (or TinyCat) prizes. We’ll announce winners at the end of the hunt.

P.S. Thanks to conceptDawg for the ghostly flamingo illustration!

Submitting a Notable Nomination: Suggestions from the Excellence Award Working Group / Digital Library Federation

The National Digital Stewardship Alliance (NDSA) is an organization with a diverse international membership sharing a commitment to digital stewardship and preservation. Its Excellence Awards Working Group (EAWG) is just as diverse and just as committed. Since 2012 this team has come together to select awardees who have offered their significant engagement with the theory and practice of long-term digital preservation stewardship at a level of national or international importance. EAWG members understand the importance of innovation and risk-taking in the developing successful digital preservation tools and activities. This means that excellent digital stewardship can take many forms; therefore, eligibility for these awards has been left purposely broad. 

I started as a member of the EAWG in 2019 and took part in discussions that led to the group’s move to presenting awards biennially in the odd-numbered years, to interleave them with the Digital Preservation Coalition’s Digital Preservation Awards. I have been co-chairing the group since January 2023, and, although the timing for awards may have changed, our standards have not. Any person, any institution, or any project meeting the criteria for any of the Excellence Awards’ six categories can be nominated. Neither nominators nor nominees need to be NDSA members or to be affiliated with member institutions. Self-nomination is accepted and encouraged, as are submissions reflecting responses to the needs or accomplishments of historically marginalized and underrepresented communities. It is truly inspiring to receive the nominations each year and learn about exciting work that is happening in the field of digital stewardship and preservation that we may never have known about otherwise.

Screenshot of spreadsheet for reviewing nominations.Basic spreadsheet shared by Excellence Awards Working Group members to review, discuss, and select awardees.

Award categories are: Individual, Educator, Future Steward, Organization, Project, and Sustainability. The criteria for each category specified on the EAWG webpage will help nominators select the “big bucket” their nominations will best fit, and every nomination must support the specific contributions named with evidence of their significance. Yet individual nominations focus on individual efforts. So, what can a nominator include to encourage EAWG members to recognize the importance of the nominee’s contributions? Let’s look at a few things that can help a nomination stand out.

 

  • Firsts
    • Efforts producing—or even on their way to producing—something absolutely fresh for the field of digital stewardship are worth nominating. This could be work to produce new tools, connections, workflows, methods, strategies, and more. Nominations for the new developments could offer information showing such aspects as: how this output is new; why it is notably original; what its impact or expected impact will be; and what potential it will have for widespread use. Past nominations have included phrases such as “facilitate the creation of a field that is easier, kinder, smarter, and faster,” “establish tangible solutions to put into practice,” “drawing on the collective experience of those in the field,” and “open resources that have been created and shared.”
  • A New Angle on the Known
    • Another perspective on fresh outputs is that of rethinking the known. This work could offer updated preservation formats, updated tools, or even an enhancement  for providing access or enhancing discoverability. Nominations for such work could offer information evidencing: how this update is an improvement; why it is important to the field; what benefit it will provide; and how wide a range of digital stewards can implement it. Nominations for this type of work have included phrases like: “re-thinking this for the next generation,” “ensuring the outputs were shared with the greater community and not created within an academic silo,” “advance future generations of digital stewards,” and “enhancing tools and standards our field has used for decades.”
  • Hot Topics
    • Significant work being done in areas of high interest to the digital stewardship and preservation communities is certainly worth nominating. Recently, such areas of interest have included DEI initiatives, study on the environmental impact of digital stewardship, and the use of artificial intelligence. Nominations reflecting efforts in such areas have incorporated aspects including: multidisciplinary connections, research and training methodologies, the promotion of integrating diverse perspectives, and strategies to increase awareness of a specific digital preservation challenge. Such efforts have been described as “uplifting while educating,” “improving experience for new digital preservationists through work on documentation, information-sharing, and tools development,” and “actively seeks out venues to spread the message.”
  • Widespread Impact
    • Another type of work worthy of nominating is that which will bring a positive impact to a significant portion of the field of digital stewardship. This impact will often include the characteristics of recognized reusability or adaptability and could be seen via open access to code, guides to a topic or practice, or policies that were developed. It could possibly be achieved through outreach activities or collaborations. Nominations describing such work have noted details such as: “demystifying often-challenging material required for working in digital preservation,” “bolsters others offering leadership and growth opportunities,” “informs digital preservation best practices,” “shaped the design and implementation of open-source software,” and “engaged with the preservation community as speakers, writers, and collaborators.”

These are just a few suggestions on nominating your colleagues and their work. There are certainly more areas, perspectives, and outputs that could be recognized. For more ideas, links to announcements for past winners can be found at the bottom of the Excellence Awards Working Group webpage. Remember, there is no perfect nomination expected by the EAWG. All submissions are received, reviewed, and discussed by all group members equally. Working group members realize that this is an opportunity to celebrate the achievements of our colleagues, and the selection has never been easy. Yet during my time with the group, we have ensured that no final selection has been solidified without the unanimous support of the members.

The EAWG will be seeking nominations again next year. Until then, we will be offering other blogs and video clips to help digital stewards and preservationists better understand our work. We also hope this information will encourage them to nominate their colleagues or themselves. We look forward to your submissions! 

Written by Kari May, Excellence Awards Working Group, Co-Chair

 

The post Submitting a Notable Nomination: Suggestions from the Excellence Award Working Group appeared first on DLF.