Having won worldly success, Sam and Fran Dodsworth pursue the dream many couples have to retire early and travel the world together. It doesn’t work out as they’d hoped. Sinclair Lewis had a similar experience in reverse with what Martin Ausmus called his “most sympathetic yet most savage” novel: he began it while getting a divorce from his first wife, and won the Nobel Prize, pinnacle of critical success, after it came out. The public domain wins Dodsworth in 12 days. #PublicDomainDayCountdown
As we bid farewell to 2024, it’s time to look back on the year that is slowly coming to an end and celebrate the incredible work done across the Open Knowledge Network. This year has been a journey of innovation, resilience, and collaboration, demonstrating the power of openness in fostering global change. I feel very lucky to share a space with all the extremely talented people that make our Network, and I particularly appreciate the last call of the year we have, where we share the main achievements that made us proud in 2024 and the exciting plans we have for the year to come.
The Network face-to-face meeting in Katowice, Poland—a powerful moment of connection despite the absence of some members due to visa restrictions and other challenges.
The early testing of Open Data Editor with Network members in November and December.
The Open Knowledge Network’s achievements reflect a collective commitment to openness, innovation, and inclusivity. Across the globe, Network members are addressing pressing issues such as climate change, digital access, and governance with bold, forward-thinking projects. The consistent theme across all regions is the power of collaboration—a cornerstone of our shared vision.
Here are some of the achievements and plans for next year that were shared by other Network members, in alphabetical order:
Armenia: This year, Armenia hosted a successful ODD event, with plans for next year focused on data visualization, air quality, and GLAM (Galleries, Libraries, Archives, and Museums). Their ODD 2025 will spotlight data journalism and cultural heritage.
Bangladesh: Open Knowledge Bangladesh reactivated this year amid political and internet challenges. They engaged young people and focused on Wikimedia-related activities. Their 2025 goals include promoting a national open access policy, expanding the national open data portal, and advancing GLAM projects.
Estonia: Estonia contributed to national public information discussions in 2024 and aims to enhance collaboration with other Network members in 2025.
Ghana: Ghana’s chapter formalized its structure, gained nonprofit certification, and spearheaded projects like Open Goes COP and the Wiki Green Conference. Looking ahead, they aim to build effective communities, sustain the movement, and enhance capacity-building efforts.
Guatemala: This first year in the Network was one of learning for Guatemala, particularly in navigating governmental collaboration. They initiated an NGO to sustain efforts and are awaiting approval to develop digital infrastructure in three municipalities.
Japan: Although less active as a group this year, Open Knowledge Japan made significant progress in discussions about rebuilding their team and planning activities. Open Data Day (ODD) remains a huge success in Japan, a testament to the community’s self-sustaining enthusiasm.
Macedonia: This year was about understanding Open Knowledge operations. Next year, they plan to work on a data visualization project for a local municipality while advocating for open data policies.
Mexico: Mexico’s ODD welcomed 200+ participants, alongside the Gobernantes.info project, which opens public interest data on elected officials. They also developed a spreadsheet course and are collaborating on a mapping course for the humanitarian sector. Upcoming projects include open data on migration within LATAM.
Nepal: This year, Nepal launched the Integrated Data Management System (IDMS), a tool to build open data portals for local governments. They also made strides in reaching rural communities with data initiatives. In 2025, they plan to advance IDMS, refresh the Nepali open data portal, and promote “The Tech We Want” (TTWW) at the OGP Local Meeting in the Philippines. Their wish? Greater collaboration across the Network.
Nigeria: In 2024, Nigeria focused on open data about climate change and its impacts on women, producing a podcast series on the topic. In 2025, they hope to explore how open data and AI can complement each other.
Russia: Despite challenges, activities in Russia persisted, mostly remotely. The focus has been on GLAM, open access, and research-related open data. Plans are underway to create a research-focused open data portal, with ODD 2025 planned as a remote event.
Zambia: This year marked Zambia’s introduction to open knowledge concepts. They’re eager to advance their understanding and involvement in 2025.
Our CEO, Renata Avila, gave an overview of the main highlights from the Foundation’s work in 2024 and our plans and wishes for 2025.
As we step into 2025, The Tech We Want will guide the Open Knowledge Foundation’s strategic focus. We will organise our work around four main pillars:
Defining Ethical Tech: Shaping a framework for open, equitable technology through community collaboration and high-impact advocacy.
Building Open Tools: Expanding tools like the Open Data Editor to foster transparency and utility in data usage.
Innovating Governance Models: Creating sustainable governance mechanisms to support long-term community engagement.
Leading by Example: Adopting the open technologies we promote within our internal operations.
We cannot imagine doing any of this without collaborating with the Network of course. The power of collaboration is a cornerstone of our vision, and we stand by our motto: better together than alone.
As we close the year and look to the next, we extend our deepest gratitude to every Network member. Together, we will continue to shape a future fair, sustainable, and open for all!
2024 marked the twentieth anniversary of the Open Knowledge Foundation (OKFN). The organisation, along with other actors in the movement, found itself actively questioning its role in the years to come. When our organisation was founded, fewer than ten million people were connected to the Internet. Yet a far greater number had digital artefacts in their hands, enabling the creation of programmes and systems and the dissemination of knowledge on a massive scale. Millions already had access to photocopiers, personal computers, CD recorders, floppy disks and control boards. So when connectivity arrived, there were already skilled individuals and communities with an emancipated relationship to digital technologies, and collective, collaborative, transnational projects were flourishing in many places, including ours.
Our initial approach was highly experimental, pioneering many areas in which neither the state nor businesses were yet deeply invested, such as open data systems or a flexible legal architecture for sharing across and beyond standard practices. We have succeeded in shifting values and institutional practices, with hundreds of governments and institutions adopting our legal, technical and social tools and approaches that enable distributed power, greater accountability and civic engagement, and appreciation of a shared culture.
But two decades later, the landscape is very different. We have four times more internet-connected devices collecting data points than people on the planet. Connected devices are controlled by very few, powerful and opaque corporations. Inequalities, despite increased connectivity, have risen sharply, wars and climate catastrophes are daily news as an anti-rights and pro-war agenda of the powerful moves rapidly.
We have also changed as a society, our social dynamics with technology are far different from the civic, creative and purposeful first moment two decades ago. So what is the role of organisations like ours today? How can we focus our efforts to increase impact and profound change, rather than just taking aspirin to today’s problems?
This year, we started answering these questions through The Tech We Want, a new initiative to articulate a positive vision for how and why we build technology, as well as its governance mechanisms, in a participatory way, as we did this year with the Open Data Editor. The tech we want is open, long-lasting, resilient and affordable, good enough to solve people’s problems, sustainable and democratically governed.
The tech we want is a means, a vehicle to open up knowledge, our purpose, to help people understand our world, unlock the latest scientific breakthroughs and tackle global and local challenges, expose inefficiencies, challenge inequality and hold governments and corporations to account. It is a vehicle for sharing power and inspiring innovation and a shared culture.
That’s where we will focus most of our efforts in 2025, working with our communities and allies, present and active on every continent.
We will continue to fight for a knowledge society rather than a surveillance society, one that benefits the many rather than the few and is built on the principles of collaboration rather than control, empowerment rather than exploitation, and sharing rather than monopoly. As we do so, we hope that through the Tech We Want we can count on you, our communities, to continue to break down the barriers to change.
“By… effecting the purposes of governmental supervision by its own internal machinery, the New York Stock Exchange has justified its existence, earned and retained the confidence of the public, and proved itself the most reliable and efficient market place in the world.” So said Robert Irving Warshaw’s The Story of Wall Street, published days before a historic market crash. It goes public domain in 13 days, maybe just in time for a new round of deregulation. #PublicDomainDayCountdown
In the first week of December, we had the opportunity and the pleasure to participate in the Open America event in Brazil, which brings together prestigious international meetings dedicated to the research, publication and use of open data on topics such as transparency, access to information, open government, civic technologies, data journalism, digital government, accountability and equity.
The event was a great success, with hundreds of people in attendance and, we dare say, representatives from every state in the Americas. We are also proud that one of our chapters, Open Knowledge Brazil, was a co-organiser of such an event.
One of the main features of the event was its multidisciplinary nature. It brought together people from civil society, journalism and government. In addition, the climate crisis and the strengthening of democracy were two of the main axes of discussion and debate.
The agenda for the event was huge, so there was a lot to choose from. To continue our Open Goes COP line of work, we decided to prioritise talks related to the climate crisis. Some of the talks were: Climate Change, Just Energy Transition and the Power of Open Data in Building Resilient Societies, Resilient Cities: Collaboration, Data and Openness in the Face of Climate Change, and Data and Natural Disasters. From all of them, we drew the following lessons (which are very much in line with our experiences and actions):
There is a huge lack of tools and capacity.
It is essential to start generating data for disaster response.
An important role for civil society is to provide capacity and perspective to governments in times of crisis.
There is a huge gap (in terms of communication) between activists and citizens. The same is true between government and citizens; government communication needs to be transformed.
Moments of crisis generate voluntary initiatives that inform the state, the challenge now is how to institutionalise such efforts.
The need to include the vision and voices of young people in the movement.
Interdisciplinarity and communication as transversal axes: the role of the Open Data Editor
If we have to highlight two transversal axes of the whole event, they are:
The need for interdisciplinary work
Giving priority to communication efforts (something that was also discussed a lot in the tracks on strengthening democracy).
My presentation of the Open Data Editor at the ‘Open Data Publishing Tools’ table on Thursday 5th December
These axes gave us the starting point for our presentation of the Open Data Editor, as it is a tool born to improve the publication of data in non-technical communities. Interdisciplinarity is a key element in the development of the tool, which we put into practice with our two pilots in organisations whose field of work is not open data. In addition, the second focus of the Open Data Editor is the translation of technical language for non-technical audiences and the publication of a course for all audiences on principles and best practices in data publishing.
We are very happy with the participation and reception of the tool in the community, we received very interesting feedback and contacts from communities and small organisations interested in the pilot programmes. We have also been approached by a number of activists asking for opportunities to work with the tool.
We are happy that the work we have been doing this year meets and contributes to solving the main needs of a vibrant community like America Aberta. The event left us with many lessons learned and many expectations for the next steps of the Open Data Editor.
This is a dataset that I hydrated (April of 2023), and was created by Zachary Maiorana, Pablo Morales Henry, and Jennifer Weintraub. It is the “#metoo Digital Media Collection - Hashtag: whyididntreport”, which is part of the the Schlesinger Library #metoo Digital Media Collection. The original dataset contains 852,638 Tweet IDs, and I was able to hydrate 556,253 tweets. Giving me a Hydration Rate of 65.24%. The hydrated dataset covers from October 15, 2017 through May 20, 2020.
Using whyididntreport-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("whyididntreport-user-info"), and pandas:
#whyididntreport Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "whyididntreport.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
I waited over 20 years to report my sexual abuser. Because I was 14. Because it was my hero. Because it was my priest. Because I thought I'd be expelled. Because I feared no one would believe me. Because I thought suicide was easier than telling 1 person#WhyIDidntReport
Using whyididntreport-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("whyididntreport-hashtags"), and pandas:
#whyididntreport hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "whyididntreport.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
A couple years ago I created a juxta (collage) of the images from this dataset. It features 26,950 images, and you can check it out here.
Emotion score
Using whyididntreport-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("whyididntreport-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
This is a dataset that I hydrated (April of 2023), and was created by Zachary Maiorana, Pablo Morales Henry, and Jennifer Weintraub. It is the “#metoo Digital Media Collection - Hashtag: believesurvivors”, which is part of the the Schlesinger Library #metoo Digital Media Collection. The original dataset contains 1,482,343 Tweet IDs, and I was able to hydrate 990,776 tweets. Giving me a Hydration Rate of 66.84%. The hydrated dataset covers from October 15, 2017 through May 23, 2020.
Using believesurvivors-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believesurvivors-user-info"), and pandas:
#BelieveSurvivors Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "believesurvivors.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
I was raped at Yale. I was groped at parties in DKE’s house—#Kavanaugh’s fraternity at Yale—and was told as a freshman to avoid their “rape basement.” Multiple dear friends were raped by Yale DKE brothers & by boys from elite prep schools.
Using believesurvivors-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believesurvivors-hashtags"), and pandas:
#BelieveSurvivors hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "believesurvivors.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
A couple years ago I created a juxta (collage) of the images from this dataset. It features 81,265 images, and you can check it out here.
Emotion score
Using believesurvivors-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believesurvivors-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
This hunt is meant to celebrate the season of light, and the holidays it brings. We wish all our members a Merry Christmas, Happy Hanukkah, and an entertaining hunt!
We’ve scattered a mint of candles around the site. You’ll solve the clues below to find the candles and gather them all together.
Decipher the clues and visit the corresponding LibraryThing pages to find some candles. Each clue points to a specific page on LibraryThing. Remember, they are not necessarily work pages!
If there’s a candle on a page, you’ll see a banner at the top of the page.
You have almost three weeks to find all the candles (until 11:59pm EDT, Monday January 6th).
Come brag about your mint of candles (and get hints) on Talk.
Win prizes:
Any member who finds at least two candles will be
awarded a candle Badge ().
Members who find all 15 candles will be entered into a drawing for one of five LibraryThing (or TinyCat) prizes. We’ll announce winners at the end of the hunt.
P.S. Thanks to conceptDawg for the candle illustration! ConceptDawg has made all of our treasure hunt graphics in the last couple of years. We like them, and hope you do, too!
When George Burns first teamed up with Gracie Allen, he wanted to be the comic while she played the straight part. They soon found they got more laughs the other way around. Their first film, Lambchops (now watchable online) adapts one of their vaudeville routines in a fourth-wall-breaking short that’s now in the National Film Registry. The couple enjoyed a long career in film, radio, and TV. With this film, they’ll be together again in the public domain in 14 days. #PublicDomainDayCountdown
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 691,908 Tweets with the search term OscarsSoWhite from January 15, 2015 through April 8, 2023.
Using oscarssowhite-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("oscarssowhite-user-info"), and pandas:
#OscarsSoWhite Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "oscarssowhite.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Using oscarssowhite-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("oscarssowhite-hashtags"), and pandas:
#OscarsSoWhite hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "oscarssowhite.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
Using oscarssowhite-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("oscarssowhite-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
This is a dataset that I hydrated (April of 2023), and was created by Zachary Maiorana, Pablo Morales Henry, and Jennifer Weintraub. It is the “#metoo Digital Media Collection - Hashtag: believewomen”, which is part of the the Schlesinger Library #metoo Digital Media Collection. The original dataset contains 691,744 Tweet IDs, and I was able to hydrate 430,512 tweets. Giving me a Hydration Rate of 62.24%. The hydrated dataset covers from October 15, 2017 through May 25, 2020.
Using believewomen-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believewomen-user-info"), and pandas:
#BelieveWomen Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "believewomen.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Chuck Grassley is up in 2022 Lindsey Graham is up in 2020 John Cornyn is up in 2020 Mike Lee is up in 2022 Ted Cruz is up this year Ben Sasse is up in 2020 Mike Crapo is up in 2022 Thom Tillis is up in 2020 John Kennedy is up in 2022
That was my Mom. Sometimes the people we love do things that hurt us without realizing it. Let’s turn this around. I respect and #BelieveWomen . I never have and never will support #HimToo . I’m a proud Navy vet, Cat Dad and Ally. Also, Twitter, your meme game is on point. pic.twitter.com/yZFkEjyB6L
Our government has today reaffirmed their stance that they don’t give a shit about women. We men need to remain vocal af about the fact that we do indeed care about and recognize and respect you women and that we will continue to fight for decency for all Americans. #BelieveWomen
Using believewomen-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believewomen-hashtags"), and pandas:
#BelieveWomen hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "believewomen.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
A couple years ago I created a juxta (collage) of the images from this dataset. It features 28,594 images, and you can check it out here.
Emotion score
Using believewomen-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believewomen-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
Maurice Ravel abandoned an orchestration of a Spanish composer’s work when he found out it had copyright complications. He instead started developing a theme based on Spanish dance music that he found had an “insistent quality”. With its much-repeated rhythm and theme, Boléro became one of Ravel’s best-known works. In 15 days it joins the US public domain in several versions published in 1929, for piano solo, piano duet, symphonic orchestra, and jazz orchestra. #PublicDomainDayCountdown
In each case, Nvidia is the best performing stock, and it is the only stock to appear in all four periods. Sounds great, doesn't it? Why wouldn't you just hold NVDA all the time and be guaranteed to beat the market?
But follow me below the fold for more detail from someone who has been long NVDA for more than three decades..
Alas, this infographic is deeply misleading because they cherry-picked their data. Nvidia's stock price is extraordinarily volatile. The log plot of NVDA shows that on average over its history every three years it suffers a drop of between 45% and 80%. Fortunately, over the same time it has had much larger rises. Thus when discussing the return for being long NVDA for a period the start and end dates matter a lot.
Dec
Close
Dec
Close
NVDA
S&P
2004
0.19
2009
0.38
+100%
-9.1%
2009
0.38
2014
0.48
+26%
+85.8%
2014
0.48
2019
5.91
+1131%
+61.7%
2019
5.91
2024
134.25
+22715%
+87.6%
This table shows the performance of NVDA and the S&P 500 over each of the 5-year periods in the infographic. Over three of the 5-year periods NVDA out-performed the S&P 500, and over the remaining one it under-performed. By choosing appropriate starting points it is easy to find other 5-year periods when NVDA under-performed the S&P 500. For example, from 8/2007 to 8/2012 NVDA returned -63.7% but the S&P returned -5.6%. Or from 11/2001 to 11/2006 NVDA returned +10.7% but the S&P returned +23.5%.
Although the infographic suggests that the huge out-performance started 20 years ago and has been decreasing, this is misleading for two reasons. First, the infographic's numbers are for the whole 20-year period, not for each 5-year period. Second, they are dominated by the rise in the most recent 5-year period. The linear plot of NVDA makes it clear that, had the most recent 5-year period been like any of the others, Nvidia would not have been in the infographic at all.
None of this is financial advice. Nevertheless, Nvidia is a great company, and it is possible to make money in its stock. There are two ways to do it, trading on the stock's volatility, or buying low and being prepared to hold it for many years.
Remember, past performance is no guarantee of future performance. If NVDA were to repeat the past 5-year return over the next 5 years, it would be around $3,100. To maintain its current P/E of 52.85 it would need annual revenue of around $2.6T.
Over the course of the past year, the OCLC Research Library Partnership (RLP) Metadata Managers Focus Group has delved into challenges related to staffing, succession planning, and thinking about metadata changes. Metadata Managers planning group member Chloe Misorski (Ingalls Library and Museum Archives, Cleveland Museum of Art) highlighted that these challenges are particularly pronounced for staff managing specialized collections and archives, who often operate with smaller workforces and budgets than larger, general collections.
To gain a deeper understanding of how these collection managers are navigating emerging next-generation metadata environments, we invited RLP cataloging and metadata colleagues in museum libraries, independent research libraries, art libraries and special collections or archives within larger campus networks. These are the metadata and cataloging colleagues of those who may be participating in broader discussions led by my RLP colleague Chela Weber as part of the Archives and Special Collections Leadership Roundtable.
We asked our participants to reflect on three prompts:
How are you preparing for next-generation metadata and linked data?
How are you developing new metadata workflows?
What other factors are impacting your current metadata operations and planning?
Our discussions revealed a sense of apprehension stemming from the ongoing tension between the unique needs required of managing special collections and archives and the limited resources available, particularly in areas of cataloging and metadata management.
Books and Scholars’ Accoutrements, late 1800s. Yi Taek-gyun (Korean, 1808-after 1883). Courtesy The Cleveland Museum of Art, Leonard C. Hanna Jr. Fund 2011.37
Backlogs
Several participants expressed that their most pressing challenge is addressing backlogs that predate the pandemic and have grown in the subsequent years due to limited cataloging resources. Some participants noted that they were only now back to previous staffing levels and that their primary focus was on tackling these backlogs using existing practices rather than looking ahead to next-generation metadata practices.
finding time to look to the future is difficult when keeping up with the present is so challenging
Workforce
Many of our participants are still dealing with a wave of retirements and organizational changes. These impacts are widespread but particularly severe in areas requiring specialized knowledge for cataloging unique materials. Staff reductions and retirements have exacerbated backlogs, and new hires often only slow their growth rather than reduce them.
Larger organizations have consolidated cataloging units, combining general and special cataloging staff. This has increased the workload for managers, who must re-imagine unit operations and provide growth opportunities for staff. Cross-training generalists to work with special collections is one strategy, but it brings challenges in balancing expertise levels across teams.
Managers often secure term-limited positions for specific projects, but training these catalogers is time-consuming, and they frequently leave when their terms end. This lack of continuity burdens the development of good documentation for cataloging practices. Recruiting for term-limited positions is also difficult due to the unique needs of special libraries, leaving descriptions stagnant after funding or staff departures.
Workflows
“Good enough” workflows: In the face of too much work and too few resources, many are working to define “good enough” workflows and standards for cataloging. Where, in the past, records for special materials may have been held back until they were “perfect,” resource limitations mean thinking about how to move records of sufficient quality forward.
Identifying efficiencies: Others are taking a hard look at their workflows and identifying opportunities to find efficiencies. Staff at Princeton University Special Collections shared how they were finding efficiencies by developing a “MARC Factory” that takes spreadsheets provided by book dealers and converts them to MARC records “good enough” to bring into other cataloging workflows. This is also an interesting example of what we’ve been calling “social interoperability” because staff in cataloging, acquisitions, and book dealers participated in crafting a workflow that worked for each of them.
Systems
Adding to existing challenges, several libraries are in the midst of system migrations. Even the smoothest implementations can cause additional disruptions, exacerbating issues related to staff shortages, backlogs, and reorganizations. Staff often need to freeze work, learn the intricacies of the new system, and rethink previous workflows.
Some libraries that invested in bespoke or open-source systems (OSS) to handle their special materials are finding it difficult to maintain them in the face of reduced resources. These systems are frequently built on technology stacks that are continually changing and, therefore, need close attention to maintain them to prevent security breaches. Maintaining bespoke systems may require gaining buy-in from leadership who are competing for limited resources, even if they share the same goals. Consequently, libraries seek commercial-off-the-shelf (COTS) solutions requiring less maintenance. At the same time, several participants mentioned how Yale University’s LUX platform (a multi-system open-source integration) is providing leadership in this area, even if their own institutions cannot build or sustain a similar platform with internal resources.
Abandoning a home-grown solution may come with other costs—if metadata in the source system is highly heterogeneous and not standards-based, it may need to be cleaned up to be migrated into new systems with less tolerance for creative descriptions. Even when moving high-quality metadata, work is required to ensure that the migration happens smoothly in ways that activate beneficial new features in the target system.
Despite the challenges of staffing, backlogs, and reparative metadata work, many participants noted that they continue to pay attention to developments around linked data by attending webinars, creating test accounts, and exploring new tools.
As the market for library linked data tools is still emerging, many are taking a wait-and-see approach. When it is challenging to maintain existing systems and services, discussion participants find it difficult to consider extending workflows and financial obligations. Others continue to use home-grown solutions, especially for managing entities for digital and cultural collections that are not dependent on MARC-based workflows. For example, the University of Nevada, Las Vegas, has developed a portal that allows entities identified with multiple URIs to connect across LC/NAF, local data, and external resources.
For several participants, artificial intelligence seems like a distant solution, especially for special materials. There is skepticism about whether existing chatbot services can produce good descriptions. Despite an interest in “good enough” records, AI-generated may not be worth developing the workflows needed to validate and remediate issues—especially for descriptions of special materials. Concerns are amplified for institutions with large archival collections of analog items. While the Library of Congress’s experiment on the use of AI for cataloging ebook backlogs is promising, it doesn’t overcome the hurdles faced by staff in archives and special collections.
From the view of our participants, AI is drawing attention away from the day-to-day realities and complexities of cataloging workflows. While it may prove useful in the future, current applications of AI still need the guidance and expertise of catalogers who are knowledgeable about special materials.
Next steps
Multiple participants noted the value they’d found in Total Cost of Stewardship: Responsible Collection Building in Archives and Special Collections. Chela and I discussed how we might bring this resource to the attention of Metadata Managers and what it might help us to do based on the challenges reported in this session. We are currently planning a future round-robin as a follow-up.
The Metadata Managers Focus Group will also take a closer look at emerging next-generation metadata workflows, starting with Activating URIs in linky MARC in January 2025 (see the RLP Events calendar for dates and times). These sessions will be tied to follow-up conversations with OCLC colleagues who are building the future of cataloging. We hope to explore emerging use cases for how OCLC is bridging existing expertise and workflows that meets libraries where they are today – whether that’s in new editing environments or through a suite of APIs that enable creation and curation of linked data entities and descriptive relationships.
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 977,677 Tweets with the search terms MMIWG2S or MMIWG from February 24, 2010 through April 8, 2023.
Using mmiwg2s-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("mmiwg2s-user-info"), and pandas:
#MMIWG2S Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "mmiwg2s.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Indian Act. Chinese Head Tax. Legalized slavery for years. Africville. Japanese internment camps. Sixties Scoop. Komagata Maru. Order-in-Council P.C. 1324 aka the ban on Black immigration, 1911. Residential Schools. State execution of Louis Riel. Oka. Carding. MMIWG2S….. https://t.co/e5XT6Q7X29
— Alicia Elliott - AND THEN SHE FELL out now! (@WordsandGuitar) June 2, 2020
10,910
26 year old Chantel Moore, a mother of a 5 year old girl, becomes the latest murdered indigenous woman in Canada. Police shot her 5 times in response to a call that she was being harassed. #MMIW#MMIWGhttps://t.co/ZBf67CD64Y
Today is Red Dress Day. It is a day when we honour missing and murdered Indigenous women, girls & gender diverse people. We remember & uplift them. We support their families. We recognize this disproportionate violence as a national tragedy. We advocate for change. #MMIWG2Spic.twitter.com/rirCXL9qh8
Using mmiwg2s-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("mmiwg2s-hashtags"), and pandas:
#MMIWG2S hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "mmiwg2s.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
Using mmiwg2s-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("mmiwg2s-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
In 1929 a Belgian reporter began a series of global adventures in the pages of Le Petit Vingtième. In 16 days Tintin starts a new journey into the public domain. But those wanting to meet him there must brave pitfalls around issues like where he’s public domain (only in the US for now), what’s reusable (only what’s in the 1929 French-language strips), and even whether a controversial precedent might keep him copyrighted in parts of the US. Great snakes- er, Sapristi! #PublicDomainDayCountdown
Tell us a little bit about Zibby Books. When and how did it get started, and what does it publish? You describe yourself on your site as “woman-led”—what significance does that have, in terms of your ethos?
It started after I got to know many authors through my podcast and realized how disappointed so many were with their publishing journey. I wanted to make it better! Woman-led is important in that our team is almost all women, as are the authors we publish!
What role(s) do you play at Zibby Books, in addition to founder and CEO? Do you take a hand in editing? What do you look for in the books you want to publish?
I get into the weeds on select titles but in general, I decide on all the acquisitions, I help with marketing and everything related to packaging, and provide oversight on all. Anne Messitte runs the show!
What are some of your favorites, of the books you’ve published so far, and why?
Between your podcast, your bookshop, your publishing company, and your writing—not to mention raising four children!—you have a lot on your plate. How do you make time for it all? What insights does being involved in so many different areas of the book world—as writer, as publisher, as blogger and influencer—give you, when it comes to each role? Are there challenges in being on “all sides” of the process?
I work pretty much nonstop because I’m obsessed with what I do! There are fewer challenges and more joys at being on all sides. I love seeing how the machine works and assessing if I can improve it! I make time by being very intentional with my schedule, cutting off work to go pick up my kids and take them to activities and all that, and having a fabulous team.
You pulled Zibby Books out as a sponsor of the National Book Awards last year, after learning that the authors were planning an organized protest of Israeli actions in response to the October 7th terrorist attack. Can you talk a little bit about the antisemitism you have seen in the book world, since the October 7th attack?
It would fill a book. The literary industry has really taken a hit which is why I continue to speak up and advocate for change.
What’s on the horizon for you, and for Zibby Books? What can we look forward to reading (or listening to) next?
So many great books! A novel by NYT bestselling author John Kenney, a novel by UK bestseller Jane Costello, a debut novel from Nanda Reddy, and an essay collection by podcaster Amy Wilson. ALL SO GOOD!
Tell us about your own personal library. What’s on your shelves?
All the books coming out in the next five months!!!
What have you been reading lately, and what would you recommend to other readers?
Welcome to the December 2024 issue of Information Technology and Libraries. In addition to highlights from the current issue, our letter from the editor provides updates about our revised editorial workflow and the addition of 93 previously-unavailable issues of this journal and our predecessor, the Journal of Library Automation.
This column presents a case study exploring innovative approaches to digital librarianship within a distance learning Higher Education institution based in the UK. Key initiatives included asynchronous information literacy instruction, Python scripts for auditing course materials for broken links and copyright compliance, and management of physical extracts via a digital content store. It examines the challenges of building an online library service, balancing learner-centric practice with efficiency and cost-effectiveness. The study analyses the work done and presents future initiatives, offering insights and sharing practices for solo or small team librarians navigating the evolving landscape of both distance and face-to-face education.
This essay recounts the development of the “One County, One Calendar” initiative at Kilgore Memorial Library in York, Nebraska. What began as a simple solution for managing the library’s meeting room bookings evolved into a county-wide collaboration aimed at improving communication about local events. By working with the York Chamber of Commerce, the York County Development Corporation, and other county leaders, the library created a shared calendar system that now serves all of York County. This project exemplifies how libraries can play a pivotal role in fostering collaboration and solving community challenges, positioning public libraries as essential facilitators of information and engagement.
In April 2024, the Department of Justice finalized a rule updating regulations for Title II of the Americans with Disabilities Act (ADA), which requires that all state and local governments make their services, programs, and activities accessible, including those that are offered online and in mobile apps. The final rule dictates that public entities’ web content meet the technical standards of the Web Content Accessibility Guidelines (WCAG) Version 2.1, level AA, an industry standard since its creation in 2018.
Libraries that receive federal funding will be required to follow this rule for any web content they create, including LibGuides. Springshare’s LibGuide platform is one of the most widely used among libraries for web content creation, from complete websites to pedagogical and research guides. While Springshare may develop plans to make sure its clients are in compliance with this new rule, there are more important questions that LibGuide creators need to consider to move beyond the bare minimum of following the rule. The authors explain what WCAG 2.1 AA compliance requires, how LibGuide authors can use accessibility principles to ensure compliance, and offer available tools to check existing guides, as well as discuss alternatives to LibGuides.
Archival repositories must be strategic and selective in deciding what collections they will acquire and steward. Careful collection stewards balance many factors, including ongoing resource needs and future research use. They ensure new acquisitions build upon existing topical strengths in the repository’s holdings and reassess these existing strengths regularly through multiple lenses. In this study, we examine the suitability of text analysis as a method for analyzing collection scope strengths across a repository’s physical archival holdings. We apply a tool for text analysis called Leximancer to analyze a corpus of archival finding aids to explore topical coverage. Leximancer results were highly aligned with the baseline subject heading analysis that we performed, but the concepts, themes, and co-occurring topic pairs surfaced by Leximancer suggest areas of collection strength and potential focus for new acquisitions. We discuss the potential applications of text analysis for internal library use including collection development, as well as potential implications for wider description, discovery, and access. Text analysis can accurately surface topical strengths and directly lead to insights that can inform future acquisition decisions and archival collection development policies.
While many languages are used for data manipulation, it is unusual to see JavaScript put to this task. This paper describes a novel application built to manipulate catalog patron data using only JavaScript running in a browser. Further, it describes the approach of building and deploying “strongly single page web applications,” a more extreme version of single page applications that have been condensed into a single HTML file. The paper discusses the application itself, how it is used, and the way that possessing web development and coding skills in an organization’s systems department can help it flexibly respond to challenges using such novel solutions.
This study is an in-depth look at the use of the LibGuides search function by college students. We sought to better understand the mental models with which they approach these searches and to improve their user experience by reducing the percentage of searches with no results. We used two research methods: usability testing, which involved 15 students in two rounds, and analysis of search terms and search sessions logged during three different weeks. Interface changes were made after the first round of usability testing and our analysis of the first week of search data. Additional changes were made after the second round of usability testing and analysis of the second week of search data.
The usability tests highlighted a mismatch between the LibGuides search behavior and the expectations of student users. Results from both rounds of testing were very similar. The search analysis showed that the level of no-result searches was slightly lower after the interface changes, with most of the improvement seen in Databases A-Z searches. Within the failed searches, we saw a reduction in the use of topic keywords but no improvement in the other causes we studied. The most significant change we observed was a drop in the level of search activity.
This research provides insights that are specific to the LibGuides platform—about the underlying expectations that students bring to it, how they search it, and the reasons why their searches do and do not produce results. We also identify possible system improvements for both academic libraries and Springshare that could contribute to an improved search experience for student users.
As one of the world’s largest archipelagic nations, Indonesia faces a major challenge in distributing its wealth, including basic infrastructure such as electricity and internet, to some of its most remote islands. Due to its remote location, most people in this area live in poverty and have poor-quality education. To help solve this problem, an offline digital library was created based on a Raspberry Pi “minicomputer.” With the ability to store more than 2,500 educational movies and e-books for offline viewing, the device has proven to be reliable even in areas with unpredictable power.
This paper considers our library’s attempt at applying a “laissez-faire leadership” model to technical committee work. Since its introduction in the 1990s, scholarship on laissez-faire leadership has historically viewed the concept very negatively. However, we argue here that many of these perspectives are straw man arguments that do not adequately consider the possibilities of a laissez-faire model. Following some dissenting voices in the literature, we would like to reclaim the laissez-faire model as a way to facilitate library technical work under certain very specific circumstances. This paper will describe the organizational context where these laissez-faire methods worked for us. Our conclusion is that this approach can promote autonomy, responsibility, and productivity. We feel that this reevaluation of this concept can provide an important framework for self-organization when doing technical work.
This paper examines the use and subsequent trajectory of academic library technologies due to the impact of the COVID-19 pandemic. Taking a broad view of technologies, the systems and services discussed will center around resource use because COVID restrictions shuttered many in-person technologies. The two academic libraries compared in this study show a similar pattern of use and signal growth of certain platforms and technologies for the future.
The book has made less of a mark in the US. That could change in 17 days, when it joins the public domain here, years before it does most anywhere else. #PublicDomainDayCountdown
At couple of years ago, when I was looking for a Goodreads alternative, one of my coworkers recommended Literal.club. However, I’ve found it a bit buggy (doesn’t always properly mark something read, search can’t find the exact book I’m looking for even though I know it exists in the system, that sort of thing). I … Continue reading "Moving from Literal (not Goodreads) to The Story Graph"
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint, and help from a few great colleagues and friends in the DocNow Slack. It contains 3,501,447 Tweets with the search terms #roevwade, #roevswade, #DobbsvJackson, #dobbsdecision, #Dobbs, and #AbortionBan from May 1, 2022 through June 16, 2022.
Using roevwade-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("roevwade-user-info"), and pandas:
Roe v. Wade Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "roevwade.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
California, Colorado, Connecticut, Delaware, Hawaii, Illinois, Maine, Maryland, Massachusetts, Nevada, New Jersey, New York, Oregon, Rhode Island, Vermont and Washington. DC all protect abortion rights by state law these are the safe states to access abortion now #RoeVsWade
Using roevwade-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("roevwade-hashtags"), and pandas:
Roe v. Wade hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "roevwade.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
A couple years ago I created a juxta (collage) of the images from this dataset. It features 510,807 images, and you can check it out here.
Emotion score
Using roevswade-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("roevwade-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 16,458,701 Tweets with the search term onpoli from July 14, 2009 through December 31, 2022. I have brief post about this dataset here as well, which includes a different Tweet Volume chart highlighting election cycles.
Using onpoli-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("onpoli-user-info"), and pandas:
onpoli Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "onpoli.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Fellow Ontario drivers - please like and retweet if you would rather pay $120 a year to renew your license plate than lose aprox $1-billion A YEAR in govt spending on public services like healthcare and education. #onpoli#onted
Using onpoli-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("onpoli-hashtags"), and pandas:
onpoli hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "onpoli.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
Using onpoli-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("onpoli-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 2,277,298 Tweets with the search term wwg1wga from February 24, 2018 through October 9, 2022.
Using wwg1wga-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("wwg1wga-user-info"), and pandas:
wwg1wga Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "wwg1wga.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Report from the battle: the fascists are regaining control of their hashtags and are now confused about K-Pop, Anime, and Swifties. The Anonymous #OpFanCam division is now coming back for round two.
Using wwg1wga-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("wwg1wga-hashtags"), and pandas:
wwg1wga hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "wwg1wga.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
A couple years ago I created a juxta (collage) of the images from this dataset. You can check it out here.
Emotion score
Using wwg1wga-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("wwg1wga-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 7,721,050 Tweets with the search term elon from June 11, 2022 through October 28, 2022.
This is dataset is also flattened line-oriented JSON data from the Twitter v2 API, which required a yet another round of updates to twut. Something I knew was on my agenda a few years ago, but never had that special combination of a use case and time on hand to get it done. Now, it’s done: twut-1.1.0!
Using elon-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("elon-user-info"), and pandas:
elon Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "elon.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Elon Musk told the United Nations he would give them $6 billion to end world hunger if they showed him a detailed plan of how they would use the money. They called his bluff and gave him their plan— and then they never got the money. Now he’s buying Twitter for $45 billion.
— No Lie with Brian Tyler Cohen (@NoLieWithBTC) April 25, 2022
79,814
What a hypocrite. Elon Musk has received billions in corporate welfare from U.S. taxpayers. Now he wants to stop 30 million Americans who lost jobs from receiving $600 a week in unemployment benefits, while his wealth has gone up by $46.7 billion over the past 4 months. Pathetic. https://t.co/hECaTul3ZI
The amount Elon Musk just paid for Twitter ($44 billion) is nearly equal to Biden’s proposed climate budget ($44.9 billion), in case anyone's wondering how seriously we’re taking the climate crisis
Using elon-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("elon-hashtags"), and pandas:
elon hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "elon.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
Using elon-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("elon-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 2,094,597 Tweets with the search terms (#WeAreCanadian) OR (#FreedomConvoy) OR (#FreedomConvoy2022) OR (#TrudeauMustGo) OR (#TrudeauMustGoNow) OR (#LiberalHypocrisy) OR (#LiberalsMustGo) OR (#WeTheFringe) OR (#Canadafirst OR (#ottawaconvoy) OR (#truckerconvoy) OR (#convoy) OR (#canadaproud) OR (#canadianpatriot) OR (#freedomprotest) OR (#nomoremandates) OR (#canadatruckprotest) OR (#operationbearhug) OR (#canadiantruckers) OR (#ramranch) OR (#FluTruxKlan) OR (#ramranch) OR (#ramranchresistance) from January 20, 2022 through February 25, 2022.
Using convoy-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("convoy-user-info"), and pandas:
Canada Convoy Protest Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "convoy.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Using convoy-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("convoy-hashtags"), and pandas:
Canada Convoy Protest hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "convoy.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
A couple years ago I created a juxta (collage) of the images from this dataset. You can check it out here.
Emotion score
Using convoy-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("convoy-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
Emotion Distribution in Canada Convoy Protest Tweets
The dataset was collected with Documenting the Now’stwarc using version twarc2 via the Academic Access v2 Endpoint. It contains 1,546,162 Tweets with the search terms (NoMoreLockdowns) OR (NoVaccinePassportsAnywhere) OR (NoVaccineMandates) from January 7, 2011 through April 13, 2023.
Using antivax-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("antivax-user-info"), and pandas:
Anti-Vaccine/Lockdown Top Tweeters
Retweets
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "antivax.jsonl"
val df = spark.read.json(tweets)
df.mostRetweeted.show(10, false)
Please help me understand this. I am double vaccinated. Last month I was ill with COVID-19. I passed it onto my family. Therefore HOW WILL A VACCINE PASSPORT PROTECT SOCIETY??????? #NoVaccinePassportsAnywhere
Using antivax-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("antivax-hashtags"), and pandas:
Anti-Vaccine/Lockdown hashtags
Top URLs
Using the full line-oriented JSON dataset and twut:
import io.archivesunleashed._
val tweets = "antivax.jsonl"
val df = spark.read.json(tweets)
df.urls
.groupBy("url")
.count()
.orderBy(col("count").desc)
.show(10, false)
#NoMoreLockdowns #NoVaccinePassportsAnywhere #NoVaccineMandates Top Media
Emotion score
Using antivax-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("antivax-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
Emotion Distribution in #NoMoreLockdowns #NoVaccinePassportsAnywhere #NoVaccineMandates Tweets
Alfred Hitchcock started making Blackmail as a silent film, but after getting access to the new movie sound technology, he ended up releasing two versions. The sound version was one of the first “talkies” released in Britain. Some film aficionados prefer the silent version, which often uses different footage.
In 2021, the University of Calgary Libraries launched a multilingual reference chatbot by leveraging a commercial product that combines a large language model (LLM) with retrieval-augmented generation (RAG) technology. The chatbot is trained on the library’s own web content, including LibGuides and operating hours, and is accessed from the library’s website.
In a Works in Progress webinar hosted by the OCLC Research Library Partnership (RLP) on 20 November 2024, University of Calgary Library staff discussed the creation and implementation of the AI reference chatbot and shared lessons learned. Kim Groome, Information Specialist; Leeanne Morrow, Associate University Librarian, Student Learning and Engagement; and Paul Pival, Research Librarian–Data Analytics presented.
This blog post provides a summary of the webinar’s key points, but for a deeper dive, you can watch the full recording here:
Project genesis
Like many research libraries, the University of Calgary Libraries has offered live chat to users since the early 2010s, with information specialists staffing the service daily from 9 a.m. to 5 p.m. The pandemic catalyzed discussions about an AI chatbot, with many factors driving the conversation:
Surging demand for chat services. Chat usage spiked dramatically during the pandemic. While the library typically handled 500–900 live chats per month in 2019, this number skyrocketed to 3,077 in September 2020.
Staffing constraints. The increased volume required additional staff and staff time to keep up with demand.
Limited service hours. Staffed by humans, live chat was available during extended business hours, but this still left students without support during the late evenings or early mornings.
Improved convenience. Even students visiting the library in person utilized the chat reference service. It was convenient and helped them maintain their study space during peak hours.
Automation potential for many questions. An analysis of live chat questions revealed a significant percentage of questions that were well-suited for automated responses.
Alignment with institutional priorities. Implementing an AI chatbot aligned with the university’s commitment to student-centered initiatives and its strategic focus on enhancing student success.
The library team looked across the Canadian library ecosystem for examples but found limited adoption among other libraries.[i] Instead, the library found that at UCalgary, the Office of the Registrar had already implemented a chatbot named “Rex,” leveraging technology provided by Ivy.ai. By building on this preexisting campus project, the library accelerated its own chatbot initiative, benefiting from shared resources and institutional experience.
Implementation
Assessing the usefulness of an AI chatbot
Initial work included conducting an analysis of past reference chat questions to evaluate the automation potential of an AI chatbot. Kim Groome described exporting approximately 3,000 chatbot interactions recorded over a one-month period during the pandemic and coding the questions into themes like study/workspace, printing, and borrowing requests. Through this analysis, which took approximately 30 hours, the library determined that 14-24% of reference chat inquiries were directional (e.g., “where is …?”) and could potentially be handled by a chatbot. The coding was performed using Excel; use of Python by experienced coders could further expedite the work.
Training and testing
After identifying a core set of common questions that could be effectively addressed by the chatbot, an eight-person library team began training and testing in April 2021, working with the vendor to increase consistency and quality of the chatbot answers. Testing was extended to other library staff members in July 2021. Recognizing the potentially infinite scope of user questions, the team avoided scope creep by initially focusing on a defined set of about fifty questions identified during their analysis.
Go-live
T-Rex chatbot avatar. Courtesy of University of Calgary Library.
The library’s chatbot launched on 16 August 2021, branded as “T-Rex” to differentiate it from the preexisting “Rex” chatbot offered by the registrar (note that the Rex is the official mascot of UCalgary Dinos teams). Today T-Rex is one of six chatbots on the UCalgary campus, each operating on separate knowledge bases and answering questions 24/7.
Continuous improvement and maturity
Quality monitoring
Kim described how the library team continually assesses the quality of chatbot responses using anonymous weekly reports. The team rates the bot’s answers on a 1 to 5 scale, where 5 represents a perfect response.
Examples of the rating process
Participants asked many questions we were unable to address during the webinar due to time constraints. Following the webinar, Kim offered these examples to address questions about the rating process:
5/5 response: If patron asks, “Do you have databases for nursing”, the bot provides an accurate answer, earning a perfect score.
4/5 – 5/5 response: If patron asks for the specific nursing database, such as, “Where can I find CINAHL” (even if spelled incorrectly in six ways), the bot delivers an excellent response.
2/5 response: But if the patron asks a question like, “I want to find articles on the social effect of the opioid crisis what database would I use,” the bot will struggle. Even though the bot didn’t answer the question, the team would still scale it at 2/5 because the specific topic is not on the website—and therefore not in the “bot brain.” Since the chatbot hasn’t been trained on the topic, it cannot answer the question. If the bot interprets the phrase “find articles” and offers a response like “to find articles, please enter the topic in the library search box…”, then the response would be rated at 3/5 or even 4/5.
Developing custom responses
The team monitored user questions following go-live, identifying questions that would benefit from customized answers. For example,
A frequent user query about access to the Harvard Business Review couldn’t initially be answered because access to the resource was embedded in a search tool outside the RAG’s scope.
Misspellings were also common, such as for resources like PsycInfo.
If any question was asked more than three times weekly in distinct transactions, the team would create a custom response to address it. Over the first year, the team spent about 5 hours each week creating 10-15 custom responses to these questions, incrementally improving the chatbot. For misspellings like the PsycInfo example, the team incorporated common misspellings like Psychinfo, pyscinfo, and psychinfo.
Monitoring the chat is important for identifying and immediately correcting any wrong responses. For example, a patron once asked, “Can I return a book that has already been declared lost,” and the bot responded, “No, you cannot return a library book that was recently declared lost.” This is obviously incorrect, and the response occurred because of information missing from the library website. But the issue is also complex, with at least fifteen different circumstances surrounding a lost item; adding a series of complex scenarios to the website was an imperfect solution. Instead, the team created a rule where any question that includes words like “lost” and “book” receives a customized qualifying statement: “If you need to contact library staff about a lost a book, or the lost book charge, please email: <email address>.” Similarly, for questions that mention the term “recall,” the bot will respond, “Recalls are a very special circumstance. Here is an FAQ for more information.”
Maturity
Screenshot of the T-Rex chatbot offered by the University of Calgary Library
The chatbot was further improved in February 2023 when a new GPT layer was added by the vendor, enabling the tool to generate its own responses to complement existing custom responses. Today the chatbot offers fast, consistent, 24/7 support that is accessible to a wide range of users and is WCAG 2.11 AA compliant. It knows over 2 million words—each form of a word is a new word (i.e., renew and renewing are two separate words)—and has over 1000 custom responses. T-Rex is very accurate for directional questions, and 50% of all questions receive a rating of at least 4/5.
T-Rex has exceeded expectations. Before launch, the implementation team estimated that the chatbot could answer 14-24% of reference chat questions, but today the chatbot answers about 50% of all questions with a rating of at least 4/5. This deflects half of all questions from live reference chat. This has been significant, as 1.5 FTE of staff time has been redirected to support more strategic, higher-level tasks. As a result, the library has reduced staffed desk hours, instead encouraging users to rely on the 24/7 chatbot for immediate assistance. There have been no staff reductions, just higher productivity.
Now that the chatbot is mature, it takes only about one hour per week to supervise and monitor the chatbot, primarily to confirm that it continues to work as expected. Updates, such as changes to library URLs, are efficiently managed using a simple Excel spreadsheet.
The implementation of T-Rex was the library’s first AI effort. More recently, the library has collaboratively established the Centre for Artificial Intelligence Ethics, Literacy and Integrity (CAELI). Located within a campus branch library, CAELI supports student success by fostering strong digital and information literacy skills among UCalgary students.
Lessons learned
The UCalgary team shared several key insights from the project:
Use library web pages as the system of record. One of the very first lessons learned after go-live was that the chatbot would be unable to answer a question if the library didn’t have a webpage or FAQ that addressed the topic. While it could be tempting to update the chatbot’s responses directly, Kim advised against this approach because it would create duplicate maintenance points. Instead, she urged participants to consider the website as the system of record for chatbot content.
Leverage a team-based approach. Implementing the chatbot with a team-based approach increased resilience and reduced points of failure for the project.
Identify and respond to user expectations. Users preferred answers that connected them directly to the source they were looking for, rather than being directed to a webpage that required further navigation. Over time, the team refined responses to reduce the number of clicks required to reach specific information.
Expect non-library questions. The team discovered that users would ask the chatbot many questions that the library RAG was unable to answer, such as, “When can I register for the spring semester?” In many cases, the bot can direct the user to one of the other relevant chatbots on campus (registrar, admissions, financial aid, career services, etc.) for appropriate answers. This is a significant benefit of an enterprise approach to adopting chatbot technology.
Think creatively about addressing non-library questions. The Calgary library recognized its role in supporting academic integrity, and it analyzed the chatbot data to learn more about the types of academic integrity questions students were asking. The library found that students were asking questions about reference styles, citation managers, plagiarism and detection software, and academic policies. These questions often arose late at night when live support was unavailable. In collaboration with the campus academic integrity coordinator, the library developed custom responses and added relevant campus content to its website, enhancing the chatbot’s ability to support student success.
Anticipate that there will be non-adopters. Some people prefer to interact directly with other humans and are unlikely to adopt chatbot technology. About 12-15% of library chatbot users still ask to speak to a human, even in cases where the chatbot could likely answer their question. Users can click through to “Connect to a Person” directly from T-Rex during regular service hours.
Library use of AI chatbots
To understand webinar participants’ own use of and experiences with chatbots, we polled attendees during the presentation. Their responses provide anecdotal insights about library adoption of chatbots.
While webinar participants were clearly interested in chatbots, they weren’t necessarily strong users. Only about 40% of participants reported using chatbots on a daily or weekly basis; 27% reported never using GenAI chatbots.
RLP affiliate responses to poll about chatbot usage
Relatedly, nearly 50% of participants reported that they didn’t enjoy interacting with GenAI chatbots, although nearly as many had mixed feelings.
RLP affiliate responses to poll about enjoyment of chatbot interactions
Implementation of library chatbots
This webinar was useful for our RLP participants because we learned that few libraries had implemented an AI chatbot, but nearly 50% were considering it.
RLP affiliate responses to poll about library adoption of AI reference chatbots
Is your library implementing an AI chatbot? Share a comment below or send me an email. I’m eager to learn more. Special thanks to the UCalgary Library team for generously sharing their experiences and insights so we can all learn from their innovative work.
[i] Julia Guy et al., “Reference Chatbots in Canadian Academic Libraries,” Information Technology and Libraries 42, no. 4 (December 18, 2023), https://doi.org/10.5860/ital.v42i4.16511.
As of December 6, 2024, the NDSA Leadership unanimously voted to welcome its three most recent applicants into the membership.
Webrecorder LLC
Museum of Glass
George Washington’s Mount Vernon
Each new member brings a host of skills and experience to our group.
Webrecorder focuses on high-fidelity browser-based archiving, creating highly accurate archives of websites, which can then be replayed at a later time, including video, maps, 3D models, and other interactive elements that are fully intact. They support multi-level user access to utilize our tools, including open-source self-hosting, browser extensions, and paid services. Their engagement reaches across the globe, providing web archiving tools to grassroots organizations, data/science labs, journalists, archivists, GLAMS, and national organizations.
Museum of Glass (MOG) is a premier contemporary art museum dedicated to glass and glassmaking in the West Coast’s largest and most active museum glass studio. MOG has an extensive collection of digital content dating back to its founding in 2002, including documentation of the over 2,000 pieces of art in the permanent collection and born-digital materials encompassing images, audio, and video recordings created over the past twenty years. These born-digital materials are critical documentation of the over 750 visiting artists in the MOG’s Hot Shop, related Hot Shop programming, and the exhibition and programming history of the Museum. A key part of this collection is an extensive amount of streaming video from the Hot Shop, where they broadcast live five days a week. In recent years, MOG has committed significant resources to establish its digital preservation program, including the hiring of the organization’s first digital archivist and archival assistant.
George Washington’s Mount Vernon is committed to building a robust digital preservation program and has taken significant steps in this direction over the past few years. They engaged a consulting firm to conduct an institution-wide digital archives assessment and hired a full-time digital records archivist. This role has been instrumental in developing our digital preservation policy and workflows, which ensure the preservation of born-digital materials, including institutional records, audiovisual materials, digital humanities projects, and web archiving. In the near future, they plan to expand our efforts to include 3D models and digital reproductions.
Each organization participates in one or more of the various interest and working groups – so keep an eye out for them on your calls and be sure to give them a shout-out. Please join me in welcoming our new members. You can review our list of members here.
I have no kind words about what Twitter has become. I abandoned the platform around 2018, keeping my account only to harvest data. Once the electioneering oligarch cut off academic access, there was no reason for me to stick around. During the briefperiod I had academic access, I collected as much data as possible each month, planning ahead for future research projects and collaborations.
Currently, I’m working with the DigFemNet team, and I believe we have some use cases for a few of these datasets. I’m planning to write all or at least most of them up similar to what I’ve done in the past. Unfortunately, the whole hydration process is no longer an option, so following along isn’t really an option any more. ☹️
To start this “series”, I’m revisiting a dataset I generated a few years ago as a side project for members of the DigFemNet team during their Archives Unleashed Cohort work. This dataset was generated from tweets with the following hashtags:
#healthcanada
#NACI
#fordnation
#medicalfreedom
#covid19
#covid19vaccines
#protectourfamilies
#protectyourchildren
#holdtheline
The dataset of Tweet IDs is available here, though I’m not sure how useful it is anymore. ☹️
On another note, I took some time to completely refactor twut. A new 1.0.0 release is now available!
Using 105683SP3QFISO4-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("105683SP3QFISO4-user-info"), and pandas:
This is the face of someone who just spent 9 hours in personal protective equipment moving critically ill Covid19 patients around London.
I feel broken - and we are only at the start. I am begging people, please please do social distancing and self isolation #covid19pic.twitter.com/hs0RQdvsn3
Hey, so, I got #Covid19 in March. I’ve been sick for over 3 months w/ severe respiratory, cardiovascular & neurological symptoms. I still have a fever. I’ve been incapacitated for nearly a season of my life. It's not enough to not die. You don’t want to live thru this, either. 1/
Using 105683SP3QFISO4-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("105683SP3QFISO4-hashtags"), and pandas:
#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Top Video
Emotion score
Using 105683SP3QFISO4-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("105683SP3QFISO4-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:
Emotion Distribution in #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Tweets
Cambridge, MA – Harvard Law School’s Library Innovation Lab (LIL) is proud to announce the official launch of the Institutional Data Initiative (IDI), a groundbreaking program helping libraries, government agencies, and other knowledge institutions share digital collections with their patrons while improving the accuracy and reliability of AI tools for all.
First developed by Greg Leppert at the Library Innovation Lab, and now led by Leppert as Executive Director, IDI seeks to redefine the creation and stewardship of the knowledge and datasets that define AI research.
Following in the footsteps of LIL’s Caselaw Access Project, the foundational dataset of IDI will be nearly one million public domain books, created thanks to the wide-ranging resources and expertise of the Harvard Library system. By prioritizing the assembly and release of open access public domain materials, as well as using principles developed at LIL for approaching large scale data with a library lens, IDI bridges the gap between model makers and knowledge institutions.
LIL Director Jack Cushman commented, “One of the goals of our lab is to incubate bold ideas and give them the resources to grow. IDI is a wonderful example of that. Our mission to bring library principles to technological frontiers is embedded in these efforts, and we are thrilled to see industry leaders and cultural heritage organizations support IDI’s work and promise. We look forward to supporting and collaborating with IDI, working to diversify and expand access to cultural heritage materials that help everyone.”
Jonathan Zittrain, IDI and LIL faculty director, said “Libraries and other stewards of humanity’s aggregated knowledge can think in terms of centuries — preserving it and providing access both for well-known uses and for aims completely unanticipated in ancient times or even recently. IDI’s aim is to address newly-energized interest in otherwise-obscure and sometimes-forgotten texts in ways that keep knowledge institutions’, and society’s, values front and center. That means working towards access for all for public domain works that have remained fenced — access both for the human eye and for imaginative machine processing.”
The Library Innovation Lab at Harvard Law School brings library principles to technological frontiers. They explore the future of libraries and the creation, dissemination, and preservation of knowledge. Through innovative projects, platforms, and partnerships, the Lab aims to advance access to information and foster collaboration across disciplines.
Contact:
Jack Cushman
Director, Harvard Library Innovation Lab
lil@law.harvard.edu
Margery Allingham’s detective Albert Campion may be easy to miss at first sight. He’s not as familiar to American readers as sleuths like Lord Peter Wimsey (who he resembles in some respects). In The Crime at Black Dudley (1929), he’s one of several mysterious secondary characters, and not the one who solves the murder.
In light of Title II of the Americans with Disabilities Act (ADA)’s rule change that now requires all web content and mobile applications provided by state and local governments, including public higher education institutions, to be accessible to people with disabilities and the passing of Disability Awareness Month, many new and important resources are available to libraries including the Neurodiversity Design System. These design principles are specifically tailored to learning managements systems, though they hold a bevy of gems for online information sharing and design choices. Topics covered in the system may already be known to designers as good practice, though many do not realize that designing for universal understanding is always the best design option. Real world examples are provided for how to format numbers (font, size, and spacing), buttons, links and input fields, and animations. The site also provides helpful definitions, legal requirements, and user personas to connect the design choices with the user experience.
What I liked most about this toolkit is its user experience! The website is well-organized with a lot of information that is easy to navigate. Structure in information organization is a key wayfinding tool which allows anyone familiar or unfamiliar with a website to go right to their intended endpoint. What’s most valuable about this toolkit for non-designers is the “under the hood” look at users. When you consider the types of disabilities that can affect a person’s experience navigating information online, you become immediately aware that you know people who have trouble navigating online. People in your life experience challenges, and most of them go unnoticed or unvoiced. When we design for all types of people, these challenges fall away making navigation successful, saving tons of stress and time. Contributed by Lesley A. Langa.
Reparative language in finding aids at Haverford College
A recent post in Descriptive Notes, the blog of the Society of American Archivists Description Section, details an effort at Haverford College (OCLC Symbol: HVC) to review legacy finding aids for racist or harmful language and to revise language accordingly. With over 1,400 collections, this was a daunting undertaking but was supported by both the library’s strategic plan and campus funding to support DEI efforts. The blog post, “Undertaking a Reparative Language Project at Haverford College,” details the many steps in the project to review and remediate descriptions.
This article highlights both the scale of the problem of legacy harmful descriptive practices as well as solutions which has involved tapping into available funding, bringing in contractors (who were either early career professionals or MLIS students), and later involving Haverford undergraduate student workers. The project drew on knowledge gleaned from previous efforts, such as the Archives for Black Lives in Philadelphia project and work done by Duke and Yale. As of now, over 1,500 finding aids have been reviewed at least once for harmful language. Future collections will benefit from revisions to processing guidelines, and the project has generated other tools such as a spreadsheet recording the names that groups have identified for themselves. Ultimately, this project reflects some truths about reparative descriptive work: it is work that benefits from focus driven by local priorities and takes an iterative approach. Readers will find projects to inspire in a series of posts in Descriptive Notes that focus on the topic of inclusive description. Contributed by Merrilee Proffitt.
National Day of Racial Healing, 21 January 2025
Since it was conceived in 2017 by the W.K. Kellogg Foundation (WKKF), the National Day of Racial Healing (NDORH) has been “a time to contemplate our shared values and create the blueprint together for #HowWeHeal from the effects of racism.” Observed annually in the United States on the Tuesday after Martin Luther King, Jr. Day, NDORH “is an opportunity to bring ALL people together and inspire collective action to build common ground for a more just and equitable world.” Especially in these days of divisiveness across so many human fault lines, OCLC WebJunction’s “2025 National Day of Racial Healing” article is both timely and valuable, providing ideas and inspirations for events, activities, and actions that libraries and their communities might plan for 21 January 2025. Included are details of events and programs organized in recent years by half a dozen public libraries around the United States, the “Read to Dream” list compiled by King’s family and others, and links to numerous other resources.
The work toward healing across all dividing lines must go on constantly, of course, not only on a single day. WKKF defines racial healing as: “… the experience shared by people when they speak openly and hear the truth about past wrongs and the negative impacts created by individual and systemic racism. Racial healing helps to build trust among people and restores communities to wholeness, so they can work together on changing current systems and structures so that they affirm the inherent value of all people.” WKKF has been cooperating with libraries and other educational institutions on racial equity and related issues as part of its Truth, Racial Healing & Transformation efforts. The American Association of Colleges and Universities and the American Library Association are prominently among its partners. Contributed by Jay Weitz.
Poor autocomplete design can prevent customers from making purchases. These 12 best practices will create a smooth user experience and increase conversions.