Planet Code4Lib

To gain the world, and lose one’s spouse / John Mark Ockerbloom

Having won worldly success, Sam and Fran Dodsworth pursue the dream many couples have to retire early and travel the world together. It doesn’t work out as they’d hoped. Sinclair Lewis had a similar experience in reverse with what Martin Ausmus called his “most sympathetic yet most savage” novel: he began it while getting a divorce from his first wife, and won the Nobel Prize, pinnacle of critical success, after it came out. The public domain wins Dodsworth in 12 days.

Looking back on 2024 and looking ahead to 2025 with the Open Knowledge Network / Open Knowledge Foundation

As we bid farewell to 2024, it’s time to look back on the year that is slowly coming to an end and celebrate the incredible work done across the Open Knowledge Network. This year has been a journey of innovation, resilience, and collaboration, demonstrating the power of openness in fostering global change. I feel very lucky to share a space with all the extremely talented people that make our Network, and I particularly appreciate the last call of the year we have, where we share the main achievements that made us proud in 2024 and the exciting plans we have for the year to come. 

For me, the highlights of the year were:

  1. The launch of the Network Regional Hubs in April was a pivotal step toward fostering localized collaboration and leadership.
  2. The Network face-to-face meeting in Katowice, Poland—a powerful moment of connection despite the absence of some members due to visa restrictions and other challenges. 
  3. The Tech We Want Summit – with the participation of so many Network members!
  4. The early testing of Open Data Editor with Network members in November and December.

The Open Knowledge Network’s achievements reflect a collective commitment to openness, innovation, and inclusivity. Across the globe, Network members are addressing pressing issues such as climate change, digital access, and governance with bold, forward-thinking projects. The consistent theme across all regions is the power of collaboration—a cornerstone of our shared vision.

Here are some of the achievements and plans for next year that were shared by other Network members, in alphabetical order:

🇦🇲 Armenia: This year, Armenia hosted a successful ODD event, with plans for next year focused on data visualization, air quality, and GLAM (Galleries, Libraries, Archives, and Museums). Their ODD 2025 will spotlight data journalism and cultural heritage.

🇧🇩 Bangladesh: Open Knowledge Bangladesh reactivated this year amid political and internet challenges. They engaged young people and focused on Wikimedia-related activities. Their 2025 goals include promoting a national open access policy, expanding the national open data portal, and advancing GLAM projects.

🇪🇪 Estonia: Estonia contributed to national public information discussions in 2024 and aims to enhance collaboration with other Network members in 2025.

🇬🇭 Ghana: Ghana’s chapter formalized its structure, gained nonprofit certification, and spearheaded projects like Open Goes COP and the Wiki Green Conference. Looking ahead, they aim to build effective communities, sustain the movement, and enhance capacity-building efforts.

🇬🇹 Guatemala: This first year in the Network was one of learning for Guatemala, particularly in navigating governmental collaboration. They initiated an NGO to sustain efforts and are awaiting approval to develop digital infrastructure in three municipalities.

🇯🇵 Japan: Although less active as a group this year, Open Knowledge Japan made significant progress in discussions about rebuilding their team and planning activities. Open Data Day (ODD) remains a huge success in Japan, a testament to the community’s self-sustaining enthusiasm.

🇲🇰 Macedonia: This year was about understanding Open Knowledge operations. Next year, they plan to work on a data visualization project for a local municipality while advocating for open data policies.

🇲🇽 Mexico: Mexico’s ODD welcomed 200+ participants, alongside the Gobernantes.info project, which opens public interest data on elected officials. They also developed a spreadsheet course and are collaborating on a mapping course for the humanitarian sector. Upcoming projects include open data on migration within LATAM.

🇳🇵 Nepal: This year, Nepal launched the Integrated Data Management System (IDMS), a tool to build open data portals for local governments. They also made strides in reaching rural communities with data initiatives. In 2025, they plan to advance IDMS, refresh the Nepali open data portal, and promote “The Tech We Want” (TTWW) at the OGP Local Meeting in the Philippines. Their wish? Greater collaboration across the Network.

🇳🇬 Nigeria: In 2024, Nigeria focused on open data about climate change and its impacts on women, producing a podcast series on the topic. In 2025, they hope to explore how open data and AI can complement each other.

🇷🇺 Russia: Despite challenges, activities in Russia persisted, mostly remotely. The focus has been on GLAM, open access, and research-related open data. Plans are underway to create a research-focused open data portal, with ODD 2025 planned as a remote event.

🇿🇲 Zambia: This year marked Zambia’s introduction to open knowledge concepts. They’re eager to advance their understanding and involvement in 2025.

Our CEO, Renata Avila, gave an overview of the main highlights from the Foundation’s work in 2024 and our plans and wishes for 2025. 

As we step into 2025, The Tech We Want will guide the Open Knowledge Foundation’s strategic focus. We will organise our work around four main pillars:

  1. Defining Ethical Tech: Shaping a framework for open, equitable technology through community collaboration and high-impact advocacy.
  2. Building Open Tools: Expanding tools like the Open Data Editor to foster transparency and utility in data usage.
  3. Innovating Governance Models: Creating sustainable governance mechanisms to support long-term community engagement.
  4. Leading by Example: Adopting the open technologies we promote within our internal operations.

We cannot imagine doing any of this without collaborating with the Network of course. The power of collaboration is a cornerstone of our vision, and we stand by our motto: better together than alone.

As we close the year and look to the next, we extend our deepest gratitude to every Network member. Together, we will continue to shape a future fair, sustainable, and open for all!

Here’s to a transformative and impactful 2025!

Towards a knowledge society through the technologies we want / Open Knowledge Foundation

CEO Reflection Ahead of 2025

2024 marked the twentieth anniversary of the Open Knowledge Foundation (OKFN). The organisation, along with other actors in the movement, found itself actively questioning its role in the years to come. When our organisation was founded, fewer than ten million people were connected to the Internet. Yet a far greater number had digital artefacts in their hands, enabling the creation of programmes and systems and the dissemination of knowledge on a massive scale. Millions already had access to photocopiers, personal computers, CD recorders, floppy disks and control boards. So when connectivity arrived, there were already skilled individuals and communities with an emancipated relationship to digital technologies, and collective, collaborative, transnational projects were flourishing in many places, including ours.

Our initial approach was highly experimental, pioneering many areas in which neither the state nor businesses were yet deeply invested, such as open data systems or a flexible legal architecture for sharing across and beyond standard practices. We have succeeded in shifting values and institutional practices, with hundreds of governments and institutions adopting our legal, technical and social tools and approaches that enable distributed power, greater accountability and civic engagement, and appreciation of a shared culture. 

But two decades later, the landscape is very different. We have four times more internet-connected devices collecting data points than people on the planet. Connected devices are controlled by very few, powerful and opaque corporations. Inequalities, despite increased connectivity, have risen sharply, wars and climate catastrophes are daily news as an anti-rights and pro-war agenda of the powerful moves rapidly.  

We have also changed as a society, our social dynamics with technology are far different from the civic, creative and purposeful first moment two decades ago. So what is the role of organisations like ours today? How can we focus our efforts to increase impact and profound change, rather than just taking aspirin to today’s problems? 

This year, we started answering these questions through The Tech We Want, a new initiative to articulate a positive vision for how and why we build technology, as well as its governance mechanisms, in a participatory way, as we did this year with the Open Data Editor. The tech we want is open, long-lasting, resilient and affordable, good enough to solve people’s problems, sustainable and democratically governed. 

The tech we want is a means, a vehicle to open up knowledge, our purpose, to help people understand our world, unlock the latest scientific breakthroughs and tackle global and local challenges, expose inefficiencies, challenge inequality and hold governments and corporations to account. It is a vehicle for sharing power and inspiring innovation and a shared culture. 

That’s where we will focus most of our efforts in 2025, working with our communities and allies, present and active on every continent. 

We will continue to fight for a knowledge society rather than a surveillance society, one that benefits the many rather than the few and is built on the principles of collaboration rather than control, empowerment rather than exploitation, and sharing rather than monopoly. As we do so, we hope that through the Tech We Want we can count on you, our communities, to continue to break down the barriers to change.

This time is different? / John Mark Ockerbloom

“By… effecting the purposes of governmental supervision by its own internal machinery, the New York Stock Exchange has justified its existence, earned and retained the confidence of the public, and proved itself the most reliable and efficient market place in the world.” So said Robert Irving Warshaw’s The Story of Wall Street, published days before a historic market crash. It goes public domain in 13 days, maybe just in time for a new round of deregulation.

Our impressions after a huge Open America event / Open Knowledge Foundation

In the first week of December, we had the opportunity and the pleasure to participate in the Open America event in Brazil, which brings together prestigious international meetings dedicated to the research, publication and use of open data on topics such as transparency, access to information, open government, civic technologies, data journalism, digital government, accountability and equity.

The event was a great success, with hundreds of people in attendance and, we dare say, representatives from every state in the Americas. We are also proud that one of our chapters, Open Knowledge Brazil, was a co-organiser of such an event.

One of the main features of the event was its multidisciplinary nature. It brought together people from civil society, journalism and government. In addition, the climate crisis and the strengthening of democracy were two of the main axes of discussion and debate.

The agenda for the event was huge, so there was a lot to choose from. To continue our Open Goes COP line of work, we decided to prioritise talks related to the climate crisis. Some of the talks were: Climate Change, Just Energy Transition and the Power of Open Data in Building Resilient Societies, Resilient Cities: Collaboration, Data and Openness in the Face of Climate Change, and Data and Natural Disasters. From all of them, we drew the following lessons (which are very much in line with our experiences and actions):

  • There is a huge lack of tools and capacity.
  • It is essential to start generating data for disaster response.
  • An important role for civil society is to provide capacity and perspective to governments in times of crisis.
  • There is a huge gap (in terms of communication) between activists and citizens. The same is true between government and citizens; government communication needs to be transformed.
  • Moments of crisis generate voluntary initiatives that inform the state, the challenge now is how to institutionalise such efforts.
  • The need to include the vision and voices of young people in the movement.

Interdisciplinarity and communication as transversal axes: the role of the Open Data Editor

If we have to highlight two transversal axes of the whole event, they are:

  1. The need for interdisciplinary work
  2. Giving priority to communication efforts (something that was also discussed a lot in the tracks on strengthening democracy).
My presentation of the Open Data Editor at the ‘Open Data Publishing Tools’ table on Thursday 5th December

These axes gave us the starting point for our presentation of the Open Data Editor, as it is a tool born to improve the publication of data in non-technical communities. Interdisciplinarity is a key element in the development of the tool, which we put into practice with our two pilots in organisations whose field of work is not open data. In addition, the second focus of the Open Data Editor is the translation of technical language for non-technical audiences and the publication of a course for all audiences on principles and best practices in data publishing.

We are very happy with the participation and reception of the tool in the community, we received very interesting feedback and contacts from communities and small organisations interested in the pilot programmes. We have also been approached by a number of activists asking for opportunities to work with the tool.

We are happy that the work we have been doing this year meets and contributes to solving the main needs of a vibrant community like America Aberta. The event left us with many lessons learned and many expectations for the next steps of the Open Data Editor.

Exploring #whyididntreport Twitter Data / Nick Ruest

Overview

This is a dataset that I hydrated (April of 2023), and was created by Zachary Maiorana, Pablo Morales Henry, and Jennifer Weintraub. It is the “#metoo Digital Media Collection - Hashtag: whyididntreport”, which is part of the the Schlesinger Library #metoo Digital Media Collection. The original dataset contains 852,638 Tweet IDs, and I was able to hydrate 556,253 tweets. Giving me a Hydration Rate of 65.24%. The hydrated dataset covers from October 15, 2017 through May 20, 2020.

#whyididntreport tweet volume #whyididntreport Tweet Volume
#whyididntreport wordcloud #whyididntreport wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "whyididntreport.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+------+
|lang| count|
+----+------+
|  en|524260|
| qme|  7126|
| qht|  6926|
|  fr|  5752|
|  pt|  2642|
|  es|  1938|
| qam|  1255|
|  ja|  1227|
| und|   986|
| art|   879|
+----+------+

Top tweeters

Using whyididntreport-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("whyididntreport-user-info"), and pandas:

#whyididntreport Top Tweeters #whyididntreport Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "whyididntreport.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1043256667285147648|44908            |
|1248064755841077251|12922            |
|1043189716970078208|10336            |
|1043169010030927872|9986             |
|1043186317381783554|9972             |
|1043626009239543808|9718             |
|1043219695548149760|8938             |
|1043255247521472512|7909             |
|1043941416131383301|7783             |
|1043177356142305280|7099             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 44,908

  2. 12,922

  3. 10,336

Top Hashtags

Using whyididntreport-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("whyididntreport-hashtags"), and pandas:

#whyididntreport hashtags #whyididntreport hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "whyididntreport.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+---------------------------------------------------------------------------------------------+-----+
|url                                                                                          |count|
+---------------------------------------------------------------------------------------------+-----+
|https://twitter.com/ThomasARoberts/status/1043256667285147648                                |1895 |
|https://twitter.com/VisceralBliss17/status/1043626009239543808                               |1847 |
|https://www.msnbc.com/the-last-word/watch/-whyididntreport-victims-answer-trump-1326545987871|1388 |
|https://t.co/HFJbX84at0                                                                      |1328 |
|https://on.msnbc.com/2I6oqUg                                                                 |894  |
|https://t.co/wk92PgPSSb                                                                      |893  |
|https://twitter.com/MPRnews/status/1043229818496507904                                       |825  |
|https://t.co/ygmEKrnldi                                                                      |757  |
|https://twitter.com/realDonaldTrump/status/1043126336473055235                               |646  |
|https://t.co/dVoM9aLh0b                                                                      |621  |
+---------------------------------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "whyididntreport.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+------------------------------------------------------------------------------------------------+-----+
|media_url                                                                                       |count|
+------------------------------------------------------------------------------------------------+-----+
|https://pbs.twimg.com/media/EVIERuPX0AAMFTc.jpg                                                 |7235 |
|https://pbs.twimg.com/media/DnosuWHUYAAwhJH.jpg                                                 |1619 |
|https://video.twimg.com/amplify_video/1047611902355017730/vid/240x240/FD3Rr59ojedN5NOG.mp4?tag=8|1211 |
|https://video.twimg.com/amplify_video/1047611902355017730/vid/720x720/y78h9KVYsZudlHdi.mp4?tag=8|1211 |
|https://video.twimg.com/amplify_video/1047611902355017730/vid/480x480/dzgOwKXHO27tQw-b.mp4?tag=8|1211 |
|https://video.twimg.com/amplify_video/1047611902355017730/pl/trDxglqrWal4E_c1.m3u8?tag=8        |1211 |
|https://pbs.twimg.com/media/Dnqbqp-U4AY4OEi.jpg                                                 |768  |
|https://pbs.twimg.com/media/Dnpfn0LU4AAhOrZ.jpg                                                 |621  |
|https://pbs.twimg.com/media/Dno_JbnU4AA7MMh.jpg                                                 |499  |
|https://pbs.twimg.com/media/DoMDgT5XUAEwp2o.jpg                                                 |309  |
+------------------------------------------------------------------------------------------------+-----+

A couple years ago I created a juxta (collage) of the images from this dataset. It features 26,950 images, and you can check it out here.

Emotion score

Using whyididntreport-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("whyididntreport-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #whyididntreport Tweets Emotion Distribution in #whyididntreport Tweets

Exploring #BelieveSurvivors Twitter Data / Nick Ruest

Overview

This is a dataset that I hydrated (April of 2023), and was created by Zachary Maiorana, Pablo Morales Henry, and Jennifer Weintraub. It is the “#metoo Digital Media Collection - Hashtag: believesurvivors”, which is part of the the Schlesinger Library #metoo Digital Media Collection. The original dataset contains 1,482,343 Tweet IDs, and I was able to hydrate 990,776 tweets. Giving me a Hydration Rate of 66.84%. The hydrated dataset covers from October 15, 2017 through May 23, 2020.

#BelieveSurvivors tweet volume #BelieveSurvivors Tweet Volume
#BelieveSurvivors wordcloud #BelieveSurvivors wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believesurvivors.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+------+
|lang| count|
+----+------+
|  en|922460|
| qme| 29683|
| qht| 22258|
| und|  4666|
| art|  2278|
| qam|  1579|
|  lv|  1163|
|  es|  1048|
|  fr|   974|
| qst|   945|
+----+------+

Top tweeters

Using believesurvivors-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believesurvivors-user-info"), and pandas:

#BelieveSurvivors Top Tweeters #BelieveSurvivors Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believesurvivors.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1043937966358433793|13986            |
|1044400591429152768|11536            |
|1047935356673437697|11496            |
|1045485744280596480|10006            |
|1044740038989418498|8962             |
|1045484872398188549|8182             |
|1043535858010210304|7859             |
|1047860627933421568|7565             |
|1045390508435152899|7124             |
|1047932623014825991|7093             |
+-------------------+-----------------+   

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 13,986

  2. 11,536

  3. 11,496

Top Hashtags

Using believesurvivors-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believesurvivors-hashtags"), and pandas:

#BelieveSurvivors hashtags #BelieveSurvivors hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believesurvivors.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|url                                                                                                                                                                    |count|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|https://twitter.com/KamalaHarris/status/1044285021635432451                                                                                                            |5175 |
|https://t.co/3vWhlGZSdR                                                                                                                                                |3931 |
|https://twitter.com/VoteChoice/status/1047932623014825991                                                                                                              |3317 |
|https://twitter.com/NBCNews/status/1047281394098028544                                                                                                                 |2823 |
|https://t.co/nKRTYgTH6T                                                                                                                                                |2763 |
|https://www.marieclaire.com/politics/a25308585/christine-blasey-ford-gofundme/?utm_source=twitter&utm_campaign=socialflowTWMAR&utm_medium=social-media&src=socialflowTW|2473 |
|https://t.co/zqVhJfsP17                                                                                                                                                |2473 |
|https://twitter.com/PPact/status/1055553082585702400                                                                                                                   |2025 |
|https://twitter.com/MoveOn/status/1044273007542390785                                                                                                                  |1865 |
|https://twitter.com/HelenBrosnan/status/1044245200015691776                                                                                                            |1746 |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believesurvivors.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+---------------------------------------------------------------------------------------------------+-----+
|media_url                                                                                          |count|
+---------------------------------------------------------------------------------------------------+-----+
|https://pbs.twimg.com/media/DMhkadMWkAA5kJr.jpg                                                    |2234 |
|https://pbs.twimg.com/tweet_video_thumb/Dn4BZw1WkAE7th5.jpg                                        |1246 |
|https://video.twimg.com/ext_tw_video/1044269645388304384/pu/pl/6p1SwPigFegfiavP.m3u8?tag=5         |1192 |
|https://video.twimg.com/ext_tw_video/1044269645388304384/pu/vid/360x640/8wWapr9Jf5yAe6bk.mp4?tag=5 |1192 |
|https://video.twimg.com/ext_tw_video/1044269645388304384/pu/vid/180x320/vTCPxygCceADwD4i.mp4?tag=5 |1192 |
|https://video.twimg.com/ext_tw_video/1044269645388304384/pu/vid/720x1280/1HjLDlp_gnVKlmBX.mp4?tag=5|1192 |
|https://pbs.twimg.com/media/DoIzXIdUUAABbEY.jpg                                                    |1117 |
|https://pbs.twimg.com/media/Dn4nGo3UUAEWDHx.jpg                                                    |1052 |
|https://pbs.twimg.com/media/Dn4BTfOU4AAbBYO.jpg                                                    |1022 |
|https://pbs.twimg.com/media/DoG8qY6XoAELfFw.jpg                                                    |1006 |
+---------------------------------------------------------------------------------------------------+-----+

A couple years ago I created a juxta (collage) of the images from this dataset. It features 81,265 images, and you can check it out here.

Emotion score

Using believesurvivors-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believesurvivors-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #BelieveSurvivors Tweets Emotion Distribution in #BelieveSurvivors Tweets

Come Join the 2024 Winter Holiday Hunt / LibraryThing (Thingology)

It’s December, and we’re hosting a special Winter Holiday Hunt!

This hunt is meant to celebrate the season of light, and the holidays it brings. We wish all our members a Merry Christmas, Happy Hanukkah, and an entertaining hunt!

We’ve scattered a mint of candles around the site. You’ll solve the clues below to find the candles and gather them all together.

  • Decipher the clues and visit the corresponding LibraryThing pages to find some candles. Each clue points to a specific page on LibraryThing. Remember, they are not necessarily work pages!
  • If there’s a candle on a page, you’ll see a banner at the top of the page.
  • You have almost three weeks to find all the candles (until 11:59pm EDT, Monday January 6th).
  • Come brag about your mint of candles (and get hints) on Talk.

Win prizes:

  • Any member who finds at least two candles will be
    awarded a candle Badge ().
  • Members who find all 15 candles will be entered into a drawing for one of five LibraryThing (or TinyCat) prizes. We’ll announce winners at the end of the hunt.

P.S. Thanks to conceptDawg for the candle illustration! ConceptDawg has made all of our treasure hunt graphics in the last couple of years. We like them, and hope you do, too!

“Yes, go ahead, be funny” / John Mark Ockerbloom

When George Burns first teamed up with Gracie Allen, he wanted to be the comic while she played the straight part. They soon found they got more laughs the other way around. Their first film, Lambchops (now watchable online) adapts one of their vaudeville routines in a fourth-wall-breaking short that’s now in the National Film Registry. The couple enjoyed a long career in film, radio, and TV. With this film, they’ll be together again in the public domain in 14 days.

Exploring #OscarsSoWhite Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 691,908 Tweets with the search term OscarsSoWhite from January 15, 2015 through April 8, 2023.

#OscarsSoWhite tweet volume #OscarsSoWhite Tweet Volume
#OscarsSoWhite wordcloud #OscarsSoWhite wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "oscarssowhite.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+------+
|lang| count|
+----+------+
|  en|633423|
| qme| 16667|
|  es| 13721|
|  fr|  5532|
| qht|  4444|
|  pt|  3371|
| und|  3309|
|  de|  1696|
|  ja|  1433|
|  it|   848|
+----+------+

Top tweeters

Using oscarssowhite-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("oscarssowhite-user-info"), and pandas:

#OscarsSoWhite Top Tweeters #OscarsSoWhite Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "oscarssowhite.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|687769539220770816 |13094            |
|569190885822308352 |12044            |
|687675158014902272 |7944             |
|695662884341284865 |6299             |
|1226648333080358913|5801             |
|687658331175952384 |4882             |
|555932035744026624 |4027             |
|691848301159804929 |3878             |
|704040698459381760 |3578             |
|836089194430746624 |3163             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 13,094

  2. 12,044

  3. 7,944 “Sorry, that post has been deleted.”

Top Hashtags

Using oscarssowhite-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("oscarssowhite-hashtags"), and pandas:

#OscarsSoWhite hashtags #OscarsSoWhite hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "oscarssowhite.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+------------------------------------------------------------------+-----+
|url                                                               |count|
+------------------------------------------------------------------+-----+
|https://amp.twimg.com/v/abb8cc7c-6f05-4163-9e39-e641c3d17194      |10454|
|https://t.co/TDl53Usnr0                                           |8251 |
|https://twitter.com/dlberes/status/687647501671907328             |5770 |
|https://t.co/3ZM9y09R7b                                           |5751 |
|https://twitter.com/meakoopa/status/695662884341284865/photo/1    |4529 |
|https://t.co/kpqGIxyTnF                                           |4525 |
|https://twitter.com/laurjbrown/status/742191565582569472/photo/1  |2536 |
|https://t.co/4KKUkL4Dis                                           |2524 |
|https://twitter.com/lifesAlyric/status/1162159262178693123/video/1|2043 |
|https://t.co/cqyYHNYkk0                                           |2042 |
+------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "oscarssowhite.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/Cad9tQRXIAAFpx3.jpg|4529 |
|https://pbs.twimg.com/media/B7Zp_lUIIAAS4hl.jpg|1645 |
|https://pbs.twimg.com/media/CcWZ65SXEAEx6W5.jpg|1601 |
|https://pbs.twimg.com/media/CYypO8kUoAIMOTk.jpg|900  |
|https://pbs.twimg.com/media/CcWDUbVXEAAdB0c.jpg|823  |
|https://pbs.twimg.com/media/CZP8DQEU8AA-QsT.jpg|736  |
|https://pbs.twimg.com/media/CYr_1CIVAAAkL86.jpg|689  |
|https://pbs.twimg.com/media/CYtvBgaWMAAbT0q.png|657  |
|https://pbs.twimg.com/media/B7eyhzNIMAA_QKy.jpg|645  |
|https://pbs.twimg.com/media/CkzLSCjWYAAUHz8.jpg|633  |
+-----------------------------------------------+-----+
#OscarsSoWhite Top Media #OscarsSoWhite Top Media

Emotion score

Using oscarssowhite-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("oscarssowhite-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #OscarsSoWhite Tweets Emotion Distribution in #OscarsSoWhite Tweets

Exploring #BelieveWomen Twitter Data / Nick Ruest

Overview

This is a dataset that I hydrated (April of 2023), and was created by Zachary Maiorana, Pablo Morales Henry, and Jennifer Weintraub. It is the “#metoo Digital Media Collection - Hashtag: believewomen”, which is part of the the Schlesinger Library #metoo Digital Media Collection. The original dataset contains 691,744 Tweet IDs, and I was able to hydrate 430,512 tweets. Giving me a Hydration Rate of 62.24%. The hydrated dataset covers from October 15, 2017 through May 25, 2020.

#BelieveWomen tweet volume #BelieveWomen Tweet Volume
#BelieveWomen wordcloud #BelieveWomen wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believewomen.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+------+
|lang| count|
+----+------+
|  en|385257|
| qme| 27368|
| qht| 11601|
| und|  2416|
|  es|   696|
| art|   484|
| qam|   362|
|  fr|   343|
| qst|   246|
|  de|   162|
+----+------+

Top tweeters

Using believewomen-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believewomen-user-info"), and pandas:

#BelieveWomen Top Tweeters #BelieveWomen Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believewomen.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+                                         
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1045672256796577793|31746            |
|1049518834972020737|23250            |
|1048676385684840448|14427            |
|1116765946092371970|9901             |
|930268728620486656 |6366             |
|1053475698692841473|6175             |
|1045777530517684224|4334             |
|1254993099115266049|4284             |
|1247237724710604800|3555             |
|1042864639065784320|3427             |
+-------------------+-----------------+   

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 31,746

  2. 23,250

  3. 14,427

Top Hashtags

Using believewomen-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believewomen-hashtags"), and pandas:

#BelieveWomen hashtags #BelieveWomen hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believewomen.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+---------------------------------------------------------------------+-----+
|url                                                                  |count|
+---------------------------------------------------------------------+-----+
|https://twitter.com/TwIzTeD_MiNd88/status/1053261293099143168/video/1|5453 |
|https://t.co/yOcjL40cWo                                              |5374 |
|https://twitter.com/variety/status/930266171701678081                |4473 |
|https://t.co/I8kW14HhAl                                              |4452 |
|https://twitter.com/Thatwasmymom/status/1049518834972020737          |3419 |
|https://twitter.com/juliealdermanb/status/1045672256796577793        |2707 |
|https://twitter.com/sophiabush/status/930268728620486656             |2138 |
|https://t.co/gYSwIX6Nrc                                              |1962 |
|https://t.co/wFw3e7VGD3                                              |1003 |
|https://twitter.com/RealJamesWoods/status/1053475698692841473        |995  |
+---------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "believewomen.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------------------+-----+
|media_url                                                  |count|
+-----------------------------------------------------------+-----+
|https://pbs.twimg.com/media/Do2d4I0X0AUI1Bq.jpg            |1428 |
|https://pbs.twimg.com/media/DnVQ2RcXgAIptdn.jpg            |1312 |
|https://pbs.twimg.com/media/EFe07peWoAAibsL.jpg            |522  |
|https://pbs.twimg.com/media/Do2TVS0W4AAZ7tP.jpg            |310  |
|https://pbs.twimg.com/media/Do2TVS8XgAEDPby.jpg            |310  |
|https://pbs.twimg.com/media/Do9AvpbXUAAjvRs.jpg            |277  |
|https://pbs.twimg.com/media/Dn4BBRhU4AAfNJA.jpg            |250  |
|https://pbs.twimg.com/media/Dn4zieuVsAAsmuZ.jpg            |229  |
|https://pbs.twimg.com/media/DoHiEZtU0AARoKP.jpg            |210  |
|https://pbs.twimg.com/tweet_video_thumb/DppwFr6UwAATOIK.jpg|186  |
+-----------------------------------------------------------+-----+

A couple years ago I created a juxta (collage) of the images from this dataset. It features 28,594 images, and you can check it out here.

Emotion score

Using believewomen-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("believewomen-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #BelieveWomen Tweets Emotion Distribution in #BelieveWomen Tweets

Variations on an insistent theme / John Mark Ockerbloom

Maurice Ravel abandoned an orchestration of a Spanish composer’s work when he found out it had copyright complications. He instead started developing a theme based on Spanish dance music that he found had an “insistent quality”. With its much-repeated rhythm and theme, Boléro became one of Ravel’s best-known works. In 15 days it joins the US public domain in several versions published in 1929, for piano solo, piano duet, symphonic orchestra, and jazz orchestra.

Cherry-picking / David Rosenthal

Source
Via Barry Ritholtz we find this infographic entitled Top Performing S&P 500 Stocks showing the best total return over the past 5, 10, 15 and 20 years. Ritholtz sourced it from Ranked: The Top Performing S&P 500 Stocks in the Last Two Decades By Marcus Lu with graphic design by Miranda Smith.

In each case, Nvidia is the best performing stock, and it is the only stock to appear in all four periods. Sounds great, doesn't it? Why wouldn't you just hold NVDA all the time and be guaranteed to beat the market?

But follow me below the fold for more detail from someone who has been long NVDA for more than three decades..

Source
Alas, this infographic is deeply misleading because they cherry-picked their data. Nvidia's stock price is extraordinarily volatile. The log plot of NVDA shows that on average over its history every three years it suffers a drop of between 45% and 80%. Fortunately, over the same time it has had much larger rises. Thus when discussing the return for being long NVDA for a period the start and end dates matter a lot.

DecCloseDecCloseNVDAS&P
20040.1920090.38+100%-9.1%
20090.3820140.48+26%+85.8%
20140.4820195.91+1131%+61.7%
20195.912024134.25+22715%+87.6%
This table shows the performance of NVDA and the S&P 500 over each of the 5-year periods in the infographic. Over three of the 5-year periods NVDA out-performed the S&P 500, and over the remaining one it under-performed. By choosing appropriate starting points it is easy to find other 5-year periods when NVDA under-performed the S&P 500. For example, from 8/2007 to 8/2012 NVDA returned -63.7% but the S&P returned -5.6%. Or from 11/2001 to 11/2006 NVDA returned +10.7% but the S&P returned +23.5%.

Source
Although the infographic suggests that the huge out-performance started 20 years ago and has been decreasing, this is misleading for two reasons. First, the infographic's numbers are for the whole 20-year period, not for each 5-year period. Second, they are dominated by the rise in the most recent 5-year period. The linear plot of NVDA makes it clear that, had the most recent 5-year period been like any of the others, Nvidia would not have been in the infographic at all.

None of this is financial advice. Nevertheless, Nvidia is a great company, and it is possible to make money in its stock. There are two ways to do it, trading on the stock's volatility, or buying low and being prepared to hold it for many years.

Remember, past performance is no guarantee of future performance. If NVDA were to repeat the past 5-year return over the next 5 years, it would be around $3,100. To maintain its current P/E of 52.85 it would need annual revenue of around $2.6T.

Keeping up with next-generation metadata in archives and special collections / HangingTogether

Over the course of the past year, the OCLC Research Library Partnership (RLP) Metadata Managers Focus Group has delved into challenges related to staffing, succession planning, and thinking about metadata changes. Metadata Managers planning group member Chloe Misorski (Ingalls Library and Museum Archives, Cleveland Museum of Art) highlighted that these challenges are particularly pronounced for staff managing specialized collections and archives, who often operate with smaller workforces and budgets than larger, general collections.

To gain a deeper understanding of how these collection managers are navigating emerging next-generation metadata environments, we invited RLP cataloging and metadata colleagues in museum libraries, independent research libraries, art libraries and special collections or archives within larger campus networks. These are the metadata and cataloging colleagues of those who may be participating in broader discussions led by my RLP colleague Chela Weber as part of the Archives and Special Collections Leadership Roundtable.

We asked our participants to reflect on three prompts:

  • How are you preparing for next-generation metadata and linked data?
  • How are you developing new metadata workflows?
  • What other factors are impacting your current metadata operations and planning?

Our discussions revealed a sense of apprehension stemming from the ongoing tension between the unique needs required of managing special collections and archives and the limited resources available, particularly in areas of cataloging and metadata management.

Ten-panel folding screen constructed of ink and silk depicting books and writing implements arranged on shelves.Books and Scholars’ Accoutrements, late 1800s. Yi Taek-gyun (Korean, 1808-after 1883). Courtesy The Cleveland Museum of Art, Leonard C. Hanna Jr. Fund 2011.37

Backlogs

Several participants expressed that their most pressing challenge is addressing backlogs that predate the pandemic and have grown in the subsequent years due to limited cataloging resources. Some participants noted that they were only now back to previous staffing levels and that their primary focus was on tackling these backlogs using existing practices rather than looking ahead to next-generation metadata practices.

finding time to look to the future is difficult when keeping up with the present is so challenging

Workforce

Many of our participants are still dealing with a wave of retirements and organizational changes. These impacts are widespread but particularly severe in areas requiring specialized knowledge for cataloging unique materials. Staff reductions and retirements have exacerbated backlogs, and new hires often only slow their growth rather than reduce them.

Larger organizations have consolidated cataloging units, combining general and special cataloging staff. This has increased the workload for managers, who must re-imagine unit operations and provide growth opportunities for staff. Cross-training generalists to work with special collections is one strategy, but it brings challenges in balancing expertise levels across teams.

Managers often secure term-limited positions for specific projects, but training these catalogers is time-consuming, and they frequently leave when their terms end. This lack of continuity burdens the development of good documentation for cataloging practices. Recruiting for term-limited positions is also difficult due to the unique needs of special libraries, leaving descriptions stagnant after funding or staff departures.

Workflows

“Good enough” workflows: In the face of too much work and too few resources, many are working to define “good enough” workflows and standards for cataloging. Where, in the past, records for special materials may have been held back until they were “perfect,” resource limitations mean thinking about how to move records of sufficient quality forward.

Identifying efficiencies: Others are taking a hard look at their workflows and identifying opportunities to find efficiencies. Staff at Princeton University Special Collections shared how they were finding efficiencies by developing a “MARC Factory” that takes spreadsheets provided by book dealers and converts them to MARC records “good enough” to bring into other cataloging workflows. This is also an interesting example of what we’ve been calling “social interoperability” because staff in cataloging, acquisitions, and book dealers participated in crafting a workflow that worked for each of them.

Systems

Adding to existing challenges, several libraries are in the midst of system migrations. Even the smoothest implementations can cause additional disruptions, exacerbating issues related to staff shortages, backlogs, and reorganizations. Staff often need to freeze work, learn the intricacies of the new system, and rethink previous workflows.

Some libraries that invested in bespoke or open-source systems (OSS) to handle their special materials are finding it difficult to maintain them in the face of reduced resources. These systems are frequently built on technology stacks that are continually changing and, therefore, need close attention to maintain them to prevent security breaches. Maintaining bespoke systems may require gaining buy-in from leadership who are competing for limited resources, even if they share the same goals. Consequently, libraries seek commercial-off-the-shelf (COTS) solutions requiring less maintenance. At the same time, several participants mentioned how Yale University’s LUX platform (a multi-system open-source integration) is providing leadership in this area, even if their own institutions cannot build or sustain a similar platform with internal resources.

Abandoning a home-grown solution may come with other costs—if metadata in the source system is highly heterogeneous and not standards-based, it may need to be cleaned up to be migrated into new systems with less tolerance for creative descriptions. Even when moving high-quality metadata, work is required to ensure that the migration happens smoothly in ways that activate beneficial new features in the target system.

Next-generation metadata & artificial intelligence

Despite the challenges of staffing, backlogs, and reparative metadata work, many participants noted that they continue to pay attention to developments around linked data by attending webinars, creating test accounts, and exploring new tools.

As the market for library linked data tools is still emerging, many are taking a wait-and-see approach. When it is challenging to maintain existing systems and services, discussion participants find it difficult to consider extending workflows and financial obligations. Others continue to use home-grown solutions, especially for managing entities for digital and cultural collections that are not dependent on MARC-based workflows. For example, the University of Nevada, Las Vegas, has developed a portal that allows entities identified with multiple URIs to connect across LC/NAF, local data, and external resources.

For several participants, artificial intelligence seems like a distant solution, especially for special materials. There is skepticism about whether existing chatbot services can produce good descriptions. Despite an interest in “good enough” records, AI-generated may not be worth developing the workflows needed to validate and remediate issues—especially for descriptions of special materials. Concerns are amplified for institutions with large archival collections of analog items. While the Library of Congress’s experiment on the use of AI for cataloging ebook backlogs is promising, it doesn’t overcome the hurdles faced by staff in archives and special collections.

From the view of our participants, AI is drawing attention away from the day-to-day realities and complexities of cataloging workflows. While it may prove useful in the future, current applications of AI still need the guidance and expertise of catalogers who are knowledgeable about special materials.

Next steps

Multiple participants noted the value they’d found in Total Cost of Stewardship: Responsible Collection Building in Archives and Special Collections. Chela and I discussed how we might bring this resource to the attention of Metadata Managers and what it might help us to do based on the challenges reported in this session. We are currently planning a future round-robin as a follow-up.

The Metadata Managers Focus Group will also take a closer look at emerging next-generation metadata workflows, starting with Activating URIs in linky MARC in January 2025 (see the RLP Events calendar for dates and times). These sessions will be tied to follow-up conversations with OCLC colleagues who are building the future of cataloging. We hope to explore emerging use cases for how OCLC is bridging existing expertise and workflows that meets libraries where they are today – whether that’s in new editing environments or through a suite of APIs that enable creation and curation of linked data entities and descriptive relationships.

The post Keeping up with next-generation metadata in archives and special collections appeared first on Hanging Together.

Exploring #MMIWG2S Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 977,677 Tweets with the search terms MMIWG2S or MMIWG from February 24, 2010 through April 8, 2023.

#MMIWG2S tweet volume #MMIWG2S Tweet Volume
#MMIWG2S wordcloud #MMIWG2S wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "mmiwg2s.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+------+
|lang| count|
+----+------+
|  en|878634|
| qme| 70528|
| qht|  9211|
|  fr|  4605|
|  ht|  2700|
| zxx|  2685|
| und|  2555|
|  et|  1249|
|  es|  1006|
|  ro|   700|
+----+------+

Top tweeters

Using mmiwg2s-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("mmiwg2s-user-info"), and pandas:

#MMIWG2S Top Tweeters #MMIWG2S Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "mmiwg2s.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1267942790413144067|15018            |
|1268702243429052416|10910            |
|1389899190105518085|8282             |
|1268958782446383107|6557             |
|1134518401311924224|6112             |
|1257822482423328770|4195             |
|1257696242760589312|3760             |
|1269113291202154496|3093             |
|1522197758484037632|2982             |
|1600620899706994688|2866             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 15,018

  2. 10,910

  3. 8,282

Top Hashtags

Using mmiwg2s-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("mmiwg2s-hashtags"), and pandas:

#MMIWG2S hashtags #MMIWG2S hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "mmiwg2s.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+------------------------------------------------------------------+-----+      
|url                                                               |count|
+------------------------------------------------------------------+-----+
|http://durangocide.com                                            |7724 |
|https://www.mmiwg-ffada.ca/final-report/                          |2975 |
|https://t.co/oD5KrQzhdK                                           |2712 |
|http://Durangocide.com                                            |2678 |
|https://www.thepetitionsite.com/tell-a-friend/62988149p           |1304 |
|https://t.co/GBOxO7H1cc                                           |1261 |
|https://t.co/2OJyhERDuW                                           |1032 |
|https://twitter.com/LakotaMan1/status/1491284311789957120/photo/1 |1032 |
|https://twitter.com/MarkRuffalo/status/1389962831949402115/photo/1|855  |
|https://t.co/pupWf8PSGt                                           |855  |
+------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "mmiwg2s.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/FLIbLPMUUAEpb2Q.jpg|1032 |
|https://pbs.twimg.com/media/E0oj1EAWUAIrvDC.jpg|855  |
|https://pbs.twimg.com/media/FILsqOLXMAc9yQK.jpg|666  |
|https://pbs.twimg.com/media/E0o5c_xWQAodkAF.jpg|526  |
|https://pbs.twimg.com/media/FRYqfFQXoAABzkZ.jpg|441  |
|https://pbs.twimg.com/media/FjDobs_XgAIY6vx.jpg|402  |
|https://pbs.twimg.com/media/E_2JNbuUUAkhVj0.jpg|397  |
|https://pbs.twimg.com/media/FR_RJIKX0AEMPVm.jpg|372  |
|https://pbs.twimg.com/media/E_57eAJUUAURadM.jpg|364  |
|https://pbs.twimg.com/media/FaTQKsOXgBAadf8.jpg|342  |
+-----------------------------------------------+-----+
#MMIWG2S Top Media #MMIWG2S Top Media

Emotion score

Using mmiwg2s-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("mmiwg2s-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #MMIWG2S Tweets Emotion Distribution in #MMIWG2S Tweets

Tintin au pays du domaine public / John Mark Ockerbloom

In 1929 a Belgian reporter began a series of global adventures in the pages of Le Petit Vingtième. In 16 days Tintin starts a new journey into the public domain. But those wanting to meet him there must brave pitfalls around issues like where he’s public domain (only in the US for now), what’s reusable (only what’s in the 1929 French-language strips), and even whether a controversial precedent might keep him copyrighted in parts of the US. Great snakes- er, Sapristi!

Publisher Interview: Zibby Books / LibraryThing (Thingology)

Zibby Owens

LibraryThing is pleased to present our second annual Independent Publisher interview. We sat down this month with Zibby Owens, founder and CEO of Zibby Media, the parent company of the Moms Don’t Have Time to Read Books podcast, Zibby’s Bookshop in Santa Monica, CA, and the boutique publishing company, Zibby Books, which has offered giveaways through our very own Early Reviewers program. A bestselling author, Owens has penned the novel Blank, the memoir Bookends: A Memoir of Love, Loss and Literature, and the children’s book Princess Charming. She is also the editor of three anthologies, Moms Don’t Have Time to Have Kids, Moms Don’t Have Time To: A Quarantine Anthology, and most recently, On Being Jewish Now: Reflections from Authors and Advocates. Owens has contributed to publications such as Vogue and Oprah Daily; appeared on CNN, CBS and other media outlets; and been described as “NYC’s most powerful book-fluencer.” She sat down with Abigail to answer some questions about her work, and about Zibby Books.

Tell us a little bit about Zibby Books. When and how did it get started, and what does it publish? You describe yourself on your site as “woman-led”—what significance does that have, in terms of your ethos?

It started after I got to know many authors through my podcast and realized how disappointed so many were with their publishing journey. I wanted to make it better! Woman-led is important in that our team is almost all women, as are the authors we publish!

What role(s) do you play at Zibby Books, in addition to founder and CEO? Do you take a hand in editing? What do you look for in the books you want to publish?

I get into the weeds on select titles but in general, I decide on all the acquisitions, I help with marketing and everything related to packaging, and provide oversight on all. Anne Messitte runs the show!

What are some of your favorites, of the books you’ve published so far, and why?

I can’t really pick but I’ll say some of our most successful have been THE LAST LOVE NOTE and Pictures of You by Emma Grey, Here After: A Memoir by Amy Lin, and Everyone But Myself: A Memoir by Julie Chavez. They’re all about helping us get through something: grief, motherhood, work stress… and giving a hopeful outcome or attitude.

Between your podcast, your bookshop, your publishing company, and your writing—not to mention raising four children!—you have a lot on your plate. How do you make time for it all? What insights does being involved in so many different areas of the book world—as writer, as publisher, as blogger and influencer—give you, when it comes to each role? Are there challenges in being on “all sides” of the process?

I work pretty much nonstop because I’m obsessed with what I do! There are fewer challenges and more joys at being on all sides. I love seeing how the machine works and assessing if I can improve it! I make time by being very intentional with my schedule, cutting off work to go pick up my kids and take them to activities and all that, and having a fabulous team.

You pulled Zibby Books out as a sponsor of the National Book Awards last year, after learning that the authors were planning an organized protest of Israeli actions in response to the October 7th terrorist attack. Can you talk a little bit about the antisemitism you have seen in the book world, since the October 7th attack?

It would fill a book. The literary industry has really taken a hit which is why I continue to speak up and advocate for change.

What’s on the horizon for you, and for Zibby Books? What can we look forward to reading (or listening to) next?

So many great books! A novel by NYT bestselling author John Kenney, a novel by UK bestseller Jane Costello, a debut novel from Nanda Reddy, and an essay collection by podcaster Amy Wilson. ALL SO GOOD!

Tell us about your own personal library. What’s on your shelves?

All the books coming out in the next five months!!!

What have you been reading lately, and what would you recommend to other readers?

I just finished Fredrik Backman’s upcoming novel My Friends. One of my all-time favorites.

Letter from the Editors / Information Technology and Libraries

Welcome to the December 2024 issue of Information Technology and Libraries. In addition to highlights from the current issue, our letter from the editor provides updates about our revised editorial workflow and the addition of 93 previously-unavailable issues of this journal and our predecessor, the Journal of Library Automation.

Beyond the Bookshelf / Information Technology and Libraries

This column presents a case study exploring innovative approaches to digital librarianship within a distance learning Higher Education institution based in the UK. Key initiatives included asynchronous information literacy instruction, Python scripts for auditing course materials for broken links and copyright compliance, and management of physical extracts via a digital content store. It examines the challenges of building an online library service, balancing learner-centric practice with efficiency and cost-effectiveness. The study analyses the work done and presents future initiatives, offering insights and sharing practices for solo or small team librarians navigating the evolving landscape of both distance and face-to-face education.

How Kilgore Memorial Library Fostered County-Wide Collaboration through One Shared Calendar / Information Technology and Libraries

This essay recounts the development of the “One County, One Calendar” initiative at Kilgore Memorial Library in York, Nebraska. What began as a simple solution for managing the library’s meeting room bookings evolved into a county-wide collaboration aimed at improving communication about local events. By working with the York Chamber of Commerce, the York County Development Corporation, and other county leaders, the library created a shared calendar system that now serves all of York County. This project exemplifies how libraries can play a pivotal role in fostering collaboration and solving community challenges, positioning public libraries as essential facilitators of information and engagement.

Beyond the Minimum / Information Technology and Libraries

In April 2024, the Department of Justice finalized a rule updating regulations for Title II of the Americans with Disabilities Act (ADA), which requires that all state and local governments make their services, programs, and activities accessible, including those that are offered online and in mobile apps. The final rule dictates that public entities’ web content meet the technical standards of the Web Content Accessibility Guidelines (WCAG) Version 2.1, level AA, an industry standard since its creation in 2018.

Libraries that receive federal funding will be required to follow this rule for any web content they create, including LibGuides. Springshare’s LibGuide platform is one of the most widely used among libraries for web content creation, from complete websites to pedagogical and research guides. While Springshare may develop plans to make sure its clients are in compliance with this new rule, there are more important questions that LibGuide creators need to consider to move beyond the bare minimum of following the rule. The authors explain what WCAG 2.1 AA compliance requires, how LibGuide authors can use accessibility principles to ensure compliance, and offer available tools to check existing guides, as well as discuss alternatives to LibGuides.

Text Analysis of Archival Finding Aids / Information Technology and Libraries

Archival repositories must be strategic and selective in deciding what collections they will acquire and steward. Careful collection stewards balance many factors, including ongoing resource needs and future research use. They ensure new acquisitions build upon existing topical strengths in the repository’s holdings and reassess these existing strengths regularly through multiple lenses. In this study, we examine the suitability of text analysis as a method for analyzing collection scope strengths across a repository’s physical archival holdings. We apply a tool for text analysis called Leximancer to analyze a corpus of archival finding aids to explore topical coverage. Leximancer results were highly aligned with the baseline subject heading analysis that we performed, but the concepts, themes, and co-occurring topic pairs surfaced by Leximancer suggest areas of collection strength and potential focus for new acquisitions. We discuss the potential applications of text analysis for internal library use including collection development, as well as potential implications for wider description, discovery, and access. Text analysis can accurately surface topical strengths and directly lead to insights that can inform future acquisition decisions and archival collection development policies.

Pure Client-Side Data Wrangling / Information Technology and Libraries

While many languages are used for data manipulation, it is unusual to see JavaScript put to this task. This paper describes a novel application built to manipulate catalog patron data using only JavaScript running in a browser. Further, it describes the approach of building and deploying “strongly single page web applications,” a more extreme version of single page applications that have been condensed into a single HTML file. The paper discusses the application itself, how it is used, and the way that possessing web development and coding skills in an organization’s systems department can help it flexibly respond to challenges using such novel solutions.

Improving the Student Search Experience in LibGuides / Information Technology and Libraries

This study is an in-depth look at the use of the LibGuides search function by college students. We sought to better understand the mental models with which they approach these searches and to improve their user experience by reducing the percentage of searches with no results. We used two research methods: usability testing, which involved 15 students in two rounds, and analysis of search terms and search sessions logged during three different weeks. Interface changes were made after the first round of usability testing and our analysis of the first week of search data. Additional changes were made after the second round of usability testing and analysis of the second week of search data.

The usability tests highlighted a mismatch between the LibGuides search behavior and the expectations of student users. Results from both rounds of testing were very similar. The search analysis showed that the level of no-result searches was slightly lower after the interface changes, with most of the improvement seen in Databases A-Z searches. Within the failed searches, we saw a reduction in the use of topic keywords but no improvement in the other causes we studied. The most significant change we observed was a drop in the level of search activity.

This research provides insights that are specific to the LibGuides platform—about the underlying expectations that students bring to it, how they search it, and the reasons why their searches do and do not produce results. We also identify possible system improvements for both academic libraries and Springshare that could contribute to an improved search experience for student users.

Raspberry Pi-Based Offline Digital Library for Indonesian Villages Without Stable Power and Internet Access / Information Technology and Libraries

As one of the world’s largest archipelagic nations, Indonesia faces a major challenge in distributing its wealth, including basic infrastructure such as electricity and internet, to some of its most remote islands. Due to its remote location, most people in this area live in poverty and have poor-quality education. To help solve this problem, an offline digital library was created based on a Raspberry Pi “minicomputer.” With the ability to store more than 2,500 educational movies and e-books for offline viewing, the device has proven to be reliable even in areas with unpredictable power.

Coding with Those Who Show Up / Information Technology and Libraries

This paper considers our library’s attempt at applying a “laissez-faire leadership” model to technical committee work. Since its introduction in the 1990s, scholarship on laissez-faire leadership has historically viewed the concept very negatively. However, we argue here that many of these perspectives are straw man arguments that do not adequately consider the possibilities of a laissez-faire model. Following some dissenting voices in the literature, we would like to reclaim the laissez-faire model as a way to facilitate library technical work under certain very specific circumstances. This paper will describe the organizational context where these laissez-faire methods worked for us. Our conclusion is that this approach can promote autonomy, responsibility, and productivity. We feel that this reevaluation of this concept can provide an important framework for self-organization when doing technical work.

An Examination of Academic Library Platforms and Systems during COVID-19 / Information Technology and Libraries

This paper examines the use and subsequent trajectory of academic library technologies due to the impact of the COVID-19 pandemic. Taking a broad view of technologies, the systems and services discussed will center around resource use because COVID restrictions shuttered many in-person technologies. The two academic libraries compared in this study show a similar pattern of use and signal growth of certain platforms and technologies for the future.

“I dipped into the book – and got hooked” / John Mark Ockerbloom

Three people unsatisfied with their lives leave home, meet a falling-apart performing troupe, and reinvent themselves as The Good Companions. J.B. Priestley’s long comic story of a now-bygone Yorkshire has had devoted British fans for many years, and inspired film, radio, TV, and theatrical adaptations in the UK.

The book has made less of a mark in the US. That could change in 17 days, when it joins the public domain here, years before it does most anywhere else.

Exploring Roe v. Wade Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint, and help from a few great colleagues and friends in the DocNow Slack. It contains 3,501,447 Tweets with the search terms #roevwade, #roevswade, #DobbsvJackson, #dobbsdecision, #Dobbs, and #AbortionBan from May 1, 2022 through June 16, 2022.

Roe v. Wade tweet volume Roe v. Wade Tweet Volume
Roe v. Wade wordcloud Roe v. Wade wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "roevwade.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+
|lang|  count|
+----+-------+
|  en|3110315|
| qme| 102365|
|  fr|  71789|
|  es|  56680|
| qht|  32451|
|  th|  31636|
|  de|  24272|
|  it|  20295|
| und|  16855|
|  nl|   7387|
+----+-------+

Top tweeters

Using roevwade-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("roevwade-user-info"), and pandas:

Roe v. Wade Top Tweeters Roe v. Wade Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "roevwade.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1540346175282356225|99281            |
|1540363850876346371|89395            |
|1521310392311558144|52535            |
|1540548661129539584|44707            |
|1521303464327929857|33426            |
|1521301933897756677|27944            |
|1521585413932163081|26350            |
|1540525520118599681|25218            |
|1521295778706206726|22717            |
|1540373700410630145|22010            |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 99,281

  2. 89,395

  3. 52,535

Top Hashtags

Using roevwade-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("roevwade-hashtags"), and pandas:

Roe v. Wade hashtags Roe v. Wade hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "roevwade.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+-------------------------------------------------------------------+-----+
|url                                                                |count|
+-------------------------------------------------------------------+-----+
|https://twitter.com/SaintHoax/status/1540363850876346371/video/1   |80975|
|https://t.co/4UbDcjknLd                                            |78131|
|https://twitter.com/vicsepulveda/status/1540402447817920512/photo/1|10892|
|https://t.co/eh2G1o6Tzq                                            |10891|
|https://twitter.com/CH005Y/status/1521676825063264257/video/1      |8386 |
|https://t.co/OGSXlC5seX                                            |7323 |
|http://liar.com                                                    |5617 |
|http://Robme.org                                                   |5539 |
|https://t.co/nUDpQHycfL                                            |5104 |
|https://t.co/VgHuvFgyXh                                            |5033 |
+-------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "roevwade.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/FWCb3EeUYAA1eOi.jpg|10892|
|https://pbs.twimg.com/media/FWCU112X0AAkLDj.jpg|3512 |
|https://pbs.twimg.com/media/FWBkV_IXwAIIPwz.png|3088 |
|https://pbs.twimg.com/media/FWBhiQWWAAE-5g4.jpg|2808 |
|https://pbs.twimg.com/media/FWEd00RXoAAk4MC.jpg|2581 |
|https://pbs.twimg.com/media/FWBmkP1XoAA618s.jpg|2561 |
|https://pbs.twimg.com/media/FR6ygX8X0AIgyVx.jpg|2269 |
|https://pbs.twimg.com/media/FWCV09nXgAA7hDB.jpg|2167 |
|https://pbs.twimg.com/media/FWBotT0XoAEYrYO.jpg|2016 |
|https://pbs.twimg.com/media/FWBkzwfUIAE1rEx.jpg|2009 |

A couple years ago I created a juxta (collage) of the images from this dataset. It features 510,807 images, and you can check it out here.

Emotion score

Using roevswade-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("roevwade-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in Roe v. Wade  Tweets Emotion Distribution in Roe v. Wade Tweets

Exploring #onpoli Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 16,458,701 Tweets with the search term onpoli from July 14, 2009 through December 31, 2022. I have brief post about this dataset here as well, which includes a different Tweet Volume chart highlighting election cycles.

onpoli tweet volume #onpoli Tweet Volume
onpoli wordcloud #onpoli wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "onpoli.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+--------+
|lang|   count|
+----+--------+
|  en|15810481|
| qht|  203116|
| qme|  181824|
|  fr|  140310|
| und|   83849|
|  es|    7561|
|  in|    3628|
|  ca|    3341|
|  ro|    2946|
|  ht|    2892|
+----+--------+

Top tweeters

Using onpoli-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("onpoli-user-info"), and pandas:

onpoli Top Tweeters onpoli Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "onpoli.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1383556131323146249|9109             |
|1496263180267446273|4357             |
|1532021002665963521|4343             |
|1547279824648953856|3934             |
|1518734995438948353|3566             |
|1491087746223702021|3372             |
|1571604171740028930|3290             |
|1119263668456214528|3273             |
|1268567604152733698|3198             |
|1587548059000770560|3167             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 9,109

  2. 4,357

  3. 4,343

Top Hashtags

Using onpoli-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("onpoli-hashtags"), and pandas:

onpoli hashtags onpoli hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "onpoli.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+------------------------------------------------------------------+-----+
|url                                                               |count|
+------------------------------------------------------------------+-----+
|http://facebook.com/OfficialJReno                                 |4664 |
|https://twitter.com/GenzelFamily/status/821769877681799169/photo/1|4661 |
|http://www.onpoli.org                                             |3258 |
|https://t.co/W445Fw3EKj                                           |2752 |
|https://t.co/Qw52iJT1TG                                           |2750 |
|https://t.co/qNkOMGm4dG                                           |2514 |
|http://virusfacts.ca/tw                                           |2446 |
|https://support.twitter.com/articles/20169199                     |2330 |
|http://Looniepolitics.com                                         |2245 |
|http://looniepolitics.com                                         |2210 |
+------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "onpoli.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/C2eDVsGUQAA07Kz.jpg|4661 |
|https://pbs.twimg.com/media/Fgq1k2mWAAEtJVY.png|1996 |
|https://pbs.twimg.com/media/CfpftDWWsAAu5yV.jpg|1796 |
|https://pbs.twimg.com/media/FL6Yx7ZXwAEaMvJ.jpg|1695 |
|https://pbs.twimg.com/media/FgrmiL6WAAMv8Sz.jpg|1575 |
|https://pbs.twimg.com/media/Eyy7aGqW8AEUca1.jpg|1263 |
|https://pbs.twimg.com/media/FQK0dqdWQAMNXPw.jpg|1217 |
|https://pbs.twimg.com/media/EzSjJj-VIAM8Wer.jpg|1146 |
|https://pbs.twimg.com/media/FWSNWWoXgAAc5PK.jpg|1038 |
|https://pbs.twimg.com/media/EtO5N49XMAIARkL.jpg|992  |
+-----------------------------------------------+-----+
#onpoli Top Media #onpoli Top Media

Emotion score

Using onpoli-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("onpoli-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #onpoli Tweets Emotion Distribution in #onpoli Tweets

The king of all, Sir Duke / John Mark Ockerbloom

Duke Ellington led jazz bands and orchestras for over 50 years, and wrote or had a hand in writing over 1000 pieces of music. His began making records in 1924, and many other musicians have also recorded his work since. Van and Shenck sang his “Choo Choo (Gotta Hurry Home)” as a novelty song on a record released in 1924. Later that year, Ellington’s Washingtonians released a jazz instrumental version. Both records finally come home to the public domain in 18 days.

A turning point for Faulkner / John Mark Ockerbloom

William Faulkner struggled going into 1929. His sprawling novel on fading southern aristocrats had been rejected by eleven publishers, and his novel in progress, a nonlinear streams-of-consciousness narrative, would be hard to sell. After severe cuts, the first novel became Sartoris, launching his Yoknapatawpha County saga. The second, The Sound and the Fury, was cited as his “breakthrough” in the award of his Nobel Prize. Both books reach the public domain in 19 days.

Exploring WWG1WGA Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 2,277,298 Tweets with the search term wwg1wga from February 24, 2018 through October 9, 2022.

wwg1wga tweet volume WWG1WGA Tweet Volume
wwg1wga wordcloud WWG1WGA wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "wwg1wga.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+
|lang|  count|
+----+-------+
|  en|1728918|
| qme| 217208|
| qht| 114168|
| und|  70492|
|  pt|  32640|
|  ja|  31177|
|  de|  17161|
|  es|  16301|
|  fr|  10350|
|  nl|   7591|
+----+-------+

Top tweeters

Using wwg1wga-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("wwg1wga-user-info"), and pandas:

wwg1wga Top Tweeters wwg1wga Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "wwg1wga.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+                                         
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1269391807173079040|5751             |
|1243635945045479426|5305             |
|1268768470524522496|5040             |
|1275001760843730945|4726             |
|1249824937369575426|3837             |
|1268576616478801920|2873             |
|1010680758879453184|2788             |
|1268694691798638593|2669             |
|1111385986804510721|2518             |
|1266780876979015680|2486             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 5,751

  2. 5,305

  3. 5,040

Top Hashtags

Using wwg1wga-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("wwg1wga-hashtags"), and pandas:

wwg1wga hashtags wwg1wga hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "wwg1wga.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+----------------------------------------------------------------------+-----+  
|url                                                                   |count|
+----------------------------------------------------------------------+-----+
|https://twitter.com/sandyleevincent/status/1269025005221883914/video/1|5216 |
|https://t.co/7jRNbDbnYa                                               |5214 |
|https://twitter.com/SuperEliteTexan/status/1103403645951893506        |1853 |
|https://t.co/UqARLtZTTW                                               |1846 |
|https://qanon.pub/                                                    |1807 |
|https://twitter.com/JipeFIN/status/1198297777677176832/photo/1        |1619 |
|https://t.co/hEqgxEKClS                                               |1619 |
|https://support.twitter.com/articles/20169199                         |1576 |
|https://t.co/Kh05juJtiI                                               |1519 |
|http://qmap.pub                                                       |1500 |
+----------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "wwg1wga.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/EKE1do0WwAEt6_X.jpg|1619 |
|https://pbs.twimg.com/media/DjuiNHpV4AAgBNZ.jpg|1364 |
|https://pbs.twimg.com/media/D2xfCrZX0AAy-Er.jpg|407  |
|https://pbs.twimg.com/media/D0YP4hNW0AAsZOh.jpg|379  |
|https://pbs.twimg.com/media/DjP7hKJVAAIEDkI.jpg|349  |
|https://pbs.twimg.com/media/EX5QPSPWAAA2erd.jpg|317  |
|https://pbs.twimg.com/media/EWlg5c5X0AAJssh.jpg|312  |
|https://pbs.twimg.com/media/D2xdPizWwAEGaa3.jpg|302  |
|https://pbs.twimg.com/media/D2xVNZpWoAEuAUS.jpg|274  |
|https://pbs.twimg.com/media/D2xY5slXQAAvg_n.jpg|274  |
+-----------------------------------------------+-----+

A couple years ago I created a juxta (collage) of the images from this dataset. You can check it out here.

Emotion score

Using wwg1wga-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("wwg1wga-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in WWG1WGA Tweets Emotion Distribution in WWG1WGA Tweets

Exploring Elon Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 7,721,050 Tweets with the search term elon from June 11, 2022 through October 28, 2022.

This is dataset is also flattened line-oriented JSON data from the Twitter v2 API, which required a yet another round of updates to twut. Something I knew was on my agenda a few years ago, but never had that special combination of a use case and time on hand to get it done. Now, it’s done: twut-1.1.0!

elon tweet volume elon Tweet Volume
elon wordcloud elon wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "elon.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+                                                                  
|lang|  count|
+----+-------+
|  en|6091601|
|  es| 445054|
|  pt| 253698|
|  fr| 237101|
|  tr| 154485|
|  de|  92694|
|  in|  55520|
|  th|  49253|
| zxx|  40027|
|  et|  35774|
+----+-------+

Top tweeters

Using elon-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("elon-user-info"), and pandas:

elon Top Tweeters elon Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "elon.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+                                         
|           tweet_id|max_retweet_count|
+-------------------+-----------------+
|1518658761979842560|           112866|
|1286809839499464704|            79814|
|1518972397277396992|            74649|
|1527772289533501440|            67668|
|1519040404087320578|            60573|
|1514575288319070214|            60440|
|1522723199993233408|            57153|
|1519012937704394752|            55310|
|1518646196952350726|            53564|
|1259206584825233409|            53015|
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 112,866

  2. 79,814

  3. 74,649

Top Hashtags

Using elon-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("elon-hashtags"), and pandas:

elon hashtags elon hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "elon.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+---------------------------------------------------------------------------------------------------------+-----+
|url                                                                                                      |count|
+---------------------------------------------------------------------------------------------------------+-----+
|https://t.co/31rRgKk2qP                                                                                  |37150|
|https://twitter.com/SaeedDiCaprio/status/1579090020019097602/photo/1                                     |37150|
|https://twitter.com/Wavytweets_/status/1542431192200482822/photo/1                                       |29744|
|https://t.co/PzyjcuYyl6                                                                                  |29740|
|https://www.theguardian.com/technology/2021/nov/08/tesla-shares-fall-elon-musk-twitter-poll-sell-off-plan|25077|
|https://twitter.com/mcphils_/status/1552176159475249153/photo/1                                          |20531|
|https://t.co/iry3EoAe3T                                                                                  |20512|
|https://twitter.com/HighyieldHarry/status/1545759072942788616/photo/1                                    |14032|
|https://t.co/TcMLs7WcrO                                                                                  |14030|
|https://twitter.com/TechEmails/status/1575588277700026368/photo/1                                        |12717|
+---------------------------------------------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "elon.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/FeoOAFhXkAESzAB.jpg|18576|
|https://pbs.twimg.com/media/FeoOAFiXwAQJeZy.jpg|18576|
|https://pbs.twimg.com/media/FWfQ_aGX0AA065Y.jpg|14874|
|https://pbs.twimg.com/media/FWfQ_aMXwAIHdnA.jpg|14874|
|https://pbs.twimg.com/media/FXOjrz5WAAADK0d.jpg|14035|
|https://pbs.twimg.com/media/FZ_D0tPWAAAT_tY.jpg|12710|
|https://pbs.twimg.com/media/Fa890cyWAAAYU7B.jpg|6690 |
|https://pbs.twimg.com/media/FefsemyXgAAbcef.jpg|6539 |
|https://pbs.twimg.com/media/FaZmMmVakAAtDCz.jpg|5942 |
|https://pbs.twimg.com/media/FeYPCF1XkAE4AfU.jpg|5679 |
+-----------------------------------------------+-----+


#elx44 Top Media elon Top Media

Emotion score

Using elon-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("elon-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in elon Tweets Emotion Distribution in elon Tweets

Exploring Canada Convoy Protest Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 2,094,597 Tweets with the search terms (#WeAreCanadian) OR (#FreedomConvoy) OR (#FreedomConvoy2022) OR (#TrudeauMustGo) OR (#TrudeauMustGoNow) OR (#LiberalHypocrisy) OR (#LiberalsMustGo) OR (#WeTheFringe) OR (#Canadafirst OR (#ottawaconvoy) OR (#truckerconvoy) OR (#convoy) OR (#canadaproud) OR (#canadianpatriot) OR (#freedomprotest) OR (#nomoremandates) OR (#canadatruckprotest) OR (#operationbearhug) OR (#canadiantruckers) OR (#ramranch) OR (#FluTruxKlan) OR (#ramranch) OR (#ramranchresistance) from January 20, 2022 through February 25, 2022.

Canada Convoy Protest tweet volume Canada Convoy Protest Tweet Volume
Canada Convoy Protest wordcloud Canada Convoy Protest wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "convoy.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+
|lang|  count|
+----+-------+
|  en|1684290|
|  fr| 120700|
| qme|  72078|
| qht|  69874|
|  nl|  30816|
| und|  27479|
|  es|  26820|
|  de|  21885|
|  it|  15090|
|  pt|   3802|
+----+-------+

Top tweeters

Using convoy-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("convoy-user-info"), and pandas:

Canada Convoy Protest Top Tweeters Canada Convoy Protest Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "convoy.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1494486976963260449|16444            |
|1491875712831311873|11418            |
|1495191350962049025|10945            |
|1490687379501367298|10106            |
|1487857084578287621|7932             |
|1492795894902804481|7584             |
|1487522340032421892|7371             |
|1486395719502241800|7175             |
|1494885897577271299|7073             |
|1487174607098593284|6910             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 16,444

  2. 11,418

  3. 10,945

Top Hashtags

Using convoy-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("convoy-hashtags"), and pandas:

Canada Convoy Protest hashtags Canada Convoy Protest hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "convoy.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+---------------------------------------------------------------------+-----+
|url                                                                  |count|
+---------------------------------------------------------------------+-----+
|https://twitter.com/joe_warmington/status/1494486976963260449/photo/1|30584|
|https://t.co/cK9CadOPaR                                              |30542|
|http://ConvoyReports.com                                             |8827 |
|https://twitter.com/SlowToWrite/status/1487522340032421892/photo/1   |6852 |
|https://t.co/og4bfU2vFF                                              |6851 |
|https://t.co/VfgKKYKzKr                                              |6140 |
|https://twitter.com/SalmanSima/status/1493057226965192704/video/1    |4947 |
|https://t.co/OEjzle2Rvz                                              |4945 |
|http://liar.com                                                      |3256 |
|https://t.co/nUDpQHycfL                                              |2780 |
+---------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "convoy.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/FL17-k4XMAEVRuy.jpg|15294|
|https://pbs.twimg.com/media/FL17-ycXsBU4UQe.jpg|15294|
|https://pbs.twimg.com/media/FKS9pr-WQAQ5olR.jpg|6854 |
|https://pbs.twimg.com/media/FKhSYaGXsAEOUm7.jpg|2660 |
|https://pbs.twimg.com/media/FLzfEr2WQAgbv9c.jpg|1947 |
|https://pbs.twimg.com/media/FKTr_gRX0AER7ko.jpg|1820 |
|https://pbs.twimg.com/media/FLMl95xUUAAB7NF.jpg|1624 |
|https://pbs.twimg.com/media/FLGmDCgXMAEjuPi.jpg|1473 |
|https://pbs.twimg.com/media/FKLazkpWYAg2OCj.jpg|1343 |
|https://pbs.twimg.com/media/FKNJ4raX0AQ7h5d.jpg|1261 |
+-----------------------------------------------+-----+

A couple years ago I created a juxta (collage) of the images from this dataset. You can check it out here.

Emotion score

Using convoy-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("convoy-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in Canada Convoy Protest Tweets Emotion Distribution in Canada Convoy Protest Tweets

Exploring Anti-Vaccine/Lockdown Twitter Data / Nick Ruest

Overview

The dataset was collected with Documenting the Now’s twarc using version twarc2 via the Academic Access v2 Endpoint. It contains 1,546,162 Tweets with the search terms (NoMoreLockdowns) OR (NoVaccinePassportsAnywhere) OR (NoVaccineMandates) from January 7, 2011 through April 13, 2023.

Anti-Vaccine/Lockdown tweet volume Anti-Vaccine/Lockdown Tweet Volume
Anti-Vaccine/Lockdown wordcloud Anti-Vaccine/Lockdown wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "antivax.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+
|lang|  count|
+----+-------+
|  en|1216637|
| qht| 102553|
| qme|  90370|
|  de|  60201|
|  fr|  11197|
|  es|  10264|
|  pt|   8909|
|  it|   8714|
|  hi|   7481|
| und|   7245|
+----+-------+

Top tweeters

Using antivax-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("antivax-user-info"), and pandas:

Anti-Vaccine/Lockdown Top Tweeters Anti-Vaccine/Lockdown Top Tweeters

Retweets

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "antivax.jsonl"
val df = spark.read.json(tweets)

df.mostRetweeted.show(10, false)
+-------------------+-----------------+
|tweet_id           |max_retweet_count|
+-------------------+-----------------+
|1479540884362469376|4931             |
|1468582283317854211|4913             |
|1479623323244023810|4726             |
|1479536586417180675|4569             |
|1483879806626406410|3786             |
|1433342429768396802|3140             |
|1445857921753960455|2864             |
|1459587108251832330|2708             |
|1481049128595513349|2707             |
|1470179658507460612|2634             |
+-------------------+-----------------+

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 4,931

  2. 4,913

  3. 4,726

Top Hashtags

Using antivax-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("antivax-hashtags"), and pandas:

Anti-Vaccine/Lockdown hashtags Anti-Vaccine/Lockdown hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "antivax.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+---------------------------------------------------------------------------------------------+-----+
|url                                                                                          |count|
+---------------------------------------------------------------------------------------------+-----+
|https://petition.parliament.uk/petitions/599841                                              |1610 |
|https://support.twitter.com/articles/20169199                                                |1588 |
|https://twitter.com/farnsworth_adam/status/1405932322717900800/photo/1                       |1518 |
|https://t.co/EtCSiyasWZ                                                                      |1518 |
|https://t.co/R02kKQul5o                                                                      |1499 |
|https://twitter.com/MaximeBernier/status/1479878675671752706/video/1                         |1414 |
|https://t.co/LemVeHrJhL                                                                      |1412 |
|https://twitter.com/emmakennytv/status/1324028940323819521/video/1                           |1343 |
|https://t.co/lCOppA2t93                                                                      |1342 |
|https://frenchdailynews.com/society/3732-60000-scientists-call-for-an-end-to-mass-vaccination|1110 |
+---------------------------------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "antivax.jsonl"
val df = spark.read.json(tweets)
  
df.mediaUrls
  .filter(col("media_url").isNotNull)
  .groupBy("media_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)  
+-----------------------------------------------+-----+
|media_url                                      |count|
+-----------------------------------------------+-----+
|https://pbs.twimg.com/media/E4Lf_qSX0AgsRKm.jpg|1518 |
|https://pbs.twimg.com/media/FGl85g-XMAMmLPm.jpg|876  |
|https://pbs.twimg.com/media/E_HBY0yXMAYvMob.jpg|831  |
|https://pbs.twimg.com/media/FGVF4YrWYAwk-hq.jpg|813  |
|https://pbs.twimg.com/media/FBYwQXwWYAMrxyG.jpg|764  |
|https://pbs.twimg.com/media/FIn0m1BX0AMpGs9.jpg|736  |
|https://pbs.twimg.com/media/FIhu_JEXwAAjMtf.jpg|700  |
|https://pbs.twimg.com/media/FJLIdBFXMAYCBCe.jpg|696  |
|https://pbs.twimg.com/media/FbOC6z4UcAMYOnC.jpg|632  |
|https://pbs.twimg.com/media/EtzRWBYWgAMyFF-.jpg|613  |
+-----------------------------------------------+-----+

#NoMoreLockdowns #NoVaccinePassportsAnywhere #NoVaccineMandates Top Media #NoMoreLockdowns #NoVaccinePassportsAnywhere #NoVaccineMandates Top Media

Emotion score

Using antivax-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("antivax-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #NoMoreLockdowns #NoVaccinePassportsAnywhere #NoVaccineMandates Tweets Emotion Distribution in #NoMoreLockdowns #NoVaccinePassportsAnywhere #NoVaccineMandates Tweets

Hitchcock thrills with sound and silence / John Mark Ockerbloom

Alfred Hitchcock started making Blackmail as a silent film, but after getting access to the new movie sound technology, he ended up releasing two versions. The sound version was one of the first “talkies” released in Britain. Some film aficionados prefer the silent version, which often uses different footage.

Blackmail‘s US copyright, once lost here due to formalities requirements, was restored in the 1990s. The film returns to the public domain here in 20 days.

Implementing an AI reference chatbot at the University of Calgary Library / HangingTogether

In 2021, the University of Calgary Libraries launched a multilingual reference chatbot by leveraging a commercial product that combines a large language model (LLM) with retrieval-augmented generation (RAG) technology. The chatbot is trained on the library’s own web content, including LibGuides and operating hours, and is accessed from the library’s website.  

In a Works in Progress webinar hosted by the OCLC Research Library Partnership (RLP) on 20 November 2024, University of Calgary Library staff discussed the creation and implementation of the AI reference chatbot and shared lessons learned. Kim Groome, Information Specialist; Leeanne Morrow, Associate University Librarian, Student Learning and Engagement; and Paul Pival, Research Librarian–Data Analytics presented. 

This blog post provides a summary of the webinar’s key points, but for a deeper dive, you can watch the full recording here: 

Project genesis 

Like many research libraries, the University of Calgary Libraries has offered live chat to users since the early 2010s, with information specialists staffing the service daily from 9 a.m. to 5 p.m. The pandemic catalyzed discussions about an AI chatbot, with many factors driving the conversation:  

  • Surging demand for chat services. Chat usage spiked dramatically during the pandemic. While the library typically handled 500–900 live chats per month in 2019, this number skyrocketed to 3,077 in September 2020. 
  • Staffing constraints. The increased volume required additional staff and staff time to keep up with demand. 
  • Limited service hours. Staffed by humans, live chat was available during extended business hours, but this still left students without support during the late evenings or early mornings. 
  • Improved convenience. Even students visiting the library in person utilized the chat reference service. It was convenient and helped them maintain their study space during peak hours.  
  • Automation potential for many questions. An analysis of live chat questions revealed a significant percentage of questions that were well-suited for automated responses. 
  • Alignment with institutional priorities. Implementing an AI chatbot aligned with the university’s commitment to student-centered initiatives and its strategic focus on enhancing student success. 

The library team looked across the Canadian library ecosystem for examples but found limited adoption among other libraries.[i] Instead, the library found that at UCalgary, the Office of the Registrar had already implemented a chatbot named “Rex,” leveraging technology provided by Ivy.ai. By building on this preexisting campus project, the library accelerated its own chatbot initiative, benefiting from shared resources and institutional experience.  

Implementation 

Assessing the usefulness of an AI chatbot 

Initial work included conducting an analysis of past reference chat questions to evaluate the automation potential of an AI chatbot. Kim Groome described exporting approximately 3,000 chatbot interactions recorded over a one-month period during the pandemic and coding the questions into themes like study/workspace, printing, and borrowing requests. Through this analysis, which took approximately 30 hours, the library determined that 14-24% of reference chat inquiries were directional (e.g., “where is …?”) and could potentially be handled by a chatbot. The coding was performed using Excel; use of Python by experienced coders could further expedite the work.  

Training and testing 

After identifying a core set of common questions that could be effectively addressed by the chatbot, an eight-person library team began training and testing in April 2021, working with the vendor to increase consistency and quality of the chatbot answers. Testing was extended to other library staff members in July 2021. Recognizing the potentially infinite scope of user questions, the team avoided scope creep by initially focusing on a defined set of about fifty questions identified during their analysis.  

Go-live 

T-Rex chatbot avatar. Courtesy of University of Calgary Library.

The library’s chatbot launched on 16 August 2021, branded as “T-Rex” to differentiate it from the preexisting “Rex” chatbot offered by the registrar (note that the Rex is the official mascot of UCalgary Dinos teams). Today T-Rex is one of six chatbots on the UCalgary campus, each operating on separate knowledge bases and answering questions 24/7. 

Continuous improvement and maturity 

Quality monitoring  

Kim described how the library team continually assesses the quality of chatbot responses using anonymous weekly reports. The team rates the bot’s answers on a 1 to 5 scale, where 5 represents a perfect response.  

Examples of the rating process 

Participants asked many questions we were unable to address during the webinar due to time constraints. Following the webinar, Kim offered these examples to address questions about the rating process: 

  • 5/5 response: If patron asks, “Do you have databases for nursing”, the bot provides an accurate answer, earning a perfect score.  
  • 4/5 – 5/5 response: If patron asks for the specific nursing database, such as, “Where can I find CINAHL” (even if spelled incorrectly in six ways), the bot delivers an excellent response.  
  • 2/5 response: But if the patron asks a question like, “I want to find articles on the social effect of the opioid crisis what database would I use,” the bot will struggle. Even though the bot didn’t answer the question, the team would still scale it at 2/5 because the specific topic is not on the website—and therefore not in the “bot brain.” Since the chatbot hasn’t been trained on the topic, it cannot answer the question. If the bot interprets the phrase “find articles” and offers a response like “to find articles, please enter the topic in the library search box…”, then the response would be rated at 3/5 or even 4/5.

Developing custom responses 

The team monitored user questions following go-live, identifying questions that would benefit from customized answers. For example,  

  • A frequent user query about access to the Harvard Business Review couldn’t initially be answered because access to the resource was embedded in a search tool outside the RAG’s scope. 
  • Misspellings were also common, such as for resources like PsycInfo.  

If any question was asked more than three times weekly in distinct transactions, the team would create a custom response to address it. Over the first year, the team spent about 5 hours each week creating 10-15 custom responses to these questions, incrementally improving the chatbot. For misspellings like the PsycInfo example, the team incorporated common misspellings like Psychinfo, pyscinfo, and psychinfo. 

Monitoring the chat is important for identifying and immediately correcting any wrong responses. For example, a patron once asked, “Can I return a book that has already been declared lost,” and the bot responded, “No, you cannot return a library book that was recently declared lost.” This is obviously incorrect, and the response occurred because of information missing from the library website. But the issue is also complex, with at least fifteen different circumstances surrounding a lost item; adding a series of complex scenarios to the website was an imperfect solution. Instead, the team created a rule where any question that includes words like “lost” and “book” receives a customized qualifying statement: “If you need to contact library staff about a lost a book, or the lost book charge, please email: <email address>.” Similarly, for questions that mention the term “recall,” the bot will respond, “Recalls are a very special circumstance. Here is an FAQ for more information.” 

Maturity 

Screenshot of the T-Rex chatbot offered by the University of Calgary Library

The chatbot was further improved in February 2023 when a new GPT layer was added by the vendor, enabling the tool to generate its own responses to complement existing custom responses. Today the chatbot offers fast, consistent, 24/7 support that is accessible to a wide range of users and is WCAG 2.11 AA compliant. It knows over 2 million words—each form of a word is a new word (i.e., renew and renewing are two separate words)—and has over 1000 custom responses. T-Rex is very accurate for directional questions, and 50% of all questions receive a rating of at least 4/5.  

T-Rex has exceeded expectations. Before launch, the implementation team estimated that the chatbot could answer 14-24% of reference chat questions, but today the chatbot answers about 50% of all questions with a rating of at least 4/5. This deflects half of all questions from live reference chat. This has been significant, as 1.5 FTE of staff time has been redirected to support more strategic, higher-level tasks. As a result, the library has reduced staffed desk hours, instead encouraging users to rely on the 24/7 chatbot for immediate assistance. There have been no staff reductions, just higher productivity.  

Now that the chatbot is mature, it takes only about one hour per week to supervise and monitor the chatbot, primarily to confirm that it continues to work as expected. Updates, such as changes to library URLs, are efficiently managed using a simple Excel spreadsheet.  

The implementation of T-Rex was the library’s first AI effort. More recently, the library has collaboratively established the Centre for Artificial Intelligence Ethics, Literacy and Integrity (CAELI). Located within a campus branch library, CAELI supports student success by fostering strong digital and information literacy skills among UCalgary students. 

Lessons learned 

The UCalgary team shared several key insights from the project:  

  • Use library web pages as the system of record. One of the very first lessons learned after go-live was that the chatbot would be unable to answer a question if the library didn’t have a webpage or FAQ that addressed the topic. While it could be tempting to update the chatbot’s responses directly, Kim advised against this approach because it would create duplicate maintenance points. Instead, she urged participants to consider the website as the system of record for chatbot content.  
  • Leverage a team-based approach. Implementing the chatbot with a team-based approach increased resilience and reduced points of failure for the project.  
  • Identify and respond to user expectations. Users preferred answers that connected them directly to the source they were looking for, rather than being directed to a webpage that required further navigation. Over time, the team refined responses to reduce the number of clicks required to reach specific information. 
  • Expect non-library questions. The team discovered that users would ask the chatbot many questions that the library RAG was unable to answer, such as, “When can I register for the spring semester?” In many cases, the bot can direct the user to one of the other relevant chatbots on campus (registrar, admissions, financial aid, career services, etc.) for appropriate answers. This is a significant benefit of an enterprise approach to adopting chatbot technology.  
  • Think creatively about addressing non-library questions. The Calgary library recognized its role in supporting academic integrity, and it analyzed the chatbot data to learn more about the types of academic integrity questions students were asking. The library found that students were asking questions about reference styles, citation managers, plagiarism and detection software, and academic policies. These questions often arose late at night when live support was unavailable. In collaboration with the campus academic integrity coordinator, the library developed custom responses and added relevant campus content to its website, enhancing the chatbot’s ability to support student success.  
  • Anticipate that there will be non-adopters. Some people prefer to interact directly with other humans and are unlikely to adopt chatbot technology. About 12-15% of library chatbot users still ask to speak to a human, even in cases where the chatbot could likely answer their question. Users can click through to “Connect to a Person” directly from T-Rex during regular service hours.  

Library use of AI chatbots 

To understand webinar participants’ own use of and experiences with chatbots, we polled attendees during the presentation. Their responses provide anecdotal insights about library adoption of chatbots.  

While webinar participants were clearly interested in chatbots, they weren’t necessarily strong users. Only about 40% of participants reported using chatbots on a daily or weekly basis; 27% reported never using GenAI chatbots.  

RLP affiliate responses to poll about chatbot usage

Relatedly, nearly 50% of participants reported that they didn’t enjoy interacting with GenAI chatbots, although nearly as many had mixed feelings.  

RLP affiliate responses to poll about enjoyment of chatbot interactions

Implementation of library chatbots  

This webinar was useful for our RLP participants because we learned that few libraries had implemented an AI chatbot, but nearly 50% were considering it.  

RLP affiliate responses to poll about library adoption of AI reference chatbots

Is your library implementing an AI chatbot? Share a comment below or send me an email. I’m eager to learn more. Special thanks to the UCalgary Library team for generously sharing their experiences and insights so we can all learn from their innovative work. 


[i] Julia Guy et al., “Reference Chatbots in Canadian Academic Libraries,” Information Technology and Libraries 42, no. 4 (December 18, 2023), https://doi.org/10.5860/ital.v42i4.16511.

The post Implementing an AI reference chatbot at the University of Calgary Library appeared first on Hanging Together.

NDSA Welcomes Three New Members in Quarter Four of 2024 / Digital Library Federation

As of December 6, 2024, the NDSA Leadership unanimously voted to welcome its three most recent applicants into the membership.

  • Webrecorder LLC
  • Museum of Glass
  • George Washington’s Mount Vernon

Each new member brings a host of skills and experience to our group. 

Webrecorder focuses on high-fidelity browser-based archiving, creating highly accurate archives of websites, which can then be replayed at a later time, including video, maps, 3D models, and other interactive elements that are fully intact. They support multi-level user access to utilize our tools, including open-source self-hosting, browser extensions, and paid services. Their engagement reaches across the globe, providing web archiving tools to grassroots organizations, data/science labs, journalists, archivists, GLAMS, and national organizations. 

Museum of Glass (MOG) is a premier contemporary art museum dedicated to glass and glassmaking in the West Coast’s largest and most active museum glass studio. MOG has an extensive collection of digital content dating back to its founding in 2002, including documentation of the over 2,000 pieces of art in the permanent collection and born-digital materials encompassing images, audio, and video recordings created over the past twenty years. These born-digital materials are critical documentation of the over 750 visiting artists in the MOG’s Hot Shop, related Hot Shop programming, and the exhibition and programming history of the Museum. A key part of this collection is an extensive amount of streaming video from the Hot Shop, where they broadcast live five days a week. In recent years, MOG has committed significant resources to establish its digital preservation program, including the hiring of the organization’s first digital archivist and archival assistant.

George Washington’s Mount Vernon is committed to building a robust digital preservation program and has taken significant steps in this direction over the past few years. They engaged a consulting firm to conduct an institution-wide digital archives assessment and hired a full-time digital records archivist. This role has been instrumental in developing our digital preservation policy and workflows, which ensure the preservation of born-digital materials, including institutional records, audiovisual materials, digital humanities projects, and web archiving. In the near future, they plan to expand our efforts to include 3D models and digital reproductions. 

Each organization participates in one or more of the various interest and working groups – so keep an eye out for them on your calls and be sure to give them a shout-out. Please join me in welcoming our new members. You can review our list of members here.

 

The post NDSA Welcomes Three New Members in Quarter Four of 2024 appeared first on DLF.

Reflection: The first half of my seventh year at GitLab and digging into Strategy & Operations / Cynthia Ng

Hard to believe it’s already just over a year in the “new” role and 6.5 years at GitLab. If you’re interested in an explanation of my current role/team and what I was working on in the first 6 or so months, check out my previous reflection post. Handbook project I talked about managing the GitLab … Continue reading "Reflection: The first half of my seventh year at GitLab and digging into Strategy & Operations"

Exploring #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Twitter Data / Nick Ruest

Introduction

I have no kind words about what Twitter has become. I abandoned the platform around 2018, keeping my account only to harvest data. Once the electioneering oligarch cut off academic access, there was no reason for me to stick around. During the brief period I had academic access, I collected as much data as possible each month, planning ahead for future research projects and collaborations.

Currently, I’m working with the DigFemNet team, and I believe we have some use cases for a few of these datasets. I’m planning to write all or at least most of them up similar to what I’ve done in the past. Unfortunately, the whole hydration process is no longer an option, so following along isn’t really an option any more. ☹️

To start this “series”, I’m revisiting a dataset I generated a few years ago as a side project for members of the DigFemNet team during their Archives Unleashed Cohort work. This dataset was generated from tweets with the following hashtags:

  • #healthcanada
  • #NACI
  • #fordnation
  • #medicalfreedom
  • #covid19
  • #covid19vaccines
  • #protectourfamilies
  • #protectyourchildren
  • #holdtheline

The dataset of Tweet IDs is available here, though I’m not sure how useful it is anymore. ☹️

On another note, I took some time to completely refactor twut. A new 1.0.0 release is now available!

Overview

The dataset was collected with Documenting the Now’s twarc. It contains 2,075,645 Tweet IDs.

Tweets were collected via the Standard Search API on:

  • November 18, 2021
  • November 21, 2021
  • November 26, 2021
  • December 1, 2021
#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline tweet volume #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Tweet Volume
#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline wordcloud #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline wordcloud

Top languages

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "healthcanada-naci-fordnation-medicalfreedom-covid19-covid19vaccines-protectourfamilies-protectyourchildren-holdtheline.jsonl"
val df = spark.read.json(tweets)

df.language
  .groupBy("lang")
  .count()
  .orderBy(col("count").desc)
  .show(10)
+----+-------+
|lang|  count|
+----+-------+
|  en|1310156|
|  es| 491763|
|  fr| 305404|
|  de| 154572|
|  it|  68379|
| und|  53911|
|  th|  46685|
|  ja|  28955|
|  hi|  27453|
|  nl|  27152|
+----+-------+

Top tweeters

Using 105683SP3QFISO4-user-info.csv from df.userInfo.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("105683SP3QFISO4-user-info"), and pandas:

#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Top Tweeters #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Top Tweeters

Retweets

Using retweets.py from twarc utilities, we can find the most retweeted tweets:

$ python twarc/utils/retweets.py elxn44.jsonl | head
1238545438476730369,190293
1241447017945223169,114538
1279155358666305541,104996
1241952864374648832,86232
1276158510955401216,70520
1241057919128526856,70412
1241696164782669824,68030
1242302400762908685,66500
1317690911174905857,59605
1238567438851112960,59007

From there, we can use append the tweet ID to https://twitter.com/i/status/ to see the tweet. Here’s the top three:

  1. 190,293

  2. 114,538

  3. 104,996

Top Hashtags

Using 105683SP3QFISO4-hashtags.csv from df.hashtags.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("105683SP3QFISO4-hashtags"), and pandas:

#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline hashtags #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline hashtags

Top URLs

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "healthcanada-naci-fordnation-medicalfreedom-covid19-covid19vaccines-protectourfamilies-protectyourchildren-holdtheline.jsonl"
val df = spark.read.json(tweets)

df.urls
  .groupBy("url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+-------------------------------------------------------------------------------+-----+
|url                                                                            |count|
+-------------------------------------------------------------------------------+-----+
|https://newsdigest.jp/pages/coronavirus/                                       |7819 |
|https://t.co/oEslL3ucOx                                                        |7639 |
|https://hmts.jp/corona                                                         |2827 |
|https://t.co/3cXQDN9NTJ                                                        |2769 |
|https://t.co/WVGVXAC0hg                                                        |1504 |
|https://thestandard.co/omicron/                                                |1504 |
|https://bnonews.com/index.php/2021/11/austria-lockdown-for-unvaccinated-people/|1421 |
|https://t.co/z405ov6SiR                                                        |1412 |
|https://t.co/TXlwtn39QO                                                        |828  |
|https://thestandard.co/covid-19-clusters-in-schools/                           |828  |
+-------------------------------------------------------------------------------+-----+

Top media urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "healthcanada-naci-fordnation-medicalfreedom-covid19-covid19vaccines-protectourfamilies-protectyourchildren-holdtheline.jsonl"
val df = spark.read.json(tweets)

df.mediaUrls
  .groupBy("image_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+----------------------------------------------------------------------------------------------------------+-----+
|image_url                                                                                                 |count|
+----------------------------------------------------------------------------------------------------------+-----+
|https://pbs.twimg.com/media/FFSgQoRWYAYlqfq.jpg                                                           |1303 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/vid/640x360/9k6Led-IswZX5O88.mp4?tag=12       |1297 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/pl/yjBhxgKQLcsKwPlU.m3u8?tag=12&container=fmp4|1297 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/vid/480x270/cMgq-mtXgWWJ0WZX.mp4?tag=12       |1297 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/vid/1280x720/gDpuQIEW0DSjQhpn.mp4?tag=12      |1297 |
|https://pbs.twimg.com/media/FEmnJrlXMAcvcrA.jpg                                                           |1135 |
|https://pbs.twimg.com/media/FFT5lKNXoAgMLEN.jpg                                                           |1118 |
|https://pbs.twimg.com/media/FEzUWZ0WUAoh4WN.jpg                                                           |925  |
|https://pbs.twimg.com/media/FEwDpM1XIAInE17.jpg                                                           |735  |
|https://pbs.twimg.com/media/FD5nGmaXoAE7ijl.jpg                                                           |704  |
+----------------------------------------------------------------------------------------------------------+-----+

#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Top Media #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Top Media

Top video urls

Using the full line-oriented JSON dataset and twut:

import io.archivesunleashed._

val tweets = "healthcanada-naci-fordnation-medicalfreedom-covid19-covid19vaccines-protectourfamilies-protectyourchildren-holdtheline.jsonl"
val df = spark.read.json(tweets)


df.videoUrls
  .groupBy("video_url")
  .count()
  .orderBy(col("count").desc)
  .show(10, false)
+----------------------------------------------------------------------------------------------------------+-----+
|video_url                                                                                                 |count|
+----------------------------------------------------------------------------------------------------------+-----+
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/vid/1280x720/gDpuQIEW0DSjQhpn.mp4?tag=12      |1297 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/vid/480x270/cMgq-mtXgWWJ0WZX.mp4?tag=12       |1297 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/pl/yjBhxgKQLcsKwPlU.m3u8?tag=12&container=fmp4|1297 |
|https://video.twimg.com/ext_tw_video/1463053406289936386/pu/vid/640x360/9k6Led-IswZX5O88.mp4?tag=12       |1297 |
|https://video.twimg.com/ext_tw_video/1462775639686189062/pu/vid/640x368/Jp7EF5jDU5l9OjcB.mp4?tag=12       |685  |
|https://video.twimg.com/ext_tw_video/1462775639686189062/pu/pl/9_jVFXJh8Ctp81tJ.m3u8?tag=12&container=fmp4|685  |
|https://video.twimg.com/ext_tw_video/1462775639686189062/pu/vid/468x270/U0sgvHnEHwJdHeVX.mp4?tag=12       |685  |
|https://video.twimg.com/ext_tw_video/1459384608282140674/pu/vid/512x640/dWtJCTKs7kKe9uSj.mp4?tag=12       |654  |
|https://video.twimg.com/ext_tw_video/1459384608282140674/pu/vid/320x400/xwplj9x8nDIgp12H.mp4?tag=12       |654  |
|https://video.twimg.com/ext_tw_video/1459384608282140674/pu/pl/7eh5unrY2hDiDHWD.m3u8?tag=12&container=fmp4|654  |
+----------------------------------------------------------------------------------------------------------+-----+

#healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Top Video

Emotion score

Using 105683SP3QFISO4-text.csv from df.text.coalesce(1).write.format("csv").option("header", "true").option("escape", "\"").option("encoding", "utf-8").save("105683SP3QFISO4-text"), Polars, and j-hartmann/emotion-english-distilroberta-base:

Emotion Distribution in #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Tweets Emotion Distribution in #healthcanada #NACI #fordnation #medicalfreedom #covid19 #covid19vaccines #protectourfamilies #protectyourchildren #holdtheline Tweets

Library Innovation Lab Announces the Launch of Institutional Data Initiative to Expand Access to Knowledge / Harvard Library Innovation Lab

Library Innovation Lab and Institutional Data Initiative logos

Cambridge, MA – Harvard Law School’s Library Innovation Lab (LIL) is proud to announce the official launch of the Institutional Data Initiative (IDI), a groundbreaking program helping libraries, government agencies, and other knowledge institutions share digital collections with their patrons while improving the accuracy and reliability of AI tools for all.

First developed by Greg Leppert at the Library Innovation Lab, and now led by Leppert as Executive Director, IDI seeks to redefine the creation and stewardship of the knowledge and datasets that define AI research.

Following in the footsteps of LIL’s Caselaw Access Project, the foundational dataset of IDI will be nearly one million public domain books, created thanks to the wide-ranging resources and expertise of the Harvard Library system. By prioritizing the assembly and release of open access public domain materials, as well as using principles developed at LIL for approaching large scale data with a library lens, IDI bridges the gap between model makers and knowledge institutions.

IDI is the largest project to come out of LIL’s Democratizing Open Knowledge program, made possible by support from the Filecoin Foundation for the Decentralized Web.

LIL Director Jack Cushman commented, “One of the goals of our lab is to incubate bold ideas and give them the resources to grow. IDI is a wonderful example of that. Our mission to bring library principles to technological frontiers is embedded in these efforts, and we are thrilled to see industry leaders and cultural heritage organizations support IDI’s work and promise. We look forward to supporting and collaborating with IDI, working to diversify and expand access to cultural heritage materials that help everyone.”

Jonathan Zittrain, IDI and LIL faculty director, said “Libraries and other stewards of humanity’s aggregated knowledge can think in terms of centuries — preserving it and providing access both for well-known uses and for aims completely unanticipated in ancient times or even recently. IDI’s aim is to address newly-energized interest in otherwise-obscure and sometimes-forgotten texts in ways that keep knowledge institutions’, and society’s, values front and center. That means working towards access for all for public domain works that have remained fenced — access both for the human eye and for imaginative machine processing.”


The Library Innovation Lab at Harvard Law School brings library principles to technological frontiers. They explore the future of libraries and the creation, dissemination, and preservation of knowledge. Through innovative projects, platforms, and partnerships, the Lab aims to advance access to information and foster collaboration across disciplines.

Contact: Jack Cushman
Director, Harvard Library Innovation Lab
lil@law.harvard.edu

The stealthy public domain arrival of a long-lived detective / John Mark Ockerbloom

Margery Allingham’s detective Albert Campion may be easy to miss at first sight. He’s not as familiar to American readers as sleuths like Lord Peter Wimsey (who he resembles in some respects). In The Crime at Black Dudley (1929), he’s one of several mysterious secondary characters, and not the one who solves the murder.

Campion takes over the lead in later books, including 18 more by Allingham. Mike Ripley continues his adventures today. In 21 days, maybe you can too.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 10 December 2024 / HangingTogether

The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by a team of OCLC contributors.

Design accessing learning interfaces for neurodiverse users

Close up of a mobile device with the definition for the word design.Photo by Edho Pratama on Unsplash

In light of Title II of the Americans with Disabilities Act (ADA)’s rule change that now requires all web content and mobile applications provided by state and local governments, including public higher education institutions, to be accessible to people with disabilities and the passing of Disability Awareness Month, many new and important resources are available to libraries including the Neurodiversity Design System. These design principles are specifically tailored to learning managements systems, though they hold a bevy of gems for online information sharing and design choices. Topics covered in the system may already be known to designers as good practice, though many do not realize that designing for universal understanding is always the best design option. Real world examples are provided for how to format numbers (font, size, and spacing), buttons, links and input fields, and animations. The site also provides helpful definitions, legal requirements, and user personas to connect the design choices with the user experience.

What I liked most about this toolkit is its user experience! The website is well-organized with a lot of information that is easy to navigate. Structure in information organization is a key wayfinding tool which allows anyone familiar or unfamiliar with a website to go right to their intended endpoint. What’s most valuable about this toolkit for non-designers is the “under the hood” look at users. When you consider the types of disabilities that can affect a person’s experience navigating information online, you become immediately aware that you know people who have trouble navigating online. People in your life experience challenges, and most of them go unnoticed or unvoiced. When we design for all types of people, these challenges fall away making navigation successful, saving tons of stress and time. Contributed by Lesley A. Langa.

Reparative language in finding aids at Haverford College

A recent post in Descriptive Notes, the blog of the Society of American Archivists Description Section, details an effort at Haverford College (OCLC Symbol: HVC) to review legacy finding aids for racist or harmful language and to revise language accordingly. With over 1,400 collections, this was a daunting undertaking but was supported by both the library’s strategic plan and campus funding to support DEI efforts. The blog post, “Undertaking a Reparative Language Project at Haverford College,” details the many steps in the project to review and remediate descriptions.

This article highlights both the scale of the problem of legacy harmful descriptive practices as well as solutions which has involved tapping into available funding, bringing in contractors (who were either early career professionals or MLIS students), and later involving Haverford undergraduate student workers. The project drew on knowledge gleaned from previous efforts, such as the Archives for Black Lives in Philadelphia project and work done by Duke and Yale.  As of now, over 1,500 finding aids have been reviewed at least once for harmful language. Future collections will benefit from revisions to processing guidelines, and the project has generated other tools such as a spreadsheet recording the names that groups have identified for themselves. Ultimately, this project reflects some truths about reparative descriptive work: it is work that benefits from focus driven by local priorities and takes an iterative approach. Readers will find projects to inspire in a series of posts in Descriptive Notes that focus on the topic of inclusive description. Contributed by Merrilee Proffitt

National Day of Racial Healing, 21 January 2025

Since it was conceived in 2017 by the W.K. Kellogg Foundation (WKKF), the National Day of Racial Healing (NDORH) has been “a time to contemplate our shared values and create the blueprint together for #HowWeHeal from the effects of racism.” Observed annually in the United States on the Tuesday after Martin Luther King, Jr. Day, NDORH “is an opportunity to bring ALL people together and inspire collective action to build common ground for a more just and equitable world.” Especially in these days of divisiveness across so many human fault lines, OCLC WebJunction’s “2025 National Day of Racial Healing” article is both timely and valuable, providing ideas and inspirations for events, activities, and actions that libraries and their communities might plan for 21 January 2025. Included are details of events and programs organized in recent years by half a dozen public libraries around the United States, the “Read to Dream” list compiled by King’s family and others, and links to numerous other resources.

The work toward healing across all dividing lines must go on constantly, of course, not only on a single day. WKKF defines racial healing as: “… the experience shared by people when they speak openly and hear the truth about past wrongs and the negative impacts created by individual and systemic racism. Racial healing helps to build trust among people and restores communities to wholeness, so they can work together on changing current systems and structures so that they affirm the inherent value of all people.” WKKF has been cooperating with libraries and other educational institutions on racial equity and related issues as part of its Truth, Racial Healing & Transformation  efforts. The American Association of Colleges and Universities and the American Library Association are prominently among its partners. Contributed by Jay Weitz.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 10 December 2024 appeared first on Hanging Together.

Joining both prize-winning and banned-books online collections shortly / John Mark Ockerbloom

Laughing Boy, by white anthropologist Oliver La Farge, is about a troubled relationship between two Navajos, one raised traditionally and one sent to a white-run boarding school. The first Pulitzer Prize-winning novel about native Americans, it was later banned by a Long Island school district, in a case that students took all the way to a divided Supreme Court in 1982. Laughing Boy will be free to read without censorship or copyright restrictions in 22 days.