Planet Code4Lib

Indigenous teaching and learning librarian position at York University Libraries / William Denton

A new job posting just went up at York University Libraries for an Indigenous teaching and learning librarian (PDF), who will be in the Student Learning and Academic Success department. Applications are due by 01 February 2022.

York University Libraries (YUL) seeks a dynamic and innovative individual to collaborate on the advancements of York University Libraries’ portfolio in support of the teaching and learning community across campus and beyond with a focus on information literacy in both in-person and online teaching environments. The successful candidate will focus on incorporating Indigenous ways of knowing and knowledge systems and Indigenous pedagogy into information literacy practices, instruction, and initiatives. This continuing appointment position is open to those with some critical understanding of ACRL’s Framework for Information Literacy for Higher Education.

My Interviewing at York University Libraries page is a little out of date (setting aside the pandemic and that we’ve been doing interviews online), but is still good about York and the whole search process. If anyone has any questions about the interview or working at York I’m glad to help or to recommend a colleague.

Hidden Certificate Authorities / David Rosenthal

The security of encrypted Web traffic depends upon a set of Certificate Authorities (CAs). Browsers and operating systems are configured with a list of CAs that they trust. The system is brittle, in the sense that if any of the multitude of CAs that your browser trusts is incompetent or malign, the security of all your traffic is imperiled. I've written several times on the topic of misbehaving CAs; there is a list of links at the end of the post.

In Web trust dies in darkness: Hidden Certificate Authorities undermine public crypto infrastructure, Thomas Claiburn reports on an important paper, Rusted Anchors: A National Client-Side View of Hidden Root CAs in the Web PKI Ecosystem by Yiming Zhang et al. This paper looks at what happens when, by fair means or foul, unofficial entries are added to or replace the CAs in the official list that your browser trusts. Below the fold I discuss their findings.

The paper's abstract reads:
Web clients rely on public root stores maintained by operating systems or browsers, with hundreds of audited CAs as trust anchors. However, as reported by security incidents, hidden root CAs beyond the public root programs have been imported into local root stores, which allows adversaries to gain trust from web clients.

In this paper, we provide the first client-side, nation-wide view of hidden root CAs in the Web PKI ecosystem. Through cooperation with a leading browser vendor, we analyze certificate chains in web visits, together with their verification statuses, from volunteer users in 5 months. In total, over 1.17 million hidden root certificates are captured and they cause a profound impact from the angle of web clients and traffic. Further, we identify around 5 thousand organizations that hold hidden root certificates, including fake root CAs that impersonate large trusted ones. Finally, we highlight that the implementation of hidden root CAs and certificates is highly flawed, and issues such as weak keys and signature algorithms are prevalent.
Why are hidden root CAs a problem?
However, recent security incidents and studies show that the management of local root stores can be the "Achilles’ heel" of Web PKI security. By injecting self-built root certificates into local root stores, local software such as anti-virus and parent-control applications creates a man-in-the-middle proxy to filter SSL/TLS-encrypted traffic. This approach can also be used by government agencies or malware, in order to monitor web users’ online behaviors. For instance, reports in 2019 show that citizens in Kazakhstan were forced to import government-built CAs on their devices.
The authors instrumented a widely-used browser, tracking the certificates used to secure traffic with Web sites, and verifying them thus:
we use the Certificate Transparency (CT) database to verify if they are truly beyond public root stores. To monitor the certificate issuance procedure, since June 2016 all CAs in public root stores are required to submit all certificates that they sign to CT for future queries. As a result, hidden root certificates should not appear in CT databases.
I discuss the details of the Certificate Transparency system (RFC 6962) in this post. Briefly, an important property of the CA system is that all CAs are equally trusted, and can issue certificates on behalf of any Web site. But Web sites authorize only a single CA to issue their certificates. A rogue CA can allow impersonation of any Web site by issuing unauthorized certificates. The CT system consists of a set of public logs containing signed attestations that the certificate in question was issued by the CA that the Web site authorized to do so. Wikipedia notes that:
As of 2021, Certificate Transparency is mandatory for all publicly trusted TLS certificates
So the certificate of a hidden root CA is one that appears in the local list of trusted CAs but isn't in the CT databases. Because these hidden root CAs are in the list, they are trusted equally with the official CAs, and can sign certificates for any Web site they choose. thus opening the user to impersonation and interception of their traffic.

The authors make four major observations. First, hidden root certificates are a pervasive problem:
We identify 1.17 million hidden root certificates that have been imported into local root stores of web clients. Based on their subject information, we identify 5,005 certificate groups, and certificates in each group come from the same organization. The impact of hidden root CAs can be profound, as they are witnessed in 0.54% of all web connections, affecting 5.07 million web clients and putting security connections at risk of interception.
Second, although there can be authorized uses for hidden root certificates, many are clearly malign:
Besides self-built root CAs of enterprises and local software, we also uncover a large number of fake root CAs that impersonate trusted ones to evade detection. For example, they use homoglyphs to replace characters in authentic names (e.g., Verislgn with an “l” and NetWork with an upper-case “W”). While not discovered by previous works at scale, we show that fake root CAs are highly trusted by web clients and pose security threats to up to 2.7 million web clients.
Third, in addition to the fundamental insecurity imposed by hidden root certificates, the overwhelming majority of them are so carelessly constructed that they insert additional vulnerabilities:
As for operational flaws, we find that the security status of hidden root CAs and certificates are worrisome: public key sharing, abuse of wildcards and long validity period are prevalent. More than 87.3% of hidden root certificates and 99.9% of leaf certificates that they sign violate at least one X.509 standard requirements. In particular, 97% of leaf certificates issued by hidden CAs use weak keys, increasing their chances of being compromised.
Fourth, they find that the distribution of hidden root certificates is heavily skewed. In particular, malign roots are targeted:
Although on average 0.54% of daily web traffic is covered by hidden roots, the proportion per individual client varies widely. For more than 95% of clients, the percentage is less than 0.01%, while 0.28% of clients have more than 90% of their web visits impacted. To figure out why certain clients were impacted so heavily, We sampled 104 cases who had more than 500 web visiting records and an affected rate of over 99% for further analysis. One may attribute this high percentage to interceptions from Local Software like proxies and packet filters, but we find this situation appeared on only 10.58% clients (11 of 104). On the contrary, hidden roots from Fake Authentication (64 clients, 61.54% of 104) lead the pack.
Victims of malign roots are actively monitored:
By examining the traffic timestamps of those clients, we also find that, hidden roots from Fake Authentication would be constantly updated on the client-side, possibly to avoid detection. Specifically, 3 of the 104 cases had successively installed more than 20 hidden root certificates from the same issuer, and the average lifetime (the period they appeared in traffic logs, rather than the validity period) of each root did not exceed 1 day.
What can clients do to detect hidden root certificates? Clearly, if browsers require CT for certificates they trust, even if by some means a hidden root CA is added to their trust list, it will not be able to falsely indicate that a site is trusted or intercept its traffic. Doing so will prevent some authorized applications, such as corporate employee monitoring and parental controls, working without explicit action by the user to allow them.

The authors detect characteristics of hidden root certificates and the certificates they sign. For example:
As for public keys, 0.59% hidden root certificates (20.73% of self-built CAs) use weak keys, while up to 97% leaf certificates signed by them use weak public keys. Comparing to leaf certificates signed by public trusted CAs, a study in 2013 find that nearly 90% already had a key strength of 2048-bit RSA or above.
Making browsers warn users when they detect suspicious certificates of these kinds would be beneficial, as encouraging authorized interception applications not to introduce significant new vulnerabilities by careless implementation.

Previous Posts

Welcome to OCLC, Kathryn Stine! / HangingTogether

We are very excited to introduce new team member, Kathryn Stine, Senior Program & Engagement Officer, Data Science & Next-generation Metadata. Kathryn comes to OCLC from the California Digital Library where she worked as Senior Product Manager for Digitization & Digital Content, managing the team that supports and coordinates the University of California Libraries’ engagement with HathiTrust and mass digitization initiatives. During the pandemic, Kathryn led a systemwide group of UC Libraries liaisons who were making use of the HathiTrust Emergency Temporary Access Service (ETAS) and ultimately conducted an assessment of how the service was used by UC faculty and students. ETAS ensured that UC’s students, faculty and staff had access to millions of digitized print items stewarded by HathiTrust while library building and physical collection access was disrupted by the pandemic.

Kathryn also played a key analysis and management role with the Zephir metadata management system, a core infrastructure component of HathiTrust developed, implemented, and maintained by the California Digital Library. Zephir processes, stores, and exports contributor metadata to other HathiTrust systems and processes in support of digital content ingest, access, and discovery. In addition to this work as part of CDL’s Discovery and Delivery team, Kathryn also contributed to work in the Collections as well as Publishing, Archives, and Digitization program areas.  

Kathryn holds a Master of Science in Information, Archives and Records Management as well as a Master of Fine Arts from the University of Michigan. In addition to her substantial experience at the California Digital Library, Kathryn has held positions at UC Berkeley and the University of Illinois, Chicago, through which she developed expertise in managing collections and the metadata that describe them in a range of contexts, including work with oral history media, visual resources, and archives and special collections.  

Kathryn says, “Metadata work involves creative and impactful problems to address – metadata is political, dynamic, and essential for knowledge work.” We couldn’t agree more and are happy to have Kathryn on the OCLC Research Library Partnership team, where she will engage with our Metadata Managers Focus Group. She will also work closely with team members in OCLC Technical Research and within our product teams.  

At the end of last November, we were happy / sad to say Sayōnara to long-time team member, Karen Smith-Yoshimura and now it is a real joy to welcome Kathryn, in whatever language works.  

The post Welcome to OCLC, Kathryn Stine! appeared first on Hanging Together.

Islandora Community Sprint / Islandora

Islandora Community Sprint kstapelfeldt Tue, 11/30/2021 - 14:42

Rosie LeFaive (UPEI), Nat Kanthan (UTSC), and Kirsta Stapelfeldt (UTSC) are helping to coordinate a community-sprint for Islandora. The sprint is running November 29th-December 15th. Details for how the sprint is organized and how to join and follow along are available in the global planning document. 

Visualizing Citation Networks / Ed Summers

Back in 2018 I wrote a small Python program called étudier which scraped citation data out of Google Scholar and presented the network as a GEXF file for use in Gephi. It also wrote the network data out as an HTML file that included a very rudimentary D3 visualization.

At the time I left a note in the README saying that I was looking for ideas on how to improve the D3 visualization. A year later I actually got a pull request from Thomas Anderson who had used another D3 visualization to modify the one that was hard coded into étudier.

Thomas’ pull request had some good ideas, like zooming and text labels, which were missing from the first version. But it also seemed to be hacked together quickly to serve a particular need, so it took me some time to get around to disentangling a few things, pruning unused code, adding some additional features, and merging it in. A few too many years later I got around to finishing it and the new D3 visualization is now out in v0.1.0.

Once you pip install –upgrade etudier you can collect a citation network either from Google Scholar search results, or by examining the citations to a particular publication. For example here’s how I collected the network for Sherry Ortner’s Theory in Anthropology since the Sixties:

$ etudier ',21&hl=en'

I’ve embedded the resulting output.html here using an <iframe>.

A few things to note about the visualization:

  • The size of each node is relative to the total number of citations to the publication (not just the ones that are visible in the graph).
  • The color of each node indicates which cluster it belongs to (generated with networkx).
  • The directed arrow indicates which publication is citing which.
  • Hovering over a node reveals its title and the titles of other nodes it is immediately connected to.
  • Click and hold on a node to foreground just connected nodes.
  • Double-click on a node to open tab with the publication in it.
  • Drag nodes around to make the network easier to read.
  • Zoom in on the image to examine particular parts of the network.

This is actually a pretty simple graph, since I ran it with étudier’s defaults, which are to collect just one level of citations and just one page of results at each level. This means étudier will look at the 10 citations on the page you supply, and will click into each publication using the cited by link to find the first 10 citations that cite it. Effectively the defaults will pull in up to 100 citations.

It’s worth reflecting on how important Google’s relevance ranking is in shaping the visualization since only some of the citations can be crawled, and the ones that are ranked higher have a higher chance of getting picked up.

You can get more using the –depth and –pages command line options. Just be careful especially with –depth since it can exponentially increase the number of results. Here’s what the same Ortner graph looks like with –depth 2:

You probably will need to zoom out to see the whole thing. Obviously this is pushing the limits of what you can do with D3 without customizing things more. I think the GEXF and GraphML files will probably be most useful if you’ve collected a really large network and want to control how it looks and prune things.

Tucked away in the HTML file is a JSON representation of the network, which could be repurposed for other things I guess. If you get a chance to use étudier please let me know. It might be fun to create a little gallery in the repository. It also could be useful to create a tool that loads the citation data into Zotero, or adds it to WikiCite. But that’s for another day.

Web Archives on, of, and off, the Web / Ed Summers

Last month Webrecorder announced a new effort to improve browser support for web archives by initiating three new streams of work: standardization, design research and browser integration. They are soliciting use cases for the Web Archive Collection Zipped (WACZ) format, which could be of interest if you use, create or publish web archives…or develop tools to support those activities.

Webrecorder’s next community call will include a discussion of these use cases as well as upcoming design research that is being run by New Design Congress. NDC specialize in thinking critically about design, especially with regards to how technical systems encode power, and how designs can be weaponized. I think this conversation could potentially be of interest to people who are working adjacently to the web archiving field, who want to better understand strategies for designing technology for the web we have, but don’t necessarily always want.

I’m helping out by doing a bit of technical writing to support this work and thought I would jot down some notes about why I’m excited to be involved, and why I think WACZ is an important development for web archives.

So what is WACZ and why do we need another standard for web archives? Before answering that let’s take a quick look at the web archiving standards that we already have.

Since 2009 WARC (ISO 28500) has become the canonical file format for saving content that has been collected from the web. In addition to persisting the payload content, WARC allows essential metadata to be recorded, such as the HTTP requests and response headers that document when and how the web resources were retrieved, as well as information about how the content was crawled. ISO 28500 kicked off a decade of innovation that has resulted in the emergence of non-profit and commercial web archiving services, as well as a host of crawling, indexing and playback tools.

In 2013, after three years of development, the Memento protocol was released as RFC 7089 at the IETF. Memento provides a uniform way to discover and retrieve previous versions of web resources using the web’s own protocol, HTTP. Memento is now supported in major web archive replay tools such as OpenWayback and PyWB as well as services such as the Internet Archive, Archive-It,, PermaCC, and cultural heritage organizations around the world. Memento adoption has made it possible to develop services like Memgator that search across many archives to see which one might have a snapshot of a specific page, as well as software extensions that bring a versioned web to content management systems like Mediawiki.

More recently, the Web Archiving Systems API (WASAPI) specification was developed to allow customers of web archiving services like Archive-It and LOCKSS to itemize and download the WARC data that makes up their collections. This allows customers to automate the replication of their remote web archives data, for backup and access outside of the given services.

So, if we have standards for writing, accessing and replicating web archives what more do we need?

One constant that is running through these various standards is the infrastructure needed to implement them. Creating, storing, serving and maintaining WARC data with Memento and WASAPI usually requires the management of complex software and server infrastructure. In many ways web archives are similar to the brick and mortar institutions that preceded them, of which only “the most powerful, the richest elements in society have the greatest capacity to find documents, preserve them, and decide what is or is not available to the public” (Zinn, 1977). This was meant as a critique in 1977, and it remains valid today. But really it’s a simple observation of the resources that are often needed to create authoritative and persistent repositories of any kind.

The Webrecorder project is working to both broaden and deepen web archiving practice, by allowing every day users of the web to create and share high fidelity archives of web content using their web browser. Initial work on WACZ v1.0 began during the development of and, which are client-side JavaScript applications for creating and sharing archived web content. That’s right, they run directly on your computer, using your browser, and don’t require servers or services running in the cloud (apart from the websites you are archiving).

You can think of a WACZ as a predictable package for collected WARC data that includes an index to that content, as well as metadata that describes what can be found in that particular collection. Using the well understood and widely deployed ZIP format means that WACZ files can be placed directly on the web as a single file, and archived web pages can be read from the archive on-demand without needing to retrieve the entire file, or by implementing a server side API like Memento.

WACZ, and WACZ enabled tools, will be a game changer for sharing web archives because it makes web archive data into a media-type for the web, where a WACZ file can be moved from place to place as a simple file, without requiring complex server side cloud services to view and interact with it–just your browser.

It’s important to remember that games can change in unanticipated ways, and that this is an important moment to think critically about the use cases a technology like WACZ will be enabling. You can see some of these threats starting to get documented in the WACZ spec repository alongside the standard use cases. These threats are just as important to document as the desired use cases, perhaps they are even more consequential. Recognizing threats helps to delineate the positionality of a project like Webrecorder, and recognizes that specifications and their implementations are not neutral, just like the archives that they make possible.

However, it’s important to open up the conversation around WACZ because there are potentially other benefits to having a standard for packaging up web archive data that are not necessarily exclusive to the and applications. For example:

  1. Traditional web archives (perhaps even non-public ones) might want to make data exports available to their users.
  2. It might be useful to be able to package up archived web content so that it can be displayed in content management systems like Wordpress, Drupal or Omeka.
  3. A WACZ could be cryptographically signed to allow data to be delivered and made accessible for evidentiary purposes.
  4. Community archivists and other memory workers could collaborate on collections of web content from social media platforms that are made available on their collective’s website.
  5. Using a standard like Frictionless Data could allow WACZ metadata be simple to create, use and reuse in different contexts as data.

Webrecorder are convening an initial conversation about this work at their November community call. I hope to see you there! If you’d rather jump right in and submit a use case you can use the GitHub issue tracker, which has a template to help you. Or, if you prefer, you can also send your idea to info [at]

Zinn, H. (1977). Secrecy, archives, and the public interest. The Midwestern Archivist, 2(2), 14–26.

Humane Ingenuity 42: Not So NFT / Dan Cohen

(Noah Kalina, Lumberland / 20180716)

Noah Kalina is a gifted photographer who has a commercial practice and also works as an artist. He is probably best known for his Everyday project, in which he has been taking a photograph of himself each day for the last two decades. I am more interested in his nature photography, which is uniformly gorgeous. Noah lives in Lumberland, in upstate New York, and his photos across the seasons — of a single tree or river bend — are evocative and engrossing.

I want to buy a print of one of these photographs, but I can’t, for reasons you can probably imagine, since it is 2021: these remarkable images are only available as NFTs. Thus far, as I write this newsletter, Noah has sold 16 Lumberland NFTs, for a total of 13 ETH (Ether cryptocurrency), which is about $55,000.

Good for him! I want to see Noah’s art supported, and if I can’t throw old-timey U.S. dollars at him in exchange for physical media, I’m glad that he is auctioning off certified links to JPEGs for something equally ethereal. May he convert his ETH to USD ASAP.

But this feeling is bittersweet. Is this how we are going to support the arts and culture in the future? Are books, for instance, going to have associated NFTs? (Seriously, don’t look now.) 

Noah’s extraordinary photography is not even in same ballpark as most NFTs, which tend toward disposable doodles and garish digital art. And yet…they are now in the same cinematic universe, with the same cartoonish twists and turns. One of the Lumberland NFTs, which Noah sold just last week for 0.408 ETH ($1,729), was put back on the market for a quick flip. First, it was listed by its owner for the juvenile price of 420.69 ETH (a cool $1,778,845), before it was lowered to 10 ETH ($42,284).

Regardless of artistic merit, because the underlying technology of NFTs is so aggressively decentralized and opposed to traditional forms of institutional, legal, and social forms of trust and value, to succeed they must rely instead on the cohesion that comes from an imagined community (of Bored Apes or VeeFriends), but since such communities often have weak ties — weakened further by online anonymity — they are currently only viable when supercharged by a speculative financial mania.

Noah Kalina may take beautiful photographs, but this is not a pretty picture.

[Further reading: Robin Sloan’s recently published jeremiad, “Notes on Web3,” provides a fuller humanistic rebuke to this creeping financialization of everything, and the creepy notion that all transactions will live on forever in a consumption ledger.]

The world without us: a map of the world with just green spaces and water, by Jonty Wareing:

Screen Shot 2021-11-19 at 1.07.41 PM.png

(The map defaults to London, but you can go anywhere. Above, of course, is Boston.)

Last week in our library, Charlotte Wiman, a Northeastern grad student in paleohydrology, presented some fascinating research about the future of the Mississippi River on a quickly warming planet. She projected forward by looking backward, specifically by finding detailed descriptions of the river and its morphology in old books.


(Plate from Harold Fisk, Geological Investigation of the Alluvial Valley of the Lower Mississippi River, 1944.)

Taking measurements from the maps, cross sections, and diagrams within these books, Wiman and three colleagues were able to generate a hydrological model going back centuries, to a time in the middle ages when last there was a warming trend in the Americas. They then reversed the timeline of this model to see what the Mississippi will look like centuries in the future. Their unsettling conclusion: The mighty Mississippi will be much less mighty, with vastly increased evaporation along its entire pathway.

[Charlotte Wiman, Brynnydd Hamilton, Sylvia G. Dee, Samuel E. Muñoz, “Reduced Lower Mississippi River Discharge During the Medieval Era,” Geophysical Research Letters, 19 January 2021.]

Screen Shot 2021-11-19 at 1.35.23 PM.png

Previously covered in Humane Ingenuity: the potent combination of human expertise and AI processing. A lingering question: how much “human” is needed? In a new paper on the identification of galaxy types, “Practical Galaxy Morphology Tools from Deep Supervised Representation Learning,” Mike Walmsley, Anna M. M. Scaife, et al. find that you don’t need much. Given a relatively small number of human-categorized shapes — just around 10 examples — machine learning tools can extract similarly shaped clusters from nearly a million examples with near 100% accuracy.

They have even built a little interface so you can find galaxy shapes yourself.

Meanwhile, back here on Earth: “For legible pages from World War I handwritten diaries held at the State Library of Victoria, AI services are able to correctly transcribe them at a level between 10% to 49% accuracy.” Not great! Understanding century-old cursive handwriting may end up being one of the hardest problems in AI/ML.

(Sofia Karim, Lita’s House – Gallows (ফাঁসির মঞ্চ) / I (detail), 2020, photographic drawing, from the new Infinitude exhibit at Northeastern University.)

Subscribe to the Humane Ingenuity newsletter:

Debates de seguimiento en España: sesión sobre RIM y comunicación científica / HangingTogether

Gracias a Francesc García Grimau, OCLC, que amablemente ha proporcionado la versión en español de esta entrada de blog. La versión en inglés está disponible aquí.

Figura 1: Captura del mapa de proyectos de la mesa redonda española en marzo de 2021

Hablando sobre los metadatos de próxima generación en el contexto de diferentes áreas de aplicación

Durante la mesa redonda en español sobre metadatos de próxima generación (NGM), celebrada el pasado marzo, los participantes expresaron el deseo de que OCLC les ayudase a organizar debates de seguimiento, como este, para continuar el análisis del panorama de los proyectos de los NGM y la conversación sobre la colaboración en España.

En la primera reunión de seguimiento, celebrada en septiembre, los participantes debatieron sobre los posibles próximos pasos y acordaron celebrar una serie de mesas redondas virtuales, en español, centradas en tres áreas de aplicación de los NGM:

  1. Gestión de la Información de Investigación (RIM) y Comunicación Científica
  2. Patrimonio cultural
  3. La cadena de suministro del libro (no académicos)

Mi colega Francesc García Grimau, de la oficina de OCLC en España, y yo organizamos la sesión sobre RIM y Comunicación Científica el pasado 3 de noviembre. A continuación, pueden leer un breve informe de esta sesión. Sin embargo, permítanme en primer lugar poner esta sesión en una perspectiva histórica, para mostrar que la conversación con las bibliotecas de investigación españolas sobre su participación en actividades relacionadas con RIM es anterior a las mesas redondas de NGM.

Hay una historia previa: registrar los investigadores en España

En la sesión de marzo, escuchamos que algunas de las principales bibliotecas universitarias de España están enriqueciendo sus registros de autoridad locales de autores con identificadores persistentes (PIDs) y alimentando sistemas externos, como su portal de la investigación de la universidad o la base de datos ORCID, con registros de autoridad y datos bibliográficos. Estos esfuerzos recuerdan las prácticas recomendadas publicadas por el Grupo de Trabajo de Investigación de OCLC en 2014, bajo el título “Registro de investigadores en ficheros de autoridad“. Este informe fue un intento temprano de esbozar el entonces nuevo panorama emergente de RIM con múltiples partes interesadas y de poner en primer plano el papel central del identificador del investigador en los flujos de datos de RIM. En algunas partes de España, las bibliotecas prestaron atención a las prácticas recomendadas y nos invitaron a hacer un balance de sus esfuerzos, durante un taller organizado por el CSUC (Consorci de Serveis Universitaris de Catalunya) en diciembre de 2019. Karen Smith-Yoshimura y yo  informamos sobre el taller en este blog. La demanda de una mayor visibilidad de los datos de los investigadores provino de los mandatos políticos (a nivel institucional o de la comunidad autónoma) y de los propios investigadores. La principal preocupación era que todo el trabajo entre bastidores realizado por las bibliotecas debía de estar mejor alineado y ser más visible para obtener el compromiso y los recursos de los líderes universitarios. Ya entonces, se requería colaboración para ayudar a abordar algunos de los problemas comunes y conducir a directrices, mejores prácticas y flujos de trabajo más eficientes. Ya entonces, dos redes de colaboración bibliotecaria fueron nombradas como las más relevantes en este contexto: Dialnet (la red centrada en su portal de literatura científica hispánica) y  REBIUN (la Red de Bibliotecas Universitarias Españolas).

Continuando la conversación: los esfuerzos relacionados con RIM de las bibliotecas en España

En cierto modo, retomamos el hilo durante nuestra sesión de NGM del 3 de noviembre sobre RIM y comunicación científica. Para esta sesión, invitamos a los principales actores y proyectos que nuestros participantes pensaron que debían estar representados en la mesa: entre ellos bibliotecas universitarias de diferentes regiones y ciudades del país (p. e. Barcelona, Islas Baleares, Valencia, Alicante, Murcia, Madrid, País Vasco y La Rioja), la Biblioteca Nacional de España y varios actores en el ámbito de la comunicación académica, como CrossRef, ORCID y DataCite.

Conceptos y herramientas analíticas para ayudar a apoyar nuestra conversación

Figura 2: Casos de uso representados en las narrativas del estudio de caso de EE. UU. (fuente: Seminario web Works in Progress)

Para comenzar, di una actualización de la investigación de OCLC sobre las prácticas de RIM y el creciente papel de las bibliotecas. Un nuevo informe, que detalla el trabajo dirigido por Rebecca Bryant, sobre RIM en las instituciones estadounidenses que aún no se había publicado, pero que podría dar un adelanto y elaborar definiciones y diferencias entre RIM, comunicación científica y gestión de datos de investigación (RDM). Las diferencias son fluidas y cambiantes, pero es especialmente útil distinguir entre los términos cuando se habla del propósito de un proyecto o sistema determinado. Al comparar diferentes proyectos o sistemas, ayuda a determinar en qué medida hay superposición y si la colaboración podría ayudar a lograr sinergias y eficiencias. Los seis casos de uso que pueden ser soportados por los sistemas RIM (ver Figura 2), como se detalla en el nuevo informe, y el marco del sistema RIM, son herramientas conceptuales y analíticas para este tipo de ejercicio. A pesar de que era demasiado prematuro poder usarlos durante esta sesión, fue un primer paso para familiarizarse con ellos.

A continuación, nos centramos en tres iniciativas españolas que se consideran más prometedoras en términos de poder movilizar a la comunidad y/o lograr una acción más concertada: la iniciativa de REBIUN y los proyectos Dialnet y Hércules.

REBIUN: hacia un catálogo colectivo de identidades de investigadores

Almudena Cotoner (Universidad de las Islas Baleares) presentó el perfil de catalogación de RDA para la creación o enriquecimiento de los registros de autoridad del personal docente e investigador de las universidades españolas. El perfil fue publicado este año por el grupo de trabajo RDA de REBIUN. El objetivo es crear una guía de catalogación para las bibliotecas que deseen preparar sus registros de autoridad basados en MARC para un entorno de datos enlazados. Con ese fin, el perfil promueve el uso de múltiples identificadores, URIs y URLs que apuntan a tantos estándares, fuentes de información, centros de datos y sitios de referencia como sea posible. A largo plazo, cuando la mayoría de las bibliotecas universitarias de España hayan adoptado e implementado el perfil, la visión es integrar todos los registros de las autoridad locales en un único catálogo colectivo del personal académico e investigador en España. Este catálogo podría servir para múltiples propósitos: mayor visibilidad nacional e internacional de los investigadores españoles; acceso a los resultados y publicaciones académicas; estadísticas sobre la presencia en hubs o plataformas externas, como VIAF, Google Scholar, ResearchGate, etc. Actualmente, el perfil está implementado por todas las universidades del Grupo de Trabajo RDA.

Dialnet: desde el apoyo al descubrimiento académico hasta el apoyo a las necesidades RIM

Joaquín León (Universidad de La Rioja) presentó las últimas novedades de Dialnet. Como recordatorio: la base de datos Dialnet recopila datos sobre investigadores españoles/portugueses/latinoamericanos y sus resultados, independientemente de su país de afiliación. Joaquín nos habló de Dialnet Métricas: un nuevo servicio, aún en fase beta, que tiene como objetivo apoyar la evaluación tanto de revistas científicas como de investigadores midiendo su visibilidad, prestigio e impacto. Para ello, se está desarrollando un conjunto de indicadores y métricas de productividad e impacto, basados en datos de referencias y citas recopilados a lo largo de un amplio rango de años. Estas métricas están destinadas a complementar las métricas existentes de Scopus, Web of Science y Google Scholar, y abordar las brechas que son comunes en estas plataformas, en particular su infrarepresentación de las publicaciones en español y de humanidades y ciencias sociales. Joaquín también mencionó Dialnet CRIS, que ofrece un software CRIS que puede interoperar fácilmente con subconjuntos institucionales de datos disponibles de la base de datos bibliográfica y las métricas de Dialnet. El objetivo es facilitar el acceso institucional a los datos y apoyar los flujos de trabajo institucionales de RIM. Actualmente el CRIS de Dialnet está implantado en dos universidades españolas.

Hércules: una nueva iniciativa RIM con ambiciones nacionales

Reyes Hernández-Mora Martínez (Universidad de Murcia) presentó Hércules, un proyecto de 5,4 M€, que está promovido por la Conferencia de Rectores de las Universidades Españolas (CRUE) y cofinanciado por la UE. Veinte universidades españolas participan en la iniciativa Hércules. El objetivo es racionalizar la gestión RIM en España, con el fin de facilitar la gestión de los costes de investigación y la inversión pública, reforzar la difusión y transferencia del conocimiento científico, monitorizar el acceso abierto a los resultados de la investigación e identificar oportunidades de colaboración entre universidades. La racionalización tiene por objeto el desarrollo de un programa informático CRIS y de una capa RIM semántica que se desplegará a nivel nacional. Para este último, se está desarrollando una red de datos interconectada y enlazada de ontologías, basada en los estándares internacionales RIM: VIVO y CERIF. La arquitectura de la capa semántica está diseñada para soportar la sincronización, enriquecimiento y publicación como datos enlazados, de los datos RIM puestos a disposición por los diferentes sistemas CRIS universitarios. El objetivo es construir una infraestructura que respalde la mayoría de los casos de uso de RIM y las funciones necesarias a nivel institucional y nacional. Actualmente, los pilotos se están ejecutando en dos universidades de España.

Conectando los puntos… ¿o los proyectos?

Para los participantes, las presentaciones de los proyectos fueron muy interesantes y para muchos, la preparación del proyecto Hércules fue una novedad. La similitud entre los proyectos de datos bibliotecarios y los proyectos más relacionados con RIM era evidente, así como la necesidad de vincularlos. Intentamos clasificar las tres iniciativas (y algunas otras que fueron inventariadas anteriormente) de una manera que pudiera revelar cómo esta conexión podría ser más útil: caracterizándolas por el tipo de datos en los que se centran (bibliográficos o bibliométricos) y el alcance de los datos cubiertos (institucionales, regionales, nacionales, internacionales):

Autoridades y datos bibliográficosRIM y datos bibliométricos
Institucional​ Portal del Investigador de la UCM
Professors UB
Brújula UAL

Hay muchos más Portales de Investigadores dirigidos por bibliotecas universitarias en España.
​ Dialnet CRIS (p. e., Portal Bibliométrico UCM)
Hércules CRIS (p. e., Universidad de Murcia)    
Existen muchos  sistemas CRIS en uso en las universidades españolas.
Regional Autores Baleares
Revistes Catalanes amb Accés Obert
Portal de la Recerca de Catalunya
Sistema de Información Científica de Andalucía
Nacional Perfil de aplicación de RDA de REBIUN
Futuro Catálogo Colectivo de autoridades de REBIUN
Portal de Dialnet Datos-BNE
Dialnet métricas ​
Hércules-capa semántica
… ​
Internacional Portal de Dialnet
Infraestructura de gestión compartida de entidades (SEMI) de OCLC
… ​
Dialnet métricas
… ​

Esta tabla puede ayudar a los proyectos y servicios existentes a decidir con qué otros proyectos/servicios conectar sus datos. Por ejemplo:

  • Interconectar datos verticalmente desde el ámbito institucional (Professors UB) con ámbitos más grandes (Portal de la Recerca de Cataluya, Dialnet, e internacionalmente con SEMI de OCLC), para la consistencia de los datos a través de sistemas y el descubrimiento a escalas más grandes;
  • Interconectar modelos semánticos horizontalmente (p. e., la capa semántica de Hércules, las ontologías Dialnet y el perfil de catalogación para registros de autoridad en RDA de REBIUN), para facilitar la navegación a través de diferentes sistemas semánticos utilizando los mismos datos sobre los investigadores y sus resultados.


El tiempo vuela y tuvimos que terminar el debate prematuramente. La mayoría de los participantes salieron de la sesión con muchas preguntas nuevas y sin respuesta: ¿sobre los proyectos presentados, sobre las oportunidades y la conveniencia de conectarse y colaborar, y sobre los próximos pasos? Es evidente que se trata de un debate que debe proseguir. Volveremos a contactar con los participantes con una propuesta para próximos pasos. ¡Y estamos preparando la próxima sesión sobre datos de Patrimonio Cultural, que promete ser una conversación completamente diferente!

The post Debates de seguimiento en España: sesión sobre RIM y comunicación científica appeared first on Hanging Together.

Follow-up discussion series in Spain: RIM and scholarly communications session / HangingTogether

(A Spanish translation is available here).

Figure 1: Capture of the projects map from the Spanish round table session in March 2021

Talking about next generation metadata in the context of different application areas

During the Spanish round table session on next generation metadata (NGM), last March, participants expressed the wish to see OCLC help them organize follow-up discussions to continue the landscape analysis of the NGM projects and the conversation on collaboration in Spain.

At the follow-up meeting, in September, participants discussed possible next steps and agreed to hold a series of virtual discussions, in Spanish, focusing on three application areas of NGM:

  1. Research Information Management (RIM) and Scholarly Communications
  2. Cultural Heritage
  3. The (non-scholarly) book supply chain

My colleague Francesc García Grimau from the OCLC office in Spain and I hosted the session on RIM and Scholarly Communications on the 3rd of November. Below you can read a short report of this session. However, let me first put this session into an historical perspective, to show that the conversation with Spanish research libraries about their involvement in RIM-related activities pre-dates the NGM round tables.   

There is a longer story to this: registering researchers in Spain

From the March session, we heard that some of the major university libraries in Spain are enriching their local name authority file with persistent identifiers (PIDs) and feeding external systems – such as the local university’s research portal or the ORCID database – with library authority file and bibliographic data. These efforts are reminiscent of the recommended practices published by the OCLC Research Task Group in 2014, under the title “Registering Researchers in Authority Files”. This report was an early attempt to outline the then newly emerging multi-stakeholder RIM landscape and to foreground the central role of the researcher identifier in RIM dataflows. In some parts of Spain, libraries took heed of the recommended practices and invited us to take stock of their efforts, during a workshop hosted by CSUC (Consorci de Serveis Universitaris de Catalunya) in December 2019. Karen Smith-Yoshimura and I reported back on the workshop in this blog. The demand for greater visibility of researchers’ data came from policy mandates (at institutional or autonomous community level) and the researchers themselves. The main concern was that all the behind-the-scenes work carried out by libraries needed to be better aligned and more visible in order to get commitment and resources from university leadership. Already then, collaboration was called for to help address some of the common issues and lead to guidelines, best practices, and more efficient workflows. Already then, two library collaboration networks were named as most relevant in this context: Dialnet (the network centered around the Portal of Hispanic scientific literature) and REBIUN (the Network of Spanish University Libraries).

Continuing the conversation: libraries’ RIM-related efforts in Spain

In a way, we picked up the thread again during our November 3rd NGM session on RIM and scholarly communication. For this session, we invited the main stakeholders and projects which our participants thought needed to be represented at the table: these included university libraries from different regions and cities of the country (a.o. Barcelona, Balearic Islands, Valencia, Alicante, Murcia, Madrid, Basque Country, and La Rioja), the National Library of Spain, and several players in the scholarly communications arena, such as CrossRef, ORCID and DataCite.

Concepts and analytical tools to help support our conversation

Figure 2: Use cases represented in US case study narratives (source: Works in Progress Webinar)

To get us started, I gave an update of our OCLC Research on RIM practices and the increasing role of libraries. A new report, detailing work led by Rebecca Bryant, on RIM at US Institutions was not yet out – but I could give a sneak preview and elaborated on definitions of and differences between – RIM, scholarly communications, and research data management (RDM). The demarcations are fluid and evolving, but it is especially helpful to distinguish between the terms when talking about the purpose of a given project or system. When comparing different projects or systems it helps determining to what extent there is overlap and if collaboration might help achieve synergies and efficiencies. The six use cases that can be supported by RIM systems (see Figure 2), as detailed in the new report, and the RIM system framework, are useful conceptual and analytical tools for this type of exercise. Even though it was too premature to be able to use them during this session, it was a first step to become acquainted with them.

We then zeroed in on three Spanish initiatives that are considered most promising in terms of being able to mobilize community and/or achieve more concerted action: the REBIUN initiative, the Dialnet projects, and Hércules.

REBIUN: towards a union catalog of researcher identities

Almudena Cotoner (University of the Balearic Islands) presented the RDA application profile for the creation or enrichment of the authority records of teaching and research staff from Spanish universities. The profile was published this year by the RDA working group of REBIUN. The objective is to provide a cataloging guide for libraries who wish to make their MARC-based authority records ready for a linked data environment. To that end, the profile promotes the use of multiple identifiers, URIs, and URLs pointing to as many standards, information sources, data-hubs and reference sites as possible. In the longer term, when most university libraries in Spain will have adopted and implemented the profile, the vision is to integrate all the local authority files into a single, union catalog of academic and research staff in Spain. This catalog could serve multiple purposes: greater national and international visibility of Spanish researchers; access to their scholarly outputs; statistics about their presence in external hubs or platforms, such as VIAF, GoogleScholar, ResearchGate, etc. Currently, the profile is implemented by all university libraries of the RDA Working Group.

Dialnet: from supporting scholarly discovery to supporting RIM needs

Joaquín León (University of Rioja) presented on the latest Dialnet developments. As a reminder: the Dialnet-database collects data about Spanish/Portuguese/Latin-American researchers and their outputs, irrespective of their country of affiliation. Joaquín told us about Dialnet Metrics: a new service, still in beta, which aims to support the evaluation of both scientific journals and researchers by measuring their visibility, prestige and impact. To this end, it is developing a set of productivity and impact indicators and metrics, based on references and citations data collected over a wide range of years. These metrics are meant to complement the existing metrics from Scopus, Web of Science and Google Scholar, and address gaps that are common in these platforms, in particular under-representation of Spanish language publications and of the humanities and social sciences. Joaquín also mentioned Dialnet CRIS, which offers CRIS-software that can easily interoperate with institutional subsets of the data available from the underlying Dialnet bibliographic and metrics databases. The goal is to facilitate institutional access to the data and support institutional RIM-workflows. Currently the Dialnet CRIS is implemented at two Spanish universities.

Hércules: a new RIM-initiative with national ambitions

Reyes Hernández-Mora Martínez (University of Murcia) presented Hércules, a €5.4M project, which is promoted by the Conference of Rectors of Spanish Universities (CRUE) and co-financed by the EU. Twenty universities participate in the Hercules initiative. The objective is to rationalize RIM in Spain, in order to facilitate the management of research costs and public investment, strengthen the dissemination and transfer of scientific knowledge, monitor open access to research results and identify collaboration opportunities between universities. Rationalization is aimed at through the development of CRIS software and a semantic RIM layer to be deployed at the country level. For the latter, an interconnected, linked data network of ontologies is being developed, based on the international RIM-standards VIVO and CERIF. The architecture of the semantic layer is designed to support the synchronization, enrichment and publication as linked data, of the RIM-data made available by the different university CRIS systems. The aim is to build an infrastructure which will support most RIM use cases and functions necessary at the institutional and national levels. Currently, pilots are running at two universities in Spain.

Connecting the dots … or the projects?

For the participants, the project presentations were very interesting and for many, the presentation of project Hércules was new. The similarity between the library data projects and the more RIM-related projects were evident as was the need to link them up. We attempted to classify the three initiatives (and some others that were inventoried previously) in a way that could reveal how this connecting might be most useful: by characterizing them by the type of data they focus on (bibliographic or bibliometric) and the scope of the data covered (institutional, regional, national, international):

 Library authority and bibliographic data​​RIM and bibliometric data
Institutional ​Portal del Investigador de la UCM; 
Professors UB​
Brújula UAL​

There are many more Researcher Portals run by university libraries in Spain.
​ Dialnet CRIS​ (e.g. Portal Bibliométrico UCM)
Hércules CRIS (e.g. at University of Murcia)  
There are many more CRIS systems in use at Spanish universities.​
Regional Autores Baleares​
Revistes Catalanes amb Accés Obert ​
Portal de la Recerca de Catalunya
Sistema de Información Científica de Andalucía …
National​ REBIUN RDA-application profile
Future Shared Catalog of Name Authorities of REBIUN
Dialnet Portal Datos-BNE​
​ Dialnet métricas​ ​
Hércules-semantic layer … ​
International​ Dialnet Portal
OCLC Shared Entity Management Infrastructure (SEMI)
​ Dialnet métricas​
Hércules-ontology …​

This table can help projects and existing services decide which other projects/services to connect their data with. For example:

  • Interconnecting data vertically from the institutional scale (Professors UB) to larger scales (Catalan Researcher Portal, Dialnet, and internationally with OCLC SEMI), for consistency of the data across systems and discovery at larger scales.
  • Interconnecting semantic models horizontally (e.g., the Hércules semantic layer, the Dialnet ontologies, and the REBIUN RDA-Application Profile for academics), to facilitate the navigation across different semantic systems using the same data about researchers and their outputs.

To be continued …

Time flew and we had to end the discussion prematurely. Most participants must have left the session with many new and unanswered questions: about the projects presented, about the opportunities and desirability to connect and collaborate, and about next steps? This was clearly a discussion to be continued. We will return to the participants with a proposal for next steps. And we are preparing the next session on Cultural Heritage data, which promises to be a completely different conversation!

The post Follow-up discussion series in Spain: RIM and scholarly communications session appeared first on Hanging Together.

OpenAlex Domains / Ed Summers

OpenAlex is a database of metadata about scholarly publishing that just had a beta release yesterday. It replaces the discontinued Microsoft Academic Graph (MAG), and is made available by the OurResearch project as a 500 GB dataset of tabular (TSV) files, that appears to be exported from an Amazon Redshift database.

500 GB is a lot to download. I guess it could be significantly improved by compressing the data first. Fortunately OurResearch are planning on making the data available via an API, which should make it easier to work with for most tasks. But having the bulk data available is very useful for data integration, and getting a picture of the dataset as a whole. It’s a few years old now but Herrmannova & Knoth (2016) has a pretty good analysis of the metadata fields used in the original MAG dataset, especially how they compare to similar sources.

I’ve never looked at MAG before, but after glancing at the list of tables I thought it could be interesting to take a quick look at the URLs, since it’s a bit more manageable at 44 GB, and can be fetched easily from AWS:

aws s3 cp --request-payer s3://openalex/data_dump_v1/2021-10-11/mag/PaperUrls.txt ~/Data/OpenAlex/PaperUrls.txt

The table has the following columns:

  1. PaperId
  2. SourceType
  3. SourceUrl
  4. LanguageCode
  5. UrlForLandingPage
  6. UrlForPdf
  7. HostType
  8. Version
  9. License
  10. RepositoryInstitution
  11. OaiPmhId

Just eyeballing the data it appears that most columns are sparsely populated except for the first four. The original MAG dataset was built from Microsoft’s crawl of the web, which then used machine learning techniques to extract the citation data (Wang et al., 2020). Of course the web is a big place, so I thought it could be interesting to see what domains are present in the data. These domains tell an indirect story about how Microsoft crawled the web, and provide a picture of academic publishing on the web.

Once you download it wc -l shows that there are 448,714,897 rows in PaperUrls.txt. Unless you are using Spark or something you probably don’t want to pull all that into memory. Over in this notebook I simply read in the data line by line, extracted the domain, and counted them. tldextract is pretty handy for getting the registered domain:

This found 243,726 domains, of which the top 25 account for over half. Below is a chart of how these top 25 break down. You can click on “Other” to toggle it off/on to get more of a view.

I’m not sure if there are any big surprises here. The prominence of and point to the significant influence of biological sciences and government. It’s also interesting to see edging out major publishers like Wiley, IEEE, Taylor & Francis, and Sage. The domain counts dataset is available here if you want to take a look yourself.


Herrmannova, D., & Knoth, P. (2016). An Analysis of the Microsoft Academic Graph. D-Lib Magazine, 22(9/10).

Wang, K., Shen, Z., Huang, C., Wu, C.-H., Dong, Y., & Kanakia, A. (2020). Microsoft Academic Graph: When experts are not enough. Quantitative Science Studies, 1(1), 396–413.

The interoperability imperative: Separated by a common language? / HangingTogether

In a recent post, I described the interoperability imperative as a key element in developing robust, sustainable research support services. The interoperability imperative is the need to pay close attention to what happens at the boundaries between the key agents – systems, people, and institutions – that bring research support services to life. Key questions include how those agents interact, and what infrastructure – technical, social, or collaborative – is needed to catalyze and sustain those interactions.

When the agents in question are people, an important piece of interoperability infrastructure is a shared vocabulary. A shared vocabulary facilitates communication by promoting mutual understanding of key concepts and terminology across stakeholders, which helps drive convergence in expectations, planning, and priorities. Take away the shared vocabulary, however, and the results can be very different.

In our report Social Interoperability in Research Support: Cross-campus Partnerships and the University Research Enterprise, we emphasize that an important tactic in building social interoperability – the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding – is to speak your partner’s language. Frame your message using concepts and terminology that are understandable and compelling to your audience.

Think about a word like preservation, and how it can differ in meaning between, say, a librarian and an IT specialist. In fact, there are lots of words and phrases that can scuttle a good conversation between individuals with different professional backgrounds. This point was underscored during the recent joint OCLC-LIBER online workshop Building Strategic Relationships to Advance Open Scholarship at your Institution, based on the findings of the Social Interoperability report. This three-part workshop brought together a group of international participants to examine the challenges of working across the institution, identify strategies and tactics for cross-unit relationship building, and develop a plan for increasing their own social interoperability.

During the first session, we conducted an online poll, asking participants – nearly all of whom worked in an academic library – to name a word, phrase, or concept that they found problematic in conversations with colleagues in other campus units. The results are displayed in this word cloud:

As the word cloud shows, there is a whole glossary’s worth of terminology that can potentially trip up an exchange of ideas between a librarian and one of their colleagues in another part of the campus. Most frequently put forward as a source of confusion and misinterpretation is the seemingly commonplace word “data”. Interpretations of this term run the gamut from research data sets associated with published scholarly work, to any kind of information in digital form. “Open” is another troublesome word: when we say something is open, do we mean it is freely available without restriction, or are there terms and conditions attached? Is “open” being used improperly as a synonym for “public domain”, or vice versa? Not to mention the freighted meanings that come with use of terms like “open content”, “open source”, and “open access”.

There is a whole glossary’s worth of terminology that can potentially trip up an exchange of ideas between a librarian and one of their colleagues in another part of the campus

The term “impact” can sow confusion as well: different parts of the university – both academic and administrative – can have different ideas about what constitutes impact (if they can define it at all), how it can be accurately measured, and what benefits are realized from documenting it (e.g., scholarly prestige, university reputation). Shades of meaning around the word “archiving” have plagued many cross-campus conversations: to an archivist, archiving is a profession with accepted stewardship standards and practices; to a computer scientist, archiving may simply mean safeguarding bits in a long-term storage system. And to some, “metadata” is hand-crafted, deep descriptions supporting discovery and use, while to others, it is merely some system-generated information that assists in file management.

The profusion of words displayed in the picture underscores the importance of defining terms and concepts prior to engaging in conversations with colleagues in other parts of the campus. They also speak to the broader need to dedicate attention to building social interoperability through the use of tactics like “speaking your partner’s language” to build bridges and promote mutual understanding. In the Social Interoperability report, we feature this tactic as one of several (see picture below) that contribute toward the overarching strategy of Securing Buy-in – the idea that in successful cross-campus partnerships, each partner should see a clear benefit from working together.

Think about advocating for something like open science: appealing to ideals and principles to get people on board can be useful, but it is often not enough – stakeholders need to see how they will advance their practical interests by embracing open science practices. In order to reach that understanding, open science advocates need to, among other things, adopt forms of expression that communicate key concepts – “open”, “impact”, “privacy”, etc. – in ways that avoid misunderstanding and misinterpretation by those they seek to persuade.

When talking about communication obstacles arising over words with different meanings to different people, or different words used by different people for the same concept, it seems almost obligatory to quote Oscar Wilde, writing in The Canterville Ghost, who famously observed that “we have really everything in common with America nowadays, except, of course, language.” But the insight behind the quote is vital for libraries to embrace as they communicate their expertise to other parts of the campus, demonstrate the importance of including librarians in cross-campus working groups, task forces, and other initiatives, and ultimately, highlight the value of the library as a campus partner. Consider this observation from the Social Interoperability report:

However, we also heard that there are sometimes senior leaders in research administration or campus ICT who do not always understand how or why the library should be a partner in research support activities, often because these leaders were “coming from the outside [academia] and really have no concept.” In these cases, libraries and their advocates on campus must effectively and regularly communicate their value and offerings.

Familiarizing yourself with how potential campus partners speak, and dedicating attention to mutual understanding of key concepts and terms, is an important tool for communicating effectively and building social interoperability.

What terms or concepts do you find troublesome in communicating with colleagues in other units around campus? Share them in the comments below.

Thanks to my colleagues Rebecca Bryant and Chela Weber for helpful advice on improving this post!

The post The interoperability imperative: Separated by a common language? appeared first on Hanging Together.

ARCH UX Testing: Designing for the Users / Archives Unleashed Project

Building new platforms, systems, and applications can be a daunting task, especially as we consider the real impact our development and design choices have on how users think, feel, and interact with our nascent product.

“Usability is about people and how they understand and use things, not about technology.” — Steve Krug

Earlier this spring, we shared our roadmap for building a robust cloud-based interface to support web archival research at scale and enhance access to web archives. Months of development have resulted in our fully functional prototype: ARCH (Archive Research Compute Hub). As we move towards an official launch in Spring 2022, we are working with users to ensure it meets all of their web archive analysis needs!

How does ARCH Work?

ARCH allows users to delve into the rich data within web archival collections for further research. Users can generate and download over a dozen datasets within the interface, including domain frequency statistics, hyperlink network graphs, extracted full-text, and metadata about binary objects within a collection. ARCH also provides several in-browser visualizations that present a glimpse into collection content.

The ARCH interface

The design process for ARCH has involved a variety of interconnected stages from sketching a wireframe, to connecting back-end processes with a user interface design, and conducting a multi-staged user testing process to continually assess user sentiment and impact with functionality and interface improvements.

Four stages of the ARCH design process

This user testing has been a vital component to understanding the user journey, intuitive workflows, and varying expertise and research needs for working with web archives at scale.

Stages of ARCH UX Testing

At its core, User Experience (UX) testing seeks to understand the impressions, experience, and feelings a user expresses while interacting with a product prototype. These insights are critical because it brings the creators and developers into closer alignment with their end-users.

Conducting UX Testing for ARCH has allowed our team to understand research behaviours and the user journey while assessing what works well, what challenges arise, and identifying needs that aren’t being met.

Testing encompassed five main stages:

  1. Define Objectives. To provide scoping goals for our UX Testing we determine evaluation criteria, methods, and testing protocols of each stage. The purpose of testing was to engage with selected Archives Unleashed / Archive-It users who would provide feedback and input on their initial ARCH impressions. Ultimately these insights informed issues that needed to be addressed regarding usability, workflow, and functionality.
  2. Recruit. We then identified and engaged with Archives Unleashed users and Archive-It partners. This stage also meant thematically grouping users based on their relationship to the Archives Unleashed Project (e.g. Concept Design Interviewees, Advisory Board members, and “Power Users”). There was also a conscious effort to ensure recruitment reflected a diverse range of institutional categories, as set by Archive-It (e.g. University & Colleges, National Institutions, Public Libraries & Local Governments, etc.).
  3. Test. We primarily tested ARCH through remote surveys, which collected qualitative and quantitative data to determine satisfaction on several key indicators including intuitiveness, ease of navigation, terminology, visualizations, processing time, and application to user research.
  4. Analyze. A variety of methods were used to extract qualitative and quantitative data collections. Statistical descriptors were used to provide a user profile that identified geographic and institutional representation, professional role, and comfort level in using data analytical software and tools. A five-point Likert scale was used to measure satisfaction of intuitiveness, navigation, terminology, visualizations, processing time, and application. To address qualitative feedback, thematic coding was applied to comments in the areas of user interface design choices, workflow, language, documentation, outputs, feature requests, and errors encountered.
  5. Implement Findings. Results were shared with the team, and feedback was translated into GitHub tickets to provide action-based tasks for future development cycles. Implementing user suggestions also provides a base for future iterative UX testing rounds.
  6. Verify with Users. As a multi-stage UX testing process, each subsequent round of testing served as another opportunity to review and refine impressions of prior development and improvements — improving our accuracy and capacity to match user needs at each stage

A Snapshot of ARCH UX Round 3 Results

Overview of ARCH UX Testing rounds

The most recent round of UX testing was conducted throughout August and September 2021, connecting with past concept design interviewees, our project advisory board members, and selected Archives Unleashed and Archive-It “power” users. Participants shared their experience and feedback through a survey.

Tester Profiles

Profiling participants revealed users were primarily from North America and representative of two main institution categories: colleges/universities and national libraries. In addition, testers can be categorized into four main professional roles: researcher/professor, librarian/archivist, technologist, and managerial/leadership.

The survey asked testers to identify their comfort level in using data analytical software and tools to help assess their experience and feelings with the technical aspects of analyzing data. The majority of participants in this group reflected high confidence, while 25% of respondents described themselves as slightly comfortable, meaning they can use tools and software but need assistance.

Areas of Satisfaction

A second area of quantitative questioning used a five-point Likert scale to measure the satisfaction of intuitiveness, navigation, terminology, visualizations, processing time, and application.

Heatmap of quantitative scores on satisfaction statements

Overall, testers identified a positive ARCH experience, noting the benefits in being able to access collections and conduct initial analysis. Using a combination of a heat map visualization and averaging satisfaction scores, it was easy to identify the highest and lowest scores, while also diagnosing areas for improvement. Statements that incurred few neutral or disagreement scores were further reflected in open comments.

Constructive Feedback

Open-ended survey comments offered a chance for testers to express their thoughts, feelings, and experience of using ARCH using their own words.

Overall, users were impressed with the new interface, noting that the integration between Archives Unleashed and Archive-It provides a familiar and dedicated environment for working with web archives, and offers opportunities for new research use cases.

Testers also appreciated the quick processing speeds, the variety of dataset and output options, and that no technical set-up was required (e.g. running a Spark shell).

Analysis of 94 comments identified six themes that conveyed detailed suggestions for improvements and considerations. These areas included interface design choices, workflow, language, accessibility, documentation, and output usability. Participants also identified errors encountered.

The majority of suggestions related to improving the UI workflow and navigation, which was supplemented by changes to language, design choices, and visualization features to improve recommendations around accessibility and usability.

Implementing UX Feedback

Our team uses GitHub to manage development and version control of the source code for the project’s ARCH platform, and uses tickets to monitor improvements and roadmap. As a result, 27 GitHub tickets were created from the open comments to provide direction for actionable items and team discussions.

Lessons Learned

Carrying out UX testing has afforded an opportunity to learn more about our users and carry lessons learned forward into future testing cycles.

Here are some of our insights:

  1. People are genuinely excited to help out! When you build a community, you also build a system of support, encouragement, and connection. In reaching out to individuals who are familiar with our project — both by following progress and using our tools — we found there was general enthusiasm and responsiveness to UX testing!
  2. Keep it simple for the user. Understanding time commitments can be a big ask, we purposefully kept our survey short and simple. We were conscious to ensure UX testing didn’t become a burdensome task, but rather presented an exciting opportunity for participants to impact platform development directly.
  3. Responses were higher for interviews than surveys. It was a bit of a (pleasant) surprise that participants were willing to schedule a longer call to discuss thoughts and impressions. In comparison, we had a 41% response rate and very thoughtful input via a feedback form. We had anticipated the asynchronous element of the survey form would have yielded a high response rate, but we also are very aware of how the element of human-interaction and the opportunity to discuss development with members of the product team may have felt more approachable and held a more personal, appealing touch.
  4. No matter how much you prepare, there will always be glitches. As they say, prepare for everything! During our UX testing, we had some unforeseen technical issues, which took our prototype offline. As a team, we triaged this issue, with some members focusing on the technical aspect while others connected with testers. Although all errors were remedied, this did cause a disruption to our testing process. This experience did reinforce within our team that no matter what issues arise, communication is critical!

Next Steps

Our final two rounds of UX testing will be conducted early in 2022. Participants will have a chance to interact with the latest version of ARCH, with implementations from our most recent testing. These final rounds will also provide an opportunity for a stress test of ARCH’s to monitor the back-end for any areas that are overwhelmed or in cases where processes fail.

We look forward to sharing ARCH with the public in the Spring of 2022!

ARCH UX Testing: Designing for the Users was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.

The $65B Prize / David Rosenthal

Senator Everett Dirksen is famously alleged to have remarked "a billion here, a billion there, pretty soon you're talking real money". There are a set of Bitcoin wallets containing about a million Bitcoins that are believed to have been mined by Satoshi Nakamoto at the very start of the blockchain in 2008. They haven't moved since and, if you believe the bogus Bitcoin "price", are currently "worth" $65B. Even if you're skeptical of the "price", that is "real money". Below the fold, I explain how to grab these million Bitcoin and more for yourself.

In the trial of Kleiman vs. Wright, currently under way in the Southern District of Florida, both sides stipulated that Craig Wright is Satoshi Nakamoto and thus controls the million Bitcoin. Kleiman's estate argues that, since Wright claimed that Kleiman helped create Bitcoin, he is entitled to half of them. Except in the context of the trial, Wright's claim to be Satoshi Nakamoto is implausible. He has been challenged to show that he has the private keys for Nakamoto's wallets by moving some of their coins. He has repeatedly failed to do so, and has failed to respond to court orders regarding them.

For the purpose of this post I assume that Wright is not Satoshi Nakamoto, and that the lack of motion of the coins in Nakamoto's wallets means either that Nakamoto is no longer with us, or has lost the relevant keys.

The security of cryptocurrencies has two aspects, both threatened by the rise of quantum computing:
  • The security of the blockchain itself, which was the subject of my talk at the "Blockchain for Business" conference. In principle, quantum computers can out-perform mining ASICs at Proof-of-Work, allowing for 51% attacks on blockchains secured using PoW.
  • The security of the public-key encryption used to protect transactions. In principle, quantum computers can use Shor's algorithm to break the encryption currently in use, allowing them to steal the contents of wallets.
The abstract of 2019's Quantum attacks on Bitcoin, and how to protect against them by Divesh Aggarwal et al reads:
The key cryptographic protocols used to secure the internet and financial transactions of today are all susceptible to attack by the development of a sufficiently large quantum computer. One particular area at risk are cryptocurrencies, a market currently worth over 150 billion USD. We investigate the risk of Bitcoin, and other cryptocurrencies, to attacks by quantum computers. We find that the proof-of-work used by Bitcoin is relatively resistant to substantial speedup by quantum computers in the next 10 years, mainly because specialized ASIC miners are extremely fast compared to the estimated clock speed of near-term quantum computers. On the other hand, the elliptic curve signature scheme used by Bitcoin is much more at risk, and could be completely broken by a quantum computer as early as 2027, by the most optimistic estimates. We analyze an alternative proof-of-work called Momentum, based on finding collisions in a hash function, that is even more resistant to speedup by a quantum computer. We also review the available post-quantum signature schemes to see which one would best meet the security and efficiency requirements of blockchain applications.
Aggarwal et al contine to track the likely date for the signature to be broken, currently projecting between 2029 and 2044.

So we have a decade or so before quantum computing threatens the blockchain security provided by Proof-of-Work. Since what Proof-of-Work does is to make Sybil attacks uneconomic, the fact that quantum computers will initially be very expensive compared to conventional ASICs means that the threat would in practice be delayed beyond the point where they were faster until the point where they were enough cheaper to make Sybil attacks economically feasible.

But the reward for building a "sufficiently large quantum computer" to break elliptic curve signatures would be the content of the wallets whose signatures were broken, The expectation is that, before this happened, wallet owners would transfer their HODL-ings to new wallets using "post-quantum cryptography", rendering them immune from theft via quantum computing. Problem solved!

Not so fast! I am assuming that the keys for Nakamoto's wallets are inaccessible through death or loss. Thus Nakamoto cannot migrate the million Bitcoin they contain to wallets that use post-quantum cryptography. Thus the first person to control a "sufficiently large quantum computer" can break the encryption on Nakamoto's wallets and transfer the million Bitcoin to a post-quantum wallet they own. Who is to know that this wasn't Satoshi Nakamoto taking a sensible precaution? The miscreant can then enjoy the fruits of their labor by repaying the costs of development of their quantum computer, and buying the obligatory Lamborghini. These would take only a small fraction of the $65B, and would be seen as Nakamoto enjoying a well-deserved retirement.

But wait, there's more! Chainalysis estimates that about 20% of all Bitcoins have been "lost", or in other words are sitting in wallets whose keys are inaccessible. That is around another 3.6 million stranded Bitcoin or at the current "price" about $234B. These coins need to be protected from theft by some public-sprited person with a "sufficiently large quantum computer" who can transfer them to post-quantum wallets he owns. The reward for being first to rescue Nakamoto's and the other stranded Bitcoin is actually not $65B but almost a third of a trillion dollars. Even by Dirksen's standards that is "real money". Certainly enough to accelerate the development of a "sufficiently large quantum computer" before 2029.

Introducing a RIM System Framework / HangingTogether

This blog post is jointly authored by Rebecca Bryant from OCLC Research and Jan Fransen from the University of Minnesota Libraries.

The recent OCLC Research report series Research Information Management in the United States provides a first-of-its-kind documentation of research information management (RIM) practices at US research universities, offering a thorough examination of RIM practices, goals, stakeholders, and system components. This effort builds upon previous research conducted by OCLC Research, including the 2018 report Practices and Patterns in Research Information Management: Findings from a Global Survey, prepared in collaboration with euroCRIS.

The reports document the history, use cases, scope, stakeholders, and administrative leadership at five case study institutions:

  • Penn State University
  • Texas A&M University
  • Virginia Tech
  • UCLA
  • University of Miami.

A major contribution of these reports is the introduction of a RIM system framework. This model visualizes the functional and technical components of a RIM system through subdivision of RIM processes into three discrete segments:

  • Data sources
  • Data processing (including storage)
  • Data consumers.

RIM System Framework

The RIM System Framework is intentionally shaped like an hourglass, representing the funnel of information into a core RIM system data store and then out again in service of institutional business uses. The three discrete sections are color coded.

RIM System Framework
Data Sources

Data sources exist at the top of the funnel and refers to the information that must be collected from outside the RIM system from both external and internal sources, and may include things like:

  • Person names and job title(s)
  • Publications
  • Patents
  • Grants and projects
  • Equipment
  • Institutional units and their hierarchical relationships
  • Instructional history
  • Statements of impact.

The framework identifies three types of data sources:

  • Publication databases and indexes are used as a data source about research outputs. These sources may be freely available (e.g., PubMed) or databases licensed through the institution’s library (e.g., Scopus or Web of Science).
  • Local data sources may include human resources, sponsored research, and student information systems. They hold information about employees and their job titles, grants awarded by external funding agencies, and instructional history, including courses taught.
  • Some information does not reside in existing databases; this local knowledge, such as statements of impact, will require manual entry. Organizational relationships and unit hierarchies, perhaps surprisingly, often fall into the local knowledge category, as data about institutional unit hierarchies can be elusive, incomplete, heterogeneous, mutable—and often completely unlinked to the people affiliated with these units.
Data Processing

The data processing section of the RIM System Framework documents how the information from the data sources is captured, transformed, and stored for later use. This constitutes the center of the model, including not only the main RIM data store in the middle but also the processes above it—the publication harvester, ETL processes, and metadata editor—that enable the transfer, cleaning, and enrichment of metadata into the RIM data store. Below the data store, it also includes the data transfer methods used to export the data in support of the various RIM use cases.

  • A publication harvester allows the regular and automated updates of publications authored by researchers in a RIM system, drawing content from one or more publication databases such as PubMed, Scopus, or others.
  • ETL processes stands for Extract, Transform, Load, and is a general term for computing processes that take data from one source, clean and crosswalk as needed, and add or merge the data into a target database.
  • The metadata editor is the interface that allows users to create, read, update, and delete information. This includes processes like the claiming/disclaiming of publications suggested by the publication harvester, importing publications from publication databases not captured by the publication harvester, and adding and maintaining data available only as “local knowledge.”
  • The data store is the main database where the aggregated data is maintained. It may be part of a licensed product or might be a bespoke database developed and maintained by the institution.
  • In order to use the data stored in a RIM system, there must be data transfer methods for extracting it. These typically take the form of APIs, but some RIM systems also allow data analysts to query the database directly using SQL.
Data Consumers

Once the data has been collected and transformed, it can be used to support one or more of the six RIM use cases identified in the reports:

  • Faculty activity reporting
  • Public portals
  • Metadata reuse
  • Strategic reporting & decision support
  • Open access workflows
  • Compliance monitoring

More details about these use cases are offered in a previous blog and in the reports themselves.

Using the RIM System Framework

The report authors developed this framework as a necessary aid in comparatively understanding the functional and technical components documented at the five case study institutions. The model can help demonstrate the different institutional decisions—and the array of options available to RIM system implementers.

For example, here’s the framework to describe Virginia Tech’s implementation of Symplectic Elements, which uses the Elements product to support many of the system components, including metadata harvesting, as a data store, and (soon), as a public portal.

RIM System Framework for Virginia Tech

RIM System Framework for Texas A&M

In comparison, Texas A&M likewise uses Elements but only as a metadata harvester, instead utilizing a MySQL database for the data store and the open source VIVO product for the Scholars@TAMU public portal front end.

Use the RIM System Framework at YOUR Institution

We invite you to take a closer look at the model as introduced and applied in the RIM in the US report series and consider how this model may apply to your local system(s). We’d love to hear from you about how this works—please share in the comments below.


Rebecca Bryant, PhD, serves as Senior Program Officer at OCLC Research where she leads investigations into research support topics such as research information management (RIM).  Janet (Jan) Fransen is the Service Lead for Research Information Management Systems at University of Minnesota Libraries. In that role, she works across divisions and with campus partners to provide library systems and data that save researchers, students, and administrators time and highlights the societal and technological impacts of the University’s research. The most visible system in her portfolio is Experts@Minnesota.


The post Introducing a RIM System Framework appeared first on Hanging Together.

Announcing Member Contact Update Form and Update on Membership Status / Digital Library Federation

NDSA membership contacts are the main point of contact if questions arise about the NDSA membership. These people also receive emails about the annual Coordinating Committee election and any other official correspondence the NDSA Leadership may have. To help keep these contacts up to date, the NDSA has been working on developing a way for organizations to let us know when their Program Representative or Authorized Signatory contact needs to be updated. We have developed a form for you to fill out to provide us with correct information.  

To assist NDSA with managing memberships, emails from contacts that bounce back more than once from organizations with multiple contacts will be automatically removed from the membership contact list. Effort will be made to reach out to organizations without multiple contacts that have emails bouncing back, however if no response is received NDSA reserves the right to remove this organization from the current membership list until such time as the organization is able to provide new contacts.

If you would like to know who your contacts are please reach out to ndsa.digipres [at] and we will be happy to provide you with that information.  

This information, instructions, and the form are also available on the Member Orientation webpage. There is also a link to the form on the Join the NDSA webpage.  

~ The NDSA Leadership Team

The post Announcing Member Contact Update Form and Update on Membership Status appeared first on DLF.

2021 Research Associates / Harvard Library Innovation Lab

Like most things at LIL, our visiting researcher program has taken many forms over the years. This year, despite our team being spread across the East and Midwest Coasts (shout out to Lake Michigan) we were thrilled to welcome five research associates to the virtual LILsphere, to explore their interests through the lens of our projects and mission.

In addition to joining us for our daily morning standups, RAs attended project meetings and brainstorming sessions, and had access to all of the resources the Harvard Library system has to offer. Their individual research was based on questions they had or ideas they wanted to explore in the realm of each of our three tentpole projects: the Caselaw Access Project, H2O, and

Each of our visitors tackled an exceptionally interesting corner of our work; some helped propel us forward in terms of platform functionality, others prompted us to reconsider some of our base assumptions around our users. They produced things from new software features to teaching materials, design briefs, and research documentation. Below are brief descriptions of their work and links to their individual outputs.

Rachel Auslander

Using technology to empower research and information access is a central tenet of the LIL mission. Another value we have as a group is that of collaboration. This summer, Rachel explored what it would mean to be able to fuse external datasets into CAP via metadata in a way that would bring context and texture to caselaw.

Her design brief which will guide future LILers to integrate these ideas into the CAP interface can be viewed here.

Ashley Fan

We got double the fun from Ashley this summer! Initially, she was interested in working on collections of caselaw that would empower journalists on various beats to apply a legal lens to their writing. Using a new feature available from CAP Labs, Ashley put together a series of Chronolawgic timelines for three different beats: education, health, and environment.

You can read her post about all of these timelines and find links to them here.

Then, in true LIL fashion, Ashley found herself swept up in an interesting problem that happened to come up during her time with the team. The power of the CAP dataset is that it makes accessing caselaw exponentially easier, but caselaw, by nature, can contain sensitive content about individuals involved in specific cases. This tension often manifests itself in requests by those individuals to remove their information from our database of cases, and Ashley jumped in alongside our team to research and formalize a process for decision-making and action.

Follow this link to learn more about this question, and Ashley's research.

Andy Gu

The scope of possibilities surrounding the Caselaw Access Project is so vast, we're really just starting to see how it can change the way scholars look at and study the law. This summer, Andy worked to create further flexibility in our built-in visualization features and expand users' ability to explore trends, particularly in relation to an extremely important aspect of the law: inter-case citation.

In a series of blog posts, Andy sets out how he extended the Trends tool using the Cases endpoint of the API; a powerful application of a new feature; and the design work that was done to integrate these upgrades into the general search interface of CAP.

Adaeze Ibeanu

Undergraduate curricula were the focus of Adaeze's summer. Where and how is the law taught to students who aren't explicitly attending law school? Via a thorough survey of undergraduate curricula and conversations with students, Adaeze presented our team with a summary of legal teaching in an undergraduate setting, and took a deeper dive into legal teaching in the social and natural science fields. Her research explored the potential impact of legal texts and open educational resources in completely new settings.

Aadi Kulkarni

Since 2018, our team has been integrating primary legal documents, including caselaw and the U.S. Code, directly into H2O, our open casebook platform, to make the creation of legal teaching materials even more seamless and powerful. This summer, Aadi continued that work by exploring ways in which H2O could include state code in a casebook—extending content capabilities for all of our users. Along the way, Aadi learned a lot about open-source communities and the process of integrating public materials into our platform.

If you're interested in our visiting research opportunities, make sure to follow us on Twitter. You should also feel free to reach out to us at!

Fall / Ed Summers

The view through the window, by my desk, on an overcast November day.


Spring 2022 Graduate Research Assistantship 25-50% – Information Quality Lab – University of Illinois at Urbana-Champaign / Jodi Schneider

Start date – January 16, 2022

Description, Responsibilities, & Qualifications:
Mixed methods research assistant to Information Sciences faculty. The incumbent will join the Information Quality Lab under the direction of Dr. Jodi Schneider to work on a newly-funded, three year IMLS grant, Strengthening Public Libraries’ Information Literacy Service Through an Understanding of Knowledge Brokers’ Assessment of Technical and Scientific Information. This project will conduct mixed methods case studies—COVID-19 year 1; climate change (year 2); and AI and labor (year 3)—to understand how knowledge brokers such as journalists, Wikipedia editors, activists/advocates, public librarians assess and use scientific and technical information. Ultimately, the project will develop a conceptual model about sensemaking and use of information. Starting in 2023, the team will co-develop services for knowledge brokers and the public, in collaboration with public library test partners. Results from the project will have implications for public access, information literacy, and understanding of science on policy-relevant topics.

Duties may include:

  • Synthesizing a collection of existing literature related to knowledge brokers.
  • Collecting a sample of about 250 public-facing documents and multimedia, including news (e.g., online print outlets), Wikipedia pages, membership-based online forums, documentaries, and data visualizations, that report, quote, or analyze scientific products (research papers, preprints, datasets, etc.).
  • Using topic modeling, argumentation analysis, and other document analysis techniques to analyze documents and multimedia.
  • Preparing for and conducting interviews with knowledge brokers (journalists, Wikipedia editors, activists/advocates, public librarians).
    • Developing an interview protocol to solicit information from journalists, Wikipedia editors, activists/advocates, public librarians, etc. to understand how they assess the quality of scientific and technical information.
    • Identifying COVID-19 knowledge brokers to interview, by using the document/multimedia collection, organizational directories, etc.
  • Qualitative analysis of interview transcripts (including correcting automatically generated interview transcripts).

Required Qualifications:

  • Excellent communication skills in written and spoken English
  • Excellent analytical/critical thinking skills and effective time management skills
  • Interest in topics such as misinformation, information diffusion, science/technology policy, etc.
  • Interest or experience in one or more methods such as: mixed methods, document analysis, altmetrics, semi-structured interviewing, critical incident technique, or qualitative data analysis

Preferred Qualifications:

  • Available for multiple semesters, including summer
  • Experience conducting and/or transcribing interviews
  • Experience with qualitative analysis software such as ATLAS.TI, NVivo, Taguette, RQDA, etc.
  • Experience as a journalist, Wikipedia editor, activist, advocate, public librarian, information conduit, or knowledge broker
  • Enrollment in the Master’s in Library and Information Science program or in a PhD program
  • Previous completion of one or more CITI Program ethics trainings modules
  • Experience in academic and/or scientific writing

Application Procedure: Interested candidates should send a cover letter and resume in a single pdf file named Lastname_IMLS_RA.pdf (e.g., Schneider_IMLS_RA.pdf) to

Review of applications will begin immediately. Applications will be accepted until the position is filled. All applications received by November 15, 2021 will receive full consideration.

Posted on the Assistantship Clearinghouse.

Graduate Hourly position – Information Quality Lab – University of Illinois at Urbana-Champaign / Jodi Schneider

Start date – ASAP

Description, Responsibilities, & Qualifications:
Mixed methods research assistant to Information Sciences faculty. The incumbent will join the Information Quality Lab under the direction of Dr. Jodi Schneider to work on a newly-funded, three year IMLS grant, Strengthening Public Libraries’ Information Literacy Service Through an Understanding of Knowledge Brokers’ Assessment of Technical and Scientific Information. This project will conduct mixed methods case studies (first topic: COVID-19) to understand how knowledge brokers such as journalists, Wikipedia editors, activists/advocates, public librarians assess and use scientific and technical information. Ultimately, the project will develop a conceptual model about sensemaking and use of information. Starting in 2023, the team will co-develop services for knowledge brokers and the public, in collaboration with public library test partners. Results from the project will have implications for public access, information literacy, and understanding of science on policy-relevant topics.

This position may become a tuition waiver generating assistantship for the Spring 2022 semester for eligible Master’s and Doctoral students.

Initial duties will include:

  • Developing an interview protocol to solicit information from journalists, Wikipedia editors, activists/advocates, public librarians, etc. to understand how they assess the quality of scientific and technical information
  • Synthesizing a collection of existing literature related to knowledge brokers
  • Collecting a sample of about 250 public-facing documents and multimedia, including news (e.g., online print outlets), Wikipedia pages, membership-based online forums, documentaries, and data visualizations, that report, quote, or analyze scientific products (research papers, preprints, datasets, etc.)
  • Identifying COVID-19 knowledge brokers to interview, by using the document/multimedia collection, organizational directories, etc.

Future work will include:

  • Conducting interviews with knowledge brokers (journalists, Wikipedia editors, activists/advocates, public librarians)
  • Correcting automatically generated interview transcripts
  • Qualitative analysis of interview transcripts
  • Using topic modeling, argumentation analysis, and other document analysis techniques to analyze documents and multimedia
  • Case studies on climate change (year 2) and AI and labor (year 3)

Required Qualifications:

  • Excellent communication skills in written and spoken English
  • Excellent analytical/critical thinking skills and effective time management skills
  • Interest in topics such as: misinformation, information diffusion, science/technology policy
  • Interest or experience in one or more methods such as: mixed methods, document analysis, altmetrics, semi-structured interviewing, critical incident technique, or qualitative data analysis

Preferred Qualifications:

  • Available for multiple semesters, including summer
  • Experience conducting and/or transcribing interviews
  • Experience with qualitative analysis software such as ATLAS.TI, NVivo, Taguette, RQDA, etc.
  • Experience as a journalist, Wikipedia editor, activist, advocate, public librarian, information conduit, or knowledge broker
  • Enrollment in the Master’s in Library and Information Science program or in a PhD program
  • Previous completion of one or more CITI Program ethics trainings modules
  • Experience in academic and/or scientific writing

Compensation: minimum $18/hour for Master’s students or $20/hour for PhD students (negotiable commensurate with experience)

Application Procedure: Interested candidates should send a cover letter and resume in a single PDF file named Lastname_IMLS_hourly.pdf (e.g., Schneider_IMLS_hourly.pdf) to

Review of applications will begin immediately. Applications will be accepted until the position is filled. All applications received by November 15, 2021 will receive full consideration.

Posted on the University of Illinois Financial Aid Virtual Job Board and Handshake.

Presentations: Collection directions, pandemic effects, Belgium, Italy / Lorcan Dempsey

Rethinking professional assembly

Presentations: Collection directions, pandemic effects, Belgium, Italy

A clear pandemic effect will be a rethinking of professional assembly. When does it make sense to think about a face to face event? When and how do you do hybrid events? And how many events will go to online only? Of course, it is not just a choice of mode. Our sense of how to present online is different than it might be in person. It is likely we will see more polls or other ways of interaction, more ways of breaking up events into effective chunks, and new kinds of experience influenced by gaming and other approaches. It will be important to think about social interaction in this context. Of course, face to face events will also change, to emphasise those elements where they deliver special value.

Here are some pointers about delivering effective online events from my colleagues:

Virtual is Here to Stay: Making Online Sessions Fun and Focused
Learn strategies to plan, design and host live online meetings or presentations that will energize, inform, and encourage your audience.
Presentations: Collection directions, pandemic effects, Belgium, Italy

Informatie aan Zee and Bibliostar

I was disappointed recently not to be able to travel to Ostende to present at Informatie aan Zee 2021, the conference of VVBAD the professional association for the information sectors in Flanders.

I delivered this presentation online talking about how the pandemic is influencing library orientation and strategies, and then focusing on collections. I presented a modified version of the material a while before at the Bibliostar conference in Milan, again regretting that I could not attend.

This presentation is pretty much delivered as it might have been live, but I did miss the audience, the questions, the interaction.

Presentation, video, interview

Here is the presentation:

This is a presentation presented at Informatie aan Zee in Belgium and in modified form at Bibliostar in Italy. It discusses some Pandemic Effects and how various collections trends are being affected.
Presentations: Collection directions, pandemic effects, Belgium, Italy

Here is the video that was shown to Informatie aan Zie:

I was pleased to do an interview with Paul Buschmann for the VVBAD publication Meta: Tijdschrift voor bibliotheek & archief, 2021 (7) p.22-23.

Collection directions and pandemic effects
Interview: Collection directions and pandemic effects
Presentations: Collection directions, pandemic effects, Belgium, Italy

Related to the Bibliostar event, there was a translation of a recent blog entry on similar topics in the publication associated with the event. (A copy is also available here.)

La biblioteca piattaforma della conoscenza - AA. VV. - Ebook Editrice Bibliografica
Compra Ebook La biblioteca piattaforma della conoscenza di AA. VV. edito da Editrice Bibliografica nella collana Il cantiere biblioteca.
Presentations: Collection directions, pandemic effects, Belgium, Italy

I Confess To Right-Clicker-Mentality / David Rosenthal

"Worth $532M"
Both Cory Doctorow and Matthew Gault and Jordan Pearson have fun with the latest meme about NFTs, "Right-Clicker-Mentality". (Tip of the hat to Barry Ritholtz)

Gault and Pearson explain the meme:
what is the “right-clicker mentality”? Quite literally, it is referring to one’s ability to right-click on any image they see online to bring up a menu and select the “save” option in order to save a copy of the image to their device. In this term we have a microcosm of the entire philosophical debate surrounding NFTs.
I join in below the fold.

They continue:
NFTs, or non-fungible tokens, are unique tokens on the blockchain ostensibly representing a receipt of ownership pointing to some (usually) digital thing, like a JPEG hosted on a server somewhere. To be an NFT collector is to philosophically buy into the idea that owning this string of numbers means you “own” a JPEG that lesser people simply right-click to save on their machines at any time.
I wrote in NFTs and Web Archiving about the tenuous relationship between an NFT and the thing it purports to "own":
the purchaser of an NFT is buying a supposedly immutable, non-fungible object that points to a URI pointing to another URI. In practice both are typically URLs. The token provides no assurance that either of these links resolves to content, or that the content they resolve to at any later time is what the purchaser believed at the time of purchase. There is no guarantee that the creator of the NFT had any copyright in, or other rights to, the content to which either of the links resolves at any particular time.
Gault and Pearson are less technical:
NFTs only hold value because everyone owning them and trading them agrees they hold value.

To right-clickers, the blockchain ledger where their receipt resides is a comforting technological myth that NFT owners point to to legitimate their claims of ownership of a JPEG. It’s a kind of slacktivism, a way to address the problem without risking anything. Right-clicking a JPEG, saving it, and displaying it back to the NFT owner is a way to point out the Emperor has no clothes. Meanwhile, the NFT fans make millions off their naked Emperor.
Cory Doctorow is more direct:
The creators of NFTs envisioned them as a kind of bragging right that described the relationship between a creator and a member of their audience. When you paid for an NFT, you recorded the fact that you had made a donation to the artist that was inspired by a specific work. That fact was indelibly recorded in a public ledger – the blockchain – so everyone could see it.

Instantly, the idea of supporting artists with NFTs was converted into a financial bubble. The point of an NFT wasn't to support an artist – it was to acquire a tradeable asset that would go up in value because the buyer thought they could unload it for even more.
The Economist dipped a toe in the water by selling an NFT of its cover about NFTs, and reports on the experience in The fun in non-fungible:
A scramble of bids forced the winner, who went by the alias @9x9x9, to make an offer of 99.9 ether—around $420,000. The proceeds, net of fees, taxes and transaction costs, will be donated to The Economist Educational Foundation, an independent charity we support.
It was in a good cause, so that's all good clean fun. But, on reflection, The Economist had three takeaways. First:
Despite the slick interface of NFT platforms, the process is a nightmare. It includes setting up a digital wallet, funding it to pay any fees associated with creating an NFT, creating the token and finding a way to convert the proceeds into conventional money in a bank account. For most legal and tax advisers this is all virgin territory. The process is expensive: we paid “gas”, a fancy word for fees, and other levies. In order to become mainstream, applications in decentralised finance will have to be as easy to use as an iPhone and cheaper than dealing with conventional financial intermediaries.
BTC transaction fees
The risks inherent in a system attempting to provide immutable, anonymous transactions make "easy to use" and "cheaper" a considerable stretch. Especially as "gas" is volatile, cheap when few want to transact and expensive when many do. I wrote about this problem in Blockchain: What's Not To Like?:
CryptoKitties average "price" per transaction spiked 465% between November 28 and December 12 as the game got popular, a major reason why it stopped being popular. The same phenomenon happened during Bitcoin's price spike around the same time.
The second problem is energy:
Our modest experiment created as many emissions as a seat on a long-haul flight. Most platforms are exploring how to lower their energy use. If NFTs are to be the Next Big Thing, they must innovate their way towards a carbon-neutral footprint.
NFTs use the Ethereum blockchain. As I explain in Alternatives To Proof-of-Work:
Ethereum, the second most important cryptocurrency, understood the need to replace PoW in 2013 and started work in 2014.
Skepticism about the schedule for ETH2 is well-warranted, as Julia Magas writes in When will Ethereum 2.0 fully launch? Roadmap promises speed, but history says otherwise:
Looking at how fast the relevant updates were implemented in the previous versions of Ethereum roadmaps, it turns out that the planned and real release dates are about a year apart, at the very minimum.
Switching to Proof-of-Stake would definitely reduce Ethereum's carbon footprint, at the cost of greatly increasing the system's attack surface and making it even less decentralized than at present, when two mining pools control the majority of the mining power. The fact that a highly-skilled team has worked on the transition for seven years and claim, however credibly, still to be a year away from done is a testimony to how hard a problem this second one is.

The Economist's last takeaway is:
A third concern is contract enforcement. We hope that this will not be an issue for our token, because the asset—a unique digital representation of a cover image already in wide circulation—will be used within decentralised finance, and there is no obvious incentive to misuse it. But for NFTs that refer to assets outside this self-contained world, such as a patent or a building, the property rights conferred by the NFT may conflict with other contracts, and courts may not recognise the digital agreement.
I'm with Cory Doctorow when he he writes:
NFTs, which have blown up into a massive, fraud-ridden speculative bubble that is blazing through whole rain-forests' worth of carbon while transfering billions from suckers to con-artists. A bezzle, in other words.

What the %$&! is Research Information Management? / HangingTogether

This post is authored by the five co-authors of Research Information Management in the United States: Rebecca Bryant, Jan Fransen, Pablo de Castro, Brenna Helmstutler, and David Scherer.

In Europe, most faculty and university leaders have a ready grasp of the terms Current Research Information System (CRIS) or Research Information Management System (RIMS). These terms—and the systems they refer to—have been around for a while, are widespread across the continent, and play an important role in policy compliance monitoring and in reporting for external requirements from research funders relating to research assessment and open access, such as the United Kingdom’s Research Excellence Framework (REF). A European community of practice is led by euroCRIS.

But here in the United States, the situation is far less clear. The terms CRIS and RIMS, while occasionally used, are less frequent and poorly understood. And, unlike Europe, we don’t have a vendor-agnostic community of practice, despite the rapid adoption of research information management (RIM) infrastructures in US research institutions. 

Instead, we more often hear other terminology like:

  • Research Networking System (RNS)
  • Research Profiling System (RPS)
  • Expert Finder System (EFS)
  • Faculty Activity Reporting (FAR) system.

This multiplicity of terminology reflects a broader confusion about just what research information management is. In a new OCLC Research report series entitled Research Information Management in the United States, we offer the following definition to describe RIM practice:

Research Information Management (RIM) systems support the transparent aggregation, curation, and utilization of data about institutional research activities.

RIM systems support multiple uses

The reports describe six discrete use cases that can be supported by RIM systems. Even though different stakeholders may use different terms to describe different uses, we believe it’s essential to recognize that they all collect and use much the same information—particularly metadata about research staff and their publications and other research outputs produced within the institution.

Six RIM use cases identified in the Research Information Management in the United States report series Six RIM use cases identified in Research Information Management in the United States report series

The similarities between these uses and the systems that support them are greater than their differences. In fact, a key observation of our study is that many different systems are used by different stakeholders within research institutions without often recognizing that all of these disparate RIM systems are part of a larger, umbrella Research Information Management (RIM) product category. We hope that our identification of these use cases will provide a frame of reference for institutions to examine and better understand their own complex practices, inviting increased collaboration, information sharing, decision-making, and institutional investment.

Furthermore, recognizing these similarities is a necessary step toward working across institutional silos and developing a cross-functional, vendor-agnostic community of practice in the United States.[1]

Case studies of RIM practices at US institutions

Our observations are based upon the close study of RIM practices at five US research institutions, selected for their diversity of practices, systems, products, and stakeholders:

  • Penn State University
  • Texas A&M University
  • Virginia Tech
  • UCLA 
  • University of Miami. 

During late 2020 and early 2021, we conducted 23 semi-structured interviews with 39 individuals at 8 institutions and combined this knowledge with a review of the literature, informed by our own past and current experiences as RIM system managers. The findings are divided into two separate reports: 

Part 1 – Findings and Recommendations provides much-needed context for understanding the RIM landscape in the United States by documenting use cases, detailing a RIM system framework, and offering concise recommendations for RIM stakeholders. 
Part 2- Case Studies provides a detailed narrative of the RIM practices at the five case study institutions, offering context about the history, goals, use cases, scope, stakeholders, and users at each institution.

The reports document the array of stakeholders in research information management, including the library, research office, provost and academic affairs units, faculty affairs, human resources, external relations (like advancement and corporate relations), IT, and, of course, the faculty and researchers themselves. Most institutions support multiple RIM uses, often with different systems (and with various degrees of interoperability), which can be seen at a glance here: 

Chart of use cases at different institutionsUse cases practiced at five case study institutions

We encourage you to learn more by reading the reports, which, like all OCLC Research outputs, are openly available to the community. Please share with others in the community and watch for more blog posts about this project here on Hanging Together

[1] This definition and the use cases were developed through our examination of the US RIM ecosystem, but we have also sought to develop descriptions that work beyond the borders of the US, to particularly be inclusive of practices in Europe and the rest of the world.

  • Rebecca Bryant, PhD, serves as Senior Program Officer at OCLC Research where she leads investigations into research support topics such as research information management (RIM).
  • Janet (Jan) Fransen is the Service Lead for Research Information Management Systems at University of Minnesota Libraries, in particularly supporting Experts@Minnesota.
  • Pablo de Castro works as Open Access Advocacy Librarian at the University of Strathclyde in Glasgow and also serves as Secretary for the euroCRIS association to promote collaboration across the research information management community.
  • Brenna Helmstutler is the Librarian for the School of Information Studies at Syracuse University and works with other library team members to support Experts@Syracuse.
  • David Scherer is the Scholarly Communications and Research Curation Consultant with the University Libraries at Carnegie Mellon University where he serves as the Operational Lead for the CMU Elements Research Information Management (RIM) Initiative.

The post What the %$&! is Research Information Management? appeared first on Hanging Together.

Interface Upgrade | Integrating Queries into Search and Case View / Harvard Library Innovation Lab

With expanded feature capabilities, users may find writing these queries to be more difficult, especially as researchers increase the complexity of their investigations. To make usage easier, we have integrated the Trends query language into the Search and Case View features. From a search query, users can click the Trends button, upon which our servers will automatically convert an existing query into a Trends timeline.

Gif showing search results converted into a Trend timeline.

Additionally, users can now view the citation history of a particular case from that case's page by clicking the "View citation history in trends" button.

Gif showing ability to display citation history on a Trend timeline from an individual case

Our exploration of timeline generation for empirical legal scholarship has inspired us to reimagine how people reason about CAP's corpus of American caselaw. In the future, we hope to restructure the search page further and empower people to quickly ask complex questions about American caselaw over time.

We believe that citation-based analysis can significantly enrich our understanding of American caselaw, and we are excited to see how these tools can expose insights both in the law itself and in quantitative techniques for its exploration. If you have any ideas for how we can further expand on these features, please do not hesitate to reach out to us at

This is part of a series of posts by Andy Gu, a visiting researcher who joined the LIL team in summer 2021. We were inspired to build these features after recognizing the power of the Caselaw Access Project's case and citation data to analyze and explore caselaw. We hope that these features will make empirical study of caselaw both faster and more accessible for researchers.

New Feature | Flexible Citation Queries / Harvard Library Innovation Lab

Expand your ability to visualize citation practices with the latest support added to our Trends tool. Trends now supports flexible queries of how cases cite other cases in addition to the other ways in which cases can be filtered. By appending the name of any acceptable filter parameter to cites_to__{parameter name here}, users can retrieve all cases citing to cases matching said filter. The parameter name, like before, can be any parameter accepted by the Cases API.

For instance, the following query graphs the number of cases that cite to another case where Justice Cardozo wrote the majority opinion against the number of cases where Justice Brandeis wrote the majority opinion.

comparison of majority opinion authors over time displayed on a graph Figure 1 query: api(cites_to__author_type=cardozo:majority), api(cites_to__author_type=brandeis:majority)

The cites_to__ feature provides users the power to flexibly reason about case citation patterns. For instance, if a user were interested in how the Supreme Court of California cited authority from its own jurisdiction in comparison to authority from other jurisdictions, they could write the following query:

comparison of citations within jurisdiction versus outside displayed on a graph Figure 2 query: api(court=cal-1&cites_to__jurisdiction__exclude=cal), api(court=cal-1&cites_to__jurisdiction=cal)

This set of parameters can be integrated with any other parameters compatible with the Cases API. For instance, we can filter the above timeline only to citations of cases that mention the term 'technology':

comparison of citations within jurisdiction versus outside filtered by topic displayed on a graph Figure 3 query: api(court=cal-1&cites_to__jurisdiction__exclude=cal&cites_to__search=technology), api(court=cal-1&cites_to__jurisdiction=cal&cites_to__search=technology)

Users may also use the parameters within the api() tag to query the Cases API directly. A caveat to the cites_to__ feature is that if the number of cases that fulfill a cites_to__ condition is greater than 20,000 cases, our system will randomly select 20,000 cases within the filtered cases to match against. For more information about all the parameters we support, please feel free to consult our Cases API documentation here.

If you're interested in exploring this data in a different way, make sure you've checked out Cite Grid.

This is part of a series of posts by Andy Gu, a visiting researcher who joined the LIL team in summer 2021. We were inspired to build these features after recognizing the power of the Caselaw Access Project's case and citation data to analyze and explore caselaw. We hope that these features will make empirical study of caselaw both faster and more accessible for researchers.

Feature Update | Extension of Trend Search Capability / Harvard Library Innovation Lab

Today, we are announcing an update to the Caselaw Access Project (CAP) API and Trends tool to help users better investigate changes in the law over time. These new features enable users to easily generate timelines of cases and explore patterns in case citations. We hope that they can help researchers uncover new insights about American caselaw.

Previously, the project's Historical Trends tool permitted users to graph word and phrase frequencies in cases over time. For instance, the following graph displays the frequency of the terms 'lobster' and 'gold' over time in cases in Maine and California.

historical trends results displayed on graph Figure 1 query: me: lobster, cal: gold

We have extended the Trends tool so that users can generate timelines of cases for any parameter accepted by the Cases API endpoint. As a result, users can ask broad questions about the Caselaw Access Project's dataset and quickly retrieve timelines of cases that follow the queried pattern.

For instance, the following query presents timelines of cases which cite Mapp v. Ohio since 1961, split by jurisdiction.

query results displayed on graph Figure 2 query: *: api(cites_to=367 U.S. 643)

The breadth of available filters drastically increases the number of possibilities for a researcher to explore case data. For example, we can take the author parameter in the Cases API to graph the number of cases where Justice Scalia wrote a dissenting opinion with the number of cases where Justice Scalia wrote a majority opinion. By clicking into the timeline, users can retrieve granular information about the qualifying cases.

results filtered by author, displayed on a graph Figure 3 query: api(author_type=scalia:dissent), api(author_type=scalia:majority)

The power of this flexible query language increases with each parameter supplied to the Trends query. If a user wanted to compare the frequency of Supreme Court cases where Justice Scalia dissented and Justice Breyer wrote the majority opinion with cases where Justice Breyer dissented and Justice Scalia wrote the majority opinion, they could draft the following search:

graphed results of specific opinion author queries Figure 4 query: api(author_type=scalia:dissent&author_type=breyer:majority&court=us), api(author_type=scalia:majority&author_type=breyer:dissent&court=us)

We have also updated our underlying database to allow users to reason over the citation patterns of individual opinions, in addition to the case itself. If a user wanted to see how many times Justice Scalia specifically cited Mapp v. Ohio in an opinion, we can do so with the following query:

number of time a case was cited by a specific author over time, displayed on a graph Figure 5 query: api(author__cites_to_id=1785580&author=scalia), api(author__cites_to_id=1785580&author=breyer)

We believe that these features will empower researchers to quickly conduct rich explorations of American caselaw, and we are excited to see how they can expose new insights about our corpus of cases. If you have any ideas for how we can further expand on these features, please do not hesitate to reach out to us at

This is part of a series of posts by Andy Gu, a visiting researcher who joined the LIL team in summer 2021. We were inspired to build these features after recognizing the power of the Caselaw Access Project's case and citation data to analyze and explore caselaw. We hope that these features will make empirical study of caselaw both faster and more accessible for researchers.

It’s here, the 2021 NDSA Staffing Survey! / Digital Library Federation

Do you work at an organization that stewards digital content for long-term preservation? If so, we’d like to hear from you about staffing for digital preservation at your organization. The 2021 NDSA Staffing Survey is designed to gain insight into current and ideal staffing for digital preservation programs. 

The 2021 Staffing Survey is meant to be answered by individuals and there is no limit on the number of individual respondents per organization. Responses are sought from individuals worldwide with current digital preservation responsibilities at their organization, ranging from practitioners to department managers to senior leadership. You do not need to be an NDSA member to answer this survey.

Follow this link to access the survey. It is available until Friday, December 10, 2021 and is expected to take approximately 20-25 minutes to complete. To assist with completing the survey, a PDF preview of all of the survey questions can be viewed in advance by following this link.

The 2021 survey has undergone an extensive redesign from the earlier 2012 and 2017 iterations, prompted by findings from the last survey and changes in the field over the last decade. During this process, the Staffing Survey Working Group aimed to ensure that all participants would see themselves reflected in the answer choices. While the survey is not exhaustive, we believe it strikes a balance. However, we welcome all feedback about how future instances of the survey can be improved, and encourage participants to submit their comments at the survey’s end.

Interested in the results of previous NDSA staffing reports? The code books, data, and reports are available in the NDSA OSF.

If you have questions or concerns about this survey, please contact and include “Staffing Survey” in the subject line.

Thank you for helping NDSA and our community define and advance digital preservation!

-The NDSA Staffing Survey Working Group

The post It’s here, the 2021 NDSA Staffing Survey! appeared first on DLF.

Exploring #elxn44 Twitter Data / Nick Ruest


This is my third time collecting tweets for a Canadian Federal Election, and will most likely be last. The changes to the Twitter API including the Academic Research product track, twarc2, and great Documenting the Now Slack have considerbly lowered the barrier to collecting and analyizing Twitter data. I’m proud of the work I’ve done over the last seven years collecting and analyizing tweets. I hope it provided a solid implementation pattern for others to be build on in the future!

If you want to check out past Canadian election Twitter dataset analysis:

In this analysis, I’m going to provide a mix of examples for examining the overall dataset; using twarc utilities, twut, and pandas. The format of this post is pretty much the same as the last election post I did, much like the results of this election!


The dataset was collected with Documenting the Now’s twarc. It contains 2,075,645 tweet ids for #elxn44 tweets.

If you’d like to follow along, and work with the tweets, they can be “rehydrated” with Documenting the Now’s twarc, or Hydrator.

$ twarc hydrate elxn44-ids.txt > elxn44.jsonl

The dataset was created by collecting tweets from the the Standard Search API on a cron job every five days from July 28, 2021 - November 01, 2021.

#elx44 tweet volume #elxn44 Tweet Volume
#elx44 wordcloud #elxn44 wordcloud

Top languages

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF =
val languages = language(tweetsDF)

|lang|  count|
|  en|1939944|
|  fr|  72900|
| und|  56247|
|  es|   2067|
|  ht|    583|
|  in|    417|
|  tl|    373|
|  ro|    286|
|  ca|    256|
|  pt|    245|

Top tweeters

Using the elxn44-user-info.csv derivative, and pandas:

import pandas as pd
import altair as alt

userInfo = pd.read_csv("elxn44-user-info.csv")

tweeters = userInfo['screen_name']

tweeter_chart = (
          x=alt.X("Count:Q", axis=alt.Axis(title="Tweets")),
          y=alt.Y("Username:O", sort="-x", axis=alt.Axis(title="Username")))

tweeter_values = tweeter_chart.mark_text(
#elx44 Top Tweeters #elxn44 Top Tweeters


Using from twarc utilities, we can find the most retweeted tweets:

$ python twarc/utils/ elxn44.jsonl | head


From there, we can use append the tweet ID to to see the tweet. Here’s the top three:

  1. 4,930

  2. 3,763

  3. 3,645

Top Hashtags

Using the elxn44-hashtags.csv derivative, and pandas:

hashtags = pd.read_csv("elxn44-hashtags.csv")

top_tags = hashtags.value_counts().rename_axis("Hashtags").reset_index(name="Count").head(10)

tags_chart = (
          x=alt.X("Count:Q", axis=alt.Axis(title="Tweets")),
          y=alt.Y("Hashtag:O", sort="-x", axis=alt.Axis(title="Hashtag")))

tags_values = tags_chart.mark_text(

(tags_chart + tags_values).configure_axis(titleFontSize=20).configure_title(fontSize=35, font='Courier').properties(height=800, width=1600, title="#elxn44 Top Hashtags")
#elx44 hashtags #elxn44 hashtags

Top URLs

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF =
val urlsDF = urls(tweetsDF)

|url                                                                                                                            |count|
|           |1107 |
|                                                                                                        |1073 |
|                  |672  |
|                                                                                                        |644  |
||491  |
|                                                                                                |472  |
|                                                                                                            |467  |
|        |453  |
|                                                                  |451  |
|                                                                                                           |448  |
#elx44 top url #elxn44 Top Url

Adding URLs to Internet Archive

Do you know about the Internet Archive’s handy Save Page Now API?

Well, you could submit all those URLs to the Internet Archive if you wanted to.

You have be nice, and need to include a 5 second pause between each submission, otherwise your IP address will be blocked for 5 minutes!!

<h1>Too Many Requests</h1>

We are limiting the number of URLs you can submit to be Archived to the Wayback Machine, using the Save Page Now features, to no more than 15 per minute.
<p>If you submit more than that we will block Save Page Now requests from your IP number for 5 minutes.
Please feel free to write to us at if you have questions about this.  Please include your IP address and any URLs in the email so we can provide you with better service.

You can use something like Raffaele Messuti’s example here.

Or, a one-liner if you prefer!

$ while read -r line; do curl -s -S "$line" && echo "$line submitted to Internet Archive" && sleep 5; done < elxn44-urls.txt

I wonder what the delta is this time between collecting organizations, Internet Archive, and Tweeted URLs is?

Top media urls

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF =
val mediaUrlsDF = mediaUrls(tweetsDF)

|image_url                                                                                                 |count|
|                                                           |4447 |
|                                                           |1530 |
|                                                           |1350 |
|       |1009 |
|      |1009 |
|       |1009 |
||1009 |
|         |908  |
|        |908  |
|                 |908  |
#elx44 Top Media #elxn44 Top Media

Top video urls

Using the full dataset and twut:

import io.archivesunleashed._
import spark.implicits._

val tweets = "elxn44_search.jsonl"
val tweetsDF =
val videoUrlsDF = videoUrls(tweetsDF)

|video_url                                                                                                 |count|
|      |1009 |
|       |1009 |
||1009 |
|       |1009 |
|         |908  |
|         |908  |
|                 |908  |
|        |908  |
|                 |838  |
|        |838  |
#elxn44 Top Video


We looked a bit at the media URLs above. Since we have a list (elxn44-media-urls.txt) of all of the media URLs, we can download them and get a high level overview of the entire collection with a couple different utilities.

Similar to the one-liner above, we can download all the images like so:

$ while read -r line; do wget "line"; done < elxn44-media-urls.txt

You can speed up the process if you want with xargs or GNU Parallel.

$ cat elxn4-media-urls.txt |  parallel --jobs 24 --gnu "wget '{}'"


Juxta is a really great tool that was created by Toke Eskildsen. I’ve written about it in the past, so I won’t go into an overview of it here.

That said, I’ve created a created a collage of the 212,621 images in the dataset. Click the image below to check it out.

If you’ve already hydrated the dataset, you can skip part of the process that occurs in by setting up a directory structure, and naming files accordingly.

Setup the directory structure. You’ll need to have elxn44-ids.txt at the root of it, and hydrated.json.gz in the elxn44_downloads directory you’ll create below.

$ mkdir elxn44 elxn44_downloads
$ cp elxn44.jsonl elxn44_downloads/hydrated.json
$ gzip elxn44_downloads/hydrated.json

Your setup should look like:

├── elxn44
├── elxn44_downloads
│   ├── hydrated.json.gz
├── elxn44-ids.txt

From there you can fire up Juxta and wait while it downloads all the images, and creates the collage. Depending on the number of cores you have, you can make use of multiple threads by setting the THREADS variable. For example: THREADS=20.

$ THREADS=20 /path/to/juxta/ elxn44-ids.txt elxn44

Understanding the collage; you can follow along in the image chronologically. The top left corner of the image will be the earliest images in the dataset (July 28, 2021), and the bottom right corner will be the most recent images in the dataset (October 6 , 2021). Zoom in and pan around! The images will link back to the Tweet that they came from.

#elx44 Juxta #elxn44 Juxta


Making Sure "Number Go Up" / David Rosenthal

Fake it till you make it is the way Silicon Valley works these days, as exemplified by Theranos, Uber, WeWork and many other role models. It is certainly the case with cryptocurrencies. Would you believe that an NFT of this image was worth $532M? How about nearly $1.1B? Most numbers that are quoted about cryptocurrencies are fake, in the sense that they are manipulated in order to fool the press, and thereby buy time until they become "too big to fail".

The credulous press reports make it look like the cryptocurrency market is much bigger and much more successful that it really is, further inflating the bubble. Below the fold, I provide a set of examples of the techniques that are used to fuel the mania.

Wash Trades

In regulated markets, wash trading is illegal, but the whole point of cryptocurrencies is to evade annoying regulations like this that prevent market manipulation. Nick Baker's An NFT Just Sold for $532 Million, But Didn’t Really Sell at All dissects a blatant example:
The process started Thursday at 6:13 p.m. New York time, when someone using an Ethereum address beginning with 0xef76 transferred the CryptoPunk to an address starting with 0x8e39.

About an hour and a half later, 0x8e39 sold the NFT to an address starting with 0x9b5a for 124,457 Ether -- equal to $532 million -- all of it borrowed from three sources, primarily Compound.

To pay for the trade, the buyer shipped the Ether tokens to the CryptoPunk’s smart contract, which transferred them to the seller -- normal stuff, a buyer settling up with a seller. But the seller then sent the 124,457 Ether back to the buyer, who repaid the loans.

And then the last step: the avatar was given back to the original address, 0xef76, and offered up for sale again for 250,000 Ether, or more than $1 billion.
How prevalent is wash trading at cryptocurrency exchanges? The top three results I got from Google are all papers from this year.

First, Wash trading at cryptocurrency exchanges by Guénolé Le Pennec, Ingo Fiedler and Lennart Ante:
Suspicious volume of >90% is detected for most investigated exchanges.
Cryptocurrency exchanges allegedly use wash trading to falsely signal their liquidity. We monitored twelve exchanges for metrics of web traffic and for their administered user funds. The exchanges were clustered in three distinct groups based on previous findings: (1) accurately-reporting exchanges, (2) exchanges that engaged in wash trading, (3) exchanges with mixed evidence of wash trading. A comparison of the reported to the predicted trading volume, calibrated on the accurately-reporting exchanges, suggests that group 2 exchanges exaggerate their true volume by a factor of 25 to 50, and exchanges of group 3 by a factor of 1.25 to 33.
Second, Crypto Wash Trading by Lin William Cong, Xi Li, Ke Tang and Yang Yang:
We introduce systematic tests exploiting robust statistical and behavioral patterns in trading to detect fake transactions on 29 cryptocurrency exchanges. Regulated exchanges feature patterns consistently observed in financial markets and nature; abnormal first-significant-digit distributions, size rounding, and transaction tail distributions on unregulated exchanges reveal rampant manipulations unlikely driven by strategy or exchange heterogeneity. We quantify the wash trading on each unregulated exchange, which averaged over 70% of the reported volume. We further document how these fabricated volumes (trillions of dollars annually) improve exchange ranking, temporarily distort prices, and relate to exchange characteristics (e.g., age and userbase), market conditions, and regulation.
Third, Detecting and Quantifying Wash Trading on Decentralized Cryptocurrency Exchanges by Friedhelm Victor and Andrea Marie Weintraud:
Cryptoassets such as cryptocurrencies and tokens are increasingly traded on decentralized exchanges. The advantage for users is that the funds are not in custody of a centralized external entity. However, these exchanges are prone to manipulative behavior. In this paper, we illustrate how wash trading activity can be identified on two of the first popular limit order book-based decentralized exchanges on the Ethereum blockchain, IDEX and EtherDelta. We identify a lower bound of accounts and trading structures that meet the legal definitions of wash trading, discovering that they are responsible for a wash trading volume in equivalent of 159 million U.S. Dollars. While self-trades and two-account structures are predominant, complex forms also occur. We quantify these activities, finding that on both exchanges, more than 30% of all traded tokens have been subject to wash trading activity. On EtherDelta, 10% of the tokens have almost exclusively been wash traded.
It looks like wash trading is quite a problem, which would make rankings of exchanges and volume numbers highly inflated, not to mention prices.


Three years ago Tao Li et al published a detailed analysis entitled Cryptocurrency Pump-and-Dump Schemes concluding that:
Pump-and-dump schemes (P&Ds) are pervasive in the cryptocurrency market. We find that P&Ds lead to short-term bubbles featuring dramatic increases in prices, volume, and volatility. Prices peak within minutes and quick reversals follow. The evidence we document, including price run-ups before P&Ds start, implies significant wealth transfers between insiders and outsiders. ... Using a difference-in-differences approach, we provide causal evidence that P&Ds are detrimental to the liquidity and price of cryptocurrencies.
They were pervasive then and they still are now. For example, David Gerard documents one on 16th December 2020 that took BTC to a new high over $20K:
We saw about 300 million Tethers being lined up on Binance and Huobi in the week previously. These were then deployed en masse.

You can see the pump starting at 13:38 UTC on 16 December. BTC was $20,420.00 on Coinbase at 13:45 UTC. Notice the very long candles, as bots set to sell at $20,000 sell directly into the pump.
A series of peaks followed, as the pumpers competed with bagholders finally taking their chance to cash out — including $21,323,97 at 21:54 UTC 16 December, $22,000.00 precisely at 2:42 UTC 17 December, and the peak as I write this, $23,750.00 precisely at 17:08 UTC 17 December.

This was exactly three years after the previous high of $19,783.06 on 17 December 2017.
And another pump on 6th October:
Someone bought 1.6 billion dollars’ worth of bitcoins in one lump on Wednesday 6 October in under five minutes, between 13:11 and 13:16 UTC. This pumped the Bitcoin price from about $50,000 to about $55,000.
Of course, they didn’t use dollars to buy the bitcoins — they used tethers to buy the coins on Binance, tethers that had been freshly created and deployed to the exchange a few days earlier.
And another on 15th October:
There was a similar tether-fueled pump, to a new all-time high, just before the CFTC settlement came out on 15 October. This pump continued for a few more days.

This new all-time high was followed by a flash-crash to below $10,000 on some exchanges. Bitfinex’ed blames trading bots being shut off. [Twitter] This also suggests that the un-pumped price of Bitcoin would be far lower than the present price in tethers.

When the price dipped from its unfeasibly-pumped peak, multiple major crypto exchanges coincidentally had simultaneous downtime.
Do these simultaneous outages look suspicious or not?

Transaction Rate

The press claim that Bitcoin is widely used because it processes around 270K transactions/day. But Igor Makarov and Antoinette Schoar write:
90% of transaction volume on the Bitcoin blockchain is not tied to economically meaningful activities but is the byproduct of the Bitcoin protocol design as well as the preference of many participants for anonymity ... exchanges play a central role in the Bitcoin system. They explain 75% of real Bitcoin volume
So it is really only processing around 27K "economically meaningful" transactions/day. And 75% of those are transactions between exchanges, so only 2.5% of the "transactions" are real blockchain-based transfers involving individuials. That's less than 5 per minute.

Trading Volume

Naively, the press multiplies the inflated transaction rate by the inflated "price" resulting from the latest wash trades or pumps to claim that the value of Bitcoins' daily transactions is around $5B. Doing so involves valuing the 90% of the not "economically meaningful" transactions like the meaningful ones. So this number is inflated by a factor of around 10.

This inflation is obvious if we look at the trading volume on major exchanges, which is currently in the region of $400M/day. Thus it represents around 8% of the claimed trading volume of Bitcoin. Interestingly, this approximately matches Makarov and Schoar's 75% of 10% of $5B as the exchange-based "trading volume" on the blockchain.

Market Cap

Similarly, the press claim that the "market cap" of Bitcoin is once again approaching a trillion dollars (as much at Tesla!) is arrived at by naively multiplying the inflated "price" by the number of Bitcoins that have been mined.

Back in January when Bitcoin's inflated "price" was only around $35K, Jemima Kelly took this claim to the woodshed in No, bitcoin is not “the ninth-most-valuable asset in the world”:
if you take the all-time-high of $37,751 and multiply that by the bitcoin supply (roughly 18.6m) you get to just over $665bn. And, if that were accurate and representative and if you could calculate bitcoin’s value in this way, that would place it just below Tesla and Alibaba in terms of its “market value”. (On Wednesday!)

The only problem is, as you might have already guessed, that’s not accurate or representative and you cannot calculate bitcoin’s value in that way.
Kelly points out that the whole idea of a "market cap" for a totally speculative asset is just wrong:
working out its “market cap” is a non-starter. As some of you might remember, it was originally designed to be a currency that could be used to buy actual things! And although it fails to meet all the criteria that would make it a currency, it does have one thing in common with it: its price is underpinned by sheer faith. The difference being that with fiat currencies, that faith is effectively placed in the governments of the nation states who issue them, whereas for bitcoin, the faith is placed in . . . the hope that other people will keep having the faith.
But even if it weren't, the number multiplying the "price" is just wrong too:
although 18.6m bitcoins have indeed been mined, far fewer can actually be said to be “in circulation” in any meaningful way.

For a start, it is estimated that about 20 per cent of bitcoins have been lost in various ways, never to be recovered. Then there are the so-called “whales” that hold most of the bitcoin, whose dominance of the market has risen in recent months. The top 2.8 per cent of bitcoin addresses now control 95 per cent of the supply (including many that haven’t moved any bitcoin for the past half-decade), and more than 63 per cent of the bitcoin supply hasn’t been moved for the past year, according to recent estimates.

What all this means is that real liquidity — the actual available supply of bitcoin — is very low indeed.
Which means that:
the idea that you can get out of your bitcoin position at any time and the market will stay intact is frankly a nonsense. And that’s why the bitcoin religion’s “HODL” mantra is so important to be upheld, of course.

Because if people start to sell, bad things might happen! And they sometimes do. The excellent crypto critic Trolly McTrollface ... pointed out on Twitter that on Saturday a sale of just 150 bitcoin resulted in a 10 per cent drop in the price.


Finally, it is important to note that the press reports "prices" in USD but the vast majority of trading in cryptocurrencies, as shown in the graph, is not in USD but in stablecoins, primarily Tether.

Is 1 USDT really the equivalent of 1 USD? Originally, Tether claimed that it has 1 USD in a bank account for every 1 USDT it issued, but that claim was abandoned a long time ago. There has never been an audit to determine what, exactly, is backing Tether.

Zeke Faux's Anyone Seen Tether’s Billions? is the latest in a series of exhaustive attempts to answer that question. He saw:
a document showing a detailed account of Tether Holdings’ reserves. It said they include billions of dollars of short-term loans to large Chinese companies—something money-market funds avoid. And that was before one of the country’s largest property developers, China Evergrande Group, started to collapse. I also learned that Tether had lent billions of dollars more to other crypto companies, with Bitcoin as collateral. One of them is Celsius Network Ltd., a giant quasi-bank for cryptocurrency investors, its founder Alex Mashinsky told me
Clearly, much of the backing is in "commercial paper", i.e. debts owed by other companies. Some of the companies are apparently Chinese real estate developers, desperate for credit. Some is Bitcoin and other cryptocurrencies as collateral for loans to cryptocurrency exchanges. David Gerard is suspicious:
There is no reason to assume the tethers sent to Binance in early October were not just sent as a loan and then the loan accounted as the backing reserve, i.e., Tether sending tethers to a crypto exchange for free — because, as the CFTC settlement notes, Tether has routinely done precisely that, for years. And the Bloomberg story confirms that they still do this in 2021.
As the graph shows, the issuance of USDT and the "price" of BTC are completely correlated. This is the "magic money pump" I outlined inStablecoins. Newly issued USDT sent to an exchange will almost certainly be quickly used to buy cryptocurrency. This pumps the price of cryptocurrencies, including those forming part of Tether's reserve. This allows Tether to issue more USDT, which can be sent to an exchange, used to buy cryptocurrency, which pumps the price, which ... Rinse and repeat.

The Name / Ed Summers

Là se noue toute l’expérience classique du langage : le caractère réversible de l’analyse grammaticale qui est, d’un seul tenant, science et prescription, étude des mots et règle pour les bâtir, les utiliser, les réformer dans leur fonction représentative ; le nominalisme fondamental de la philosophie depuis Hobbes jusqu’à l’Idéologie, nominalisme qui n’est pas séparable d’une critique du langage et de toute cette méfiance à l’égard des mots généraux et abstraits qu’on trouve chez Malebranche, chez Berkeley, chez Condillac et chez Hume ; la grande utopie d’un langage parfaitement transparent où les chose elles-mêmes seraient nommées sand brouillage, soit par un système totalement arbitraire, mais exactement réfléchi (langue artificielle), soit par un langage si naturel qu’il traduirait la pensée comme le visage quand il exprime une passion (c’est de ce langage fait de signes immédiats que Rousseau a rêvé au premier de ses Dialogues). On peut dire que c’est le Nom qui organise tout le discours classique ; parler ou écrire, ce n’est pas dire les chose ou s’exprimer, ce n’est pas jouer avec le langage, c’est s’acheminer vers l’acte souverain de nomination, aller, à travers le langage, jusque vers le lieu où les chose et les mots se nouent en leur essence commune, et qui permet de leur donner un nom. (Foucault, 1966)


This is the nexus of the entire Classical experience of language: the reversible character of grammatical analysis, which is at one and the same time science and prescription, a study of words and a rule for constructing them, employing them, and remoulding them into their representative function; the fundamental nominalism of philosophy from Hobbes to Ideology, a nominalism that is inseparable from a critique of language and from all that mistrust with regard to general and abstract words that we find in Malebranche, Berkeley, Condillac, and Hume; the great utopia of a perfectly transparent language in which things themselves could be named without any penumbra of confusion, either by a totally arbitrary but precisely thought-out system (artificial language), or by a language so natural that it would translate thought like a face expressing a passion (it was this language of immediate sign that Rousseau dreamed of in the first of his Dialogues). One might say that it is the Name, that organizes all Classical discourse; to speak or to write is not to say things or to express oneself, it is not a matter of playing with language, it is to make one’s way towards the sovereign act of nomination, to move, through language, towards the place where things and words are conjoined in their common essence, and which makes it possible to give them a name. (Foucault, 1994)

After having read lots of later of Foucault (mostly his lectures that touch on governmentality) I’d never really taken the time to read the book that catapulted him to fame: The Order of Things, or the original title, Les Mots et les Choses. How different are these titles? I personally think they made a mistake not using a more literal translation: Words and Things.

Since I’m in no particular rush I’ve been trying to revive the little French I learned in high school, by reading in English, but taking a look at the original French when I run across a section I really like. Even for a novice like me, the French has a different luminous quality–maybe that’s true of the language in general though…

One thing that struck me here when reading the original French is the translation of noue as nexus in the first sentence. The verb nouer is to start or tie a knot, whereas se nouer is the point at which the strands of a plot come together. I guess nexus works alright. But the translation totally misses the mirroring that happens between se noue in the first sentence and se nouent in the last sentence:

… to move, through language, towards the place where things and words are conjoined in their common essence and which makes it possible to give them a name.

… aller, à travers le langage, jusque vers le lieu où les chose et les mots se nouent en leur essence commune et qui permet de leur donner un nom.

Also lost is in translation is the idea of their start together: words and things.

Foucault, M. (1966). Les mots et les choses: une archéologie des sciences humaines. Paris: Gallimard.

Foucault, M. (1994). The Order of things: an archaeology of the human sciences. New York: Vintage Books.

Open Library in Every Language / Open Library

The Open Library catalog is used by patrons from across the globe, but its usage is predominated by English speakers (32% US, 9% India, 5% UK, 4% Canada). This is driven by four factors which we’re working to change.

  1. International Holdings – It goes without saying that, in order to be an Open Library for the Internet™ our catalog needs to include book records and link to source material from more languages. We’re actively working with the acquisitions team within the Internet Archive to fight for greater diversification of our book holdings, including more languages and regions. If you are an international library or publisher, you may help us by sharing your catalog metadata and we’ll happily include these records on Open Library & provide back-links so patrons know where the metadata comes from.
  2. Search – In order for Open Library to be as useful as possible for diverse communities around the globe, our search engine has to show patrons the right books with appropriately translated titles. Managing a search engine for a service like Open Library is a full-time job. Presently, this gargantuan task is spearheaded by Drini Cami. Presently, because of historical reasons & performance, the Open Library search engine indexes on Works (collections of editions) as opposed to Editions. This limits our ability to tailor search results and show patrons book editions in their preferred language. This year we made progress on supporting Edition-level indexing and “search for books in language” (one of our most requested features) will be on our roadmap for 2022.
  3. Marketing – Open Library is run by a small team of staff that you can count on one hand and our success depends on the efforts of volunteers who champion literacy and librarianship for their communities. We’re still learning which channels may be best to extend our offerings to patrons in regions which we’re currently under-serving. If you have an idea on how we can reach a new community, we’d love your advice and your help. Please send us you ideas using the “Communication & Outreach” link on our volunteer page.
  4. Translation & Localization – Making a website like Open Library accessible and usable to an international audience takes more than clicking “google translate”. For years Open Library has had a pipeline and process for adding translations.

Goal: 5 Languages

Our current goal is to fully localize the Open Library website into 10 languages. We currently have contributions for translations across 7 languages: Čeština, Deutsch, English, Español, Français, Hrvatski, and తెలుగు.

English, Spanish, French, and Croatian (Hrvatski) are the most up to date and you can try the website in those languages by clicking their respective links. Can you help us get one of these other languages across the finish line?

Why Contribute Now?

In the past, translators did not have an automatic way to receive feedback about whether they had contributed translations correctly. Translators would need to have a conversation with staff in order to get started, submit translations for review, and then a member of staff would report back if there was a mistake. This process had so much friction that it resulted in many incomplete translation submissions.

This year, Jim Champ, Drini Cami, and others in the community added automated validation so translators get near-real time feedback about whether translations had been submitted correctly. Now, submitting a translation is much simpler and only requires one to know the target language. Here’s how!

How it Works

All you need in order to contribute translations is a Github account. Translations can be contributed directly on the Github website by following the Translator’s Contributor’s Guide with no special software required to participate.

Want to Help Translate?

Let us know here:

Meet our Translators

Daniel – Spanish

Daniel Capilla lives in Málaga, Spain and has been contributing to Open Library since 2013. Daniel’s interest in contributing to Open Library was sparked by his joy of reading and all things  library-and-book-related as well as the satisfaction he gets from contributing to open source projects and knowing that everyone will be able to freely enjoy his contributions in the future. Dan has made significant contributions by adding a first Spanish translation and believes:

“The issue of the internationalization of the Open Library seems to me to be a fundamental issue for the project to have more acceptance, especially in non-English speaking countries. This is an issue on which there is still much to be done.”

Follow Daniel on twitter: @dcapillae

Results of the 2021 Fixity Survey and Fixity Case Studies / Digital Library Federation

The 2021 Fixity Survey Working Group is pleased to announce the publication of the results of the 2021 Fixity Survey and corresponding Fixity Case Studies.  The 40 question survey was completed by 116 respondents over a month long period.  

The Report documents the results of each question in the areas of 1) basic information about fixity practices, 2) how fixity is being used, 3) fixity in relation to cloud services, 4) fixity errors, and 5) general demographic information.  

Several key points can be made from studying the survey results, some of which are listed below with more details provided in the report. 

  • The results demonstrate just how important fixity information is to the digital preservation community, with over 96% of survey respondents confirming that they utilize fixity information within their organization and over 98% of these using checksums (sometimes alongside other types of fixity information). The primary reason fixity information is used by the community is to determine whether data has been altered over time.
  • Despite a clear consensus that the use of fixity information represents good practice, the results demonstrate huge variation in fixity practices across the community. There are a variety of practices reported across the survey questions, including at what point fixity information is verified, the frequency of checks, where fixity information is recorded, and the checksum algorithms in use. 
  • Receiving fixity information at the time of acquisition remains a challenge.
  • Though fixity checking lends itself well to automation, for many it remains a fairly manual process, with a majority of respondents using manually-run software to carry out this activity. 

In addition to analyzing the results of the survey, the Fixity Survey Working Group conducted follow-up interviews with five organizations to explore fixity practices in more detail. Case studies from four of these organizations are currently included in the report and provide a rich illustration of how fixity is used within specific organizations, and build on some of the findings of the survey itself.

You can read and/or download the Report, Data files, and Codebook from the NDSA OSF site.  

Thank you to all of you who participated in the survey. We appreciate your time and effort spent in providing the information to us.  

Thank you to the members of the Fixity Survey Working group who worked on a tight schedule to complete this work in time for DigiPres.  

~ The 2021 Fixity Survey Working Group co-chairs

The post Results of the 2021 Fixity Survey and Fixity Case Studies appeared first on DLF.

Islandora Foundation Intern / Islandora

Islandora Foundation Intern kstapelfeldt Tue, 11/02/2021 - 15:30

Are you a professional who would like to participate in a vibrant international community?

Join the Islandora Foundation for eight months as our inaugural Foundation Intern. This position provides somebody with excellent organizational skills and the willingness to learn with an opportunity to interact and support a global network of information professionals, developers, librarians, and institutions maintaining digital repository software “Islandora.”

Islandora is a modern/state-of-the art open source framework that supports building digital asset management systems, and is collaboratively developed by an international community. Bringing together the best of modern web technologies for content management and stewardship, Islandora empowers many types of institutions to author, preserve, and disseminate collections using global best-practices and open standards. A

s the Islandora communications intern, you will help coordinate existing communications in Islandora and participate in the development of a revised communication strategy, improved documentation, and the planning of an Islandora conference to be held in summer 2022. This position is anticipated to be entirely remote, although there may be onsite opportunities for work.

The successful incumbent would need to be willing to travel (pandemic pending) for the conference in 2022, expected to be in a North American location, with expenses covered by the Foundation.

This is an excellent opportunity for a GLAM (galleries, libraries, archives, and museums) professional looking for strong networking and mentorship opportunities, as well as project management experience in the digital library and curation space, but all qualified applications will be considered.

Compensation: An offer will be made to successful applicants for compensation in line with an annual salary of $58,500 - $ 69,356.

Term: January - August 2022

Position Description

  • Reporting to the Secretary of the Islandora Board of Directors the successful applicant will:
  • Author/circulate notes and other communications for key Islandora community groups.
  • Collaborate with active community membership groups to facilitate a renewal of Islandora communication and event planning.
  • Organize the development of a planning committee and the execution of an Islandora conference in the summer of 2022.
  • Organize the website and social media updates for Islandora.
  • Manage some supporting budgets/purchasing in collaboration with the bookkeeper and Islandora Foundation treasurer.

Required Qualifications

  • Demonstrated, strong written and spoken communication skills, including the ability to set and maintain agendas and notes.
  • Technical acumen and ability to learn new technologies.
  • Ability to manage projects.
  • Demonstrated, robust understanding of underlying principles in diversity, equity and inclusion and strategies that enhance equity.
  • Demonstrated ability to work and communicate effectively across diverse teams. Demonstrated ability to work independently with a high degree of effectiveness.
  • Demonstrated interest in GLAM (galleries, libraries, archives, and museums) environments and/or Open Source communities
  • Knowledge/awareness of key communications technologies (Google groups, Google calendar, and Slack)

Desirable Qualifications

  • Post-secondary education in a relevant field (communications, information science/studies)
  • Previous experience working in communications, especially for non-profit organizations.
  • Demonstrated ability to leverage technology to solve problems.

Application Instructions

Candidates must submit to

  • a statement of interest citing the position name, a full curriculum vitae, and a list of three (3) references including an email address for each by November 21st, 2021.

In addition, a statement on how the candidate will implement the principles of Equity, Diversity and Inclusion (EDI) in their career is to be included in the application package.

An application package including and addressing all of the elements listed above must be submitted electronically as a single PDF file to:

The Foundation wishes to thank all applicants for their interest. Applications shall be assessed by the Islandora Board, and a shortlist of candidates shall be interviewed. Only those applicants selected for the shortlist will be contacted. In accordance with Canadian immigration requirements, all qualified candidates are encouraged to apply; however, Canadian citizens and permanent residents will be given priority. The Islandora Foundation is committed to the principle of equity in employment. We encourage applications from racialized persons / persons of colour, women, Indigenous / Aboriginal People of North America, persons with disabilities, LGBTQ2S+ persons, and others who may contribute to the further diversification of ideas.

Questions about the position can also be directed to the Islandora Board of Directors via 

Fully funded PhD program in Information Sciences, University of Illinois at Urbana-Champaign, deadline December 1, 2021 / Jodi Schneider

Dr. Jodi Schneider’s Information Quality Lab invites applications for fully funded PhD students in Information Sciences at the School of Information Sciences (iSchool), University of Illinois at Urbana-Champaign.

Current areas of interest include:

  • scientific information and how it is used by researchers and the public
  • scholarly communication
  • controversies within science
  • potential sources of bias in scientific research
  • confidence in applying science to public policy

Candidates should have a Bachelor’s or Master’s degree in any field (e.g., mathematics, sciences, information sciences, philosophy, liberal arts, etc.). The most essential skills are strong critical thinking and excellent written and spoken English. Interest or experience in research, academic writing, and interdisciplinary inquiry are strongly preferred.

Students in the Information Quality Lab develop both domain expertise and technical skills. Examples of relevant domains include public policy, public health, libraries, journalism, publishing, citizen science, information services, and life sciences research. Examples of technical skills include knowledge representation, text and data analytics, news analytics, argumentation analysis, document analysis, qualitative analysis, user-centered design, and mixed methods.

Examples of current Information Quality Lab projects:
REDUCING THE INADVERTENT SPREAD OF RETRACTED SCIENCE: SHAPING A RESEARCH AND IMPLEMENTATION AGENDA (Alfred P. Sloan Foundation) – stakeholder-engaged research to understand the continued citation of retracted research, currently focusing on standards development and raising awareness of what various stakeholders across scholarly communication can do.

STRENGTHENING PUBLIC LIBRARIES’ INFORMATION LITERACY SERVICES THROUGH AN UNDERSTANDING OF KNOWLEDGE BROKERS’ ASSESSMENT OF TECHNICAL AND SCIENTIFIC INFORMATION (Institute of Museum and Library Services Early Career Development) – Scientific misinformation and pseudoscience have a significant impact on public deliberation. This project will conduct case studies on COVID-19, climate change, and artificial intelligence to understand how journalists, Wikipedia editors, activists, and public librarians broker knowledge to the public. We will develop actionable strategies for reducing public misinformation about scientific and technical information.

USING NETWORK ANALYSIS TO SUPPORT AND ASSESS CONFIDENCE IN RESEARCH SYNTHESIS (National Science Foundation CAREER) – developing and testing a novel framework to evaluate sets of expert literature for potential sources of bias and to allow evidence-seekers to swiftly determine the level of consensus within a body of literature and identify the risk factors which could impact the reliability of the research.

Dr. Jodi Schneider studies the science of science through the lens of arguments, evidence, and persuasion. She seeks to advance our understanding of scientific communication in order to develop tools and strategies to manage information overload in science, using mixed methods including semantic web technology (metadata/ontologies/etc.), network analysis, text mining and user-centered design. Her long-term research agenda analyzes controversies applying science to public policy; how knowledge brokers influence citizens; and whether controversies are sustained by citizens’ disparate interpretations of scientific evidence and its quality. Prior to joining the iSchool, Schneider served as a postdoctoral scholar at the National Library of Medicine, the University of Pittsburgh Department of Biomedical Informatics, and INRIA, the national French Computer Science Research Institute. She is an NSF CAREER awardee and holds an Institute of Museum and Library Services Early Career Development grant. Her past projects have been funded by the Alfred P. Sloan Foundation, the National Institutes of Health, Science Foundation Ireland, and the European Commission.

iSchool PhD students have backgrounds in a broad range of fields, including the social sciences, sciences, arts, humanities, computing, and artificial intelligence. Accepted students are guaranteed five years of funding in the form of research and teaching assistantships, which include tuition waivers and a stipend. Additional funding is available for conference travel.

Our PhD program in Information Science is the oldest existing LIS doctoral program in the U.S. with 270 graduates. Recent graduates are now faculty members at institutions such as the University of Michigan, University of Washington, University of Maryland, Drexel, and UCLA, professionals at Baidu, Google, Twitter, Uber and AbbVie, and academic library professionals at the Library of Congress, Princeton University, and the University of Chicago.

For more information about the application process, please visit:
Next application deadline: December 1, 2021
(This is an annual opportunity.)


For additional information about the iSchool PhD program, see

For questions about the program, please contact Prof. Michael Twidale, PhD Program Director, at

For questions, about the Information Quality Lab, please contact Dr. Jodi Schneider.

Sponsorship and mentorship / Tara Robertson

screenshot of HBR with our article in the top spot

Mentors share advice, sponsors share opportunities. Both are important and useful, but sponsorship accelerates careers. Underrepresented people are over-mentored and under-sponsored.

In the last couple of weeks I’ve had several firsts:

It was exciting and humbling to have smart women I admire open doors I didn’t even know existed and support me to be successful. Alex saw this opportunity to write a piece on hybrid work and DEI measurement as a way for me to meet her editor Ania Wieckowski. Alex has been a guest on CBC radio many times and pitched the two of us to Michelle. Alex and I had a quick prep call to make sure we both knew the key points that the other wanted to make. Michelle mentioned the session went well to Janella, who phoned and asked if I could do an on camera interview. Even though I have the skills to do all of these things I hadn’t had the chance to do any of them before. I’m grateful and humbled to have had these opportunities.

How can you sponsor others? How have others sponsored you?

The post Sponsorship and mentorship appeared first on Tara Robertson Consulting.

DLF Digest: November 2021 – DLF Forum Edition / Digital Library Federation


DLF DigestJoin us online! DLF Forum, NDSA Digital Preservation, Learn@DLF

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation.

This month’s news:

  • Welcome to a special DLF Forum edition of the DLF Digest! The virtual 2021 DLF Forum takes place today, Monday, November 1, through Wednesday, November 3, followed on Thursday, November 4, by NDSA’s Digital Preservation 2021: Embracing Digitality. Our Learn@DLF workshops take place next Monday through Wednesday, November 8 through 10.
  • If you weren’t able to register for our events, there are still plenty of ways to keep up with the #DLFvillage over the next two weeks:
  • DLF’s Data and Digital Scholarship working group (DLFdds) is seeking nominations and self-nominations for two new co-conveners. Check out the call for participation here, and complete the nomination form by December 1.
  • CLIR offices are closed Monday-Friday, November 22-26, in observance of the Thanksgiving holiday.

This month’s open DLF group meetings:

NOTE: Some of November’s standing working group meetings may be rescheduled due to conflicts with the DLF Forum and affiliated events. Check the DLF Community Calendar for updates.

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Can’t find meeting call-in information? Email us at DLF AIG Metadata

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member institution. Learn more about our working groups and how to get involved on the DLF website. Interested in starting a new working group or reviving an older one? Need to schedule an upcoming working group call? Check out the DLF Organizer’s Toolkit to learn more about how Team DLF supports our working groups, and send us a message at to let us know how we can help. 

The post DLF Digest: November 2021 – DLF Forum Edition appeared first on DLF.