Planet Code4Lib

Weeknote 4, 2023 / Mita Williams

§1 Justfacts: Overrepresentation of Black People in the Canadian Criminal Justice System §2 They Rule §3 Public Interest Databases from Canada's Investigative Journalism Foundation

Celebrating Museum Library Week with the OCLC RLP / HangingTogether

In case you missed it, it is #MuseumLibraryWeek, hosted on social media by Thomas J. Watson Library (at the Metropolitan Museum of Art) and at Smithsonian Libraries and Archives. Both are members of the OCLC Research Library Partnership, along with numerous other museum libraries. Some of those partners include:

  • Art Institute of Chicago
  • Sterling and Francine Clark Art Institute
  • Cleveland Museum of Art
  • Frick Collection and Frick Art Reference Library
  • Getty Research Institute
  • Metropolitan Museum of Art
  • Museum of Fine Arts, Boston
  • Museum of Fine Arts, Houston
  • National Gallery of Art
  • Nelson-Atkins Museum of Art
  • New-York Historical Society
  • Saint Louis Art Museum
  • Winterthur Museum, Garden & Library
Books on a shelf with the museums and libraries logo

Additionally, there are museum libraries that are embedded within the universities, colleges, and even national libraries that make up the transnational OCLC RLP. We often say that it is the diversity of institutions and institution types in the OCLC RLP that makes our group distinctive.

Just one example of how those institutions have influenced our thinking is the Operationalizing the Art Research Collective Collection project. This project examines the opportunities, challenges, and potential strategies for cooperation between art, academic, and independent research libraries to help identify new collaborative models to support the continued availability of the art research collective collection. You can look for outputs from that project later this year.

While we celebrate museum libraries every day, we are delighted that there is a special week where everyone can appreciate these organizations, their distinctive collections, and talented staff.

The post Celebrating Museum Library Week with the OCLC RLP appeared first on Hanging Together.

Regulatory Capture In Action / David Rosenthal

On January 20th, SEC Commissioner Hester Peirce gave a long speech at Duke University entited Outdated: Remarks before the Digital Assets at Duke Conference essentially arguing against doing her job by regulating cryptocurrencies.

Below the fold I point out how she is shilling for the cryptosphere, with a long list of excuses for inaction.

Right from the start it is clear that Peirce has swallowed the industry line that "crypto has immense potential" but that "it is still the early days":
Underlying these lessons is the truth that technology takes time to develop and often must combine with innovative developments in other fields to realize its full potential. In the interim, it can appear, particularly to outsiders looking in, awkward, useless, or downright harmful.
Peirce cites but ignores the letter that I and 1500 other insiders wrote pointing out that the technology was in fact "awkward, useless, or downright harmful". Molly White's It's not still the early days demolishes the idea that we only have to wait a while for the amazing future that Peirce envisages:
a multitude of potential uses, including smart contracts, payments, provenance, identity, recordkeeping, data storage, prediction markets, tokenization of assets, and borderless human collaboration.
Note that every one of this list of crypto-bro buzzwords is something we can already do, so the maximum possible benefit from cryptocurrencies is some marginal improvement in productivity. Peirce doesn't acknowledge that the innovation in the cryptosphere amounts to nothing more than replicating existing financial services, just without that pesky regulation that prevents scammers from maximising their profits.

Peirce then starts the list of excuses for inaction by appealing to crypto-bros better instincts:
The first and most important lesson of the evening for people who believe in crypto’s future is that they should not wait for regulators to fix the problems that bubbled to the surface in 2022. They can act themselves to root out harmful practices and encourage good behavior. Regulatory solutions, which tend to be inflexible, should be a last resort, not a first resort. People working together voluntarily are much better at fixing things than regulators using their inherently coercive power to impose mandatory solutions. Privately designed and voluntarily implemented solutions can be both more effective and more tailored because the people driving them better understand the technology and what they are trying to achieve with it. Iterating and experimenting with private solutions is easier than it is with regulatory ones. Moreover, private solutions avoid the systemic risk that comes from an industry homogenizing because everyone has to fit into the same regulatory parameters.
Surely we can depend upon the well-intentioned actors in the cryptosphere to voluntarily give up the profits they have been accumulating from enabling fraud, theft, money laundering and ransomware. It is in their best interests, after all.

Pierce continues to lecture the crypto-bros:
remember the point of crypto. It is not driving up crypto prices so that you can dump your tokens on someone else. Digital assets need to trade, so centralized venues or decentralized exchange protocols are necessary, but trading markets are not the ultimate point. Nor is the point of crypto to lend your crypto assets so that other people can trade them, although lending markets, in which everyone is aware of the risks, are not inherently problematic. Rather, at its core, crypto is about solving a trust problem: how can you interact and transact safely with people you do not know. Traditionally, people have looked to centralized intermediaries or government to solve this problem, but technology like cryptography, blockchain, and zero-knowledge proofs offer new solutions.
The "new solutions" replace trust in identifiable, regulated institutions against which you have legal recourse, with trust in pseudonymous code developers, unregulated offshore entities, and vulnerability-ridden software, against none of which you have effective legal recourse. Clearly a huge improvement.

Peirce's detachment from reality continues:
although crypto enables reduced reliance on centralized intermediaries, as long as companies are actively involved in crypto, people should take the same precautions as they would when dealing with any other company. Unthinking trust in centralized intermediaries is antithetical to crypto. As you assess a company’s products and services, consider the associated risks. For example, if a company plans to take your assets and lend them to someone else, who is that person, and what happens if that person cannot return the assets or if the company goes bankrupt? Regulation is not a silver bullet, but understanding whether, by whom, and how the company is regulated can help you calibrate your own due diligence. For their part, crypto companies should take the steps necessary to earn and keep their customers’ and counterparties’ trust.
In practice, of course, Peirce is right that almost everyone's interface to cryptocurrencies is via centralized companies. But the fact is that almost all of these companies make great efforts to prevent people taking "the same precautions as they would when dealing with any other company". Take Binance for example, the largest exchange but which claims not to be located or regulated anywhere, and has just been found to have been comingling customer funds and collateral. Or Coinbase, the supposedly trustworthy US exchange, whose customers only found out after the fact that they were unsecured creditors without protection. Or Gemini, which didn't exactly tell their customers that their funds were all sent to Genesis, which lent them to Three Arrows Capital. Might regulation play some role in ensuring that people had accurate information upon which to base "the same precautions"?

Peirce is clear that:
What we should not learn from the events of 2022 is that the failures of centralized entities are failures of decentralized protocols.
Source
No indeed! Despite a lack of actual decentralization, decentralized protocols have their own ways of failing. Peirce cites but doesn't pay attention to Web3 is Going Just Great where Molly White has recorded at least ten DeFi hacks and rug-pulls in the last three months. It is true that none approach the multi-billion dollar losses of FTX and Genesis, but that is because the DeFi space is much smaller.

Peirce makes an important point:
we have to take a nuanced approach that recognizes differences across blockchains, distinctions between Layer 1 blockchains and the chains and applications built on top of them, and differences among crypto assets. The crypto industry encompasses a wide variety of experiments being conducted by many different people, so we must avoid painting them all with the same regulatory brush. A centralized trading venue is a world away from a public, decentralized blockchain, but they all get talked about in one breath in Washington regulatory circles.
Note that Peirce's speech up to this point has failed to make the essential distinction between permissioned, centralized systems and permissionless, purportedly decentralized systems. But then Peirce writes:
preserving the core of crypto—decentralization—is lso [sic] important. Decentralization can help support the resilience of the financial system. Decentralized finance (“DeFi”), enables people to interact with one another through the intermediation of code rather than relying on a financial intermediary such as a bank. DeFi deserves special consideration because of its unique properties, some of which take the place of functions that regulation otherwise might perform. DeFi is self-executing, open-source code operating on top of public, permissionless, decentralized blockchains. Anyone can participate, but nobody has to, and everyone participates on the same terms, which everyone knows beforehand.
This is 100% pure gaslighting:
Peirce is unduly optimistic about technology here:
Attacks on DeFi protocols are common, but early auditing, testing, and investigating the incentives that are built into the DeFi code can identify problems.
Even if they could "identify problems" adequately, which experiment shows they can't, absent regulation there is little incentive either to spend money "auditing, testing, and investigating", or to spend money fixing any problems that were found. In many cases of DeFi "hacks" it is strongly suspected that the beneficiaries were insiders with privileged knowledge of the vulnerability.

This is just a sampling of Peirce's excuses for regulatory inaction. It is really depressing to see someone who is paid to ensure the integrity of financial markets throw up their hands and explain how their job is just too hard and they need to wait for the idustry to self-regulate or for Congress to provide new tools.

Issue 97: Again with the AI Chatbots / Peter Murray

The hot technology in the news now is chatbots driven by artificial intelligence. (This specific field of artificial intelligence is “large language models” or LLM). There were two LLM threads in DLTJ Thursday Threads issue 95 and a whole issue six weeks ago (issue 93). I want to promise that Thursday Threads will not turn into an every-other-issue-on-LLMs, but so far that is what is catching my eye here at the start of 2023.

Also on DLTJ in the past week:

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ’s Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

AI generating news articles, and we’re not impressed

CNET will pause publication of stories generated using artificial intelligence “for now,” the site’s leadership told employees on a staff call Friday.

The call, which lasted under an hour, was held a week after CNET came under fire for its use of AI tools on stories and one day after The Verge reported that AI tools had been in use for months, with little transparency to readers or staff. CNET hadn’t formally announced the use of AI until readers noticed a small disclosure.

“We didn’t do it in secret,” CNET editor-in-chief Connie Guglielmo told the group. “We did it quietly.”

CNET pauses publishing AI-written stories after disclosure controversy , The Verge, 20-Jan-2023

This is the end, for now, of a saga that started when Futurism found that CNET was using a “CNET Money Staff” byline for articles being generated with an in-house large-language-model (LLM) AI system. (That was covered in Thursday Threads issue 95.) CNET was using the tech to create monotonous articles like “Should You Break an Early CD for a Better Rate?” or “What is Zelle and How Does It Work?” That might have been the end of it if human editors had indeed proofread the articles before publication (as CNET had claimed). Either the editors were bad at their job or it was not the case. Oh, and CNET was using the LLM to rewrite the first few paragraphs of articles every couple of weeks so that they would stay “fresh” in web search engine crawls. CNET admitted to using the LLM before ultimately pausing the use of the technology (for now), as The Verge article describes.

Large language models in scholarly publishing

Nature, along with all Springer Nature journals, has formulated the following two principles, which have been added to our existing guide to authors (see go.nature.com/3j1jxsw). As Nature’s news team has reported, other scientific publishers are likely to adopt a similar stance.

First, no LLM tool will be accepted as a credited author on a research paper. That is because any attribution of authorship carries with it accountability for the work, and AI tools cannot take such responsibility.

Second, researchers using LLM tools should document this use in the methods or acknowledgements sections. If a paper does not include these sections, the introduction or another appropriate section can be used to document the use of the LLM.

Tools such as ChatGPT threaten transparent science; here are our ground rules for their use (editorial), Nature, 24-Jan-2023

The Springer Nature publisher has set some early ground rules, even as it admits that ultimately it may not be able to judge whether a large-language-model (LLM) had been used in the crafting of an article. Last week Nature noted that at least four articles had already been submitted citing ChatGPT as a co-author.

Higher education classrooms prepare for ChatGPT

Mr. Aumann confronted his student over whether he had written the essay himself. The student confessed to using ChatGPT, a chatbot that delivers information, explains concepts and generates ideas in simple sentences — and, in this case, had written the paper.

Alarmed by his discovery, Mr. Aumann decided to transform essay writing for his courses this semester. He plans to require students to write first drafts in the classroom, using browsers that monitor and restrict computer activity. In later drafts, students have to explain each revision. Mr. Aumann, who may forgo essays in subsequent semesters, also plans to weave ChatGPT into lessons by asking students to evaluate the chatbot’s responses.

Alarmed by A.I. Chatbots, Universities Start Revamping How They Teach , New York Times, 16-Jan-2023

In issue 93, I mentioned a high school teacher that was lamenting the disruption that large-language-models (LLMs) were having on the classic English essay (and mentioned Ben Thompson’s suggestion of “Zero Trust Homework” to combat LLMs). This New York Times article describes more examples of how instructors are coping with LLMs.

Adapting to the disruptions of generative artificial intelligence

Generative AI models for businesses threaten to upend the world of content creation, with substantial impacts on marketing, software, design, entertainment, and interpersonal communications. These models are able to produce text and images: blog posts, program code, poetry, and artwork. The software uses complex machine learning models to predict the next word based on previous word sequences, or the next image based on words describing previous images. Companies need to understand how these tools work, and how they can add value.
How Generative AI Is Changing Creative Work , Harvard Business Review, 14-Nov-2022

We will adapt to new technology (if history is any guide). This Harvard Business Review article is about more than the large-language-models discussed in this issue. (It also covers the generative adversarial network technology that creates “AI art”.) But the lessons and cautions are generally applicable. If fact, we may see new professions emerging, like a “prompt engineer” that will know the phrasing and techniques to best elicit the output the client is seeking. (The article describes the efforts of an award-winning AI artist: “he spent more than 80 hours making more than 900 versions of the art, and fine-tuned his prompts over and over.”) Or, like was suggested with the “Zero Trust Homework” idea (and seemingly resoundingly ignored by the CNET editors), using LLM to “generate original content” faster so “writers now have time to do better research, ideation, and strategy.” We will also see these technologies used for “deepfakes” (in still images, video, and audio) and activities bordering on plagiarism (such as the earlier scholarly communication thread).

Come to think of it, this sort of thread is likely to be quite common in upcoming DLTJ Thursday Threads issues.

Alan in snowy weather

Photograph of a white cat with black splotches sitting up tall on a paver patio. The ground in front of the cat is covered with snow.

We got snow in central Ohio, and Alan just had to check it out. You will note that he is not in the snow, just next to the snow. He has gone full “house-cat” after all.

Issue 95: Updating ChatGPT, Cryptomining, and Website-for-Small-Libraries Threads / Peter Murray

This week we revisit threads from a month ago, a year ago, and 12 years ago.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ’s Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

ChatGPT references a non-existent book

Been seeing a lot about #ChatGPT lately and got my first question at the library this week from someone who was looking for a book that the bot had recommended. They couldn&apost find it in our catalog. Turns out that ALL the books that ChatGPT had recommended for their topic were non-existent. Just real authors and fake titles cobbled together. And apparently this is known behavior. 😮
Wanda Whitney , @bibliotecaria@blacktwitter.io, 7-Jan-2023

One of the Thursday Threads issues last month focused on Chat-bots Powered by Artificial Intelligence. This weekend I saw this fediverse post from Wanda Whitney; a patron was looking for a book recommended by ChatGPT. This is a problem known as “stochastic parroting”. Let’s ask ChatGPT to define that term…

Jester Cap icon

What is “stochastic parroting”? Define this term and give an example of how this is a problem for large language models like ChatGPT.

“Stochastic parroting” is a phenomenon that occurs when a language model generates text that is similar to the text it has seen during training, but is not a coherent or logical response to the input. This can happen when the model is simply repeating patterns it has learned from the training data, rather than using its understanding of the meaning of the input to generate a response.

This is a problem for large language models like GPT-3 because the amount of data these models are trained on is so huge that they are able to memorize large amount of text and sometimes can repeat it back which might not be relevant to the task or context, this can be mitigated by fine-tuning the model with smaller dataset that is domain specific, or other techniques like controlling sampling temperature and restricting the model’s training to a certain task.

From what I’ve read, that is a pretty good answer. The algorithm behind the large language model uses statistical probabilities to find the most likely next word based on the previous words. Even if that generates nonsense in the real world. I still think that the invention and wide-scale use of large-scale natural language models like ChatGPT means an increased importance on information literacy training. Let’s get to it, librarians!

AI generating news articles

Text of tweet from @GaelBreton: Looks like @CNET (DR 92 tech site) just did their coming out about using AI content for SEO articles. It includes a screenshot of a Google search results page for “This article was generated using automation technology and thoroughly edited and fact-checked by an editor on our editorial staff.”
Gael Breton , Twitter, 11-Jan-2023

Also in that Thursday Threads issue was mention of how AI can change the nature of essay-writing and instructors can adapt to the change by making the assignment about editing rather than writing text. Publishers are trying out just that thing. With headines like “What Are NSF Fees and Why Do Banks Charge Them?” and “What Happens When you Bounce a Check?”, these seem to be fairly basic articles based on common knowledge and easy editing. I wonder if anyone—CNET, the publisher, or an academic researcher—has reviewed those articles to see how accurate they are.

This isn’t the first time automation has been used to write news articles. Back in 2015, The Verge reported on how the Associated Press was using a fill-in-the-blank system to write financial stories from regulatory filings.

Note! An update to this thread is in issue 97.

Cryptocurrency miners in Texas forced to shut down due to electricity shortage

As Texas and much of the U.S. confronts the challenges of this winter storm, the bitcoin mining industry in Texas is playing a part in supporting the Texas grid during this challenging time by proactively curtailing power. Bitcoin miners in Texas and the Texas Blockchain Council are working with grid operators in Texas as they are closely monitoring the situation, and stand ready to contribute to the state&aposs efforts to ensure reliability and keep families warm and safe during the extreme weather.
Texas Bitcoin Miners Curtail Power in Advance of Arctic Blast , Texas Blockchain Council press release, 23-Dec-2022

A Thursday Thread from a year ago was cryptocurrency’s energy consumption and how cryptocurrency miners were heading for Texas after being forced out of China and Kosovo. Now comes last month’s press release from the Texas Blockchain Council on how they shed load as the grid struggled to keep up with heating demands. Still no word on why using energy for fraud-supporting busy-work calculations is a better use of excess capacity than, say, some form of energy storage that would help smooth out the bumps in the grid.

Lament for a future that could have been: A Web presence for every library

Late last year -- just about a week before ALA Midwinter -- came an announcement of a project by OCLC&aposs Innovation Lab to offer an inexpensive website to every small library. At a price of about $5 per month, a library could have a basic desktop and mobile website. At about $40 per month, the library could have a simple inventory and circulation module. You can see what is possible at the Loremville, TN public library sample site and read more information about the project in my write-up of the public demonstration.
A Web presence for every library , Disruptive Library Technology Jester, 13-Jan-2011
Screenshot of Sample Library Website

Twelve years ago, OCLC announced a project to build a website for every library, and I wrote an excited piece about it for this blog. It seemed like the perfect use of the OCLC cooperative’s resource magnification efforts; if anyone could pull off something like this, it was OCLC. Unfortunately, the project didn’t last, and I wish I knew more about why it didn’t take off. I couldn’t find the Loremville Public Library sample site on Wayback, so the only thing that may be left is this low resolution screenshot I made for my 2010 article.

Extra Cat!

Two photographs of a mostly black cat with a white bib. In the left photograph, a the cat is lounging comfortably on the floor. In the right photograph, the photograph is taken looking up at the face of the cat.

We had a visiting cat for Christmas break! This sweetheart is Pickle, which is short for Dill Pickle Murray, and is my daughter’s college apartment cat. It took most of the Christmas break for Pickle to get comfortable in the house and for Mittens and Alan to get comfortable with Pickle. Okay, honestly, neither of them liked Pickle as a new interloper in the house. To be honest, Mittens and Alan are much better behaved with each other since Pickle left. Maybe the two of them realized how good a two-cats-only household is?

Issue 93: Chat-bots Powered by Artificial Intelligence / Peter Murray

This week we jump into the world of chat-bots driven by new artificial intelligence language models. The pace of announcements about general-purpose tools driven by large training sets of texts or images has quickened, and the barrier to experimenting with these tools has dropped. There are now fully-functional websites where there once were only programmer-focused APIs. We wonder what the effects will be on our students, our business workflows, and on society. We also wonder about the underlying biases in the training data.

As an aside, in the first article below I mention that the use of these tools, while free for now, will be monetized at some point. This is another unfortunate example of taking from the common good and commercializing it. The training data used by the company came from crawling web pages, from Wikipedia, and from books (source). Yet soon, it seems, all of the benefit from that information will be held by a corporate body. The same thing has been said about the image-based AI tools that have slurped up sets of photos from sites like Flikr, Wikipedia, and even stock photo businesses. We don’t talk enough about this private capture of the common good and the uncompensated taking of other’s work.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ’s Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

OpenAI Introduces ChatGPT

We’ve trained a model called ChatGPT which interacts in a conversational way. The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests. ChatGPT is a sibling model to InstructGPT, which is trained to follow an instruction in a prompt and provide a detailed response.
ChatGPT: Optimizing Language Models for Dialogue , OpenAI blog, 30-Nov-2022

This link is the announcement from the company that created ChatGPT, OpenAI. The innovation with this model is the introduction of Reinforcement Learning from Human Feedback (RLHF). With RLHF, “human AI trainers provided conversations in which they played both sides—the user and an AI assistant” — and the ChatGPT language model incorporated the refinements learned from those human interactions. The blog post gives examples of how this human training affected the output. In the language model without RLHF training, when asked how to bully someone the AI would return a list of ideas. With the RLHF training, the response starts with “It is never okay to bully someone” and says that others should be treated with respect.

The research preview is open for anyone to try. On Twitter, the CEO of OpenAI says it costs in the low pennies per chat and will have to be monetized at some point.

A High School Teacher Laments a Tool for Easy Essays

Teenagers have always found ways around doing the hard work of actual learning. CliffsNotes dates back to the 1950s, “No Fear Shakespeare” puts the playwright into modern English, YouTube offers literary analysis and historical explication from numerous amateurs and professionals, and so on. For as long as those shortcuts have existed, however, one big part of education has remained inescapable: writing. Barring outright plagiarism, students have always arrived at that moment when they’re on their own with a blank page, staring down a blinking cursor, the essay waiting to be written. Now that might be about to change. The arrival of OpenAI’s ChatGPT, a program that generates sophisticated text in response to any prompt you can imagine, may signal the end of writing assignments altogether—and maybe even the end of writing as a gatekeeper, a metric for intelligence, a teachable skill.
The End of High-School English , Daniel Herman, The Atlantic, 9-Dec-2022

A teacher of the humanities in high school tried out ChatGPT. When he gave the chat program a writing prompt that he gave his own students, it returned a better essay than what his own students turned in. When he submitted the text of an essay, the chat-bot returned the text in a clearer writing style without changing the ideas expressed in the essay. This does sound like an epoch of woe.

Writing for Stratechery, Ben Thompson says that the introduction of new tools means a change to the skills being taught. Under the heading “Zero Trust Homework”, he describes a kind of “essay sandwich” (my phrasing, not his). The skills needed by the student is in crafting a good prompt to the chat-bot and in the editing/analysis of the resulting output; in the middle the chat-bot uses its language model to write the essay. The AI-generated essay may contain factual or structural errors (perhaps some intentionally put there if the chat-bot is being licensed as an educational tool), so the student is demonstrates “learning how to be a verifier and an editor, instead of a regurgitator.”

Note! An update to this thread is in issue 97.

A Real-world Example

Ben Whittle, a pool installer and landscaper in rural England, worried his dyslexia would mess up his emails to new clients. Then one of his clients had an idea: Why not let a chatbot do the talking? The client, a tech consultant named Danny Richman, had been playing around with an artificial intelligence tool called GPT-3 that can instantly write convincing passages of text on any topic by command. He hooked up the AI to Whittle’s email account. Now, when Whittle dashes off a message, the AI instantly reworks the grammar, deploys all the right niceties and transforms it into a response that is unfailingly professional and polite. Whittle now uses the AI for every work message he sends, and he credits it with helping his company, Ashridge Pools, land its first major contract, worth roughly $260,000. He has excitedly shown off his futuristic new colleague to his wife, his mother and his friends — but not to his clients, because he is not sure how they will react.
Stumbling with their words, some people let AI do the talking , Washington Post, 10-Dec-2022

Getting beyond high school, these sorts of language models can help people communicate more clearly with new and existing clients. Earlier I used the phrase “the language model without RLHF training” — that is GPT-3, the same language model used in the example in this article. There is documentation for APIs that can be used to interact with the language model. I tried out the GPT-3 model and asked it to summarize the The End of High-School English from above. The prompt was: “Summarize this for a 10th grade student in as many sentences as you can” and the answer was:

For 10th grade students, OpenAI’s ChatGPT is a program that can generate sophisticated text in response to a prompt. It has the potential to drastically change the way writing is taught in school and may even eliminate the need for writing assignments altogether. It has the capability to produce writing better than many teachers which could mean the end of writing being used as a metric for intelligence. OpenAI is forcing us to question the value of writing, grammar rules, and essays as a literary form and has the potential to allow students to focus on more creative aspects of writing.

It cost USD$0.05.

Can’t Paper Over Biased Training Data

The perils of trusting the expert in the machine, however, go far beyond whether AI-generated code is buggy or not. Just as any human programmer may bring their own prejudices to their work, a language-generating machine like ChatGPT harbors the countless biases found in the billions of texts it used to train its simulated grasp of language and thought. No one should mistake the imitation of human intelligence for the real thing, nor assume the text ChatGPT regurgitates on cue is objective or authoritative. Like us squishy humans, a generative AI is what it eats. And after gorging itself on an unfathomably vast training diet of text data, ChatGPT apparently ate a lot of crap. For instance, it appears ChatGPT has managed to absorb and is very happy to serve up some of the ugliest prejudices of the war on terror.
The Internet’s New Favorite AI Proposes Torturing Iranians and Surveilling Mosques , The Intercept, 8-Dec-2022

When public access to GPT-3 was unveiled last year, it didn’t take long for people to call out the inherent bias in its responses. OpenAI attempted to counteract that bias with the RLHF training, but the underlying bias is still there. Depending on how the question is asked, you get the same awful answers.

The View from a Human Trainer

Brenda [the name of the chat-bot AI prodct], the recruiter told me, was a sophisticated conversationalist, so fluent that most people who encountered her took her to be human. But like all conversational AIs, she had some shortcomings. She struggled with idioms and didn’t fare well with questions beyond the scope of real estate. To compensate for these flaws, the company was recruiting a team of employees they called the operators. The operators kept vigil over Brenda 24 hours a day. When Brenda went off-script, an operator took over and emulated Brenda’s voice. Ideally, the customer on the other end would not realise the conversation had changed hands, or that they had even been chatting with a bot in the first place. Because Brenda used machine learning to improve her responses, she would pick up on the operators’ language patterns and gradually adopt them as her own.
Becoming a chatbot: my life as a real estate AI’s human backup , Laura Preston, The Guardian, 13-Dec-2022

What is it like to be someone training the chat-bot AI? It sounds like a mind-numbing, high-pressure experience. The operator, a recent English graduate student, describes how her writing skills were used to craft non-robotic answers to chat questions from apartment leasing prospects.

Synchronized Sleeping

Photograph of two cats on a bench against a cream-colored wall. On the bench seat are two blue boxes, and in each box is a sleeping cat.

Alan and Mittens are tuckered out after a long day. Up until this week, the bench only had one box on it. The box is the container for a lay leadership award I received from my church last year, and the two cats fought over who would get to sit in the box. (We didn’t set out to create a “throne” for the cats; they just adopted the empty box, as cats will do.) Last Sunday, the church gave out the awards for 2022, and I asked if I could take one of the boxes home.

Now there are two happy cats.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2023 January 24 / HangingTogether

The following  post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by Jay Weitz.

Neurodivergent library workers, part two

In Part Two of “Library Workers Who Are Neurodivergent,” former school librarian and special educator Kelley McDaniel talks with an anonymous neurodivergent former library worker who eventually found retail customer service to be more welcoming. In spite of having earned a B.S. degree in Information and Library Science, they were unable to advance. “Throughout both of my long-running library jobs at two different libraries I made repeated attempts to expand my hours and pursue advancement, and I was repeatedly turned down (at best) or ignored (at worst),” they said. Urging libraries to be “direct and straightforward” rather than employing “quiet firing” methods in such cases, they still hope someday to find decent library work. The interview appears in the January 2023 issue of the American Library Association-Allied Professional Association (ALA-APA) Library Worklife.

OCLC Cataloging Community Meeting

OCLC Cataloging Community Meeting

Twice a year, OCLC presents our Cataloging Community meeting. We have regularly included a Diversity, Equity, and Inclusion (DEI) discussion panel as well as updates on initiatives to foster DEI in library metadata and cataloging workflows. The next meeting on Friday, February 3, 2023, 2:00-4:30 p.m. Eastern Time, will begin with a DEI panel to include Patricia Harpring, Managing Editor of the Getty Vocabulary Program of the Getty Research Institute (OCLC Symbol: JPG) talking about the Getty vocabularies; and Sara Levinson, University of North Carolina at Chapel Hill (OCLC Symbol: NOC), joined by Pamela Louderback, Northeastern State University (OCLC Symbol: OKN), discussing the Subject Authority Cooperative (SACO) Funnel for Latin American and Indigenous Peoples of the Americas. There will be time for questions and answers. As always, the virtual Cataloging Community meeting is free and open to all and will be recorded. Please register whether you would like to attend live on February 3, 2023, 2:00-4:30 p.m. Eastern, or you would prefer to receive the full recording following the event.

Legislative protections for pregnant and breastfeeding working mothers

The Fiscal Year 2023 Omnibus Spending Bill, signed by President Biden on 2022 December 29, included two bills in support of pregnant and breastfeeding mothers who work. The Pregnant Workers Fairness Act (PWFA) “prohibits employment practices that discriminate against making reasonable accommodations for qualified employees affected by pregnancy, childbirth, or related medical conditions.” The Providing Urgent Maternal Protections for Nursing Mothers Act (PUMP) “expands workplace protections for employees with a need to express breast milk.” Both bills are briefly explained in “Increased Legislative Protections for Pregnant and Breastfeeding Workers that Employers Need to be Aware of,” which appears in the January 2023 issue of the American Library Association-Allied Professional Association (ALA-APA) Library Worklife (Volume 20, Number 1).

Diversity in juvenile biographies

Five librarians at the University of Central Florida (OCLC Symbol: FTU), home to the School of Teacher Education and the Curriculum Materials Center (CMC) Library, have studied the subject headings in the bibliographic records for nearly a thousand juvenile biographies for access to the gender, race, ethnicity, and nationality of the biography’s subject. “Juvenile Biography Collection: EDI Analysis and Enhancement” (Cataloging and Classification Quarterly, 2022, Volume 60, Number 8, Pages 836-857) reports on their audit, the biases built into Library of Congress Subject Headings, the resulting difficulties in satisfactory discovery, and some of the local solutions they devised.

Respectful descriptions

Camille Callison, University Librarian of the University of the Fraser Valley (OCLC Symbol: FVL) in British Columbia, Canada, and Dr. Stacy Allison-Cassin, Assistant Professor in the
School of Information Management at Dalhousie University (OCLC Symbol: DAK) in Nova Scotia, Canada, have written about “News From Canada: Respectful Description Initiatives in Canada” in the IFLA Metadata Newsletter (December 2022, Volume 8, Number 2, Page 5). They report on a May 2022 event centered on respectful terminologies, including the Respectful Indigenous Terminologies Platform Project of the National Indigenous Knowledge and Language Alliance (NIKLA-ANCLA). The recording of the complete one-day session, “Respectful Terminology: Creating a National Framework,” includes presentations from across Canada.

“Inclusive Leadership”

The Association for College and Research Libraries (ACRL) will present a free webinar in “Inclusive Leadership” at 2:00 p.m. Eastern Time on Tuesday, February 7, 2023. The panel — including Annie Bélanger, Dean of University Libraries, Grand Valley State University (OCLC Symbol: EXG) in Michigan; Maisha Carey, Deputy University Librarian and Director of Organizational Learning, University of Delaware Library (OCLC Symbol: DLM); Dr. Jolie O. Graybill, Dean of Libraries, North Dakota State University (OCLC Symbol: NWQ); and Elaine Westbrooks, University Librarian, Cornell University (OCLC Symbol: COO) in New York — will discuss how to build diversity, equity, and inclusion into one’s approach to library leadership and decision-making, among other topics.

Engagement with Indigenous communities

WebJunction will present a free webinar on February 15, 2023, at 3:00 p.m. Eastern Time, featuring Naomi Bishop, Health Sciences Librarian at the Phoenix Biomedical Campus of the University of Arizona (OCLC Symbol: AZU), and Akimel O’odham (Pima) from the Gila River Indian Community. “Native Stories, Native Peoples: Opportunities for Library Engagement” will consider how libraries can connect with their past and present Indigenous communities with accuracy and respect and how that fosters service to all communities. As is usually the case with WebJunction presentations, a recording will be made available after the live session, along with accompanying materials and additional resources.

Sources of “preferred” terms

A book review (College and Research Libraries, January 2023, Volume 84, Number 1, Pages 158-160) by Anastasia Chiu, Scholarly Communications Librarian at New York University (OCLC Symbol: ZYU), calls attention to Stefan Vogler’s 2021 work Sorting Sexualities: Expertise and the Politics of Legal Classification (University of Chicago Press). Although the book’s focus is on “the ways that sexuality is seen and understood in different legal settings,” Chiu notes that it can also shed considerable light on how critical cataloging “often means participating in battles that are really about who gets to be seen, counted, and treated as human.” Vogler considers two realms in which courts deal with personal sexuality: “LGBTQ+ asylum determinations in US immigration, and the evaluation and carceral placement of sex offenders.” The establishment of legal categorizations and terminology in such contexts does not necessarily lead to the best “preferred” terminology in library contexts. As Chiu asks, “What are the legal and social routes by which this language finally makes its way to libraries through written and classifiable resources? And when we use state-sanctioned language and state-based criminal classifications to describe and organize our resources, are we aware of how the state derived that language? Do we understand the political and cultural forces that shape how we describe our own resources?”

LGBTQ materials in Oregon

In December 2022, Oregon Public Broadcasting (OPB) reported that the board of trustees of Crook County Library (OCLC Symbol: CCLIB) in Prineville, Oregon, rejected a proposal “to segregate LGBTQ-friendly children’s books into a separate section.” A standing room only crowd at the 2022 December 8 meeting “had spoken overwhelmingly in support of keeping the books where they are” and the board agreed in a 4-1 vote. In May 2022, a group of local elementary school students had visited the library and one took out an LGBTQ book, after which the school ceased sending its students to the library, without explanation. Supporters of the proposal claimed that marking or segregating the books would clarify the materials for those who wanted to view or to avoid them. “Crook County library board rejects proposal to segregate LGBTQ books” notes that this is just one of hundreds of libraries across the United States where such resources have been targeted for various forms of censorship.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2023 January 24 appeared first on Hanging Together.

Call for Co-chairs for 2023 Storage Survey / Digital Library Federation

NDSA is seeking two volunteers from NDSA member organizations to serve as co-chairs for the next iteration of the Storage Survey. Volunteers should be knowledgeable about, and interested in, digital preservation storage and comfortable with co-leading a group to produce and administer a survey, analyze data, and write and publish a report. Prior experience chairing an NDSA group is not necessary, and NDSA Leadership will provide guidelines for the co-chairs.

Responsibilities include:

  • Scheduling and leading meetings
  • Organizing and completing work
  • Setting deadlines and tracking progress
  • Attending monthly NDSA Leadership meetings and communicating with Leadership about Working Group progress

The time commitment will vary based on the number of group members and scope of work.

If you’re interested, please fill out this form by February 15, 2023. 

The post Call for Co-chairs for 2023 Storage Survey appeared first on DLF.

Debt / Ed Summers

Maintenance has been a popular topic over the last few years, some of which has focused on software maintenance (Eghbal, 2020). I think part of the appeal of maintenance as a term is its relation to practices of repair and ethics of care more broadly (Graham & Thrift, 2007 ; Jackson, 2014).

But this depth and conceptual richness can present some problems when discussing maintenance in the context of software project management, where maintenance is often boiled down to time spent paying down technical debt, which takes time away from creating new features or applications.

Take for example this is a definition of technical debt from Eghbal’s study of open source software communities:

Code is “cleanest” when it’s first released, because that’s the time at which developers are thinking about the project holistically and writing it from scratch. As more code is added incrementally, software starts to become unwieldy, like a building from the 1850s that’s had new rooms, plumbing, and electric wiring added piecemeal over the years.

While it certainly is the case that technical debt accrues over time for the reasons that Eghbal goes on to describe, this definition implies that maintenance, and the paying down of technical debt, is something that comes after the software is conceived or released, and not there from the beginning and part of an ongoing practice.

I recently ran across this short article from David Pereira about why technical debt is a necessary part of healthy software development. I like how he argues (like Martin Fowler before him) that there are good and bad (or prudent and reckless) types of technical debt.

For Pereira technical debt is a necessary part of learning how to solve a problem, and building something of value. His insight is that this technical debt needs to be identified, documented and paid off as the software gains traction or moves into production, and that project management needs to account for this. However I don’t think it’s always obvious when this transition happens–and once software is actively being used it continues to require attention as bugs are identified or limitations are discovered.

Even if your work practice involves dedicated maintenance time to pay down technical debt, I think project stakeholders and other managers can perceive it as putting on the brakes, or time not making progress on things that matter, unless (and here is Pereira’s other main point) the maintenance work has a story.

If project stakeholders only hear that there is going to be work-cycle of maintenance to pay down debt, without hearing why it matters to the organizational goals and values, then they won’t respond well, and there will be lack of understanding between both the developers (who see management as not understanding software development), and project managers/stakeholders (who see the software developers as not having an understanding of the larger picture).

A good story for why particular maintenance work matters, is helpful for building an understanding of software development within the organization, and how it aligns with organizational goals. The story helps managers understand the stakes of the work, and to communicate it effectively to others. The story is also helpful for software developers to motivate the work, and to prioritize it against other types of maintenance that may be needed. Having a bucket of time set aside for maintenance, be it a day or a month, can be helpful, but it won’t be as effective as it could be if had a compelling and concrete story for why it matters.

References

Eghbal, N. (2020). Working in Public: The Making and Maintenance of Open Source Software. San Francisco: Stripe Press.
Graham, S., & Thrift, N. (2007). Out of order understanding repair and maintenance. Theory, Culture & Society, 24(3), 1–25.
Jackson, S. J. (2014). Rethinking repair. In P. Boczkowski & K. Foot (Eds.), Media technologies: Essays on communication, materiality and society (pp. 221–239). MIT Press. Retrieved from http://sjackson.infosci.cornell.edu/RethinkingRepairPROOFS(reduced)Aug2013.pdf

LibNFT: a second look…still ‘nope’ / Peter Murray

The day after I posted LibNFT: a Project in Search of a Purpose, the project proponents held their CNI project briefing. The recording of that briefing is now online, and I’ve made some annotations on the recording transcript. I came away with a more nuanced understanding of the proposed project, but no more convinced that LibNFT is useful to pursue. The speakers talked about a number of drivers for the project, and three of them stood out to me: building community around their collections with a little money on the side, improving discovery of digital archives, and fear-of-missing-out.

Building community

As they did in the whitepaper, the speakers downplayed the intention of turning LibNFTs into a speculative asset. Instead, they want to build communities of users around these assets. The “affinity marketing” phrase was mentioned a few times, so I looked that up on Wikipedia: “Affinity marketing is a concept that consists of a partnership between a company (supplier) and an organization that gathers persons sharing the same interests to bring a greater consumer base to their service, product or opinion. This partnership is known as an affinity group.” In this case, the idea is to find users that like the library or archives (as in a university alumni group) or an archival collection (people that are interested in content that you have; Meredith Evans mentioned “miniature books” for instance). These are people that will buy LibNFTs to gain “ownership” in the collection or show common cause with the affinity group. (“Ownership” is in quotes, of course, because linking your blockchain wallet to the LibNFT doesn’t give you rights to the collection content—a point explained in the presentation.)

Also mentioned was the possibility of getting special perks for owning a LibNFT. Owning a Justin Bieber NFT gets you access to a special area by the stage or invited to a song release. We’re now getting into the realm of artificial scarcity, where limiting the quantity of an item makes it more valuable. Perhaps LibNFTs have unlimited quantities, and anyone can get the special benefits of owning one. It isn’t clear what the special benefits for LibNFTs would be that wouldn’t be offered to the general public. Surely we wouldn’t make someone own a LibNFT to have access to a collection’s materials in an archive, would we? And if there was a lecture about an illuminated manuscript, we wouldn’t force people to own a LibNFT of that manuscript to attend that lecture, would we?

One last piece about the monetization aspects of LibNFTs. At several points, the speakers asked the audience to set aside the problems with cryptocurrencies: the bankruptcy of FTX, the crashing value of cryptocurrencies and mainstream NFTs, and the general fraud that is happening in this space. The problem is that I don’t think you can set aside the cryptocurrency part of this from the smart contract part of NFTs. Does the smart contract that says you own a LibNFT have to have an exchange of value to be valid? If not, what is the difference between transferring a LibNFT to a blockchain wallet and simply posting a list of donors on a webpage? If and until there is a stable cryptocurrency—one likely backed by a government with armed forces and all of that other government-like stuff—it is not possible to separate cryptocurrencies from NFTs. As things are now, the only companies making money in the space seem to be those running the blockchain and NFT infrastructure, a sort of extracting-of-rent from those who think there is money to be made here and fools that are willing to part with their treasure.

Improving discovery

This idea made little sense to me. The Bored Ape Yacht Club got visibility because a bunch of celebrities got paid by a couple of NFT companies to promote their weird ape pictures. There are hundreds if not thousands of NFT collections on OpenSea.io. (I stopped paging through https://opensea.io/rankings in the 900s.) Cultural heritage NFTs are supposed to stand out in that?
Twitter allows people to change their avatar to the picture assigned to an NFT; will this sort of program increase our visibility? (Setting aside, of course, whether it is wise to deeply attach our marketing plans to a social media company’s capabilities…as the current Twitter kerfuffle points out.)

Fear of missing out

The last big driver I heard in the presentation was “fear of missing out.” Libraries were big in the Web 1.0 space but totally missed the Web 2.0 space. If this is going to be Web 3.0, then we want to get in early, be seen, and influence how this turns out. Well, “Web 3.0”, as seen by the venture capitalists funding the current hype, is all about blockchain → cryptocurrencies → NFTs. In other words, venture capitalists want this blockchain thing because they can make money funding companies and then selling their stakes in those when the interest is high. And in order to have high-value blockchain companies, you need something to sell, and that is cryptocurrency. And for cryptocurrencies to be anything of interest, you need something to buy, and that right now is NFTs (although the market is clearly cooling off). Not convinced? Invest an hour watching Crypto: The World’s Greatest Scam.

Now, there is an interesting Web3 idea called the Interplanetary File System (IPFS), or “distributed web”. That is an effort to abstract away the web’s single-point-of-failure that is the domain name. It is a way to point to this article…https://dltj.org/article/libnft-2/…without relying on the “dltj.org” part. (So if “dltj.org” goes away but “article/libnft-2” is still out there on the distributed web, you’d still be able to get to it.) That is a much more interesting experiment, but unfortunately the “Web 3.0” name—and at times the IPFS technology—has gotten sucked into this blockchain sludge.

Experiment onward, if you want

In the end, if some GLAM institutions want to experiment with this concept and report back to the rest of us, then…I guess…go for it? (If you can justify the expense of the experiment, more power to you.) As I see the blockchain and NFT landscape now, there are other things I’d rather invest time and money into. As I said at the start, I remain unconvinced that this project is even worth the experiment simply based on the current and foreseeable technology. I suppose if someone were to meet me on longbets.org, I could be convinced to wager some amount in support of the Internet Archive (my choice for a winner’s charity). In the meantime, I’m going to sit this one out.

LIBnft: a Project in Search of a Purpose / Peter Murray

At first, I thought this was a parody.

LibNFT is an R&D initiative exploring the impact of blockchain and the digital asset economy on library archives.
LIBnft homepage , 12-Dec-2022

However, it seems like a serious proposal that was presented today at a CNI project briefing. I did not attend the project briefing; the only details publicly available are from the whitepaper. (Note: link to the whitepaper can’t be robustified—Dropbox is hostile to web archiving—but I have saved a copy of version I reviewed…version 0.04 dated 4-Dec-2022.)

From the details in the whitepaper, it is safe to say this project should be shelved until the need and purpose are better understood. Why? First, blockchain is the wrong technology; gallery-library-archive-museum (GLAM) institutions do not need a technology where participants are adversarial or trying to steal each other’s data. Second, there is no utility in non-fungible tokens for GLAM governance or assets; it would be better (and certainly cheaper) to hold a meeting or write a typical contract.

Note! The recording of the LibNFT project briefing is now up on YouTube, and I've posted a follow-up with additional thoughts.

Why Use Blockchain

As the LIBnft whitepaper points out, “in its simplest form, a blockchain is a communally maintained distributed ledger, or database, that reliably and immutably stores digital information” (summarizing a New York Times glossary). The “database” term is crucial—blockchain is a technique for storing and retrieving information, much like one would do with a run-of-the-mill database. This database has some interesting characteristics: data can’t be erased once it is written and there are copies of the database spread over the network. Rather than “distributed”, though, a blockchain database is “decentralized”. A USENIX article makes an important distinction between “decentralized” (which blockchain is) and “distributed” (emphasis added):

A distributed system is composed of multiple, identified, and nameable entities. DNS is an example of such a distributed system, as there is a hierarchy of responsibilities and business relationships to create a specialized database with a corresponding cryptographic PKI. Similarly the web is a distributed system, where computation is not only spread amongst various servers but the duty of computation is shared between the browser and the server within a single web page.

A decentralized system, on the other hand, dispenses with the notion of identified entities. Instead everyone can participate and the participants are assumed to be mutually antagonistic, or at least maximizing their profit. Since decentralized systems depend on some form of voting, the potential for an attacker stuffing the ballot box is always at the forefront. After all, an attacker could just create a bunch of sock-puppets, called “sibyls”, and get all the votes they want.

In a distributed system sibyls are easy to deal with because there are responsible entities in the system who act as gatekeepers. These gatekeepers are often recruited to also prevent “undesired” activity. This is especially true of financial gatekeepers who perform payment processing and have legal obligations to block large swaths of criminal activity.

Decentralized systems purport to eliminate the presence of gatekeepers. But there is a problem as without such gatekeepers there is no efficient solution to the sibyl problem. Instead there are ugly hacks, such as a “proof of work” system where sibyls are only prevented by the need to waste resources, or “proof of stake” where the design literally becomes “he who has the gold makes the rules”.

So one of the defining characteristics of blockchain—distinguishing it from other database technologies—is that it guards against gatekeepers. But who are the GLAM gatekeepers, and why would we need this technology? Blockchain is a complex technology, and anyone who has dealt with complex technologies—such as digital heritage scholars—knows they are expensive and hard to maintain and preserve.

If you need more background or convincing why blockchain is a technology in search of a problem (as a general issue, not just for GLAM institutions), Cal Paterson goes into much more detail in his article: “There aren’t that many uses for blockchains”.

Why use NFTs

Since blockchain offers no inherent benefit to the GLAM community, there must be something about NFTs on top of blockchain that makes it all worthwhile. Back to the whitepaper, the authors say that “Web 3” is the reason.

But the second group, which we will call Group 2, is interested in Web 3 for the object, outcome, or return, but instead for the processes that Web 3 can enable. Group 2 analyzes Web 3, according to Klein, for “the way it can decentralize decision making, or create new forms of transparency, or bind a whole community, or project, or even company to rules and protocols that can never be broken. … These are the people who are trying to make a future for [Web 3] that is more interesting than [] just some payment infrastructure operating in the background of the internet.”

The vitality of the Group 2 conversation exists beyond the market value of Bitcoin, the current “crypto winter,” or whether you will miss out like Larry David, as one of those Super Bowl ads warned millions about. This Group 2 conversation—the one about the possibilities for and with Web 3—is the conversation that GLAM sector leaders should be having and paying attention to, for it allows us to explore the possibilities of blockchain technology generally, and NFTs, without worrying about how much money we may have gained or lost.

A shorter, if more colloquial, way to put this is “code is law.” This description of NFTs resembles funding decisions or project governance rather than selling rights to a digital representation of a cultural object. The “code” in this case is machine language in the NFTs to enforce specific other actions to happen. Think of it as: if variable ‘x’ has ‘y’ value, then this little bit of code in the NFT executes to make ‘z’ happen. As a real-world example, if the S&P 500 index is 4,000 then transfer this NFT to receiver ‘z’ in return for ‘$’ amount.

What is the use case for this kind of code-is-law in GLAMs? There might be some edge cases—funding bodies releasing money to recipients based on some condition? …or many institutions working together in a so-called “distributed autonomous organization” using NFT tokens to vote on project governance proposals? GLAM institutions tend to cooperate more openly than your average institution or corporation. Are these edge cases worth the overhead in NFTs, blockchains, and their underlying technologies?

“But No, Really: Why NFTs?”

Given all the talk about bored apes, why should galleries, libraries, archives, and museums (GLAM) pay any attention at all to NFTs? There are several reasons. First, NFTs are collectible digital assets, and the GLAM sector long has been involved in accumulating, curating, and preserving collectible digital assets. In the world of curation, an NFT really is like just a different form of investment into the very thing we love as collectors and guardians of rare and unique material—just in digital form.

Second, NFTs can have brand value. Based upon their one-of-a-kind nature, NFTs can carry cultural cachet in at least two different ways. For one, the subject matter of an NFT can be unique enough that the token carries a certain level of prestige. Additionally, the entity that issues the NFT can be credible enough that the issuance of a token from that institution bestows distinction upon the issued token. Therefore, association with a reputable institution increases the value of the NFT.

The answer to “why NFTs” is summed up as:

  1. because we can; and
  2. because we would look cool.

None of this sounds like GLAM principles to me

At the conclusion of the whitepaper:

We have begun this project by asking a fundamental question: can blockchain technology generally—and NFTs specifically—facilitate the economically sustainable use, storage, long-term preservation and accessibility of a library’s special collections and archives? … Regardless of the outcome of this research, we believe that this space will develop quickly, with or without GLAM entities’ involvement or participation, and that GLAM leaders ignore blockchain, digital assets, and the metaverse at their institutions’ peril.

Nope. This sounds like a technology in search of a problem. It is a complex technology in which the complexity does not add nearly enough utility to justify its cost. Although the first sentence sounds like a research question, the second sentence already presumes the answer. No, GLAM institutions are not at peril if they don’t participate in the latest hyped technology.

Code4Lib Journal, Issue 55 / Distant Reader Blog

This is a reading of Code4Lib Journal, Issue 55, January 2023.

The lastest issue of Code4Lib Journal came out yesterday, and I wanted to see how quickly I could garner insights regarding the issue's themes, topics, and questions addressed. I was able to satisfy my curiosity about these self-imposed challenges, but ironically, it took me longer to write this blog posting than it did for me to do the analysis.

Rudimentary text mining

First, the number of words in the issue is relatively small -- only 35,000 words. (Moby Dick is about 200,000 words long.) Visualizations depicting unigram, bigram, and keyword frequencies begin to tell of the issue's aboutness. A computed bibliography elaborates on the same themes.

http://distantreader.org/stacks/carrels/code4lib-issue_55-2023/figures/unigrams-cloud-small.png
unigrams
https://distantreader.org/stacks/carrels/code4lib-issue_55-2023/figures/bigrams-cloud-small.png
bigrams

http://distantreader.org/stacks/carrels/code4lib-issue_55-2023/figures/keywords-cloud-small.png
keywords

For more additional statistics describing the issue, see the computed summary.

Topic modeling

Topic modeling is a unsupervised machine learning process used to enumerate latent themes is a corpus. Given an integer (T), the underlying algorithm clusters the corpus into T groups, and because the words in each group are deemed to be physically close to each other, the resulting groups can be considered topics or themes. Since there are 9 articles in this issue, I denoted T to equal 9. After removing stop words and running the algorithm the following topics presented themselves:

labels (topics) weights features
using 0.46649 using data use library new used code file
google 0.06551 google primo tag unpaywall manager links open
records 0.06229 records record isbn python data author title
video 0.06111 video search videos lecture application text
vue 0.05749 vue html page strong code true fas hamburger
archival 0.04857 archival description digital materials systems
data 0.03616 data linked bibframe cataloging metadata name
app 0.02710 app value queue key delete export system studio
stress 0.01552 data stress word model fairseq research column

To understand the results another way, the overarching theme of the issue is using data, as illustrated below:

http://distantreader.org/stacks/carrels/code4lib-issue_55-2023/figures/topics-small.png
topics

Themes in-and-of themselves may be interesting, but they become more interesting when compared to metadata values such dates, places, or in this cases, authors. By supplementing the underlying model with author values and then pivoting the results, we can literally see the predominate topics discussed by each author. Notice how the editorial, the introduction written by Tidal is all the themes. Notice also how there is an underlying theme -- using data.

http://distantreader.org/stacks/carrels/code4lib-issue_55-2023/figures/topics-by-author-small.png
topics by author

Questions and answers

In the very recent past I have been playing with question/answer systems. Given a previously created model, it is possible to feed a text to a computer program, and the result will be a list of questions extracted by the model. One can then feed the question as well as the text to a second program in an effort to identify the answers. To my surprise, the process works pretty well and represents an additional way to conote the aboutness of a document.

I applied this technique to the editorial, and below are some of the more interesting question/answer pairs:

  • Q: How long did I serve on the editorial committee for Code4Lib Journal?
  • A: 7 years
  • Q: I have learned quite a lot from what?
  • A: my fellow editorial committee members
  • Q: What describes the use of sequence-to-sequence models and how it can be applied for a variety of applications?
  • A: Data Preparation for Fairseq and Machine-Learning using a Neural Network
  • Q: What kind of editors are highly encouraged to apply to Code4Lib Journal?
  • A: diverse communities
  • Q: Where did I attend my first code4lib conference?
  • A: North Carolina State University

The complete list of question/answer pairs is available as a part of the underlying data set.

Summary

Through the exploitation of rudimentary text mining, more sophisticated topic modeling, and even more sophisticated machine learning computing techniques it is relatively easy to get an overview of a given corpus.

The data set used to do this anaysis, complete with a cache of the original documents, is available as a Distant Reader study carrel at https://distantreader.org/stacks/carrels/code4lib-issue_55-2023/etc/reader.zip.

Code4Lib 2021 lightning talk: Planning for the most; or, a bellwether speaks / Erin White

I gave this 5-minute talk almost two years ago at Code4Lib 2021, but hadn’t yet shared it here. Slides are available through OSF; text is below. I’m no longer working at VCU, or in libraries, but wanted to share the talk here because this is something I continue to think about. Thanks for reading.

Planning for the most; or, a bellwether speaks

Hi folks,

…just a visit from your future, here. I’m the ram with the bell around its neck.

I’m Erin White. This is my 11th Code4Lib!

I’m head of digital engagement at VCU Libraries in Richmond, VA.

I’m also the interim digital collections librarian
…for the past five years or so.

Interim math

Interim math: 1/4 of my time times 1/2 of my ass equals one eighth of a full time personShoutout to everyone who’s holding an interim appointment or who has absorbed a vacancy in your area. I know many of y’all have been doing this math too. The past year in particular brought so much hardship across all vectors of our lives, and at work that likely included layoffs, retirements, health-related departures, and other stark changes.

I’m in a relatively good position – I get to say how much of this work has to get done. Still, it turns out half-assing a job for a quarter of my time means projects move really slowly or not at all. [2023 editor’s note: “half-assing” was sarcastically used here to mean, “Learning how to do a job I had not done before.” A reminder that we need to be kind to ourselves and others when we take on new roles!]

Where we’re headed

I’m not sharing this with you to complain. It’s not an indictment of my employer. I share it because I think this is where we’re headed.

Definition of bellwether: the leading sheep of a flock with a bell around its neck; or, an indicator or predictor of something

The early 2000s were a boom time for mass digitization and library investment in digital collections. It was a time of huge growth and excitement in digital libraries.

But, y’all, library budgets are not getting bigger. It’s not that we’re temporarily in tough times. This is how things are and will be. It sure seems to me that digital collections work, and other types of important but invisibilized work in the library, will continue to be deprioritized when budget conversations inevitably get tough. [2023 editor’s note: All this in a broader U.S. political and fiscal climate increasingly hostile to higher ed, libraries, and cultural heritage institutions.]

I won’t tell you not to hope, and fight, for the best.

I will tell you to plan for the worst. Or rather, to plan for the most. ‘Cause this is where most of us are heading. And it’s not necessarily the worst. It’s just different.

The last mile problem

There are a lot of ripple effects of disinvestment that I could talk about, but I only have a few minutes, so I’ll talk about the ones that haunt me most. 🙂

At Code4Lib 2014 Sumana Harihareswara gave a keynote that I still think about.

She talked about the last mile problem: the “largest hurdle we face in making things usable.” She gave many good examples and even wrote it up into a C4L journal article.

The bottom line is that many people don’t use services, even ones that are “best” for them, because they’re simply not usable.

The most beautiful bus stop

Photo of a wet road going down a hill next to a beautiful terraced lawn. A tiny bus stop sign stands next to a power pole.Here is a picture of the most beautiful bus stop in Richmond, VA. It’s my bus stop.

[2023 editor’s note: original slide text noted that there is no sidewalk, no bench, no shelter, and the stop is only serviced (unreliably) once per hour. This is still true.]

While this bus stop has the loveliest views, it has zero amenities. It’s inaccessible for many of my neighbors. And it only works well for me because I have a smartphone to check in on bus status, I have flexibility on what time I can arrive at my job, and I can walk quickly down a road with no sidewalk, dodging traffic, to catch a ride. If any of those things were to become untrue, or when the weather goes south, I can’t use the service easily.

This example is the very literal definition of the last mile problem.

The most beautiful workflows

One of the ways the last mile problem has manifested in my work-life has been that, even after a year and a half of using Islandora for our digital collections, we still haven’t figured out a workflow to batch-upload collections. We have added only one item to our digital collections since fall 2019.

First of all, as I said a few slides back, this is a result of disinvestment in libraries as a whole. Like many departments in our library, we’ve had a vacant position for years.

This is also a documentation problem. To get our process sorted out we’ve been hanging on every word of this 7-year-old blog post that’s only accessible through the Wayback Machine.

This is also, fundamentally, a last-mile problem. This upload process was designed assuming every institution had people with scripting expertise and, more importantly, time to design, code, and troubleshoot each bulk upload.

It feels personal. It’s not.

I am actually ashamed to admit this. I feel this failure in my body. I know that if I carved out two solid days I could probably get something working, right? It seems so fundamental! It should be simple. If I just tried harder. If I just had more time.

But this isn’t about me. This isn’t really about Islandora either. (BIG love and gratitude to the community of maintainers for Islandora. I know a lot of this is different in version 8. Again, this isn’t about Islandora.)

This is about beautiful bus stops that only a few people in good circumstances can use. We can and must design more usable things for each other.

Planning for the most

  • Design for the margins
  • Design for use
  • Assume nothing
  • Collaborate & de-silo
  • Define innovation as a social process rather than a technical one

So I ask you to think of this. How can we adjust the angle of our vision? To set our sights on each other instead of the distant horizon of another cutting-edge revolutionary technology that’ll solve all our problems?

What if instead of thinking of this as “planning for the worst” we see it instead as “planning for the most”? Because most of us are pressed for time, for money, for the brain cells to rub together to create new workflows.

By designing for needs of institutions that have fewer resources, we can design for everybody. Because the center is NOT holding. The dividing line between have- and have-not institutions is only getting stronger, with fewer in between.

Cultural heritage organizations must continue to become interdependent with each other as time goes along. Consortial, collectively-held platforms and communities are the way we need to go. Code4Lib itself is a model of how this can work. We can make this work!

So consider this an invitation.

Let’s keep building the future we need, together.


Postscript

This talk was inspired by all of my amazing colleagues doing library tech and digital collections work, and by the book Design Justice by Sasha Costanza-Chock. Thank you to Drew Heles at LYRASIS who reached out last year about this presentation and inspired me to post it.

When I gave this talk in March 2021, I got some feedback that it was too gloomy. After 13 years in the field, and well over a year after giving this talk, I stick by it. My takeaway is actually not gloomy at all; it’s hopeful. I believe we can have proactive new visions for the future instead of waiting for things to improve. No way out but through, no way through but together.

In the time since this talk I have left libraries and moved to a new city. My (former) institution is currently hiring a digital collections librarian.

Thanks again for reading.

NDSA 2022 Year in Review / Digital Library Federation

As we begin 2023 we wanted to take a moment to look back at NDSA activities over the past year.  Please take a look at the things we’ve accomplished and think about how you can participate this year!  

NDSA Leadership

This summer, NDSA Leadership went through an facilitated exercise to discuss NDSA strategy and how we engage with the marketplace of digital preservation service providers. One result of this exercise is a refreshed NDSA Foundational Strategy, which includes tweaks to our mission and vision statements, adds Transparency and Openness as values, and now includes operating principles based on our values. These new principles will be used as guidestones as we conclude this work on service provider engagement. In October, NDSA sponsored an open conversation on the Ithaka S+R report, The Effectiveness and Durability of Digital Preservation and Curation Systems. This conversation provided space for the community to react to the report and discuss its implications, you can read a summary on the NDSA News blog.

Membership Updates 

Following our new quarterly membership review process, we welcomed a total of 15 new members, with 5 of those being international members from Africa (3), Iceland, and Mexico. We look forward to working with and learning from our new members.   

As existing members, the new year is a good time to make sure your organization’s  contact information is up to date. A simple form is available to assist with this process.  

Interest Groups

Content Interest Group

  • During 2022 one of our co-chairs, Deb Verhoff stepped down and we welcomed Deon Schutte who now, together with Brenda Burk leads the content interest group.
  • Nathan Tallman spoke to us about appraisal and selection for digital preservation at our first meeting in February. 
  • At our May meeting we officially said goodbye to Deb Verhoff and had an interesting discussion about the new forms of content that are created in news organizations and the implications thereof for digital preservation.
  • Dealing with content that has both cultural and ethical concerns, as well as offensive content was the topic of the presentation and discussion led by the University of Cape Town during the August meeting. Andrea Walker, an archivist from the University of Cape Town spoke to us about an ongoing digital curation project involving the ǂKhomani San.
  • Our last meeting of 2022 was a casual affair with new members to the NDSA (those who joined in 2022) coming to tell us about themselves and their organizations. We look forward to interacting with and learning from them during 2023.

Infrastructure Interest Group

In 2022, the Infrastructure Interest Group met quarterly and focused on exploring topics of common interest through invited presentations, solution sessions where members bring their challenges and questions to the group, article discussions, and an in person event at DigiPres2022. Topics explored included:

  • Oxford Common File Layout (OCFL) and Implementation presented by Andrew Woods, Princeton University
  • Geographic Distribution in Cloud Environments
    • Presentation “Calculating the Costs of Redundant Storage” by Martha Anderson, University of Arkansas
    • Presentation and facilitated discussion on distributed cloud storage by Leslie Johnson, Director of Digital Preservation at NARA
  • Discussion post review of:
    • The Digital Preservation Declaration of Shared Values put forth by the Digital Preservation Services Collaborative
    • Preservica’s Charter for Long-Term Digital Preservation Sustainability
  • Solution Discussion Topics
    • Non-public sharing of digital born materials
    • Potential use of W3C’s Screen Capture for digital preservation
    • Secondary server storage

Standards and Practice Interest Group

  • Standards and Practices welcomed a new co-chair, Ann Hanlon (UWM), who joined continuing co-chair, Felicity Dykas. 
  • The Standards and Practices Interest Group held quarterly meetings, with the following agendas:.
    • January: A presentation by Lynda Schmitz Fuhrig, Digital Archivist at the Smithsonian, on preservation standards for digital video files. It was well attended and provided quite a bit of useful information.
    • April: A working session to identify glossaries that address terminology used in digital preservation. This will be re-reviewed and posted in 2023.
    • July: We discussed staffing for digital preservation, using two slides from the Staffing Survey questionnaire as a jumping off point. Attendees expressed challenges with staffing, and noted different staffing models.
    • October: We ended the year with a presentation on the Digital Preservation Coalition Competency Audit Toolkit (DPC CAT) given by Amy Currie and Sharon McMeekin of DPC. We appreciated the preview on the Toolkit, which was publicly released shortly after the meeting. 

Working Groups

Communication and Publications Working Group

The Communications and Publications group works to support Leadership and co-chairs of the Interest and Working groups through creating documentation, updating the website, and posting to social media and the NDSA blog.  The items below are highlights of completed activities in 2022.

DigiPres Conference Organization Committee

  • The 2022 DigiPres Conference was held October 12-13 in Baltimore, Maryland. Highlights from the conference can be found in this wrap-up post. The opening plenary video will be released soon, keep your eyes open for an announcement.
  • A virtual session to accommodate additional sessions from the 2022 Conference is being planned for February 2023.  
  • The 2023 DigiPres Conference will be held at the St. Louis Union Station Hotel in St. Louis, Missouri on November 15-16.

Excellence Awards Working Group

  • In line with our new working agreement with the Digital Preservation Coalition (DPC), only the DPC Digital Preservation Awards were awarded in 2022.  In 2023, the NDSA Excellence Awards will be presented at the annual Digital Preservation conference.  
  • If you are interested in participating in this group, keep your eyes out for a call for participation!

Levels of Digital Preservation

  • The steering group has established a Levels of Preservation ‘office hour’. This is held once every 2 months and provides a forum for members of the community to drop in and discuss the Levels and ask questions. Specific topics covered within these sessions include a focus on community archives and their use of the levels and a discussion about documentation. Do come along to future ‘office hour’ sessions – we would love to see you there! (See the NDSA Calendar of Events for specific dates)
  • We were excited to cheer the Levels of Preservation all the way to the quarter finals of World Cup of Digital Preservation! Unfortunately they were beaten in the semi-finals by PRONOM which then went on to win the tournament! 
  • A Dutch translation of the NDSA Levels has been published. A big thank you to Lotte Wijsman for providing this!
  • The group continues to respond to comment and feedback on the Levels. Do use our feedback form if you would like to share your thoughts on the Levels and associated resources with the steering group.

Membership Working Group

  • A newly formed Membership Working Group grew out of the 2021 Membership Task Force, which conducted a survey on a wide range of membership issues, and published a report on their findings.
  • We will build our work around the findings of the report published by the Membership Task Force. 
  • If you are interested in joining this working group, we are still recruiting participants, please see our call to get involved!

Staffing Survey

  • The 2021 Staffing Survey Report was published in August. The report documents survey responses from 269 individuals, covering topics such as digital preservation activities and staffing qualifications. Additional information for review, including the Survey codebook and data files, are also available in the NDSA OSF.
  • In September, members of the Staffing Survey Working Group presented a peer-reviewed panel presentation at iPres in Glasgow, Scotland. A short paper is available in the conference proceedings (p. 424). An additional presentation at the DigiPres conference in October led to an engaging audience conversation about digital preservation staffing.

Web Archiving Survey

  • The Web Archiving Survey Group was re-established to refresh the survey which was last distributed in 2017
  • 190 survey responses were received — 72.6% from institutions and individuals in the United States, and 27.4% from international institutions and individuals
  • The report is currently being worked on and we are aiming for a Spring 2023 release!

The post NDSA 2022 Year in Review appeared first on DLF.

A Fast and Full-Text Search Engine for Educational Lecture Archives / Code4Lib Journal

E-lecturing and online learning are more common and convenient than offline teaching and classroom learning in the academic community after the covid-19 pandemic. Universities and research institutions are recording the lecture videos delivered by the faculty members and archiving them internally. Most of the lecture videos are hosted on popular video-sharing platforms creating private channels. The students access published lecture videos independent of time and location. Searching becomes difficult from large video repositories for students as search is restricted on metadata. We presented a design and developed an open-source application to build an education lecture archive with fast and full-text search within the video content.

Click Tracking with Google Tag Manager for the Primo Discovery Service / Code4Lib Journal

This article introduces practices at the library of Oregon State University aiming to track the usage of Unpaywall links with Google Tag Manager for the Primo discovery interface. Unpaywall is an open database of links to full-text scholarly articles from open access sources[1]. The university library adds Unpaywall links to Primo that will provide free and legal full-text access to journal articles to the patrons to promote more usage of open-access content. However, the usage of the Unpaywall links is unavailable because Primo does not track the customized Unpaywall links. This article will detail how to set up Google Tag Manager for tracking the usage of Unpaywall links and creating reports in Google Analytics. It provides step-by-step instructions, screenshots, and code snippets so the readers can customize the solution for their integrated library systems.

Creating a Custom Queueing System for a Makerspace Using Web Technologies / Code4Lib Journal

This article details the changes made to the queueing system used by Virginia Tech University Libraries' 3D Design Studio as the space was decommissioned and reabsorbed into the new Prototyping Studio makerspace. This new service, with its greatly expanded machine and tool offerings, required a revamp of the underlying data structure and was an opportunity to rethink the React and Electron app used previously in order to make the queue more maintainable and easier to deploy moving forward. The new Prototyping Queue application utilizes modular design and auto building forms and queues in order to improve the upgradeability of the app. We also moved away from using React and Electron and made a web app that loads from the local filesystem of the computer in the studio and runs on the Svelte framework with IBM's Carbon Design components to build out functionality with the frontend. The deployment process was also streamlined, now relying on git and Windows Batch scripts to automate updating the app as changes are committed to the repository.

Designing Digital Discovery and Access Systems for Archival Description / Code4Lib Journal

Archival description is often misunderstood by librarians, administrators, and technologists in ways that have seriously hindered the development of access and discovery systems. It is not widely understood that there is currently no off-the-shelf system that provides discovery and access to digital materials using archival methods. This article is an overview of the core differences between archival and bibliographic description, and discusses how to design access systems for born-digital and digitized materials using the affordances of archival metadata. It offers a custom indexer as a working example that adds the full text of digital content to an Arclight instance and argues that the extensibility of archival description makes it a perfect match for automated description. Finally, it argues that building archives-first discovery systems allows us to use our descriptive labor more thoughtfully, better enable digitization on demand, and overall make a larger volume of cultural heritage materials available online.

Data Preparation for Fairseq and Machine-Learning using a Neural Network / Code4Lib Journal

This article aims to demystify data preparation and machine-learning software for sequence-to-sequence models in the field of computational linguistics. The tools, however, may be used in many different applications. In this article we detail what sequence-to-sequence learning looks like using code and results from different projects: predicting pronunciation in Esperanto, predicting the placement of stress in Russian, and how open data like WikiPron (mined pronunciation data from Wiktionary) makes projects like these possible. With scraped data, projects can be started in automatic speech recognition, text-to-speech tasks, and computer-assisted language-learning for under-resourced and under-researched languages. We will explain why and how datasets are split into training, development, and test sets. The article will discuss how to add features (i.e. properties of the target word that may or may not help in prediction). By scaffolding the tasks and using code and results from these projects, it’s our hope that the article will demystify some of the technical jargon and methods.

DRYing our library’s LibGuides-based webpage by introducing Vue.js / Code4Lib Journal

At the Kingsborough Community College library, we recently decided to bring the library’s website more in line with DRY principles (Don’t Repeat Yourself). We felt we this could improve the site by creating more concise and maintainable code. DRYer code would be easier to read, understand and edit. We adopted the Vue.js framework in order to replace repetitive, hand-coded dropdown menus with programmatically generated markup. Using Vue allowed us to greatly simplify the HTML documents, while also improving maintainability.

Revamping Metadata Maker for ‘Linked Data Editor’: Thinking Out Loud / Code4Lib Journal

With the development of linked data technologies and launch of the Bibliographic Framework Initiative (BIBFRAME), the library community has conducted several experiments to design and build linked data editors. While efforts have been made to create original linked data 'records' from scratch, less attention has been given to copy cataloging workflows in a linked data environment. Developed and released as an open-source application in 2015, Metadata Maker is a cataloging creation tool that allows users to create bibliographic metadata without previous knowledge in cataloging. Metadata Maker might have the potential to be adopted by paraprofessional catalogers in practice with new linked data sources added, including auto suggestion of Virtual International Authority File (VIAF) personal name and Library of Congress Subject Heading (LCSH) recommendations based on the users’ text input. This article introduces those new features, shares the user testing results, and discusses the possible future steps.

Using Python Scripts to Compare Records from Vendors with Those from ILS / Code4Lib Journal

An increasing challenge libraries face is how to maintain and synchronize the electronic resource records from vendors with those in the integrated library system (ILS). Ideally vendors send record updates frequently to the library. However, this is not a perfect solution, and over time a problem with record discrepancies can become severe with thousands of records out of sync. This is what happened when, at a certain point, our acquisitions librarian and our cataloging librarian noticed a big record discrepancy issue. In order to effectively identify the problematic records among tens of thousands of records from both sides, the author of this article developed some solutions to analyze the data using Python functions and scripts. This data analysis helps to quickly scale down the issue and reduce the cataloging effort.

Weeknote 3, 2023 / Mita Williams

§1: Market Research Tools vs. Academic Research Tools §2: The Legal Layer that Enables Inequality §3: Simple Opt Out

Issue 96: Metadata / Peter Murray

Metadata is at the core of what libraries do. (“metadata” is one of the most common tags on this here library technology blog.) We gather information about the resources available to patrons, then massage it and slice it and sort it and display it in ways that help patrons find what they need. I was thinking about metadata because of a thread that ran here in Thursday Threads 12 years ago. That is the first thread in this issue, and it is followed by more recent articles about metadata.

Feel free to send this newsletter to others you think might be interested in the topics. If you are not already subscribed to DLTJ’s Thursday Threads, visit the sign-up page. If you would like a more raw and immediate version of these types of stories, follow me on Mastodon where I post the bookmarks I save. Comments and tips, as always, are welcome.

The Most Important Library Metadata

Screen capture of a table, which is described in the text below. Table 2.1—MARC tags occurring in 20% or more of WorldCat records—from 'Implications of MARC Tag Usage on Library Metadata Practices'

The outliers in this case are those elements that appear in a large number of records — that is, what might be considered “core” elements that are used to describe the vast majority of library owned material.

Those “outliers” can be categorized according to three general purposes:

  • Provenance and Identity: identifiers (e.g. ISBN, OCLC, etc.) and cataloging source (040)
  • Elements useful for discovery: title statement (245), personal names (100, 700) and subject (650)
  • Elements useful for understanding and evaluation: publication statement (260), physical description (300), and notes (500)

That’s it. In a nutshell you have the very core of bibliographic description as defined by librarians over the last century or so.

The Core of Bibliographic Description , Roy Tennant, OCLC Research&aposs "Hanging Together" blog, 17-Jan-2011

Back in 2011, Roy Tennant posted a summary of this research on the fields that libraries use most often when describing stuff. It pointed to a report from which the above table is taken, Implications of MARC Tag Usage on Library Metadata Practices. WorldCat is the name of a database of records from libraries. As the table shows, every record has some mandatory elements: a control number, a set of codes called the “fixed length data elements”, a source for where the record came from, and the title. Most records also have an imprint (96%) and a physical description (number of pages, etc.—91%). In library-speak, the “main entry” is the person or organization responsible for the work, and the research found that 61% of records had a personal “main entry”…what we could commonly call the author. 46% of records have a topical subject and 44% had a descriptive note of some sort. Rounding out the last entries in the table, 28% of records had an additional responsible entry—a second author or illustrator or similar—and 23% of records had an ISBN.

Now library records have hundreds of fields and variations of fields to describe more esoteric aspects of a work, but these are the most common. Just a fun fact for your next dinner party.

ICOLC Statement on the Metadata Rights of Libraries

Metadata and the metadata services that describe library collections are critical in supporting content discovery, knowledge creation, and libraries’ public missions. Metadata describing library collections is not typically copyrightable, and should be considered freely shareable and reusable under most circumstances. However, some industry players restrict libraries’ rights to use such metadata through contractual terms and market influence. Such restrictive activity is out of alignment with libraries’ needs and public, not-for-profit/educational missions.

The endorsers of this document urge all organizations, whether for-profit or not-for-profit, to uphold libraries’ rights and interests to use, re-use, adapt, aggregate, and share metadata that describes library collections to serve the public interest, without restriction or limitation.

ICOLC Statement on the Metadata Rights of Libraries , International Coalition of Library Consortia (ICOLC) website, 26-Aug-2022

As you might guess, metadata is really important to libraries. And libraries have a natural ethos to cooperate with one another to share the burden of creating and maintaining that metadata. Not all of the actors in the metadata world, though, are naturally so cooperative, and this statement from ICOLC emphasizes the need to keep this cooperative ethos going for the benefit of all libraries.

What’s in a name?

Because of both the great variety of name forms and the variability of applications that make use of names, I recommend a metadata vocabulary that follows the principle of minimum semantic commitment. This means a vocabulary that includes broad classes and properties that can be used as is where detailed coding is not needed or desired, but which can be extended to accommodate many different contexts.
What's in a Name? , Coyle&aposs InFormation , 26-Jan-2022

This is a great article if you like going down rabbit holes. In the Western world (the overwhelmingly large number of Thursday Threads readers), you’ll see a form ask you for first name, last name, and sometimes middle name. What happens, though, when names are much more complex than that:

  • When you add a birth date and/or death date to disambiguate people with the same name.
  • When you encounter titles (“Sir”) or a generational (“Jr.”) suffix.
  • What about cultures that by convention put their family name before their given name.

These are the types of things that Karen Coyle explores in this blog post.

Who me?

Photograph of a white cat with black splotches sitting on a wicker cat tree in front of a sliding door. The cat's body is facing a way but it has its head twisted back towards the camera.

Ah, the innocence of a cat face. Who would believe this cute face could cause any trouble.

Hopefully Alan is looking healthier now. He was on a steady path towards completely rejecting the food we had for him. Fortunately, we found some wet food and some kibble that he likes to eat and is good for his sensitive stomach. Unfortunately, though, we’ve had one room-clearing episode of flatulence that may be caused by this new food, so we might be back to the drawing board.

IIPC Technical Speaker Series: Archiving Twitter / Harvard Library Innovation Lab

I was invited by the International Internet Preservation Consortium (IIPC) to give a webinar on the topic of “Archiving Twitter” on January 12.

During this talk, I presented what we’ve learned building thread-keeper, the experimental open-source software behind social.perma.cc which allows for making high-fidelity captures of twitter.com urls as “sealed” PDFs.

Here are the slides I used as a background for my presentation:

Open Knowledge as an essential design principle for a free, sustainable and open future / Open Knowledge Foundation

Over the last two decades, Open Knowledge Foundation worked at the intersection of cutting-edge digital tools and a distributed network of communities and movements to serve the public interest, shaping and democratising the technologies and knowledge of our time, making its powerful combination “ours” and testing them in different fields. 

In the last year, our organisation has provided deep technical knowledge and advocacy around standards and new definitions that would shape what openness would mean more broadly, designing skills programmes which will be deployed globally, improving digital infrastructures in the public interest, activating new communities to connect open knowledge and data to solve their problems and consolidating a powerful, decentralised network which is testing replicable open models to improve our neighbourhoods, our cities, our elections and our health systems and sharing the lessons, code and practices to scale impact.

We built solid alliances with the public sector, celebrated our first global convening after the pandemic restrictions, served and guided the private and philanthropic sectors in their quest to open data and open knowledge principles, and worked closely with academia, aid agencies and with International Organisations to unleash the power of open access and reproducible research.

We created digital tools, equipped people with skills and expanded digitally savvy networks that understood the transformative power of openness as defined in the Open Definition. We explored how openness and open knowledge can play a role in shaping the infrastructures and institutions of the future, combined with a different set of principles rooted in the local, decentralised and green-sustainable digital commons logic. And how innovative knowledge systems designed to fit a more democratic future will play a role in achieving the sovereignty of our digital sphere of action. 

In 2023, we will keep enabling individuals, collectives, communities, institutions and governments so openness and open knowledge are adopted in their design and daily practice to ensure access to critical information that will empower humans to solve the most pressing problems of our times, leading to a free, fair and open future for everyone, everywhere.

We will do it by creating tools, developing models, harnessing communities and advocating for standards and policies in a sustainable, ethical and agile manner. And we will not do it alone; we will continue growing our network and relying on alliances that make us stronger and more effective in shaping the future we want.

#onpoli Tweet Volume / Nick Ruest

Lately I’ve been making use of my Twitter Academic Research Access while it still exists by grabbing #onpoli tweets. It took two monthly cycles with the 10,000,000 tweet cap per month. The dataset I have now runs from the very first tweet to use the hashtag on July 14, 2009 through December 31, 2022. Overall, there are 16,458,701 tweets.

I’m not exactly sure what questions I’m going to ask the dataset as of now. But, I wanted to get a sense of what the historical volume looked like over time, so I threw together a bar chart using pandas and Altair. Below is what the distribution of the 16,458,701 tweets look like by month. Each election cycle is the pink coloured bars.

#onpoli tweet volume July 14, 2009 - December 31, 2022

Medical/Life Sciences Graduate or Undergraduate Student Hourly – Biomedical Literature Annotation for Citation Accuracy/Integrity (10 hours per week, spring semester) – School of Information Sciences – University of Illinois at Urbana-Champaign / Jodi Schneider

The ScienceNLP Lab and the Information Quality Lab at the School of Information Sciences (iSchool) are seeking a University of Illinois at Urbana-Champaign student to read and annotate health-related literature. Knowledge and training and knowledge in a medical/life sciences field such as biology, chemistry, bioinformatics, food science and nutrition, or bioengineering is essential. The hourly will work an average of 10 hours per week for spring semester, under the co-supervision of Dr. Halil Kilicoglu and Dr. Jodi Schneider. The project focuses on assessing biomedical publications for citation accuracy and integrity. Your role in this project will be to locate citation statements in biomedical articles and assess their accuracy with respect to the cited articles. You will collaborate with other annotators on this task. This work is part of the project Natural Language Processing to Assess and Improve Citation Integrity in Biomedical Publications, funded by the Office of Research Integrity (ORI).

Project Description: While citations play a fundamental role in scientific knowledge diffusion and research assessment, they are often inaccurate (e.g., citation of non-existent findings), undermining the integrity of scientific literature and distorting the perception of available evidence. A recent meta-analysis showed that 25.4% of medical articles contained a citation error. A bibliometric analysis revealed that inaccurate citations of a letter published in 1980 may have contributed to the opioid crisis. The project will develop and validate resources and models that aid stakeholders in assessing biomedical publications for citation accuracy and integrity. The publicly available annotated corpus you help create will be used to develop natural language processing/artificial intelligence (NLP/AI) models for assessing reporting quality in biomedical articles.

Duties include:

  • Reading and annotating biomedical publications for citation integrity/accuracy
  • Contribution to development of annotation guidelines
  • Contribution to scientific presentations and publications

Required qualifications:

  • Background in a field such as: medicine, life sciences, including biology, chemistry, bioinformatics, food science and nutrition, bioengineering, or a related field.
  • Excellent English reading comprehension skills
  • Excellent communications skills in written and spoken English
  • Excellent analytical/critical thinking skills
  • Effective time management skills, attention to detail

Preferred qualifications:

  • Interest in topics such as trustworthy science, research rigor/quality, reproducibility
  • Interest in biomedical data science, bioinformatics, or related fields
  • Availability for multiple semesters

Interested candidates should send their CV/resume and a short statement of purpose drawing attention to their training in medicine or life sciences (e.g., biology, chemistry, bioinformatics, food science and nutrition, bioengineering, or a related field) to Halil Kilicoglu (halil@illinois.edu) and Jodi Schneider (jodi@illinois.edu). Review of applications will begin immediately. Applications will be accepted until the position is filled.

Posted on Handshake and the Virtual Job Board.

Legal and Ethical Considerations for Providing Access to Born-Digital Collections: Copyright / Digital Library Federation

This blog is part four of a series comprising a set of primers that address ten complex legal and ethical issues encountered when providing access to born-digital archival records. The guidance is grounded in two foundational beliefs: 1. There is a potential to harm individuals, communities, and institutions if access is not thoughtfully and strategically viewed through an ethical and legal lens; and 2. It is not ethical to restrict access unnecessarily. For more information see: https://osf.io/ketr7/.


Overview

Copyright in the United States is defined as, “A form of protection provided by the laws of the United States for “original works of authorship”… “Copyright” literally means the right to copy but has come to mean that body of exclusive rights granted by law to copyright owners for protection of their work” (U.S. Copyright Office).

Archives are filled with copyrighted materials, and Archivists have been providing access to them for years. So what is different about born-digital archives, and what do we need to consider when providing access to them? 

The method of access that is unique to digital content. Namely that it must be rendered using a machine, and this machine rendering requires multiple copies to be made. Section 108 of the US Copyright Statute authorizes libraries and archives to make copies for preservation and access purposes, but there are limitations to this provision that have serious legal consequences for born-digital access.

The access affordances that are unique to digital content. It can be reproduced and distributed online more easily than physical material, and therefore has the potential to reach a wide public audience. If made openly available online without clearly articulated access, use, and takedown policies, this level of visibility increases the likelihood that a copyright holder will challenge the repository (Hirtle 2015, 2).

The scope of copyrighted material that is unique to born-digital collections. Like most archival collections, modern born-digital collections often largely comprise unpublished material. Additionally, they are usually sizable, and are likely to contain works by more creators than physical collections, which is particularly true for collections containing email. Donors can only transfer copyright or license use of the email messages they authored, not the messages they received. Thus, it is reasonable to expect an email archive to contain copyrighted materials from hundreds or thousands of authors (Briston 2015, 26). This presents a logistical challenge for institutions seeking to provide online public access to this material.

Most likely to come up in

  • Collections consisting largely of material that was created for commercial intent, such as photography collections, literary papers, and papers of artists, musicians, etc.
  • Collections containing email
  • Corporate archives

Actions for the institution to take

Avoid acquiring collections from donors who impose reproduction and/or use restrictions.

Negotiate open licenses with donors if the donor owns copyright. As the majority of born-digital records were created after 1989, most born-digital collections comprise comparatively “young” materials that are about 30 years old or less (as of the publication of this document). Thus, copyright for these collections may generally begin to expire around 2070 (for earlier digital records), but can extend to 2140 or later (for more recently created records).[1] Traditional donation agreements prompt donors to retain copyright by default. To promote broad use of collections, talk with donors about the implications of retaining copyright, such as increased barriers to access due to Section 108 of the US Copyright Statute (see below), as well as general barriers to diverse use of the material. Pursue one of several avenues:

Ask the donor to transfer or license copyright to the institution upon donation, after a specified period of time after donation (i.e. 20 years), or upon the donor’s death. This could be accomplished either with an exclusive or non-exclusive license. A non-exclusive license gives the institution the rights they need for their educational and/or research mission, and the donor is happy to still have “control of the rights.” This is an especially useful strategy when the donor still wants to benefit commercially from their materials.

Alternatively, ask the donor to license their works under a Creative Commons license, such as a CC BY attribution license (users must credit the creator) or a CC0 dedication (gives copyright to the public domain). (Callahan 2019). Document the terms of license in the deed of gift.

  • Or consider a compromise between the two options above: ask the donor to transfer copyright to the institution, and the institution pledges to release it under a CC-BY attribution license.[2]

Develop and/or update the following policies that impact access to and use of copyrighted born-digital records. Ask for these to be vetted by legal counsel.

  • Reproduction and use policy. This clarifies that it is the user’s responsibility to evaluate fair use and/or obtain permission to use reproductions if their intended use is beyond the scope of fair use.
  • Online access disclaimer on digital collections site. A boilerplate note stating that the institution makes the content available for educational and/or research purposes, and clarifies the user’s responsibilities if they intend to use the content (perhaps linking back to your reproduction and use policy).[3]
  • Deed of gift or an addendum to the deed that specifically addresses born-digital. This should clarify transfer or non-transfer of copyright, include permission to provide online access to born-digital content by default, document any donor-applied access restrictions, and specify that the institution has permission to circumvent encryption and password protections for the purposes of preservation and access. The deed of gift must be completed prior to accessioning.[4]

Support staff responsible for acquisition, processing, and preservation in developing an understanding of Digital Rights Management (DRM), as software subject to technological protection measures may show up in archival collections, and sometimes is preserved to provide access to born-digital material.[5]

Actions for the archivist to take

Document copyright ownership during the acquisition or accessioning phase. While it is not the Archivist’s responsibility to research and identify the copyright holder on behalf of the user and/or determine if their use violates copyright law, documenting the copyright owner supports use of collection material by both the user and the institution.

Undertake a fair use assessment before making born-digital records available online. Does copyright law impact your ability to reproduce and provide access to it under fair use? Maintain documentation of this assessment.

Maintain consistent documentation of the research, reasoning, and justification behind decisions to make born-digital content available in the specific way it is made available, such as via a publicly accessible online platform. Documentation demonstrates diligence of the investigation and helps to support the fair use determination if it is called into question.

Consider your resources and plan strategically. If making born-digital records available online, it is necessary to carefully consider the immense amount of labor typically required to research copyright status and owner, obtain authorization to publish (if necessary), and create item-level metadata as it pertains to copyright.

Technical Infrastructure

See Recommendations For All Collections section in this document.

Legal considerations

There are three sections of US copyright law that specifically impact born-digital access: Section 108, Section 107 (Fair Use), and the Digital Millennium Copyright Act (DMCA).

Section 108: Section 108 of title 17 of the US Copyright Statute went into effect on January 1, 1978. It supplements the exemptions in Section 107, by providing limited exceptions for libraries and archives to reproduce and distribute copyrighted material for the purposes of preservation, replacement, and to fulfill patron requests, without seeking permission from the copyright holder. The purposes and limitations that are most relevant to born-digital access are:

Patron requests: only certain types of works can be reproduced in response to patron requests under Section 108. Most audiovisual works including musical and motion picture works, and pictorial and graphic works are not covered. Theremore, most textual works are allowed to be reproduced for patrons if the other requirements in the section are met (Hirtle, Hudson, Kenyon 2009, 117). For instance, a library may reproduce a section of unpublished emails in a collection for a patron if the following conditions are met: the emails are not available elsewhere “at a reasonable price,” once reproduced they become the “property” of the patron (i.e. they are delivered to them), a copyright notice is included in the reproduction order form, and the patron acknowledges that the reproduction is for private study, scholarship or research purposes.

  • Preservation and replacement: Section 108 also provides exemptions for up to three copies (including digital copies) of a copyrighted work to be made for preservation purposes, and for a replacement copy (including a digital copy) if the work is manifested in an obsolete format. However, digital copies made for preservation and replacement cannot be subsequently distributed in digital format, nor can the digital copy be used “outside the premises” of the institution. Thus, these copies cannot be made available to patrons remotely under provisions of Section 108. (However, this purpose may be exempted under Fair Use. See below.)

Section 108 was not created with digital content in mind, and it leaves a lot to be desired in how copyrighted works are preserved, accessed, and used. By necessity, archivists must make multiple copies of born-digital records, often more than three, in order to acquire, process, preserve, and make them accessible. Furthermore, the exceptions in Section 108 do not cover remote public access to unpublished material in an online environment, such as on a publically accessible digital collections site or virtual reading room.[6] (However, if the institution takes measures to limit remote access to communities that are not considered the public at large (i.e. members of a university community), the exceptions provided in Section 108 may provide safe harbor for online access to such material.) The Library of Congress convened a Section 108 Study Group in 2005 to update the provision for the digital age. The group issued their report in 2008 but their proposed changes were not adopted. In 2017, the US Copyright Office issued a discussion document that restated the need to update Section 108 to more accurately reflect changes in technology and better support libraries, archives and museums.[7]

Section 107: Archivists rely on fair use to support the provision of remote access. Fair use permits the unlicensed use of copyright-protected works in certain circumstances, such as criticism, comment, news reporting, teaching, scholarship, and research. It requires consideration of four factors in evaluating a given use case: 1. the purpose and character of the use; 2. the nature of the copyrighted work; 3. the amount and substantiality of the portion used; and 4. the effect of the use on the potential market for or value of the copyrighted work (U.S. Copyright Office). In many circumstances, organizations can legally and ethically provide access to copyrighted materials, including born-digital content, under the allowances of fair use. It is important to note that while born-digital collections may contain a larger quantity of copyrighted works than paper collections, the allowances of fair use still apply regardless of format.

Digital Millennium Copyright Act (DMCA): The DMCA passed in 1998. It criminalizes circumvention of encryption mechanisms that control access to digital content (Briston 2015, 34). In the context of born-digital archives, this means an archivist cannot legally copy or provide access to an encrypted file unless permission or login info is provided by the copyright holder. This presents a problem for archivists seeking to preserve password-protected or encrypted content, which is particularly challenging to the preservation of software (see below).

Special considerations for software: Software preservation is a necessary aspect of digital preservation and access, yet because most software is copyrighted and much of it is commercially produced, almost every part of the software preservation process can potentially infringe copyright. Institutions generally rely on fair use to support preservation efforts as they largely do not negatively impact the market for the copyrighted software, and preservation is considered transformative in that it serves a different use than the original market purpose (Aufderheide, Butler, Cox, Jaszi 2019).

Additionally, software preservation often requires circumvention of “technological protection measures” (such as a password or product key), thereby violating DMCA. However, the Library of Congress has adopted  temporary exemptions for libraries and archives to make these circumventions under certain conditions. These exemptions are reviewed every three years in a rulemaking proceeding. The latest proceeding in 2021 upheld these exemptions until October 21, 2024 (U.S. Copyright Office).

Ethical considerations

“…many of the issues that archivists face in managing rights are not matters of law, but of professional ethics. The financial damage that can accrue from reproducing an unregistered work is likely to be almost nothing, but the possible damage to an institution’s reputation may be great if it acts without due regard for both those who created materials and for those who use them” (Hirtle 2015, 5).

Copyright concerns around access to born-digital archives are complex and uncertain. But this uncertainty should not rationalize conservative access restrictions. Following standards of practice arrived at by consensus in archival and library communities, such as ARL’s Code of Best Practices in Fair Use for Academic and Research Libraries, and taking a reasonable approach to access supports ethical access to copyrighted born-digital records.

Case study

Google v. Oracle (2020-2021). See Butler, Brandon. “Google v. Oracle: Takeaways for Software Preservation, Cultural Heritage, and Fair Use Generally.” Software Preservation Network, June 3, 2021. Available at https://www.softwarepreservationnetwork.org/google-v-oracle-takeaways-for-software-preservation-cultural-heritage-and-fair-use-generally-2021-reflection/

Works Cited

Aufderheide, Patricia, Bradon Butler, Krista Cox, Peter Jaszi. Code of best practices in fair use for software preservation. Washington, DC: Association of Research Libraries; 2019

Briston, Heather. “Understanding Copyright Law.” In Rights in the Digital Era, edited by Behrnd-Klodt, Menzi L., and Christopher J. Prom. Chicago: SAA, 2015.

Callahan, Maureen and Heather Briston. “Radical Access—Leveraging Creative Commons Licenses to Open up Archives.” Presentation in OCLC Works In Progress series. OCLC, 2019. Available at https://www.oclc.org/research/events/2019/121019-leveraging-creative-commons-licenses-open-archives.html

Hirtle, Peter B. “Introduction.” In Rights in the Digital Era, edited by Behrnd-Klodt, Menzi L., and Christopher J. Prom. Chicago: SAA, 2015.

Hirtle, Peter B., Emily Hudson, and Andrew T. Kenyon. Copyright and Cultural Institutions: Guidelines for Digitization for U.S. Libraries, Archives, and Museums. Ithaca, NY: Cornell University Library, 2009. (Note: chapter 6 provides an excellent overview of Section 108.)

U.S. Copyright Office. “More Information on Fair Use.” Accessed May 4, 2022. Available at https://www.copyright.gov/fair-use/more-info.html

U.S. Copyright Office. “Section 1201 Exemptions to Prohibition Against Circumvention of Technological Measures Protecting Copyrighted Works.” Accessed May 4, 2022. Available at https://www.copyright.gov/1201/2021/?loclr=eanco

U.S. Copyright Office. “U.S. Copyright Office Definitions.” Accessed April 27, 2022. Available at https://www.copyright.gov/help/faq/definitions.html

Additional Resources

Copyright and Fair Use

Briston, Heather. “Contracts, Intellectual Property, and Privacy.” In The Digital Archives Handbook, edited by Aaron D. Purcell. Lanham: Rowman & Littlefield, 2019.

Code of Best Practices in Fair Use For Academic and Research Libraries. Association of Research Libraries, 2012. Available at https://publications.arl.org/code-fair-use/

“Copyright 101: Everything You Wanted to Know About Copyright But Were Afraid To Ask.” ALA Office for Information Technology Policy, Copyright Advisory Committee and Copyright Advisory Network Team, and Office of Government Relations Committee on Legislation, Intellectual Property Subcommittee. Available at http://www.ala.org/advocacy/sites/ala.org.advocacy/files/content/copyright/Section108.pdf

Hirtle, Peter B. “Copyright Term and the Public Domain in the United States.” Cornell University Library. Available at  https://copyright.cornell.edu/publicdomain

U.S. Copyright Office. “Revising Section 108: Copyright Exceptions for Libraries and Archives.” Accessed May 4, 2022. Available at https://www.copyright.gov/policy/section108/

The National Archives. “Copyright.” Accessed April 27, 2022. Available at https://www.nationalarchives.gov.uk/legal/copyright/

Digital Preservation and Access Workflows

Leventhal, Aliza, Laura Schroffel and Jody Thompson. “Deeds of Gift as a Tool to Facilitate born-digital Design File Processing and Preservation.” Society of American Archivists, April 2019. Available at https://www2.archivists.org/sites/all/files/Leventhal_Schroffel_Thompson_Deeds%20FINAL.pdf

“Responsible Access Workflows.” University of California Berkeley Library, 2021. Available at https://docs.google.com/presentation/d/1V66PGpIq9xqXxdvngpD3rkAMoIw2hIyVVDS4Iv4VFOM/edit#slide=id.g7ffa9d7047_0_2

“UCLA Library Special Collection Risk Assessment Guidelines.” University of California Los Angeles Library. Available at https://www.library.ucla.edu/sites/default/files/Guidelines_RiskAssessment.pdf

“Well-intentioned practice for putting digitized collections of unpublished materials online.” OCLC Research, revised May 29, 2010. Available at http://www.oclc.org/research/activities/rights/practice.pdf

Software and DMCA

Albert, Kendra, Daina Bouquin, Farber Alena, and Russell Hoover. “Copyright Guide for Scientific Software.” Software Preservation Network, 2019. Available at https://www.softwarepreservationnetwork.org/copyright-guide-for-scientific-software/

Albert, Kendra and Kee Young Lee. “A Preservationist’s Guide to the DMCA Exemption for Software Preservation.” Software Preservation Network, December 10, 2018. Available at https://www.softwarepreservationnetwork.org/a-preservationists-guide-to-the-dmca-exemption-for-software-preservation/

Enriquez, Ana. (2022, August 15). Section 108 and Software Collections: A Users Guide. Software Preservation Network. https://www.softwarepreservationnetwork.org/section-108-and-software-collections-a-users-guide/

Contributors

Primary Author of this section: Kate Dundon, Supervisory Archivist, University of California, Santa Cruz
Co-Authors: Jessika Drmacich, Jess Farrell, Christina Velazquez Fidler, Hannah Wang, Camille Tyndall Watson

Thank you to the many community contributors for feedback and edits to Legal and Ethical Considerations for Born-Digital Access, from which this post was derived.

DISCLAIMER: This text has been reviewed and feedback incorporated from library professionals with legal expertise, but it was not originally drafted by a lawyer and is not legal advice. Please use this as a starting point to understand the issues, but always consult your local legal counsel or other support system as you make access decisions around born-digital collections.

——

[1] Copyright for born-digital records in the United States will expire according to the following terms most of the time: Unpublished born-digital content: 70 years after the death of author, or 120 years from date of creation if unknown author or work of corporate authorship. Published born-digital content: 70 years after the death of author, or 95 years from publication if work of corporate authorship. Copyright terms may be different in other jurisdictions. Other possibilities exist such as Crown Copyright in the UK, Canada, Australia. For example, in the UK such materials are typically released under the Open Government License.

[2] Licensing options were contributed by Kyle K. Courtney, Copyright Advisor and Program Manager at Harvard University.

[3] See the UCLA digital collections copyright statement for a good example that can be modified for born-digital. Available at: https://digital.library.ucla.edu/copyright_and_collections. Also see pg. 64 in Rights in the Digital Era for another version.

[4] For sample deeds of gift, see pg.116-117 in Rights in the Digital Era. Also see Appendix F: Policies, Templates, Documents, etc. in AIMS Born-Digital Collections: An Inter-Institutional Model for Stewardship. Available at: http://www2.lib.virginia.edu/aims/whitepaper/AIMS_final.pdf

[5] Digital rights management (commonly known as DRM and sometimes called technological protection measures, or TPMs) control access to computer programs by requiring a user to do something before allowing the user to use the program. See Digital Millennium Copyright Act below.

[6] See 17 U.S.C. § 108(b)(2): “any such copy or phonorecord that is reproduced in digital format is not otherwise distributed in that format and is not made available to the public in that format outside the premises of the library or archives.”

[7] Section 108 of Title 17: a discussion document of the register of copyrights. Available at: https://www.copyright.gov/policy/section108/discussion-document.pdf

The post Legal and Ethical Considerations for Providing Access to Born-Digital Collections: Copyright appeared first on DLF.

Weeknote 2, 2023 / Mita Williams

§1: The Law Library of Congress Report Examines the Canadian Emergencies Act §2: Data Cartels §3: Make sure you know how to properly redact your PDFs §4: The volume of fear-mongering crime news §5: The Society for the promotion of radical analogue games

Towards "deep fake" web archives? Trying to forge WARC files using ChatGPT. / Harvard Library Innovation Lab

Chatbots such as OpenAI’s ChatGPT are becoming impressively good at understanding complex requests in “natural” language and generating convincing blocks of text in response, using the vast quantity of information the models they run were trained on.
Garnering massive amounts of mainstream attention and rapidly making its way through the second phase of the Gartner Hype Cycle, ChatGPT and its potential amazes and fascinates as much as it bewilders and worries. In particular, more and more people seem concerned by its propensity to make “cheating” both easier to do and harder to detect.

My work at LIL focuses on web archiving technology, and the tool we’ve created, perma.cc, is relied on to maintain the integrity of web-based citations in court opinions, news articles, and other trusted documents.
Since web archives are sometimes used as proof that a website looked a certain way at a certain time, I started to wonder what AI-assisted “cheating” would look like in the context of web archiving. After all, WARC files are mostly made of text: are ChatGPT and the like able to generate convincing “fake” web archives? Do they know enough about the history of web technologies and the WARC format to generate credible artifacts?

Let’s ask ChatGPT to find out.

Making a fake web archive 101

What do I mean by “fake” web archive?

The most commonly used format for archiving web pages is Web ARChive (.warc), which consists of aggregated HTTP exchanges and meta information about said exchanges and the context of capture. WARC is mainly used to preserve web content, as a “witness” of what the capture software saw at a given url at a given point in time: a “fake” web archive in this context is therefore a valid WARC file representing fabricated web content.

Do we really need the help of an AI to generate a “fake” web archive?

The WARC format is purposefully easy to read and write, very few fields are mandatory, and it is entirely possible to write one “from scratch” that playback software would accept to read. Although there’s a plethora of software libraries available to help with this task, creating a convincing web archive still requires some level of technical proficiency.
What I am trying to understand here is whether novel chatbot assistants like ChatGPT - which are surprisingly good at generating code in various programming languages - lower that barrier of entry in a significant way or not.
A related question is whether these tools make it easier for sophisticated users to fake an entire history convincingly, such as by creating multiple versions of a site through time, or multiple sites that link to each other.

Asking ChatGPT to generate a fake web archive from scratch

For this first experiment, I asked ChatGPT to generate an “About” page for an imaginary 1998 version of LIL’s website, before wrapping it into a WARC file.

What LIL's website looked like in 1998, according to ChatGPT

Transcript:

While I had to provide detailed - and somewhat technical - instructions to make sure the resulting HTML document was “period correct”, the end result can be considered “convincing enough” from both a visual and technical standpoint, in the sense that it is not obvious that it was generated by a chatbot.
Some of the features I asked for are present in the code but do not render properly in modern browsers, which arguably makes it even more credible.

ChatGPT appears to know what a WARC file is, and is able to generate an output that resembles one. There are however a few important issues with the output it generated:

  • The WARC-Target-URI property is missing, there is therefore no association between the record and the URL it was supposed to originate from, http://lil.law.harvard.edu/about.html.
  • Every single Content-Length property is wrong, making the document impossible to parse correctly.
  • The unique identifiers ChatGPT issues are … not unique. See WARC-Record-ID for example.
  • The hashes are also placeholders, and don’t match the payloads they are meant to represent. See WARC-Block-Digest for example.

We can certainly ask ChatGPT to fix some of these mistakes for us, but like every other large language model, everything involving actual computation is generally out of its reach. This makes it impossible for it to calculate the byte length of the HTML document it generated, which is a critically important component of a valid WARC file.

These limitations demonstrate the need, which is typical in applications of generative AI, to embed the language model itself in a larger framework to generate coherent results. If we wanted to do large scale fakery, we would likely look to the model to generate convincing period text and HTML, and use a custom tool to generate WARC records.

Asking ChatGPT to alter an existing web archive

We now know that ChatGPT is able to generate convincing-enough “period correct” HTML documents and to wrap them into (slightly broken) WARC files.
But can it edit an existing WARC file? Can it identify HTML content in a WARC file and edit it in place?

To figure that out, I took the half-broken web archive ChatGPT generated as the result of my first experiment and asked it to:

  • Add the missing WARC-Target-URI property on the first “record” entry of the file
  • Replace the title of the HTML document associated with the url http://lil.law.harvard.edu/about.html

These tasks are text-based and ChatGPT was able to complete them on the first try.

Transcript:

Uncanny valley canyon

The experiments I conducted and described here are not only partly inconclusive, they also focus on extremely basic, single-document web archives.
Actual web archives are generally much more complex: they may contain many HTML documents - which are generally compressed - but also images, stylesheets, JavaScript files, and contextual information that a “faking” assistant would need to be able to digest and process appropriately. A language model cannot do all that on its own, but ad-hoc software embedding one just might.

It is therefore unclear how close we are from being able to generate entirely coherent multi-page or multi-site archives that pass initial review, but it seems clear that, over time, such archives will take less and less work to create, and more and more work to disprove.

Increasing our collective focus on developing and adopting technology to “seal and stamp” web archives, for example by using cryptographic signatures, could be a productive way to help deter tampering and impersonation attempts and reinforce the role of web archives as credible witnesses, regardless of how such attempts were performed.

NDSA’s Digital Preservation Virtual Event! / Digital Library Federation

NDSA will be hosting a virtual event on February 23, 2023, for those presenters who were unable to participate in the in-person conference in Baltimore last fall. The event, held in collaboration with our host organization, CLIR, will consist of panels and presentations and will take place over Zoom Events. The schedule for this event is available on the website while information regarding registration is forthcoming.  Please mark your calendars and plan to join us for this special online event! All sessions will be free and open to the public.

CLIR will also host virtual presentations in April; more information about that program can be found on the DLF Forum Virtual Sessions web page.

The post NDSA’s Digital Preservation Virtual Event! appeared first on DLF.

Managing time demands: Francis Wade’s skill ladders / Jodi Schneider

For diagnosing problems with time management, the best thing I’ve found so far is Francis Wade’s idea of “skill ladders”, described in the 2014 Perfect Time-Based Productivity: How to rescue your peace of mind as time demands increase (Amazon; book website).

For instance, the “capturing” ladder has 7 levels, with the bottom “not even trying to commit tasks to memory” and the top “always using a backed up electronic device that is never far away”:

Skill ladder for capturing, from Francis Wade’s Perfect Time-Based Productivity: How to rescue your peace of mind as time demands increase

He lists multiple ladders:

A downloadable PDF form provides a checklist to identify where you are on each ladder.

A summary of his diagnosis-based approach to managing time demands is available from The Evergreen Guide to Choosing Your Next Task Management App (2017).

More recently he organized the Task Management & Time Blocking Virtual summit (2020 & 2021 & 2022, with the 2023 event planned for Thursday through Saturday March 2-4). You can find more of Francis Wade’s recent thinking via Framework Consulting and in his Quora answers.

on jubilee / María A. Matienzo

debts are wrought iron bars
     from which we all are
     due our remission.

we see them rusting,
     but they never strain
     until the hammer strikes.

do you remember
     our conversation
     about all the freedoms?

let us break shackles
     sooner than seven
     cycles of shmita.

trumpet lessons be
     damned, we buzz lips
     to absolve each other.

we will no longer
     owe one another;
     only serve as needed.

Balázs Bodó: ‘Digital commons are actually reproducing existing power inequalities’ / Open Knowledge Foundation

It’s 2023 already, and for us, at the Open Knowledge Foundation, it really means new times. Last year, soon after the incorporation of Renata Ávila as our CEO, we announced a joint strategy of our advocacy and communications arms called 100+ Conversations to Inspire Our New Direction (#OKFN100).

Today, we are happy to announce the kick-off of this project.

We are meeting 100+ people to discuss the future of open knowledge, shaped by a diverse set of visions from artists, activists, academics, archivists, thinkers, policymakers, data scientists, educators and community leaders from all over the world. 

Our team wants to identify and debate issues sensitive to our movement and use this effort to constantly shape our actions and business strategies to deliver in the best possible way what the community expects from us and from our Network, a pioneering organization that has been defining the standards of the open movement for two decades. 

Another objective is to include the perspectives of people from diverse backgrounds, especially those from marginalised communities, from dissident identities, and whose geographic location is outside the world’s major financial powers.

How openness can speed up and strengthen the fights for the complex challenges of our times? This is the key question behind conversations like the one you can read below.

This release is not happening on any given day. Not coincidentally, today is the 10th anniversary of the death of Aaron Swartz, an iconic political organizer and internet hacktivist whose struggle for open knowledge continues to inspire our generation of activists. Keeping this agenda active and refreshed is our humble posthumous tribute to him.

*

We are kicking off #OKF100 today with a conversation with Dr. Balázs Bodó, economist, and researcher at the Institute for Information Law (IViR) at the University of Amsterdam. 

Bodó is a recurrent voice in the intellectual property field and an expert in the legal discussions and social implications of access to knowledge, and consequently the barriers created to prevent it. Among several important articles and conferences in recent times, he contributed to the book Shadow Libraries: Access to Knowledge in Global Higher Education (edited by Joe Karaganis, MIT Press, 2018).

Before moving to the Netherlands, he was deeply involved in the development of the Hungarian internet culture. He was the project lead for Creative Commons Hungary and a member of the National Copyright Expert Group. In 2018 he received an ERC Starting Grant to study the legal, and political implications of blockchain based technologies, and started the Blockchain & Society Policy Research Lab. He has been invited by the European Commission to serve as an expert for various blockchain-related projects.

We spoke with Bodó at the end of November 2022, a few days after the arrest of the Z-Library creators in Argentina, and this was the topic from which we started to build a broader view around the recent threats to knowledge sharing on a global scale.

We hope you enjoy the read.

*

OKFN: What does the process of chasing and taking down Z-Library mean for the concept of open knowledge?

Balázs Bodó: When I read the news that these two Russian individuals have been detained, I thought, well, history has come to a full circle. I don’t know these people, how old they are, I assume they are in their thirties. But certainly, their parents or their grandparents may have been or could have easily been detained by the Soviet authorities for sharing books that they were not supposed to share. And now, 30 years after the fall of the Berlin Wall, people are again detained for sharing books. For a different reason, but it’s the same threat, ‘You’re gonna lose your freedom if you share knowledge’.

The freedom to access and share knowledge was one of the reasons why people were willing to risk their lives in the pre-1990s era in Central and Eastern Europe. People were risking going to jail, losing their jobs, their livelihoods, and sometimes their lives because of they wanted to know, and wanted to share knowledge through writing samizdat, printing, distributing samizdat editions of –among others– banned western books. And now the Western or liberal or democratic political system is jailing people for – on the surface – very similar acts. The history of copyright (the control of the flow of knowledge through the exclusive economic rights of authors) and the history of censorship (the control of the same flow due to political considerations) have been closely entwined from the very beginning, and apparently, it is sometimes hard to disentangle them even today. In Belarus, the law now permits the digital piracy of works from unfriendly countries, a move quite similar to old, Soviet-era authors’ rights rules.

Z-Library is certainly far from being a clear-cut case because it is a hybrid library that offered both scholarly and commercial works, both literary and academic literature. It’s easier to make a case for shadow libraries and book piracy when it is limited to publicly funded research, academic articles published in scholarly journals, or exorbitantly priced text-books. Piracy is way less defendable and a much more contentious issue when we talk about the piracy of widely available commercial works, such as books as the Harry Potter series. But there is an immense grey zone between these two extremes: in copyright but out-of-print works, books that are hard to buy legally, which do not have electronic versions, which cannot be borrowed from a library nearby, etc, etc.

I think it’s never nice to see a library burn down. And this is what a burning down of a library looks like in the digital space. It’s never nice to see books being made inaccessible for one reason or another. And this is what is happening. Books are highly symbolic objects and their disappearance, them being forced to be inaccessible, is never a good sign.

When talking about shadow libraries, the first thing that comes to mind is Sci-Hub or Alexandra Elbakyan. It is very curious that many of these initiatives and also some shadow libraries themselves started in Russia somehow. Do you think there’s a sort of historical link, more sensitivity to censorship or historical heritage around that?

Absolutely. I’ve written about that in my research. Yes, there is a very clear link between current digital shadow libraries and the Russian legal, economic, social, cultural, historic contexts. In Russia, there is a history of how to circulate knowledge under the conditions of oppression, and how to build shadow networks of knowledge when political oppression tries to curtail the accessibility of certain forms of information. When in the 1990’s political censorship ended, but the post-Soviet economies collapsed, these networks and practices could find a second life in the digital space. In that era of societal, economic turmoil and hardship, digitising books and building shadow libraries become a form of agency – cultural agency, economic agency, and institutional agency in the hands of a highly literate, economically deprived cultural elite.

Also, it certainly helps that Russia and many of the post-Soviet republics are at the peripheries of legal enforcement, so they are not fully incorporated into the Western cooperating enforcement regimes, which makes it easier for shadow libraries to survive. Apparently, these individuals were never detained in Russia, but somewhere in South America.

What is the exact crime that Z-Library is committing? What is the legal argument behind the shutdown? 

I think there are at least four different discourses and languages through which this whole phenomenon is being discussed, and they are not communicating with each other at all. But they are used very strategically by most people, including myself, who participate in this debate. 

One first language is the law, of course. ‘What does the law says about someone releasing an ebook without the authorization or license from the rights holder?’ The law is very clear on that: this is a copyright infringement. You cannot do that without the permission of the rights holder, and they were doing it without such permission, so there is not much debate or uncertainty about the legal status of Z-Library. There are some legal fights, for example, in India, regarding educational materials. Library Genesis and Sci-Hub are being litigated in India, and the question is whether the educational exceptions within the Indian law would cover such a practice. But yes, it’s certainly a copyright-infringing practice, everywhere in the West, and in almost all other jurisdictions as well. Rights holders certainly have a very good legal case there.

The second language in which piracy is usually discussed in is the economic discourse. What is the relationship between the grey markets and the legal markets? What kind of economic harm piracy causes? As an economist, my argument is that black markets arise when legal markets fail. They emerge when and where the legal markets do not deliver what customers want. There is extensive, but inconclusive research on the economic impact of digital piracy. Some works some creators do suffer losses due to piracy. For others it makes little difference. For some it may even have some beneficial effects.

With regards to books, we’ve measured the economic impact of the Library Genesis catalogue. We asked the following questions: ‘If I wanted to buy this book from the LibGen catalogue, could I buy this? Where? For how much? In what format? Is it available on the secondhand market? If I wanted to read this book online, would I be able to buy or rent this book as an e-book? Could I lend this book from a library?’ In 2015 what we found is that around 80% of the titles were available in print, but 20% were not. For four out of five books there customers could buy a copy if they wanted to. But very few of these titles were actually available digitally.

So there was a clear argument at that point, that these shadow libraries were serving a demand for books for which there was no legal supply. In the last decade, a lot has changed in the ebook marketplace, so it is quite possible that most of the Z-Library supply is somehow available online now. But the question is still there: is the legal online access evenly distributed? Only if we compare legal access alternatives to illegal downloads, only then we can say something empirically about the true economic impact of piracy.

But this type of economic or market discussion is tricky because it gives you hard numbers. It gives you very clear evidence of whether the price is good or bad, whether there are unserved customers, lost sales, etc. In my experience, rights holders are not always happy to have such numbers out in the public because it highlights all the things that they could do to avoid the piracy of their products: ‘Hey, you have to lower your price’, or, ‘Hey, you could make better deals with libraries around e-landing, because your customers are going elsewhere.’ 

But economic arguments don’t always sit well, because with regards to piracy there is a third discourse, and that is about control. If the economic arguments are favourable to the rights holders’ position, then they use the economic arguments. But if the economic arguments suggest that making more money is only possible if rights holders are willing to give up control of the circulation of works, or if the economic arguments show that there is no substitution effect, and thus no economic harm, then the debate tends to switch arguments about control. These arguments emphasize that no matter what, rights holders can do whatever with their works, including making bad decisions. Both the law, and good philosophical arguments support the idea that they should have (almost) exclusive and absolute control over the life and death of the works thy created, and so they are not bound by the economic wisdom of their decisions.

And there is a fourth line of argument, which focuses on the role of creative works in culture. These arguments are often abstract and symbolic, and focus on such value laden ideals such as access to knowledge, common cultural heritage, cultural participation, free circulation of ideas, and the public domain. These are often seen as utopia, naive or incorrigible humanist ways of discussing what books are and how they should live with us.

I think these arguments are as important as the legal, or economic ones. If you talk to, for example, librarians, they’re going to agree with this approach. But the world of books is not curtailed to libraries and librarians. This world also consists of a few gigantic publishing corporations and those are organized around profit margins, investors returns, and control, and for them these arguments are more important than the books-are-culture approach.

Today shadow libraries serve a role of an almost permanent repository of works that are out of the digital space in a way, a role that a National Library would have for physical books, let’s say. We’ve seen cases of initiatives by the open access movement in Europe, like in Sweden or Switzerland, where a model of subscription to journals is being tested as an alternative. Do you think that shadow libraries will keep on being there because they fill that gap, or do you think that once this gap is solved through some market alternative, then their role will be maybe redundant and a bit less useful? 

I think cultural black markets play other important role beyond providing immediate access to a wide range of works, and that is that they can shape the development of the legal alternatives. What they say to rights holders is the following: ‘If you don’t adjust your business models, if you don’t improve your legal offering, then we will stay here threatening you’. This is what has happened in the digital music space. They managed to sue Napster, and then Kazaa, and so on, into oblivion but what ultimately put an end to music piracy was Spotify. But Spotify would’ve never happened without the credible threat of music piracy. If there was no music piracy, we would be still buying DRMs and MP3s from the Apple Music Store, because that was at the time the preferred business model for rights holders. Only the threat of piracy shifted the calculus, where they said, ‘Hey, an all-you-can-eat subscription service may not be our preferred way of making music accessible, but it’s still better than losing all our business to the pirates’.

I think this is also the role of shadow libraries. They managed to shut down Z-Library for now, but there are already gazillions of copies out there. They exist on IPFS, on bittorrent, on various private repositories, therefore they cannot fully get rid of this threat. The only way to suppress piracy is to actually serve the demand, which is now being served by the pirates, and offer comparable, equally favourable conditions. We’ve seen with all the subscription services that people are very happy to pay a monthly fee for having no-question-asked access to movies, music, series, whatever. 

We have done a survey a couple of years ago where we asked people if would they be willing to pay some money on top of their internet subscription, for a file-sharing licence. We found that they were very happy to actually pay around 10 euros per month for not having to think twice about the legality of what they are downloading. What we’ve shown is that, if you have every internet subscriber paying 5 to 10 euros extra per month on their subscription to not be bothered by legality, then this would actually make many times more money for rights holders than what they’re making now. It seemed like a match made in heaven. People would be happy to pay. Rights holder would be better off. Win-win, right? But when we approached them, they said they were not really interested in that proposition because it would have meant that they had to give up control.

On the question of shadow libraries, well, they will never be legal, but they might be irrelevant if there are good legal alternatives. But there are no good legal alternatives…

But, this also reminds me to stress that we mustn’t talk about Z-Library in and by itself. They are just one puzzle piece in a much larger picture, with a number of other players. There’s the Internet Archive. In COVID times they set up what they called “the emergency lending library”, and they are now fighting in court to defend the legality of lending out digital copies of books they have a physical copy of. There is also a conflict between publishers and libraries about the terms and conditions of digital lending. There is also Amazon with their business and services, like text-book and e-book rentals. There are also all those regional and national funding agencies that buy access to electronic databases for local research and education institutions. This is the big picture that Z-Library is also part of. 

What are the more immediate impacts of shutting down Z-Library? To what extent shadow libraries are an intrinsic part of the academic world today? 

Well, people find their way. It’s not that difficult. If you are a user, you can go to Reddit and then find the alternatives. It’s not rocket science. The academic scholarly networks are aware of what’s happening. There are lots of substitutes for Z-Library. I don’t think that, in general, there’s a great loss, and that’s the whole point.

Paying a small fee for services that today are illegal is something that has been mentioned also in some corners of the discussions in the open movement, with people saying maybe open doesn’t always mean that it’s free, and maybe you can have like small subscriptions that would allow maybe a fair remuneration of authors, for example. Do you think that’s a direction a service like this would take, or should it stay in the rebellion activist space?

I’m pretty sure you know Alek Tarkovsky and Paul Keller, who run Open Future. They had an essay on the Paradox of Open and they are launching a series of essays in response to that. I’ve written one, and all these essays are asking what is open, how is open, and what is the use of usefulness of open? My argument is that the particular form of openness that was promoted by Benkler, Lessig, all these Creative Commons types in the US, was a very particular and misleading understanding of what open is and what open could be. They have been referring to the ideas of Elinor Ostrom about commons, and suggested that in the digital space we can organize resources as open commons. In that discourse this has meant that anyone can come and take whatever they need from the commons, because digital resources are infinitely copyable

This was a rather stark departure from the way the Ostromian commons are organised, because these are closed, not open. You cannot just go to a common fishery and start fishing, or go into the common forest and cut down a tree. You cannot go into a community and start to extract resources without permission. So they are more like clubs – if you’re part of the club, then you can extract, if you’re not part of the club, then you can take a walk but you cannot extract resources. My understanding is that this whole creative / open / digital commons movement uses the commons as a term, but what they meant was public domain, where you don’t need to ask permission for taking. You can just go in and take whatever you see. It’s okay because it’s digital, you take and it’s left and then it’s no problem, right? 

My impression is that, for a number of reasons, digital knowledge or knowledge commons or digital commons are only inexhaustible in one narrow sense, and if you look beyond that, there are a number of really good ideas why access to these resources should be thought about. For example, freely accessible digital resources are not inexhaustible are actually reproducing existing power inequalities. We found this effect also in relation to shadow libraries. When looked at who uses shadow libraries the most, it’s highly developed countries. Why? Because though the knowledge resource is openly accessible to everyone, the other resources that would actually allow you to make use of this free resource are not evenly distributed. I can speak as a Hungarian. I know all my Hungarian academic colleagues, who are using shadow libraries. They are using it, but they also have to juggle one, two, often more jobs to make a living in academia because an assistant professor makes 700€ gross, and a full professor salary is 1400€ gross per month. With such salaries, they cannot make a living working only in academia. People who have to have other jobs cannot make the same use of the same pirate library as someone who can dedicate their full time and attention to teaching, and research. The more of these other elements are missing, the bigger these problems get in terms of the use of free resources.

We have a similar problem with Creative Commons licenced images. Creative Commons is great, it’s very nice if I, as a blogger, can use a Creative Common licenced image to illustrate my blogpost, but the true values of this resource is being captured by Google and Facebook and all the other players who use it to train their AI models.

So there are good reasons to put fences around open resources and let someone say under what conditions you have access to these -otherwise free- resources. I think fences make a lot of sense if they enable you to say: ‘Hey, Google, you cannot just come in and extract value from these resources’. They also help you answer another question about the production of common resources, because they can offer a way to consider the following challenge: ‘Okay, we pulled this resource together’, but then how do we actually make sure that whoever contributed to this resource has a chance to actually benefit from it one way or another? The fences can also make sure that those who worked on a resource can charge for it if they wish.

Well, you can say at this point that what I’m proposing is not that different from what the rights holders have been saying with regards to Z-Library. And this is indeed true, to a certain extent. But what the (close) commons enable are also more models than what came to dominate the copyright-based approaches.

Take, for example, K, an arthouse bittorrent tracker. They managed to build an extremely comprehensive archive of arthouse cinema. They also put up a fence. They don’t chargefor access, so making money is not their point. They put up a fence so they can make sure people don’t take content without contributing something to the community. They also take down content if the rights holders demand it. Many members of the community are from the film industry: directors, producers, creatives, as well as film distributors, people who are actually trying to make money by selling those films that are also available freely on the tracker. The reason they don’t see K as a threat is twofold. First, they know that it is not an open resource: there is some control over who can access that archive. Also, the tracker policy clearly states that they are going to remove content if the rights holder requests it, because they are not running the archive to hurt arthouse moviemakers.

So the fence is there to protect the resource. A fence can solve lots of problems, so openness is not always the best approach. In other cases, for example science, I think a resource could be radically open.

We’re glad you brought up the meaning of openness because that was one of the questions. But we wanted to dig deeper there. Access was maybe the main pillar of the movement 15 years ago, but now after the ontological turn in social sciences, and all discussions around intersectionality, it appears that there are many other layers to be considered. So what are we talking about when we talk about open today? Do you think access is still the point? 

Of course, open access resources are just one infrastructure, which is good if it’s there. But we need a number of other infrastructures as well, and they need to be able to work together if we want to have an impact. One of the questions is what kind of other infrastructures or elements we can identify. If you are in the Open Knowledge Foundation, what kind of other players do you need to talk to in order to make sure that your work has as much impact as it could have and just doesn’t evaporate because some infrastructural elements are missing.

I give you an example. The academic knowledge market has been operating under a business model where publishers have been charging for access. Now the biggest firms in the industry are slowly moving towards another business model. And this means more than just supporting golden open access and charging article processing fees. These firms have also invested quite a lot in the infrastructural elements of knowledge production, like citation management, software, institutional archive, and science metrics. The new business model is based on collecting data at every step of knowledge production, from citation via publishing and archiving to distribution. They take all the information from academic knowledge production, package it into various metrics and sell these back to the universities. ‘This is how many papers have been published by your department, how many times they were cited, their impact, this is the score of your individual researcher, how good they are compared to all the others, their worth in the academic market’. What started with academic publishing is not turning into something akin to what credit rating agencies are doing. It’s the same business model. And, for that business model, if you’re in the business of metrics, of selling creditworthiness information to the market, you need circulation, the circulation of the knowledge that you are trying to measure and turn into a product. You cannot put up artificial, legal or technological barriers that limit the circulation of the knowledge that you want to measure and then repackage.

So I see that there is a shift from a business model which made money out of limiting circulation to business models which tries to profit from circulation. If that is indeed the case, then in this new model you have to let circulation take place, and the traditional barriers to such free circulation through DRM or subscriptions must go down.

This transformation of how knowledge production is organized in society brings forward the question of infrastructure. In order for publishers to create the new metrics-based products and services, they need to control the infrastructure of the workflow where the data is generated: the systems of archiving, publishing, citation management, and search. This then raises the question of sovereign technology infrastructures, not just for the EU, or on the national level, but for the institutions, universities, research institutes, and for the whole societal activity of knowledge production. What is technological sovereignty? What is the infrastructure of knowledge? Where does it start? Where does it end? Does it include cloud computing infrastructure? Does it include archival library infrastructures? Does it include Microsoft 365? And what are the right responses to the concentration of power and the emergence of new business models in the private sector which try to serve/capture this market? Should all do as the French did, for example, who just banned Microsoft and Google cloud services from their schools? Should we build our own infrastructures to have some form of sovereignty not just over content, but also in terms of hardware and software?

We as the Open Knowledge Foundation have been thinking about this a lot. How do we use the open definition or the open movement to fight not only with our peers for the things we’ve been historically advocating for but actually tackle the most pressing problems of our times, such as climate change? Maybe you have some input, or want to share a final comment regarding our role as an organisation, or the roles of organisations like ours.

My impression is that there is always some kind of cautious distance between mainstream open access / open culture advocates, and people who try to address open knowledge and open access through confrontational pirate ways. The relationship to legality has been a gap between these two communities. My impression is that there is still a lot of hope in various advocacy groups that it is possible to achieve change through advocacy or lobbying or working with policymakers. I don’t know whether those hopes are justified, but I think pirates have long been part, and oftentimes a positive force in these struggles, maybe there is now a historic moment where these two approaches need to work together more strategically. Piracy is far from the ideal solution, but currently it is the only one which is able to provide access to science and overwhelmingly publicly funded research published in books and journals. As long as someone else, preferably us, the academic community, or the businesses in this market are not providing legally something as comprehensive as the shadow libraries, we need them, we rely on them. We might as well do that openly and consciously, as part of a push for better legal alternatives, and as a pushback against the privatization of both knowledge and the infrastructures of knowledge production which is taking place today.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2023 January 10 / HangingTogether

The following  post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by Jay Weitz.

Preparing for comic book challenges

Photo by Martijn Baudoin on Unsplash

On January 18, 2023, at 5:00 p.m. Eastern, ALA’s Graphic Novels and Comics Round Table (GNCRT) will present the webinar “Preparing for Challenges – How to be ready before you get one.” Panelists will discuss such preparations as documenting strong collection development policies, building advocacy, and knowing the historical context “to build an arsenal of tools to deflect or defend against challenges.” The participants will include Shauntee Burns-Simpson, Associate Director, Center for Educators and Schools, New York Public Library (OCLC Symbol: NYP); Mary Grahame Hunter, youth services librarian, Ferndale Area District Library, Michigan; and Carla Riemer, school librarian and current member of the GNCRT Addressing Comic Book Challenges Committee.

Gender and Sexuality SACO Funnel Project

Virtual attendees of the June 3, 2022, OCLC Cataloging Community meeting witnessed a rare event in real time: the formation of a new Program for Cooperative Cataloging (PCC) Subject Authority Cooperative (SACO) funnel project. “A SACO funnel is a group of libraries (or catalogers from various libraries) that have joined together to contribute subject authority records for inclusion in the Library of Congress Subject Headings,” according to the SACO site. The new Gender and Sexuality Funnel Project, which developed from a Cataloging Community discussion, “promotes and facilitates the creation and revision of authority records for Library of Congress controlled vocabularies used in the cataloging of resources about, for, and by transgender, gender diverse, intersex, asexual, and other queer, non-heteronormative or non-heterosexual people. The goal of this project is to improve access to gender- and sexuality-related resources, and to reflect more accurately the terminology used by non-heteronormative and/or non-heterosexual communities. Additionally, the Gender and Sexuality Funnel Project is interested in the creation and revision of terminology addressing personal relationships, with a focus on subject authority records that perpetuate or normalize cisgender and heterosexual perspectives and relationships as the standard.” Potential participants are welcome to contact the funnel at GenderAndSexualityFunnel@gmail.com. Among many other SACO funnels are the African American Subject Funnel, the Africana Subject Authority Funnel, the CJK Funnel, the Hawaii/Pacific Subject Authority Funnel, the Judaica Funnel, and the Latin American and Indigenous Peoples Funnel.

Marrakesh Treaty

The Marrakesh Treaty facilitates the production and international distribution of books adapted for people with visual impairments through exceptions to standard copyright laws. In “Framework for the provision of information to the visually impaired in academic libraries in compliance with the Marrakesh Treaty,” three Kenyan information scientists suggest various ways to improve that access through such strategies as revising library policies and increasing awareness of the treaty. Samuel Macharia Were, Lecturer at the School of Information Science at Kisii University (OCLC Symbol: KEKIS); Japhet N. Otike, Professor of Library and Information Sciences in the School of Information Sciences in Moi University (OCLC Symbol: KEMOI); and Emily K Bosire, Senior Lecturer in the School of Information Science at Moi University, presented their study in the December 2022 issue of the IFLA Journal (Volume 48, Number 4, Pages 727-741).

Loida Garcia-Febo on diversifying the library workforce

The Oregon Library Association (OLA) Diversity, Equity, Inclusion, and Antiracism Committee presents the tenth episode of its podcast, Overdue: Weeding Out Oppression in Libraries. The episode entitled “Libraries, Communities and Mentorship: Connecting the Dots with Loida Garcia-Febo” features the Puerto Rican American librarian, 2018-2019 President of ALA, and global activist for equity, diversity, and inclusion. She talks about mentoring Black and Indigenous People of Color, trying to instill empathy in all library workers, and getting out into the community to serve it better by building trust. She suggests embedding EDI principles into the strategic plan of each library, announcing job openings to multiple ethnic and cultural organizations with in ALA, and reaching out to high schools to help diversify the library workforce. Garcia-Febo spoke with Roxanne M. Renteria, Community Librarian at the Deschutes Public Library (OCLC Symbol: DCH), Bend, Oregon; and Brittany Young, Law Librarian at the Lane County Law Library, Eugene, Oregon, on 2022 September 20.

Multiple mentors

Continuing on the same topic, Annmarie Magurany, a reference and instruction librarian, and Elizabeth Dill, director of University Libraries, both at the University of Hartford (OCLC Symbol: HRM), write about “BIPOC Librarians and Retention: Mentorship and Supportive Relationships in the Workplace” in College and Research Libraries News, December 2022 (Volume 83, Number 11, Pages 474-476). In the discussion, Magurany says, “I think that it is valuable to see representation of oneself within leadership in any organization, especially for historically marginalized groups. It helps signal the organization is not just open to theoretical inclusion but follows through on the ideals expressed in DEI statements. But ultimately, … I do not believe that a mentor and mentee need to have a similar background.” They agree that having more than a single mentor can be most valuable.

Facing censorship

The December 2022 episode of the “ALA Connect Live Series” is now available. “Our Brave Communities: Facing Censorship Head On with ALA,” moderated by ALA Intellectual Freedom Committee Chair Lesliediana Jones, features Martha Hickson, School Librarian at New Jersey’s North Hunterdon-Voorhees Regional High School (OCLC Symbol: LI7); Brian Raitz, director of the Parkersburg and Wood County Public Library (OCLC Symbol: PARWD) in West Virginia; and Lisa Varga, Executive Director of the Virginia Library Association, discussing the current spate of book bans and how the profession has been countering it. In addition, Ed Garcia, Library Director of Cranston Public Library (OCLC Symbol: RH6) in Rhode Island and Chair of the ALA Committee on Legislation, and ALA’s Public Policy and Advocacy Office Senior Director Alan Inouye consider the possible impact of the 2022 elections and library advocacy efforts planned for 2023. A PDF of “Links and Resources” from the episode is also available.

“Protecting Minors” in Louisiana

On November 30, 2022, the Attorney General of Louisiana, Jeff Landry, announced the creation of an online “Protecting Minors” tip line through which citizens may report “books that contain extremely graphic sexual content that is far from age appropriate for young audiences.” In the words of AG spokesperson Cory Dennis, “Landry has been committed to working with Louisiana communities to protect minors from exploitation, including early sexualization, grooming, sex trafficking, and abuse.” ALA Office of Intellectual Freedom Director Deborah Caldwell-Stone, who believes that this is the first such tip line OIF is aware of, points out in response that “Libraries have long had policies on the books that allow any library user to raise a concern about a book. Every book has its reader. Public libraries serve a wide range of information needs for everyone in the community. There are going to be books that people disagree with or don’t think are suitable for their kids. But they’re there because they serve the information needs of someone in the community.”

LGBTQ+ controversy in Kansas

Thanks to an “outpouring of public support,” the lease of the Pottawatomie Wabaunsee Regional Library (OCLC Symbol: BX2) in Kansas was renewed for a year on December 6, 2022, without restrictive language that had been proposed. Saint Marys City Commissioner Matthew Childs, who was elected mayor at the same meeting, had put forward wording that would have required the library to not “supply, distribute, loan, encourage, or coerce acceptance of or approval of explicit sexual or racially or socially divisive material, or events (such as ‘drag queen story hours’) that support the LGBTQ+ or critical theory ideology or practice.” Coverage of the ongoing controversy can be read in Kansas Reflector reporter Rachel Mipro’s articles “Kansas town’s library lease renewed after months of debate about LGBTQ content” and “St. Marys officials looking at ways to reshape public library, limit public comments.”

Accessibility standards for DLOs

Donovan Frazier, a curriculum development student assistant at the University of California-Riverside (OCLC Symbol: CRU) has conducted a most helpful review of accessibility standards for digital learning objects (DLOs) that can be applied proactively to try to ensure quality accessibility in the vision, motor, audio, cognitive, and linguistic realms. “Shifting from reactive to proactive: An accessibility review and revision project” appears in the January 2023 issue of College and Research Libraries News (Volume 84, Number 1, Pages 27-31). It accounts for sixteen standards that include alt text, image descriptions, video descriptions, check contrast, epilepsy check, closed captions, link descriptions, font clarity, and color blindness, among others.

Neurodivergent library workers

Former school librarian and special educator Kelley McDaniel contributes part one of “We Need to Talk About How We Treat Library Workers Who Are Neurodivergent” to the December 2022 issue of the American Library Association-Allied Professional Association (ALA-APA) Library Worklife. McDaniel believes that “librarians tend to do a good job promoting their diverse and inclusive collections and programs” and “that libraries can and should be part of the solution to the unemployment and underemployment of adults who are neurodivergent.” McDaniel links to the Autism @ Work Playbook: Finding talent and creating meaningful employment opportunities for people with autism, from the ACCESS-IT Research Group at the University of Washington Information School (OCLC Symbol: WAW).

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2023 January 10 appeared first on Hanging Together.

KO is KO'd / CrossRef

A library is intended to be a place of organized knowledge. Knowledge organization (KO) takes place in two areas: the shelf and the catalog. In this post I want to address KO in the catalog.

Headings

KO in the catalog makes use of "headings". Headings are catalog entry points, such as the title of a work or the name of an author. Library catalogs also assign topical headings to their holdings.

The "knowledge organization" of the title and author headings consists of alternate versions of those. Alternate forms can be from an unused form (Cornwell, David John Moore) to the used one (Le Carré, John). They can also refer from one form that a searcher may use (Twain, Mark) to a related name that is also to be found in the catalog (Snodgrass, Quintus Curtius).

Subject headings are a bit more complex because they also have the taxonomic relationships of broader and narrower concepts. So a broader term (Cooking) can link to a narrower term (Desserts) in the same topic area. Subject headings also have alternate terms and related terms.

The way that this KO is intended to work is that each heading and reference is entered into the catalog in alphabetical order where the user will encounter them during a search.

Cornwell, David John Moore
    see: Le Carré, John

Twain, Mark
    see also: Snodgrass, Quintus Curtius
    
Cooking
    see narrower: Desserts
    see narrower: Frying
    see narrower: Menus
    
It may seem obvious but it is still important to note that this entire system is designed around an alphabetical browse of strings of text. The user was alerted to the alternate terms and the topical structure during the browse of cards in the card catalog, where the alternate and taxonomic entries were recorded for the user to see. Any "switching" from one term to another was done by the user herself who had to walk over to another catalog drawer and look up the term, if she so chose. The KO that existed in the catalog was evident to the user.

Automation

A database of data creates the ability to search rather than browse. A database search plucks precise elements from storage in response to a query and delivers them to the questioner. The "random access" of that technology has all but eliminated the need to find information through alphabetical order. Before the database there was no retrieval in the sense that we mean today, retrieval where a user is given a finite set of results without intermediate steps on their part. Instead, yesterday's catalog users moved around in an unlimited storehouse of relevant and non-relevant materials from which they had to make choices.

In the database environment, the user does not see the KO that may be provided. Even if the system does some term-switching from unused to used terms, the searcher is given the result of a process that is not transparent. Someone searching on "Cornwell, David" will receive results for the name "John Le Carré" but no explanation of why that is the case. Less likely is that a search on "Twain, Mark" will lead the searcher to the works that Twain wrote under the additional alias of "Snodgrass, Quintus Curtius" or that the search on "Cooking" will inform the user that there is a narrower heading for "Menus." A precise retrieval provides no context for the result, and context is what knowledge organization is all about.

Answering a question is not a conversation. The card catalog engaged the user in at least a modicum of conversation as it suggested entry headings other than the ones being browsed. It is even plausible that some learning took place as the user was guided from one place in the list to another. None of that is intended or provided with the database search.

KW is especially not KO

The loss of KO is exacerbated with keyword searching. While one might be able to link a reference to a single-word topic or to a particular phrase, such as "cookery" to "cooking," individual words that can appear anywhere in a heading are even further removed from any informational context. A word like "solar" ("solar oscillations", "solar cars", "orbiting solar observatories") or "management" ("wildlife management", "time management", "library catalog management") is virtually useless on its own, and the items retrieved will be from significantly different topic areas.

Keyword searching is very popular because, as one computer science student once told me, "I always get something." The controversy today over mis-information is around the fact that "something" is a context-free deliverable. In libraries, keyword searching helps users retrieve items with complex headings, but the resulting resources may be so different one from the other that the the retrieved set resembles a random selection from the catalog. Note, too, that even the sophisticated search engines are unable to inform their users that broader and narrower topics exist, nor can they translate from words to topics. Words are tools to express knowledge, but keywords are only fragments of knowledge.

21st Century Goals

I would like to suggest a goal for 21st century librarians, and that is a return to knowledge organization. I don't know how it can be done, but it is essential to provide this as a service to library users who are poorly served by the contextless searches in today's library catalogs. To accomplish this with computer and database technology will probably not make use of the technique of heading assignment of the card catalog. Users might enter the library through a topic map of some type, perhaps. I really don't know. I do know that educating users will be a big hurdle; the facility of typing a few words and getting "something" will be hard to overcome in a world where quick bits of information are not only the norm but all that some generations have ever known. A knowledge system has to be demonstrably better, and that's a tall order.

Binance's Time In The Barrel / David Rosenthal

The bulk of last month's Dominoes was about Binance, the dominant unregulated cryptocurrency exchange, and the risk that in the wake of FTX's collapse it might be the next victim of cryptocurrency contagion. Just as happened with FTX, once the media picked up on reports of problems, further stories came thick and fast. So below the fold are updates on two of the problems facing Binance.

Income Drop

The non-fraudulent way for a cryptocurrency exchange to make money is by charging transaction fees. The more transactions, the more fee income. Last November 1st, one week before FTX suspended withdrawals, I wrote in Greater Fool Supply-Chain Crisis:
Why would retail investors buy? They are facing high inflation and a looming recession, their stock and bond portfolios are evaporating, and their cryptocurrency HODL-ings have evaporated even faster. It isn't just retail:
Meanwhile, institutional digital-asset products this month saw their lowest-ever volume in data going back to June 2020, with average daily trading volume dropping 34% to $61 million, according to CryptoCompare.
Source
The subsequent collapse of FTX, and its spreading contagion, haven't helped. In ‘Spectacular’ Trading Drop Plagues Still-Reeling Crypto Market Vildana Hajric and Olga Kharif write:
In 2022, trading volume on centralized exchanges such as Coinbase, Kraken and Binance plunged more than 46%, according to data compiled by CryptoCompare. On Binance, which remains the leader in terms of market share, spot trading fell 45% to $5.4 trillion. And Bitcoin, the most-traded digital asset, saw trading volumes decline 31% year-on-year, the researcher said in a report.
Eyeballing the graph one can estimate that centralized exchange volume is currently about 90% down from the peak around May 2021, and decentralized exchange volume is down around 90% from its peak around November 2021. This must leave exchanges desperate for non-fee income streams and thus, as Matt Levine suggested in How Not to Play the Game, tempted to cheat commit fraud.

Magic Beans and Bogus Blockchains

Two recent readable summaries of investigations by @cryptohippo65 and DataFinnovation are Dirty Bubble Media's The Binance Scam Chain and Patrick Tan's Binance Built a Blockchain, Except it Didn’t. Dirty Bubble Media summarizes the results of the investigations thus:
The Binance Smart Chain and BNB are cornerstones of the Binance empire. However, recent analyses have called almost every aspect of this blockchain into question. It turns out that, like many things with Binance, a closer look reveals cracks within the façade. The vast majority of BNB tokens appear to be owned directly by Binance, and market analysis suggests that the price has been artifically inflated. Billions of dollars in purported stablecoins pegged to the U.S. dollar on the BSC were not backed with real assets for weeks at a time. The code base for BSC is not open source and appears to be controlled directly by Binance employees. And most importantly, a deep dive into the Binance Smart Chain suggests that it might not function like a blockchain at all…

Magic Beans

FTX imploded in part because much of their reserves consisted on FTT, a token they created. FTX held the vast majority of FTT, and controlled much of the trading in it. This allowed them to manipulate the "price" of FTT and thus the "value" of their reserves of these magic beans. All appeared well, but the "price" of FTT didn't reflect what it could be sold for once it was pointed out that it was a magic bean.

Similarly, Dirty Bubble Media writes:
BNB is the fifth-largest cryptocurrency with a market cap of $42 billion. Based on Binance’s own public records, they directly own between 70-80% of the total BNB. The blockchain analyst @cryptohippo65 examined Binance’s proof of reserves information to determine the allocation of BNB across both Ethereum and Binance blockchains. Cryptohippo65 discovered that the vast majority of BNB on Ethereum and the BSC were likely attributable to customer holdings. This accounted for around 15% of the circulating supply
...
However, nearly all of the BNB on the “governance” chain for the Binance blockchain, called the Beacon chain, appears to be owned directly by Binance. This can be determined by a simple process of elimination: Binance does not include these addresses in their customer proof of reserves address list, yet they clearly control these addresses.

This means that Binance is holding somewhere between $28-32 billion worth of BNB on its balance sheet. Yet BNB’s price is not determined independently of Binance itself, as Binance (unsurprisingly) hosts the largest trading pairs for BNB.
How liquid is BNB?:
Based on data from the last 30 days, the ratio of daily spot trading volume to market cap for BNB was roughly 1%. This is significantly lower than the average ratio for Bitcoin (5.4%) or Ether (3.0%), indicating that BNB liquidity is markedly lower than other major cryptocurrencies. The true BNB volume is likely much lower since a large fraction of this alleged volume is reported by highly questionable microexchanges.
Is there any sign that BNB was pumped the way FTT was? This is where Binance's allegedly "fully backed" metastablecoin BUSD comes in. It is deliberately confusing, because it exists on multiple blockchains. The basis for BUSD is BUSD-on-Ethereum, credibly backed 1-for-1 because it is run by Paxos, not Binance, and Ethereum is a credible blockchain. Binance runs a bridge from BUSD-on-Ethereum to BUSD on multiple Binance operated blockchains. The claim is that a peg-BUSD on these blockchains is matched exactly by a BUSD locked on the Ethereum blockchain.

Source
DataFinnovation showed that this claim wasn't always true. Dirty Bubble Media explains:
DataFinnovation went back in history to examine these peg-BUSD tokens. When he tried to match the number of pegged tokens to the Ethereum BUSD held in reserve, DataFinnovation discovered that the Binance backing wallet frequently ran at a large deficit for weeks at a time
...
In other words, Binance had printed dollar equivalents from thin air. As DF notes, at a minimum this suggests incredible disorganization at Binance. DF also noted something interesting: the periods where pegged BUSD was unbacked correlated neatly with periods of time when BNB prices skyrocketed. It is almost as if Binance needed the money to raise the price of BNB…
DataFinnovation writes:
The run up in price begins exactly when the dramatic unbacked printing starts. Similarly the price stops rising, and then retraces a bit, when the printing stops.

This does not prove anything of course. Doubly so as the BUSD are eventually backed by ERC20 tokens in the peg wallet. But it is suspicious and indicates that pumping of BNB with unbacked BUSD might have occured. At best — at absolute best — Binance was careless during this time as the tick marks on the horizontal axis are almost 7ish weeks apart.

Further this activity occurred nearly 2 years ago in a major stablecoin, branded by the largest exchange, on a major chain. There is surely a lot still to be discovered.

Bogus Blockchains

Maybe I'm naive, but I would think that if the two blockchains and the bridge were correctly implemented it would not be possible to mint unbacked peg-BUSD.

We can assume that Ethereum is correctly implemented, and as far as I know no-one has claimed that the part of the bridge that runs on Ethereum isn't. We don't know about the other part of the bridge. But thanks to research by DataFinnovation acting on a tip from @cryptohippo65 we do know that the Binance Smart Chain is not what anyone should call a blockchain. DataFinnovation documented the research in this Twitter thread and explained it in BNB Beacon Chain: Not A Blockchain?. Patrick Tan provides a summary:
DataFinnovation went ahead to try and sync [the blockchain] from the genesis block, and generated the following error:
panic: Failed to process committed block (285075852:2BDC391C402FF452B83AD484D5C40DA615133C25E60C07352CBC6E45435EA873): Wrong Block.Header.AppHash. Expected 3E60F1573122DC7FAD2C5E4779A21BFEEB578422C915A16DEA70A1A617314720, got 1EDDADB1DC0B8E67A3F10FCE05201A4A59A7C380EC54F3CA83D400317CC49685
...
DataFinnovation goes further to identify where the exact failure to sync can be found, and it’s here and after several attempts, found that the Binance Chain regularly breaks at these timestamps.
Coincidentally, Binance, the centralized cryptocurrency exchange resets its price candles at the same time everyday.

Nonetheless, what DataFinnovation has discovered is that the Binance Chain breaks every 24 hours, without fail, yet somehow the blockchain still runs and validators somehow push past the previous breaks
There are a lot more suspicious aspects to the Binance Chain. Among DataFinnovation's discoveries were not just that the hashes verifying the "blockchain" are routinely rewritten to alter history, but also that:
  • Balances in wallets change with no corresponding transactions on the chain.
  • The two blockchain implentations were forked from the originals (geth and Cosmos) in ways that make identifying differences difficult.
  • Important parts of the implementations are not open-source.
  • Parts of the systems are distributed as byte-code.
  • Non-Binance nodes have to run from binary distributions, not source.
  • The binaries for the test nets and the production nets are different.
  • When questioned via Twitter, the "Binance Chain Chief Scientist" stopped responding.
All of which suggests that trusting the Binance "blockchains" would be foolish, since they have been deliberately built to evade the transparency and consistency that are the goals of blockchain technology. Patrick Tan sums up:
While there have been no allegations against Binance for any form of wrongdoing with respect to its blockchains, that it was potentially issuing an unbacked dollar stablecoin and possibly engaging in pump and dump activities may raise the ire of law enforcement agencies in the U.S. who have had their targets set on the cryptocurrency exchange since 2018 for alleged money laundering and evasion of sanctions.
And so does Dirty Bubble Media:
The value of BNB is, in theory, derived from its use case as the “stock” of the Binance blockchain. We showed above that, like FTT or CEL, BNB ownership is highly concentrated in the hands of Binance itself. In yet another flywheel scheme, Binance appears to have spun up tens of billions of dollars in free assets on paper. We don’t know if Binance has leveraged these tokens; their CEO insists that Binance has no loans. Regardless, it’s clear that the BNB held on Binance’s books are worth far less than what current market prices suggest.

As we have demonstrated, there is substantial evidence that there are major problems with the Binance Smart Chains as well. These include hypercentralization of chain governance in the hands of Binance, periodically unbacked stablecoins, and a closed-end project under the control of a shadow group tied to Binance. Most importantly, DataFinnovation’s analyses suggest that the BSC does not operate like a proper blockchain.
My assessment is that the risk level of Binance is rapidly increasing, and that there are more revelations to come.

Call for applications: 2023 DLF GLAM Cross-Pollinator Registration Awards / Digital Library Federation

GLAM Cross-Pollinators

The Digital Library Federation (DLF) is pleased to announce the return of the GLAM Cross-Pollinator Registration Awards in 2023. Initially supported by generous Kress grants from 2015-2017, the DLF Cross-Pollinator program, in partnership with other GLAM (Galleries, Libraries, Archives, and Museums) organizations, endeavors to build bridges among our communities, bring new voices and perspectives to the DLF Forum, and continue our efforts at “cross-pollination” by sending DLF practitioners to valuable conferences they might not otherwise attend. We are thrilled to welcome a new partner in 2023, the Association for Information Science & Technology (ASIS&T), as well as partner once again with Art Libraries Society of North America (ARLIS/NA) and Electronic Resources & Libraries (ER&L).

GLAM Cross-Pollinator awards cover the cost of registration for awardees affiliated with a DLF member organization at one partner conference and can be applied retroactively for a full refund if payment has already been submitted. The 2023 partner conference locations and dates are:

Applications for members of ER&L, ARLIS/NA, and ASIS&T to attend the 2023 DLF Forum will open this spring. Subscribe to the DLF Forum Newsletter for updates as they become available.

Award winners for all opportunities will be selected by CLIR/DLF staff in consultation with partner organizations. You may apply for multiple awards, but preference will be given to applicants who have not yet been a GLAM Cross-Pollinator Fellow.

Application deadline for both the ER&L and ARLIS/NA conferences is Friday, February 3, 2023.

Additional information on awards and eligibility, including a link to the lightweight application form, can be found on the GLAM Cross-Pollinators program page

Questions? Contact us at awards@diglib.org

The post Call for applications: 2023 DLF GLAM Cross-Pollinator Registration Awards appeared first on DLF.

Automatically Generating Podcast Transcripts / Peter Murray

I’m finding it valuable to create annotations on resources to index into my personal knowledge management system. (The Obsidian journaling post from late last year goes into some depth about my process.) I use the Hypothesis service to do this—Hypothesis annotations are imported into Markdown files for Obsidian using the custom script and method I describe in that blog post. This works well for web pages and PDF files…Hypothesis can attach annotations to those resource types. Videos are relatively straight forward, too, using Dan Whaley’s DocDrop service; it reads the closed captioning and puts that on an HTML page that enables Hypothesis to do its work. What I’m missing, though, are annotations on podcast episodes.

Podcast creators that take the time to make transcripts available are somewhat unusual. Podcasts from NPR and NPR member stations are pretty good about this, but everyone else is slacking off. My task management system has about a dozen podcast episodes where I’d like to annotate transcripts (and one podcast that seemingly stopped making transcripts just before the episode I wanted to annotate!). So I wrote a little script that creates a good-enough transcript HTML page. You can see a sample of what this looks like (from the Search and Ye Might Find episode of 99% Invisible).

Note! Of course, 99% Invisible has now gone back and added transcripts to all of their episodes, including the one used in this example. Thanks? ... No really, thank you 99PI!

AWS Transcribe to the rescue

Amazon Web Services has a Transcribe service that takes audio, runs it through its machine learning algorithms, and outputs a WebVTT file. Podcasts are typically well-produced audio, so AWS Transcribe has a clean audio track to work with. In my testing, AWS Transcribe does well with most sentences; it misses unusual proper names and its sentence detection mechanism is good-but-not-great. It is certainly good enough to get the main ideas across to provide an anchor for annotations. A WebVTT file (of a podcast advertisement) looks like this:

WEBVTT

1
00:00:00.190 --> 00:00:04.120
my quest to buy a more eco friendly deodorant quickly started to

2
00:00:04.120 --> 00:00:08.960
stink because sustainability and effectiveness don't always go hand in hand.

3
00:00:09.010 --> 00:00:11.600
But then I discovered finch Finch is a

4
00:00:11.600 --> 00:00:14.830
free chrome extension that scores everyday products on

After a WEBVTT marker, there are groups of caption statements separated by newlines. Each statement is numbered, followed by a time interval, followed by the caption itself. (WebVTT can be much more complicated than this…to include CSS-like text styling and other features; read the specs if you want more detail.)

What the script does

The code for this is up on GitHub now. The links to the code below point to the version of software at the time this blog post was written. Be sure to click the “History” button near the upper right corner of the code listing to see if it has been updated.

  1. Download the audio file from its server and upload it to an AWS S3 bucket so AWS Transcribe can get to it.
  2. Create a new AWS Transcribe job and wait for the job to finish.
  3. Set a public-read ACL on the WebVT file so this script can get it later. Also, save the output of the transcription job; the function then returns the link to the WebVTT file.
  4. In a new function, get the WebVTT file from where AWS Transcribe put it on the S3 bucket.
  5. Then it concatenates the caption text into one string and uses SpaCy to break the transcription into sentences. I’m doing this because the WebTT generates each caption by time, and the transcript is easier to read if it is broken up into sentences.
  6. Loop through the sentences looking for occurrences when a WebTT caption contains the start of the sentence. That way, I can get the timestamp of when the sentence starts.
  7. When the sentences are synced time times, use a Jinja2 template to create the HTML file.
  8. Lastly, upload the HTML to the S3 bucket as the index.html file, and make a final record of the podcast metadata.

That’s it!

Design choices

Amazon Transcribe is pretty cheap. AWS charges for each minute a transcript job runs at a rate of 2.4¢ a minute. Transcribing an hour-long podcast costs about $0.10. The storage and bandwidth costs are negligible.

The way that the Hypothesis annotation JavaScript works forced the use of a CSS-“:before”-content structure. One of the downsides of DocDrop is that annotations on multiple blocks are changed into just the first block of text. Based on my experimentation, it seems like the user-select: none property is enough of a break in the DOM to cause the problem. Because I didn’t want the timestamps included in the annotated text, the timestamps are put into the DOM using a CSS “:before” selector. Playing with the box margins enables everything to line up.

I’m not including the playback of the podcast audio along with the transcript. Unlike DocDrop, which embeds the YouTube viewer in the transcript page, playback of the audio from the S3 bucket wouldn’t be counted in the podcaster’s statistics. And I’m comfortable with the copyright implications of publicly posting uncorrected transcripts (in the absence of creator-produced transcripts), but not so comfortable as to also offer the audio file.

Issues

So there are some issues with this setup.

  • Copying and pasting episode data required: This is running as a command line program with four parameters: audio URL, episode title, episode landing page URL, and podcast title. Sometimes this takes a bit of hunting because podcast sites are not the most friendly for finding the audio URL. Viewing the page source is often necessary, and sometimes digging into the RSS/Atom XML is needed.
  • Times will vary with advertisement inserts: Because podcast networks insert ads with different lengths over time, the timestamps that were found when the transcription was made probably won’t correspond to later playbacks. But I think they will be close enough that I can go back and find the audio clip when I need to.
  • Default directory document doesn’t work: Right now, the “index.html” is required as part of the web link. It would be nice if one could remove that and just refer to the root directory, but AWS CloudFront doesn’t work like that.

Blog Year 2022 in review / Cynthia Ng

Another year, another look at blog stats. So I say, but earlier this year, I did a summary from 2016-2021 since WordPress stopped generating them, and I totally forgot about it. In any case, here’s the 2022 in review. Posts 14 posts and about 20,000 words doesn’t sound like much, but what’s interesting is that … Continue reading "Blog Year 2022 in review"

Holiday Product Management and the Hardcover Ebook / Eric Hellman

(This post was originally published in December of 2009. It was removed by google because "Your content has violated our Malware and Viruses policy." It was then restored and re-dated. I haven't the foggiest!)

The product development team responsible for Christmas should get an award. I can imagine the brainstorming sessions: "Let's have a holiday where travelers get stranded and get to sleep with smelly animals!"

"No, that won't work. How about a holiday where poor people visit maternity wards in hospitals?"

"Crackpot. Wait...I've got it... How about if everyone gets PRESENTS!"

Positive reinforcement can go a long way towards creating success. E-books these days are sort of like the current state of my Christmas tree. Beautiful, smells good, but no decorations or presents. In my previous article discussing copyright enforcement, I stressed that if publishers really want to fight off piracy they need to find ways to positively reinforce the ebook purchase experience. In other words, decorate your Christmas ebook and put lots of presents under it.

Step one. It should be completely painless and hassle-free to buy and use a piece of content. Duh. Amazon has figured this out. Apple has figured it out with iTunes. If it comes to a choice between secure DRM (which will get cracked anyway) and making it easy for customers, always do what the users sending you money want you to do.

Step two. Decorate! When someone gets an ebook, it should be customized with the purchaser's name and a pretty frontispiece that says "this book is being lovingly read and cared for by PURCHASERS NAME HERE", and every ebook file would have a unique serial number. For gifts, of course the message would be different. If the e-reader marketers and developers were really attuned to the publishing ecosystem, they would hardwire a cryptographic signature reader into their devices that would be able to tell a "genuine" ebook frontispiece from a fake one. It's not technically hard.

Step three. Give out presents! The wonderful thing about ebooks is that the reproduction cost is zero. You can afford to give your customers free stuff! Once they register their unique ebook serial number, send them a "personalized" thank you note from the author. Give them a free book from an unknown author that they would never, ever have bought on their own.

Step four. Give out more presents! If you want to reward genuine ebook purchases in places like China, turn ebook registration into a raffle. Put a golden ticket in every millionth ebook, and a silver ticket in every thousandth. Give something to everyone who registers, even if it's just a virtual badge, or "frequent-reader" points. People may start buying ebooks just for a chance at a payout. Other people will try to register fake serial numbers, and for free, you will get both marketing and pirate tracking data.

Step five. Regifting! If someone has paid you good money for an ebook, let them give it to a friend after six months or so (as long as they've registered it!) If they're a platinum-level frequent buyer, let them keep their own registered copy, too.

If ebook publishers get really good at adding value to their products, they could consider rolling out the "Hardcover Ebook". Current practice in the print book industry for "trade" books is to initially offer only a hardcover version. A year later, the same book is released as a softcover at a substantially lower price. The effect of this is to partition the market and capture revenue from more potential purchasers. Consumers accept this partitioning partly because they assign some value to the hard cover- they assume the hard cover is more expensive to produce.

Recently, there's been much discussion about publishers holding back titles from the ebook market to protect hardcover sales. An ebook with enhanced value comparable to a hard cover print book could be offered on the intitial release date at a higher price so as to prevent pirates from having a monopoly on the ebook.

Is there a Grinch preventing the ebook Christmas party? As long as ebook publishing is captive to distribution platforms, innovations in ebook value may be difficult to implement. Amazon's interests do not always align with those of publishers. In particular, ebook serial numbers that register with publishers a not going to be high on the Kindle development queue.

Even the Grinch learned what Christmas is really about. You won't get hardcover ebook in your stocking this year, but have a great Holiday anyway!

The Rock-Star Librarian and Objective Selector / Eric Hellman

(This post was originally publisher in January of 2010. Google removed it for inexplicable reasons, then restored it on appeal. But with a new date!)

You probably can't name any musical performers from the 18th or 19th century. But you've probably heard of Enrico Caruso. Caruso had a sharp business sense to go along with a legendary voice, and he took advantage of cutting edge technology to make his voice heard by more people than perhaps any other human before him. He earned millions of dollars from the sales of his recordings before his untimely death in 1920. He was the first rock-star opera singer. 
 In my article about the changing role of public libraries in the ebook economy, I observed that libraries would have a diminished economic role when most books had become digital. How will the the role of librarians change when this happens?

Librarians have already seen their roles change drastically as library operations have moved onto the internet. Cataloging has changed profoundly, reference has been googlized and pre-internet licensing was almost nonexistent. But these changes have occurred in the context of relatively stable institutions. In what ways will the technological shift to ebooks transform both the role and the context of librarians?

Here's one possibility: objective selection.

I mentioned tropical fish farmers as an example of a group with distinctive information needs, and thus in need of a specialized collection of ebooks. Someone needs to do the selection. But why limit the luxury of customized collections to obscure trade groups? Shouldn't knitters be able to support a custom-selected ebook library? what about bass guitarists? Erlang programmers? Hula dancers?

Some book publishers imagine that their future is tied to the development of "verticals", or units that specialize in a single subject and thereby develop strong relationships with their audience. They imagine developing libraries of digital content to satisfy market needs and selling these libraries by subscription. And while this is likely to be a sensible strategy, it is at the same time rather limiting. It seems to me that the libraries that best serve the consumer's needs will be built by objective selectors, not marketing mavens.

The importance of objective selection is borne out in today's book business, in which the the most powerful person is Oprah Winfrey. Her viewers trust her to select books based on their merit and her good taste, rather than their profit potential. A book selected for Oprah's Book Club will rack up hundreds of thousands of sales.

The reason the transition to ebooks could amplify the role of selectors is that new models for selling books are possible. In the print world, you would never think of buying a collection of 1000 books at any price, even if it was selected for you by Oprah herself. In the e-book world, one could easily imagine wanting to buy subscriptions to 1000-book collections- not so much to read, but to keep on the computer and the iPhone just in case.

If you accept the idea of the thousand ebook libraries and how you would market them, you are inevitably led to the concept of rock-star librarians. Collections can't be marketed by specific content, or else there would be cannibalization of single-item sales. Marketing of collections would have to be centered around the selectors. Think how much some people would pay to have Warren Buffett's librarian selecting their thousand ebook libraries! Selectors would no longer be anonymous John Does. They would develop their own followings, their own brands, their own communities. These communities would not be bounded by geography; a top selector would reach patrons around the world, just as Enrico Caruso's voice did.
If you're a publisher, your first reaction is probably that this is nonsense, and that publishers would be better positioned to develop brands and communities. Why would publishers ever offer discounted ebooks through objective selectors, let alone allow a percentage for the selectors to live on? The reason, of course, is the demand curve. Objective selectors will earn their economic keep by helping to aggregate demand and segment the market. Their libraries will provide access to books that the consumer needs but wouldn't buy on their own; those are the sales that publishers will try their best to keep to themselves.

Having an objective intermediary between consumer and publisher could have other types of benefits, most notably privacy. Librarians have a strong code of ethics surrounding the rights of patrons to read without fear of having someone looking over their shoulder, and these values could be built in to a ebook selection and collection platform. Publishers may well find it easier to sell certain kinds of content when it comprises part of a discreetly selected library.

It could happen. The other possibility is that punk rock stars could take jobs now and then as librarians. That could happen, too.