Planet Code4Lib

Automatically Generating Podcast Transcripts / Peter Murray

I’m finding it valuable to create annotations on resources to index into my personal knowledge management system. (The Obsidian journaling post from late last year goes into some depth about my process.) I use the Hypothesis service to do this—Hypothesis annotations are imported into Markdown files for Obsidian using the custom script and method I describe in that blog post. This works well for web pages and PDF files…Hypothesis can attach annotations to those resource types. Videos are relatively straight forward, too, using Dan Whaley’s DocDrop service; it reads the closed captioning and puts that on an HTML page that enables Hypothesis to do its work. What I’m missing, though, are annotations on podcast episodes.

Podcast creators that take the time to make transcripts available are somewhat unusual. Podcasts from NPR and NPR member stations are pretty good about this, but everyone else is slacking off. My task management system has about a dozen podcast episodes where I’d like to annotate transcripts (and one podcast that seemingly stopped making transcripts just before the episode I wanted to annotate!). So I wrote a little script that creates a good-enough transcript HTML page.

AWS Transcribe to the rescue

Amazon Web Services has a Transcribe service that takes audio, runs it through its machine learning algorithms, and outputs a WebVTT file. Podcasts are typically well-produced audio, so AWS Transcribe has a clean audio track to work with. In my testing, AWS Transcribe does well with most sentences; it misses unusual proper names and its sentence detection mechanism is good-but-not-great. It is certainly good enough to get the main ideas across to provide an anchor for annotations. A WebVTT file (of a podcast advertisement) looks like this:

WEBVTT

1
00:00:00.190 --> 00:00:04.120
my quest to buy a more eco friendly deodorant quickly started to

2
00:00:04.120 --> 00:00:08.960
stink because sustainability and effectiveness don't always go hand in hand.

3
00:00:09.010 --> 00:00:11.600
But then I discovered finch Finch is a

4
00:00:11.600 --> 00:00:14.830
free chrome extension that scores everyday products on

After a WEBVTT marker, there are groups of caption statements separated by newlines. Each statement is numbered, followed by a time interval, followed by the caption itself. (WebVTT can be much more complicated than this…to include CSS-like text styling and other features; read the specs if you want more detail.)

What the script does

The code for this is up on GitHub now. The links to the code below point to the version of software at the time this blog post was written. Be sure to click the “History” button near the upper right corner of the code listing to see if it has been updated.

  1. Download the audio file from its server and upload it to an AWS S3 bucket so AWS Transcribe can get to it.
  2. Create a new AWS Transcribe job and wait for the job to finish.
  3. Set a public-read ACL on the WebVT file so this script can get it later. Also, save the output of the transcription job; the function then returns the link to the WebVTT file.
  4. In a new function, get the WebVTT file from where AWS Transcribe put it on the S3 bucket.
  5. Then it concatenates the caption text into one string and uses SpaCy to break the transcription into sentences. I’m doing this because the WebTT generates each caption by time, and the transcript is easier to read if it is broken up into sentences.
  6. Loop through the sentences looking for occurrences when a WebTT caption contains the start of the sentence. That way, I can get the timestamp of when the sentence starts.
  7. When the sentences are synced time times, use a Jinja2 template to create the HTML file.
  8. Lastly, upload the HTML to the S3 bucket as the index.html file, and make a final record of the podcast metadata.

That’s it!

Design choices

Amazon Transcribe is pretty cheap. AWS charges for each minute a transcript job runs at a rate of 2.4¢ a minute. Transcribing an hour-long podcast costs about $0.10. The storage and bandwidth costs are negligible.

The way that the Hypothesis annotation JavaScript works forced the use of a CSS-“:before”-content structure. One of the downsides of DocDrop is that annotations on multiple blocks are changed into just the first block of text. Based on my experimentation, it seems like the user-select: none property is enough of a break in the DOM to cause the problem. Because I didn’t want the timestamps included in the annotated text, the timestamps are put into the DOM using a CSS “:before” selector. Playing with the box margins enables everything to line up.

I’m not including the playback of the podcast audio along with the transcript. Unlike DocDrop, which embeds the YouTube viewer in the transcript page, playback of the audio from the S3 bucket wouldn’t be counted in the podcaster’s statistics. And I’m comfortable with the copyright implications of publicly posting uncorrected transcripts (in the absence of creator-produced transcripts), but not so comfortable as to also offer the audio file.

Issues

So there are some issues with this setup.

  • Copying and pasting episode data required: This is running as a command line program with four parameters: audio URL, episode title, episode landing page URL, and podcast title. Sometimes this takes a bit of hunting because podcast sites are not the most friendly for finding the audio URL. Viewing the page source is often necessary, and sometimes digging into the RSS/Atom XML is needed.
  • Times will vary with advertisement inserts: Because podcast networks insert ads with different lengths over time, the timestamps that were found when the transcription was made probably won’t correspond to later playbacks. But I think they will be close enough that I can go back and find the audio clip when I need to.
  • Default directory document doesn’t work: Right now, the “index.html” is required as part of the web link. It would be nice if one could remove that and just refer to the root directory, but AWS CloudFront doesn’t work like that.

Building community capacity for effective use of environmental data / Open Knowledge Foundation

As part of the Open Data Day 2022 small grants program, the Open Knowledge Foundation (OKF) supported 14 events. 5 organisations received small grants under the Environmental data category to host open data events and activities: Tanjona Association from Madagascar, Bolivia Tech Hub from Bolivia, Dream Factory Foundation from Botswana, Lekeh Development Foundation from Nigeria, and Fundación Datalat from Ecuador.

Category: Environmental data

Here are some highlights of the Environmental data events supported by Open Data Day 2022.

  1. Tanjona Association (Madagascar)

Tanjona Association, a Madagascar based non-profit organisation hosted a two-days workshop for young researchers from the University of Mahajanga and offered them basic training on GIS and spatial analysis for biodiversity conservation. The event was held on 26 and 27 July 2022 at the University of Mahajanga under the name “Hay Tech”. The objectives of the event were to familiarise young researchers from multiple backgrounds with using GIS as a crucial conservation tool for decision, strategy, and planning; to train young researchers in critical thinking about conservation strategy and climate change action based on GIS outcomes; and to create a space for networking among researchers.

30 participants joined the workshop and 3 mentors from the University of Mahajanga facilitated the workshop. The initial session started by setting up QGIS, and for that Herizo RADONIRINA helped participants. Kanto RAZANAJATOVO opened the workshop by welcoming participants, conducting a self-introduction, and an ice-breaking session. Then, Dr. Bernard ANDRIAMAHATANTSOA made an introductory session about the importance of GIS in the environmental context and biodiversity conservation. The impact of climate change on biodiversity loss and the necessity of monitoring biodiversity were also pointed out.

The workshop content was beginner friendly so that everyone could follow it easily. Case studies and datasets of the Boeny region were given to participants as an application and the whole session has been conducted using both French and Malagasy languages. The following modules were done during the session:

  • Understanding the QGIS interface
  • Creating a map and importing data, scaling
  • Handling vector and raster file
  • Basics spatial analysis

Mentors lead the participants to think about solutions to tackle climate change effects and biodiversity loss in Madagascar based on the outcomes from the GIS workshop session. In this session, participants are trained to adopt critical thinking on issues that Madagascar biodiversity is facing. Numerous ideas were collected from each participant; to support researchers from the local/national government to undertake an in-depth study about biodiversity and climate change; to adopt an adequate policy and good governance to stop biodiversity losses; to build synergies for the multidisciplinary collaborations and commitments.

  1. Bolivia Tech Hub (Bolivia)

Bolivia Tech Hub organised Open Data Day Bolivia on Saturday, 18 June 2022; where participants paired as a team to work together for data exploration and collaborative work. During a 7 hours long event, the participants selected 5 topics to organise the exploration. The topics were:

  • Environmental needs water and soil
  • Environmental monitoring
  • Environmental biodiversity
  • Waste and protected area

34 participants (14 female and 20 male) from La Paz and El Alto city joined the event. During the event, we learned that access to environmental data from Bolivia is not readily available as we thought. Nevertheless, there’s a vibrant community with the need and desire to work to make them accessible. Also that the networking part of the event was the participant’s favorite.

  1. Dream Factory Foundation (Botswana)

On 25 June 2022, Dream Factory Foundation hosted the first ever in-person workshop focusing on the importance of using Open Data sources to support smallholder farmers’ agricultural activities. Guest lecturer of the workshop was Bashanganyi Magwape who not only made the workshop informative but also practical by running an exercise where attendees created an open data source using the Whatsapp Business API to input information they wanted on each other in real-time.

One of the highlights of the event was the presence of Honourable Tselawa, the Councillor who was there on behalf of the Mayor of the City of Francistown. He remarked how this event was the first of its kind in the city and that he would use the learnings to parliament to advocate for diverting currently closed river channels to be reopened for the benefit of local farmers. He said, “I have walked away from here with a new mandate to take to parliament”.

The event was attended by 31  guests, including government officials as well as farmers across the country. The tagline for the day became: “Open Data is about we and not me”. We used a WhatsApp Chatbot data collection tool to show how we can practically contribute to open data, how easily farmers can contribute, and how we can benefit from the answers as well.

  1. Lekeh Development Foundation (Nigeria)

With a goal to train the Ogoni community on practical air quality monitoring for effective environmental management, Lekeh Development Foundation organised the Open Data Day to train critical stakeholders on the use and deployment of air quality monitoring sensors, and the need of advocacy to entrench a clean and healthy environment for the Ogoni people. Ogoni has become a symbol of environmental degradation, which has impacted air, water and soil. This has led to various problems ranging from environmental, socio-cultural, and then, to health issues. The event was hosted on 28 June 2022.

40 participants were carefully selected among climate defenders, environmental/human rights activists, members of coastal communities, and key stakeholders to be trained on air note devices and practical air quality monitoring for effective environmental management and advocacy. Presentations were on;

  • Air quality reading for advocacy, engagement, and campaign at the local, state, and federal levels.
  • Practical demonstration on how to take readings and use the air quality monitor.
  • Impact of air quality monitoring on health and wellness.

The event strengthened the capacity of the community to protect themselves and the environment in general by increasing their knowledge of environmental data collection. During the event, it was stressed many times that there is a deep relation between the well-being of a community and the well-being of the environment.

During her presentation, Nnenna Obike Oviebor noted that the air quality note is an apt tool that should be deployed in all areas of air pollution monitoring to actually get the real data and extent of air pollution. The data could serve as evidence for advocacy to bring about the needed action for the well-being of the community people. Participants appreciated the organiser and funder of the project for enlightening them on air quality management. Participants said there is a need for coastal Ogoni communities to monitor the air quality and have devices installed at strategic points, a soil spills and gas flaring are sadly becoming normal incidents.

LeBari Baridakara from the local government called on the air note devices to be donated to a climate change unit in the local authority for in situ air quality monitoring for record-keeping at the local level. We hope to further engage 10 volunteers who would be monitoring the instrument for bi-weekly and monthly analysis of the data. The data will be compared with allowed limits, and the outcomes will be shared with the local authorities within Ogoni. This will hopefully help drive the needed action.

  1. Fundación Datalat (Ecuador)

Fundación Datalat organised Mapathon to create maps with environmental open data. The Mapathon was a hybrid event that took place on 29 July 2022 that promoted the use of environmental open data that are being published by the Ministry of Environment related to forests, national parks, and conservation lands in Ecuador. A kit was provided to all participants, which included all material needed to visualise maps offline using the data provided during the event.

The Mapathon was aimed at people between 18-35 years old, including students, activists, journalists, civil society, and other data and environment enthusiasts. It took place locally in Latacunga – Ecuador and it also had a virtual space through Zoom. The activities included a workshop to explain open data, environmental open data, off-line mapping tools, and mapping indications, followed by hands-on activities to produce maps with the provided materials.

The event was facilitated by the Datalat team which included experts on geography, data analysis, visualisation, and communications and design. It also included personnel from the Ministry of Environment that guided the attendees about the use of the environmental data. The Mapathon lasted 4 hours and was carried out in 3 parts;

The participants used 5 open geographic datasets focused on the province of Cotopaxi. These databases were: a) National Protected Area System, b) Restoration priority area, c)Area Under Conservation from the Socio-Bosque Programme, d) Land Cover 2018, e) Deforestation period 2016-2018. In total, between online and on-site participants, the Mapathon produced 30 analog maps about environmental and conservation issues that allowed us to understand the perspective each participant had about these issues and environmental conflicts. 

——–

Open Data Day is an annual celebration of open data all over the world, where we gather to reach out to new people and build new solutions to issues in our communities using open data.

For the 2022 edition of Open Data Day, Open Knowledge Foundation supported 14 events with small grants. Please find the details of all grant winners here.

Internet Archive Summer of Design 2022 / Open Library

Forward by Mek

For several years and with much gratitude, the Internet Archive has participated in Google’s Summer of Code (GSoC). GSoC is a program run by Google that supports select open source projects, like the Open Library, by generously funding students to intern with them for a summer. Participation in GSoC is as selective of organizations as it is for students and so in years when GSoC is not available to the Internet Archive, we try to fund our own Internet Archive Summer of Code (IASoC) paid fellowship opportunity.

GSoC and IASoC are traditionally limited to software engineering candidates which has meant that engineering contributions on Open Library have often outpaced its design. This year, to help us take steps towards righting this balance, an exceedingly generous donor (who wishes to remain anonymous but who is no less greatly appreciated) funded our first ever Internet Archive Summer of Design fellowship which was awarded to Hayoon Choi, a senior design student at CMU. In this post, we’re so excited to introduce you to Hayoon, showcase the impact she’s made with the Open Library team through her design work this summer, and show how her contributions are helping lay the groundwork to enable future designers to make impact on the Open Library project!

Introducing Hayoon Choi

Profile photo for hayoonc

Hello, my name is Hayoon Choi and this summer I worked as a UX designer with Open Library as part of the Internet Archive Summer of Code & Design fellowship program. I am a senior attending Carnegie Mellon University, majoring in Communication Design and minoring in HCI. I’m interested in learning more about creative storytelling and finding ways to incorporate motion design and interactions into digital designs. 

Problem

When I first joined the Open Library team, the team was facing three design challenges:

  1. There was no precedent or environment for rapidly prototyping designs
  2. There wasn’t a living design system, just an outdated & static design pattern library
  3. The website didn’t display well on mobile devices, which represents and important contingency of patrons.

Approach

In order to solve these challenges, I was asked to lead two important tasks:

  1. Create a digital mockup of the existing book page (deskop and mobile) to enable rapid prototyping
  2. A propose a redesign of the book page optimized for mobile.

To achieve the first task, I studied the current design of the Open Library Book Page and prototyped the current layout for both mobile and desktop using Figma. In the process, I made sure every element of that Figma file is easily editable so that in the future, designers and developers can explore with the design without having to code.

For the second task, we first scoped our work by setting our focus to be the set of content which appears above the fold — that is, the content which first loads and appears within the limited viewport of a mobile device. We wanted to make sure that when the page initially loads, our patrons are satisfied with the experience they receive.

Even before conducting interviews with patrons, there were easily identifiable design issues with the current mobile presentation:

  • Information hierarchy: some texts were too big; certain information took up too much space; placement of the book information were hard to discover
  • Not mobile friendly: Some images were shown too small on images; it was hard to scroll through the related books; one feature included hovering, which is not available on mobile devices

To address these concerns, I worked with the Open Library community to receive feedback and designed dozens of iterations of the mobile book page using Figma. Based on feedback I learned about the most necessary information to be presented above-the-fold, I choose to experiment with 6 elements:

  1. The primary Call To Action (CTA) buttons: how do I make them more highlighted?
  2. The Navigation Bar: which placement and styling are most convenient and effective?
  3. The Editions Table: how might we make it easier for patrons to discover which other book editions and languages may be available?
  4. Ratings & reviews: how do I encourage users to rate more and help them understand the book effectively with the review system?
  5. Sharing: how do I make it easier for users to share the book?
  6. The Information Hierarchy: how can we reorder content to better meet the diverse needs of our audience?

From these questions and feedback from the Open Library team, I was able to settle on five designs which seemed like effective possibilities for showcasing differences in book cover size, sharing buttons, information display, and rating and reviewing system which we wanted to test:

User Interviews & Mazes

With these five designs selected, I planned on running multivariate user testings to get feedback from actual users and to understand how I can more effectively make improvements to the design. 

I believed that I would gather more participants if the user testing was done remotely since it would put less pressure on them. However, I wasn’t sure how I would do this until I discovered a tool called Maze.

Maze provides a way for patrons to interact with Figma mockups, complete certain tasks, answer questions, and leave feedback. While this is happening, Maze can record video sessions, keep track of where patrons are clicking, and provide valuable data about success rates on different tasks. I felt this service could be extremely useful and fitting for this project; therefore I went ahead and introduced Maze to the Open Library’s team. Thanks to a generous 3-month free partner coupon offered by Maze, I was able to create six Maze projects — one for each of our five new designs, as well as our current design as a control for our experiment. Each of these six links were connected to a banner that appeared across the Open Library website for a week. Each time the website was reloaded, the banner randomized the presented link so participants would be evenly distributed among the six Maze projects.

Although the Maze projects showed patrons different mobile screens, they enabled comparisons of functionality by asking patrons to answer the same pool of 7 questions and tasks:

  1. What was the color of the borrow button (after showing them the screen for five seconds)
  2. What key information is missing from this screen (while showing the above-the-fold screen)
  3. Share and rate this book
  4. Borrow the Spanish edition for this book
  5. Try to open a Spanish edition
  6. Review this book
  7. Try to open the preview of this book

In between these tasks, the participants were asked to rate how challenging these tasks were and to write their feelings or opinions.

In addition to Maze, which we hoped would help us scale our survey to reach a high volume of diverse participants, we also conducted two digital person-to-person user interviews over Zoom to get more in depth understanding about how patrons approach challenging tasks. Because Maze can only encode flows we program directly, these “in person” interviews gave us the ability to intervene and learn more when patrons became confused.

Results & Findings

After around a week of releasing the Maze links on the website, we were able to get a total of 760 participants providing feedback on our existing and proposed designs. Maze provided us with useful synthesis about how long it took participants to complete tasks and showed a heat map of where patrons were clicking (correctly or incorrectly) on their screens. These features were helpful when evaluating which designs would better serve our patrons. Here’s a list of findings I gathered from Maze:

The Sharing Feature:

Results suggest that the V1 design was most clear to patrons for the task of sharing the book. It was surprising to learn patrons, on average, spent the most time completing the task on this same design. Some patrons provided feedback which challenged our initial expectations about what they wanted to accomplish, reporting that they were opposed to sharing a book or that their preferred social network was not included in the list of options.

Giving a book a Star Rating:

One common reaction for all designs was that people expected that clicking on the book’s star ratings summary would take them to a screen where they could rate the book. It was surprising and revealing to learn that many patrons didn’t know how to rate books on our current book page design!

Leaving a Community Review

When participants were asked to leave a community review, some scrolled all the way down the screen instead of using the review navigation link which was placed above the fold. In design V4, using a Tag 🏷 icon for a review button confused many people who didn’t understand the relationship between book reviews and topic tags. In addition, the designs which tested combining community review tags and star ratings under a single “review” button were not effective at supporting patrons in the tasks of rating or reviewing books.

Borrowing Other Editions

Many of our new designs featured a new Read button with a not-yet-implemented drop down button. While it was not our intention, we found many people clicked the unimplemented borrow drop down with the expectation that this would let them switch between other available book editions, such as those in different languages. This task also taught us that a book page navigation bar at the top of the design was most effective at supporting patrons through this task. However, after successfully clicking the correct navigation button, patrons had a difficult time using the provided experience to find an borrow a Spanish edition within the editions table. Some patrons expected more obvious visual cues or a filtering system to more easily distinguish between available editions in different languages.

Synthesis

By synthesizing feedback across internal stakeholders, user interviews, and results from our six mazes, we arrived at a design proposal which provides patrons with several advantages over today’s existing design:

  • First and foremost, redesigned navigation at the very top of the book page
  • A prominent title & author section which showcases the book’s star ratings and invites the patron to share the book.
  • A large, clear book cover to orient patrons.
  • An actionable section which features a primary call to action of “Borrow”, a “Preview” link, and a visually de-emphasized “Want to Read” button. Tertiary options are provided for reviewing the book and jotting notes.
  • Below the fold, proposals for a re-designed experience for leaving reviews and browsing other editions.
(Before)
(After)

Reflections

I had a great time working with Open Library and learning more about the UX field. I enjoyed the process of identifying problems, iterating, and familiarizing myself with new tools. Throughout my fellowship, I got great feedback and support from everyone from the team, especially my mentor Mek. He helped me plan an efficient schedule while creating a comfortable working environment. Overall, I truly enjoyed my working experience here and I hope my design works will get to help patrons in the future!

About the Open Library Fellowship Program

The Internet Archive’s Open Library Fellowship is a flexible, self-designed independent study which pairs volunteers with mentors to lead develop of a high impact feature for OpenLibrary.org. Most fellowship programs last one to two months and are flexible, according to the preferences of contributors and availability of mentors. We typically choose fellows based on their exemplary and active participation, conduct, and performance within the Open Library community. The Open Library staff typically only accepts 1 or 2 fellows at a time to ensure participants receive plenty of support and mentor time. Occasionally, funding for fellowships is made possible through Google Summer of Code or Internet Archive Summer of Code & Design. If you’re interested in contributing as an Open Library Fellow and receiving mentorship, you can apply using this form or email openlibrary@archive.org for more information.

Responsible Disclosure Policies / David Rosenthal

Recently, Uber was completely pwned, apparently by an 18-year-old. Simon Sharwood's Uber reels from 'security incident' in which cloud systems seemingly hijacked provides some initial details:
Judging from screenshots leaked onto Twitter, though, an intruder has compromised Uber's AWS cloud account and its resources at the administrative level; gained admin control over the corporate Slack workspace as well as its Google G Suite account that has over 1PB of storage in use; has control over Uber's VMware vSphere deployment and virtual machines; access to internal finance data, such as corporate expenses; and more.
And in particular:
Even the US giant's HackerOne bug bounty account was seemingly compromised, and we note is now closed.

According to the malware librarians at VX Underground, the intruder was using the hijacked H1 account to post updates on bounty submissions to brag about the degree of their pwnage, claiming they have all kinds of superuser access within the ride-hailing app biz.

It also means the intruder has access to, and is said to have downloaded, Uber's security vulnerability reports.
Thus one of the results of the incident is the "irresponsible disclosure" of the set of vulnerabilities Uber knows about and, presumably, would eventually have fixed. "Responsible disclousure" policies have made significant improvements to overall cybersecurity in recent years but developing and deploying fixes takes time. For responsible disclosure to be effective the vulnerabilities must be kept secret while this happens.

Stewart Baker points out in Rethinking Responsible Disclosure for Cryptocurrency Security that these policies are hard to apply to cryptocurrency systems. Below the fold I discuss the details.

Baker summarizes "responsible disclosure":
There was a time when software producers treated independent security research as immoral and maybe illegal. But those days are mostly gone, thanks to rough agreement between the producers and the researchers on the rules of “responsible disclosure.” Under those rules, researchers disclose the bugs they find “responsibly”—that is, only to the company, and in time for it to quietly develop a patch before black hat hackers find and exploit the flaw. Responsible disclosure and patching greatly improves the security of computer systems, which is why most software companies now offer large “bounties” to researchers who find and report security flaws in their products.

That hasn’t exactly brought about a golden age of cybersecurity, but we’d be in much worse shape without the continuous improvements made possible by responsible disclosure.
Baker identifies two fundamental problems for cryptocurrencies:
First, many customers don’t have an ongoing relationship with the hardware and software providers that protect their funds—nor do they have an incentive to update security on a regular basis. Turning to a new security provider or using updated software creates risks; leaving everything the way it was feels safer. So users won’t be rushing to pay for and install new security patches.
Users have also been deluged with accounts of phishing and other scams involving updating or installing software, so are justifiably skeptical of "patch now" messages. In fact, most users don't even try to use cryptocurrency directly, but depend on exchanges. Thus their security depends upon that of their exchange. Exchanges have a long history of miserable security, stretching back eight years to Mt. Gox and beyond. The brave souls who do use cryptocurrency directly depend on the security of their wallet software, which again has a long history of vulnerabilities.

Next, Baker points to the ideology of decentralization as a problem:
That means that the company responsible for hardware or software security may have no way to identify who used its product, or to get the patch to those users. It also means that many wallets with security flaws will be publicly accessible, protected only by an elaborate password. Once word of the flaw leaks, the password can be reverse engineered by anyone, and the legitimate owners are likely to find themselves in a race to move their assets before the thieves do.
Molly White documents a recent example of both problems in Vulnerability discovered in vanity wallet generator puts millions of dollars at risk:
The 1inch Network disclosed a vulnerability that some of their contributors had found in Profanity, a tool used to create "vanity" wallet addresses by Ethereum users. Although most wallet addresses are fairly random-looking, some people use vanity address generators to land on a wallet address like 0xdeadbeef52aa79d383fd61266eaa68609b39038e (beginning with deadbeef), ... However, because of the way the Profanity tool generated addresses, researchers discovered that it was fairly easy to reverse the brute force method used to find the keys, allowing hackers to discover the private key for a wallet created with this method.

Attackers have already been exploiting the vulnerability, with one emptying $3.3 million from various vanity addresses. 1inch wrote in their blog post that "It’s not a simple task, but at this point it looks like tens of millions of dollars in cryptocurrency could be stolen, if not hundreds of millions."

The maintainer of the Profanity tool removed the code from Github as a result of the vulnerability. Someone had raised a concern about the potential for such an exploit in January, but it had gone unaddressed as the tool was not being actively maintained.
It is actually remarkable that it took seven months from the revelation of the potential vulnerability to its exploitation. And the exploits continue, as White reports in Wintermute hacked for $160 million:
Wintermute hasn't disclosed more about the attack, but it's possible that the hacker may have exploited the vulnerability in the vanity wallet address generator Profanity, which was disclosed five days prior. The crypto asset vault admin had a wallet address prefixed with 0x0000000, a vanity address that would have been susceptible to attack if it was created using the Profanity tool.
But everything is fine because the CEO says the company is "solvent with twice over that amount in equity left". Apparently losing one-third of your equity to a thief is no big deal in the cryptosphere.

Baker describes rapid exploitation of such vulnerabilities as "nearly guaranteed" because of the immediate financial reward, and provides two more examples from last month:
In one, hackers took nearly $200 million from Nomad, a blockchain “bridge” for converting and transferring cryptocurrencies. One user began exploiting a flaw in Nomad’s smart contract code. That tipped others to the exploit. Soon, a feeding frenzy broke out, quickly draining the bridge of all available funds. In the other incident, Solana, a cryptocurrency platform, saw hackers drain several million dollars from nearly 8,000 wallets, probably by compromising the security of their seed phrases, thus gaining control of the wallets.
Baker summarizes:
Together, these problems make responsible disclosure largely unworkable. It’s rarely possible to fill a security hole quietly. Rather, any patch is likely to be reverse engineered when it’s released and exploited in a frenzy of looting before it can be widely deployed. (This is not a new observation; the problem was pointed out in a 2020 ACM article that deserves more attention.)

If I’m right, this is a fundamental flaw in cryptocurrency security. It means that hacks and mass theft will be endemic, no matter how hard the industry works on security, because the responsible disclosure model for curing new security flaws simply won’t work.
Böhme et al Fig. 2
The 2020 paper Baker cites is Responsible Vulnerability Disclosure in Cryptocurrencies by Rainer Böhme, Lisa Eckey, Tyler Moore, Neha Narula, Tim Ruffing and Aviv Zohar. The authors describe the prevalence of vulnerabilities thus:
The cryptocurrency realm itself is a virtual "wild west," giving rise to myriad protocols each facing a high risk of bugs. Projects rely on complex distributed systems with deep cryptographic tools, often adopting protocols from the research frontier that have not been widely vetted. They are developed by individuals with varying level of competence (from enthusiastic amateurs to credentialed experts), some of whom have not developed or managed production-quality software before. Fierce competition between projects and companies in this area spurs rapid development, which often pushes developers to skip important steps necessary to secure their codebase. Applications are complex as they require the interaction between multiple software components (for example, wallets, exchanges, mining pools). The high prevalence of bugs is exacerbated by them being so readily monetizable. With market capitalizations often measured in the billions of dollars, exploits that steal coins are simultaneously lucrative to cybercriminals and damaging to users and other stakeholders. Another dimension of importance in cryptocurrencies is the privacy of users, whose transaction data is potentially viewable on shared ledgers in the blockchain systems on which they transact. Some cryptocurrencies employ advanced cryptographic techniques to protect user privacy, but their added complexity often introduces new flaws that threaten such protections.
Böhme et al describe two fundamental differences between the disclosure and patching process in normal software and cryptocurrencies. First, coordination:
the decentralized nature of cryptocurrencies, which must continuously reach system-wide consensus on a single history of valid transactions, demands coordination among a large majority of the ecosystem. While an individual can unilaterally decide whether and how to apply patches to her client software, the safe activation of a patch that changes the rules for validating transactions requires the participation of a large majority of system clients. Absent coordination, users who apply patches risk having their transactions ignored by the unpatched majority.

Consequently, design decisions such as which protocol to implement or how to fix a vulnerability must get support from most stakeholders to take effect. Yet no developer or maintainer naturally holds the role of coordinating bug fixing, let alone commands the authority to roll out updates against the will of other participants. Instead, loosely defined groups of maintainers usually assume this role informally.

This coordination challenge is aggravated by the fact that unlike "creative" competition often observed in the open source community (for example, Emacs versus vi), competition between cryptocurrency projects is often hostile. Presumably, this can be explained by the direct and measurable connection to the supporters' financial wealth and the often minor technical differences between coins. The latter is a result of widespread code reuse, which puts disclosers into the delicate position of deciding which among many competing projects to inform responsibly. Due to the lack of formally defined roles and responsibilities, it is moreover often difficult to identify who to notify within each project. Furthermore, even once a disclosure is made, one cannot assume the receiving side will act responsibly: information about vulnerabilities has reportedly been used to attack competing projects, influence investors, and can even be used by maintainers against their own users.
The second is controversy, which:
emerges from the widespread design goal of "code is law," that is, making code the final authority over the shared system state in order to avoid (presumably fallible) human intervention. To proponents, this approach should eliminate ambiguity about intention, but it inherently assumes bug-free code. When bugs are inevitably found, fixing them (or not) almost guarantees at least someone will be unhappy with the resolution. ... Moreover, situations may arise where it is impossible to fix a bug without losing system state, possibly resulting in the loss of users' account balances and consequently their coins. For example, if a weakness is discovered that allows anybody to efficiently compute private keys from data published on the blockchain, recovery becomes a race to move to new keys because the system can no longer tell authorized users and attackers apart. This is a particularly harmful consequence of building a system on cryptography without any safety net. The safer approach, taken by most commercial applications of cryptography but rejected in cryptocurrencies, places a third party in charge of resetting credentials or suspending the use of known weak credentials.
I discussed the forthcoming ability to "efficiently compute private keys" in The $65B Prize.

Böhme et al go on to detail seven episodes in which cryptocurrencies' vulnerabilities were exploited. In some cases disclosure was public and exploitation was rapid, in other cases the developers were informed privately. A pair of vulnerabilities in Bitcoin provides an example:
a developer from Bitcoin Cash disclosed a bug to Bitcoin (and other projects) anonymously. Prior to the Bitcoin Cash schism, an efficiency optimization in the Bitcoin codebase mistakenly dropped a necessary check. There were actually two issues: a denial-of-service bug and potential money creation. It was propagated into numerous cryptocurrencies and resided there for almost two years but was never exploited in Bitcoin. ... The Bitcoin developers notified the miners controlling the majority of Bitcoin's hashrate of the denial-of-service bug first, making sure they had upgraded so that neither bug could be exploited before making the disclosure public on the bitcoin-dev mailing list. They did not notify anyone of the inflation bug until the network had been upgraded.
The authors conclude with a set of worthy recommendations for improving the response to vulnerabilities, as Baker does also. But they all depend upon the existence of trusted parties to whom the vulnerability can be disclosed, and who are in a position to respond appropriately. In a truly decentralized, trustless system such parties cannot exist. None of the recommendations address the fundamental problem which, as I see it, is this:
  • Cryptocurrencies are supposed to be decentralized and trustless.
  • Their implementations will, like all software, have vulnerabilities.
  • There will be a delay between discovery of a vulnerability and the deployment of a fix to the majority of the network nodes.
  • If, during this delay, a bad actor finds out about the vulnerability, it will be exploited.
  • Thus if the vulnerability is not to be exploited its knowledge must be restricted to trusted developers who are able to ensure upgrades without revealing their true purpose (i.e. the vulnerability). This violates the goals of trustlessness and decentralization.
This problem is particularly severe in the case of upgradeable "smart contracts" with governance tokens. In order to patch a vulnerability, the holders of governance tokens must vote. This process:
  • Requires public disclosure of the reason for the patch.
  • Cannot be instantaneous.
If cryptocurrenceies are not decentralized and trustless, what is their point? Users have simply switched from trusting visible, regulated, accountable institutions backed by the legal system, to invisible, unregulated, unaccountable parties effectively at war with the legal system. Why is this an improvement?

Announcing Incoming NDSA Coordinating Committee Members for 2023-2025 / Digital Library Federation

Please join me in welcoming the three newly elected Coordinating Committee members Shira Peltzman, Deon Schutte, and Bethany Scott. Their terms begin January 1, 2023 and run through December 31, 2025. 

Shira Peltzman is the Digital Archivist for UCLA Library Special Collections where she works with stakeholders on an enterprise-wide basis to preserve and make LSC’s born-digital material accessible to the widest possible audience. As a current member of the NDSA Staffing Survey Working Group, she has seen firsthand the importance of undertaking this work collectively and the impact that it has on the field. Shira is interested in serving as a member of the NDSA Coordinating Committee because she would like to help guide and coordinate this work to maximize the quality, relevance, consistency, and overall effectiveness of the publications that come out of all Interest and Working Groups.

Deon Schutte worked as a freelance typesetter in the educational publishing industry in South Africa for many years. In 2018 he completed his B.INF (Bachelor of Information Science) through the University of South Africa and his B.INF Honours in 2019. Deon is a MPhil (Master of Philosophy, specializing in Digital Curation) candidate at the University of Cape Town. His research interests are hermeneutics, heuristics, and sensemaking as cognitive processes that support the curation of archival arrangements. Deon serves as the Chair of the Association of Southern African Indexers and Bibliographers, and he is a Fellow of the South African Chefs Association. He works at Africa Media Online as the project manager of a team that is tasked with the organizing and arrangement, prior to digitisation, of the extensive personal archive of one of the prominent politicians of the anti-Apartheid struggle.

Bethany Scott is the Head of Preservation & Reformatting at the University of Houston Libraries. In this role she provides strategic leadership for the Libraries’ physical and digital preservation programs, and digitization and reformatting services for the Libraries and its patrons. Bethany also serves as Product Owner of the Libraries’ open-source digital access and preservation ecosystem, which incorporates Avalon, Hyrax, Archivematica, and ArchivesSpace. Her areas of expertise include digital preservation, born-digital archives, scanning and imaging, and reuse of archival metadata.

We are also grateful to the very talented, qualified individuals who participated in this election.

We are indebted to our outgoing Coordinating Committee members, Courtney Mumma, Dan Noonan, and Nathan Tallman, for their service and many contributions. To sustain a vibrant, robust community of practice, we rely on and deeply value the contributions of all members, including those who took part in voting.

 

Hannah Wang, Vice Chair

On behalf of the NDSA Coordinating Committee

The post Announcing Incoming NDSA Coordinating Committee Members for 2023-2025 appeared first on DLF.

Are Blockchains Decentralized? / David Rosenthal

Bitcoin pools 9/1/18
In April 2014, more than 8 years ago, I posted this comment:
Gradually, the economies of scale you need to make money mining Bitcoin are concentrating mining power in fewer and fewer hands. I believe this centralizing tendency is a fundamental problem for all incentive-compatible P2P networks. ... After all, the decentralized, distributed nature of Bitcoin was supposed to be its most attractive feature.
That October I expanded the comment into Economies of Scale in Peer-to-Peer Networks, in which I wrote:
The simplistic version of the problem is this:
  • The income to a participant in a P2P network of this kind should be linear in their contribution of resources to the network.
  • The costs a participant incurs by contributing resources to the network will be less than linear in their resource contribution, because of the economies of scale.
  • Thus the proportional profit margin a participant obtains will increase with increasing resource contribution.
  • Thus the effects described in Brian Arthur's Increasing Returns and Path Dependence in the Economy will apply, and the network will be dominated by a few, perhaps just one, large participant.
Ethereum miners 11/07/21
Ever since I have been pointing out that the claim that permissionless blockchain networks are decentralized is gaslighting, and thus the benefits that decentralization is supposed to deliver are not in practice obtained. Real-world permissionless blockchains such as Bitcoin's and Ethereum's have remained centralized all this time.

Now, a DARPA-sponsored report entitled Are Blockchains Decentralized? by a large team from the Trail of Bits security company conforms to Betteridge's Law. They examine this and many other centralizing forces affecting a wide range of blockchain implementations and conclude that the answer to their question is "No". Below the fold I comment on each of their findings (in italic), then discuss Professor Angela Walch's analysis of the problems of using "decentralized" in a legal context.
The challenge with using a blockchain is that one has to either (a) accept its immutability and trust that its programmers did not introduce a bug, or (b) permit upgradeable contracts or off-chain code that share the same trust issues as a centralized approach.
Although the report is focused on Bitcoin's blockchain, it includes some of Ethereum's problems:
For example, Alice can submit a transaction to a contract and, before the transaction is mined, the contract could be upgraded to have completely different semantics. The transaction would be executed against the new contract. Upgradeable contract patterns have become incredibly popular in Ethereum as they allow developers to circumvent immutability to patch bugs after deployment. But they also allow developers to patch in backdoors that would allow them to abscond with a contract’s assets. The challenge with using a blockchain is that one has to either (a) accept its immutability and trust that the programmers did not introduce a bug, or (b) permit upgradeable contracts or off-chain code that share the same trust issues as a centralized approach.
Given that it is impossible to predict how long any given transaction will be delayed, the risk Alice runs can be significant. I first discussed the problems that "upgradeable" contracts both cure and create in 2018's DINO and IINO, linking to Udi Wertheimer's 2017 Bancor Unchained: All Your Token Are Belong To Us:
Bancor’s contracts are “upgradeable”, meaning they can replace them with new functionality, giving them more power, or removing power from themselves. They promise on some communications they will gradually remove their control over the system.
Bancor's contracts had administrative backdoors that allowed, for example, taking anyone's tokens. An attacker exploited them to take complete control of the contract and steal $23M.
Every widely used blockchain has a privileged set of entities that can modify the semantics of the blockchain to potentially change past transactions.
By which the authors mean the developers and maintainers of the software. They note that:
In some cases, the developers or maintainers of a blockchain intentionally modify its software to mutate the blockchain’s state to revert or mitigate an attack—this was Ethereum’s response to the 2016 DAO hack. But in most other cases, changes to a blockchain are an unintentional or unexpected consequence of another change. For example, Ethereum’s Constantinople hard fork reduced the gas costs of certain operations. However, some immutable contracts that were deployed before the hard fork relied on the old costs to prevent a certain class of attack called “reentrancy.” Constantinople’s semantic changes caused these once secure contracts to become vulnerable.
Report page 23
It will come as no surprise that the authors found a high degree of centralization in the various blockchain's client software:
We generated software bills of materials (SBOMs) and dependency graphs for the major clients for Bitcoin, Bitcoin Cash, Bitcoin Gold, Ethereum, Zcash, Iota, Dash, Dogecoin, Monero, and Litecoin. We then compared two dependency graphs based on the clients’ normalized edit distance.
The table shows that almost all the clients are at least 90% identical. This commonality makes various blockchains vulnerable to supply chain attacks on other blockchains:
While software bugs can lead to consensus errors, we demonstrated that overt software changes can also modify the state of the blockchain. Therefore, the core developers and maintainers of blockchain software are a centralized point of trust in the system, susceptible to targeted attack. There are currently four active contributors with access to modify the Bitcoin Core codebase, the compromise of any of whom would allow for arbitrary modification of the codebase. Recently, the lead developer of the $8 billion Polygon network, Jordi Baylina, was recently targeted in an attack with the Pegasus malware, which could have been used to steal his wallet or deployment credentials.
Many recent software supply chain attacks have used compromised developer credentials. Thus the security of the blockchain depends upon the operational security of the developers.
The number of entities sufficient to disrupt a blockchain is relatively low: four for Bitcoin, two for Ethereum, and less than a dozen for most PoS networks.
This is the so-called "Nakamoto coefficient". The authors make the same point that I have been making for the last eight years:
It is well known that Bitcoin is economically centralized: in 2020, 4.5% of Bitcoin holders controlled 85% of the currency. But what about Bitcoin’s systemic or authoritative centralization? As we saw in the last section, Bitcoin’s Nakamoto coefficient is four, because taking control of the four largest mining pools would provide a hashrate sufficient to execute a 51% attack. In January of 2021, the Nakamoto coefficient for Ethereum was only two. As of April 2022, it is three.
The authors explain this lack of decentralization:
Each mining pool operates its own, proprietary, centralized protocol and interacts with the public Bitcoin network only through a gateway node. In other words, there are really only a handful of nodes that participate in the consensus network on behalf of the majority of the network’s hashrate. Controlling those nodes provides the means to, at a minimum, deny service to their constituent hashrate.
They perform the same analysis for a set of Proof-of-Stake blockchains, which are naturally centralized because of the extreme Gini coefficients of cryptocurrencies:
Most PoS blockchain’s consensus protocols (Avalanche’s Snowflake, Solana’s Tower BFT, etc.) break down if the validators associated with at least one-third of the staked assets are malicious, effectively pausing the network. Therefore, the Nakamoto coefficient of most PoS blockchains is equal to the smallest number of validators that have collectively staked at least a third of all of the staked assets.
And they point out that for Ethereum2, the most consequential PoS blockchain, the Nakamoto coefficient is currently 12 because:
According to Nansen, the four biggest depositors have more than a third of the stake, and those depositors have 12 nodes.
There are many problems with Proof-of-Stake, but obvious ones include that the depositors are pseudonymous, and could thus actually be one person, that much of the currency is held by exchanges on behalf of their customers (see Justin Son's takeover of the Steem blockchain), and that it is possible to borrow large amounts of cryptocurrencies. These issues were aptly illstrated by Molly White's report Solend DAO passes proposal to take over the account of a large holder with a position that poses systemic risk:
The proposal succeeded hours after it was proposed, with one whale providing 1 million votes out of the 1.15 million votes in favor.
The standard protocol for coordination within blockchain mining pools, Stratum, is unencrypted and, effectively, unauthenticated.
The Stratum protocol is how the mining pool operator divides up the work on its proposed block and assigns each fragment to pool members. The authors point out that:
We have discovered that, today, all of the mining pools we tested either assign a hard-coded password for all accounts or simply do not validate the password provided during authentication. For example, all ViaBTC accounts appear to be assigned the password “123.” Poolin seems not to validate authentication credentials at all. Slushpool explicitly instructs its users to ignore the password field as, “It is a legacy Stratum protocol parameter that has no use nowadays.” We discovered this by registering multiple accounts with the mining pools, and examining their server code, when available. These three mining pools alone account for roughly 25% of the Bitcoin hashrate.
The Stratum protocol was enhanced with passwords in order to mitigate a denial-of-service attack, but clearly the pools no longer care about it.
For a blockchain to be optimally distributed, there must be a so-called Sybil cost. There is currently no known way to implement Sybil costs in a permissionless blockchain like Bitcoin or Ethereum without employing a centralized trusted third party (TTP). Until a mechanism for enforcing Sybil costs without a TTP is discovered, it will be almost impossible for permissionless blockchains to achieve satisfactory decentralization.
As regard the economic forces driving centralization, in the section on Sybil and Eclipse Attacks: The “Other” 51%, the authors note that (Pg. 15):
A recent impossibility result for the decentralization of permissionless blockchains like Bitcoin and Ethereum was discovered by Kwon et al. It indicates that for a blockchain to be optimally distributed, there must be a so-called Sybil cost. That is, the cost of a single participant operating multiple nodes must be greater than the cost of operating one node.
In 2019's Impossibility of Full Decentralization in Permissionless Blockchains Kwon et al formalize and extend the mechanism I described in 2014. My argument for centralization was that economies of scale meant that the cost per Sybil would decrease with N; their argument is that unless the cost per Sybil increases with N the system will not be decentralized. And, as they point out, no-one has any idea how to push back against economies of scale, much less make the cost per Sybil go up with N:

Because there is no central membership register, permissionless blockchains have to defend against Sybil attacks. But they also face three other related problems:
  • Maintaining the connectivity of the network as nodes join and leave without a central register of node addresses.
  • Maintaining the "mempool" of pending transactions as they are created, chosen by miners to include in blocks, and eventually finalized with no central database.
  • Maintaining the state of the blockchain as miners propose newly mined blocks to be appended to it with no central database.
They manage these tasks using a Gossip protocol; each node communicates with a set of neighbors and, if it receives information from one of them that it has not previously received, updates its state and forwards the message to the other neighbors. The number of a node's neighbors is called its "degree". Among the information a node communicates to its neighbors are:
  • The identities of its other neighbors. Thus a node wishing to join the network need only communicate with one member node, which will propagate its identity to the other nodes.
  • Any transactions it has received. Thus, similarly, a node wishing to transact need only communicate with its neighbors to update the "mempool".
  • Its idea of the head of the chain. Thus network-wide consensus on the longest chain is accelerated.
One problem with this technique is that it is possible for the network to partition into two disjoint sub-networks that do not communicate. In order to take part in maintaining connectivity, Bitcoin nodes have to have a public IP address so that they can accept incoming connections.
Report page 18
By crawling the Bitcoin network and querying nodes for known peers, we can estimate the number of public Bitcoin nodes (i.e., nodes actively accepting incoming connections). From crawling the Bitcoin network throughout 2021, we estimate that the public Bitcoin nodes constitute only 6–11% of the total number of nodes. Therefore, the vast majority of Bitcoin nodes do not meaningfully contribute to the health of the Bitcoin network. We have extended the Barabási–Albert random graph model to capture the behavior of Bitcoin peering. This model suggests that at the current size of the Bitcoin network, at least 10% of nodes must be public to ensure that new nodes are able to maximize their number of peers (and, therefore, maximize the health and connectivity of the network). As the total number of nodes increases, this bound approaches 40%.
The authors observe that only 6-11% of Bitcoin nodes accept incoming connections, and assume that the others are behind network address translators and can only listen to the gossip protocol. For the purpose of maintaining connectivity the Bitcoin network is much smaller than it appears. The authors conclude:
A dense, possibly non-scale-free, subnetwork of Bitcoin nodes appears to be largely responsible for reaching consensus and communicating with miners—the vast majority of nodes do not meaningfully contribute to the health of the network.
Thus the target for attacks on the Bitcoin network is not the whole network, but only this subnetwork.
When nodes have an out-of-date or incorrect view of the network, this lowers the percentage of the hashrate necessary to execute a standard 51% attack. Moreover, only the nodes operated by mining pools need to be degraded to carry out such an attack. For example, during the first half of 2021 the actual cost of a 51% attack on Bitcoin was closer to 49% of the hashrate.
Report page 14
Because it takes time for a gossip protocol to flood the whole network with updated information, the Bitcoin network targets a block time of around 10 minutes. As the graph shows, as delays become comparable to a block time, the proportion of nodes that are no longer effective in achieveing consensus increases, thus decreasing the proportion needed for a 51% attack. Given the actual network topology, DDoS-style attacks on the "dense, possibly non-scale-free, subnetwork" could cause enough delay to materially assist such an attack. This is why the issues with gossip protocols are related to Sybil defense.
The vast majority of Bitcoin nodes appear to not participate in mining and node operators face no explicit penalty for dishonesty.
Since the vast majority of blocks are mined by the large pools, which each appear as a single node, this is inevitable. Miners in a pool that submit invalid blocks will be excluded and penalized by the pool.
Bitcoin traffic is unencrypted—any third party on the network route between nodes (e.g., ISPs, Wi-Fi access point operators, or governments) can observe and choose to drop any messages they wish.
And:
Of all Bitcoin traffic, 60% traverses just three ISPs.
As we see, for example with bufferbloat, ISPs need not explicitly drop entire messages, they can introduce delays that cause packets to be dropped.
As of July 2021, about half of all public Bitcoin nodes were operating from IP addresses in German, French, and US ASes, the top four of which are hosting providers (Hetzner, OVH, Digital Ocean, and Amazon AWS). The country hosting the most nodes is the United States (roughly one-third), followed by Germany (one-quarter), France (10%), The Netherlands (5%), and China (3%). ... This is yet another potential surface on which to execute an eclipse attack, since the ISPs and hosting providers have the ability to arbitrarily degrade or deny service to any node. Traditional Border Gateway Protocol (BGP) routing attacks have also been identified as threats.

The underlying network infrastructure is particularly important for Bitcoin and its derivatives, since all Bitcoin protocol traffic is unencrypted. Unencrypted traffic is fine for transactional and block data, since they are cryptographically signed and, therefore, impervious to tampering. However, any third party on the network route between nodes (e.g., ISPs, Wi-Fi access point operators, or governments) can observe and choose to drop any messages they wish.
The effect of dropping messages and introducing delays is to reduce the threshold for a 51% attack.
Tor is now the largest network provider in Bitcoin, routing traffic for about half of Bitcoin’s nodes. Half of these nodes are routed through the Tor network, and the other half are reachable through .onion addresses. The next largest autonomous system (AS)—or network provider—is AS24940 from Germany, constituting only 10% of nodes. A malicious Tor exit node can modify or drop traffic similarly to an ISP.
Malicious Tor exit nodes are a long-running problem.
Of Bitcoin’s nodes, 21% were running an old version of the Bitcoin Core client that is known to be vulnerable in June of 2021.
The security of a blockchain depends on the software, which will inevitably have vulnerabilities, which will need timely patches.
The Ethereum ecosystem has a significant amount of code reuse: 90% of recently deployed Ethereum smart contracts are at least 56% similar to each other.
The potential for "smart contracts" acquiring vulnerabilities through their software supply chain is very significant, because once again it is highly centralized. The authors sampled:
1,586 smart contracts deployed to the Ethereum blockchain in October 2021, and compared their bytecode similarity, using Levenshtein distance as a metric. One would expect such a metric to underestimate the similarity between contracts, since it compares low-level bytecode that has already been transformed, organized, and optimized by the compiler, rather than the original high-level source code. This metric was chosen both to act as a lower bound on similarity and to enable comparison between contracts for which we do not have the original source code. We discovered that 90% of the Ethereum smart contracts were at least 56% similar to each other. About 7% were completely identical.
However, the authors don't discuss the major cause of Ethereum's "decentralized apps" not being decentralized. Ethereum nodes need far more resource than a mobile device or a desktop browser can supply. But on a mobile device or in a desktop browser is where a "decentralized app" needs to run if it is going to interact with a human. So, as Moxie Marlinspike discovered:
companies have emerged that sell API access to an ethereum node they run as a service, along with providing analytics, enhanced APIs they’ve built on top of the default ethereum APIs, and access to historical transactions. Which sounds… familiar. At this point, there are basically two companies. Almost all dApps use either Infura or Alchemy in order to interact with the blockchain. In fact, even when you connect a wallet like MetaMask to a dApp, and the dApp interacts with the blockchain via your wallet, MetaMask is just making calls to Infura!
Law professor Angela Walch's Deconstructing ‘Decentralization’: Exploring the Core Claim of Crypto Systems examines the legal issues created by the false assertion that permissionless blockchains are decentralized. She takes off from here:
On June 14, 2018, William Hinman, Director of the SEC’s Division of Corporation Finance, seized the crypto world’s attention when he stated that “current offers and sales of Ether are not securities transactions” and linked this conclusion to the “sufficiently decentralized” structure of the Ethereum network.
She concludes:
Like many other descriptors of blockchain technology (e.g., immutable, trustless, reflects truth), the adjective ‘decentralized’ as an inevitable characteristic of blockchain technology proves to be an overstatement, and we know that making decisions based on overstatements rather than reality can lead to bad consequences.
Her argument is in four parts:
  1. Walch analyzes how people use the word "decentralized":
    For example, in Arizona’s statute that uses the term ‘decentralized’ to define ‘blockchain technology,’ there is no definition of ‘decentralized’ to be found. Most mainstream descriptions of blockchain technologies or cryptoassets state simply that blockchains are decentralized. End of story. Decentralized is just something that blockchains are. An inherent characteristic. An essential and identifying feature.
    She identifies two senses in which it is used:
    First, it is used to describe the network of computers (often referred to as ‘nodes’) that comprise a permissionless blockchain, as these systems operate through peer‐to‐peer connections between computers, rather than on a central server.
    ...
    The second way ‘decentralized’ is commonly used is to describe how power or agency works within permissionless blockchain systems. 20 If there is not a single, central party keeping the record, that means that no single party has responsibility for it, and thus no single party is accountable for it.
    And dissects Hinman's speech using these senses, showing how it conflates the two meanings, and how Hinman asserts that the Bitcoin and Ethereum blockchains are "sufficiently decentralized" without having a clear definition of the term or detailed evidence to support the assertion.

  2. Walch observes that:
    It turns out I am far from alone in critiquing the use of ‘decentralized’ to describe blockchain systems. In fact, in the past few years, exploring the concept of “decentralization” has become a trend for thought leaders and academics in the crypto space. Venture capitalists, Ethereum creator Vitalik Buterin, and others have attempted to articulate what “decentralization” means.
    Despite these efforts, we are no nearer to an agreed definition. Among the difficulties is that these systems are layered, and the way they are centralized is different at each layer. Walch makes seven points:
    1. No One Knows What “Decentralization” Means because it cannot be measured. Although it is possible to estimate a "Nakamoto coefficient" assuming that the entities involved are independent, there is no way to know if this assumption is true. And in most cases this computation ignores the concentration of power in the software development team, who cannot be considered independent actors.
    2. Satoshi Didn’t Invent Decentralization. Walch means here that it has a long history in politics but, as Arvind Narayanan and Jeremy Clark show in Bitcoin's Academic Pedigree, it also has a long history in software engineering.
    3. Decentralized Does Not Equal Distributed. Walch notes that these terms are often misused in the discourse. Decentralized refers to multiple independent loci of power, where distributed refers to multiple coordinated loci of execution.
    4. Decentralization Exists on a Spectrum, which since it cannot be measured should be a given.
    5. Decentralization is Dynamic rather than Static, because both the technology and the way it is used evolve over time. Walch observes:
      The critical takeaway here is that any measurement of decentralization is obsolete immediately after it has been calculated. In a permissionless system, anyone can join, and no one has to stay, so the system’s composition is, in theory, always in flux.
    6. Decentralization is Aspirational, Not Actual, or as I have argued, the use of the term is essentially gaslighting.
    7. Decentralization Can Be Used to Hide Power or Enable Rule‐Breaking. The whole purpose of cryptocurrencies is to evade regulation, so Walch is too kind when she writes:
      the term ‘decentralized’ is being used to hide actions by participants in the system in a fog of supposedly “freely floating authority,” and we must be vigilant not to overlook pockets of authority and power within these systems.
      Walch fails to appreciate that, as Dave Troy explains, cryptocurrencies are rooted in the idea that government and thus regulation are evil.
    8. Calls to Action. Walch writes:
      The status quo usage of the terms “decentralized” and “decentralization” is deemed untenable by many commentators, and there are a variety of calls to action in the literature. ... The rationale behind these calls to action is that current usage of the term is creating misunderstandings about the capabilities of the technology. Further, it is clearly creating misunderstandings about how power works in these systems, with the potential for error in how law or regulation treats these systems and the people who act within them.
      Well, yes, but "creating misunderstandings about how power works" is the goal the crypto-bros are trying to achieve.

  3. Walch then provides:
    examples of actions within the Bitcoin and Ethereum blockchain systems that undermine claims that either system is particularly decentralized.
    These include two accounts of secretive actions by Bitcoin developers. Critical Bug Discovery and Fix in Bitcoin Software in Fall 2018 and Bitcoin’s March 2013 Hard Fork, and two accounts of similar secretive actions by Ethereum developers, Secret Meetings of Ethereum Core Developers in Fall 2018 and Ethereum’s July 2016 Hard Fork. These descriptions of the response of the systems to crises dovetail nicely with the DARPA report's more analytical view of developer power.

    In Hashing Power Concentration and 51% Attacks Walch briefly covers the concentration of mining power, but she omits many of the other aspects of concentration detailed in the DARPA report.

  4. Finally, Walch discusses the problems the issues she identifies (not to mention those in the DARPA report) pose for the use of "decentralized" as a legal concept in four areas:
    1. Decentralization’s Uncertain Meaning Makes It Ill‐Suited for A Legal Standard, in which Walch writes:
      it is relatively easy to count nodes in a network, but much harder to identify and understand how miners, nodes, and software developers interact in governing a blockchain. As Sarah Jamie Lewis, a privacy advocate and crypto systems expert, has explained, “We need to move beyond naïve conceptions of decentralization (like the % of nodes owned by an entity), and instead, holistically, understand how trust and power are given, distributed and interact...Hidden centralization is the curse of protocol design of our age. Many people have become very good at obfuscating and rationalizing away power concentration.”
      The system has more layers than just the miners and the developers, and each of them is centralized in a different, and hard-to-measure way. This obviously makes "decentralization" useless as a basis for regulation.
    2. Decentralization’s Dynamic Nature Complicates Its Use as a Legal Standard, in which Walch writes:
      if the measurement and determination of a decentralization level is done periodically to mark the moment when a particular legal status is achieved, then participants in blockchain systems (nodes, miners, developers) may game the standard by taking actions to move along the decentralization spectrum. If the prize is large (as non‐security status would be), then anything gameable (including a level of decentralization) will be gamed.
      Any technique for measuring the decentralization of a layer in the system will give answers that change over time, and that will be subject to being gamed. Again, this isn't a viable basis for regulation.
    3. If Actual Decentralization is Now Just a Dream, Wait Till It Comes True, in which Walch writes:
      In Part III, I provided examples of events in Bitcoin and Ethereum that belie claims that they are decentralized, while in Part II, I noted the largely aspirational nature of ‘decentralization’ in permissionless blockchains. If this is the case, it is premature to use ‘decentralization’ as a way to make legal decisions. However noble the goals are for a given blockchain system to reach decentralization nirvana, the law must deal with present‐day realities rather than hopes or dreams.
      It has been more than 13 years since Bitcoin launched and in all that time it has never been effectively decentralized at the layers of hash rate or developers. Kwon et al, in which show that it will never be decentralized. So dealing with "present‐day realities" involves rejecting the idea that these blockchains are decentralized and focusing on applying existing laws to the loci of power in the system, such as the organizers of mining pools, the developers, and the exchanges.
    4. Decentralization Veils and Malleable Tokens, in which Walch finally gets to the most important point, laying out the function the claim of "decentralization" is intended to perform:
      the common meaning of ‘decentralized’ as applied to blockchain systems functions as a veil that covers over and prevents many from seeing the actions of key actors within the system. Hence, Hinman’s (and others’) inability to see the small groups of people who wield concentrated power in operating the blockchain protocol. In essence, if it’s decentralized, well, no particular people are doing things of consequence.

      Going further, if one believes that no particular people are doing things of consequence, and power is diffuse, then there is effectively no human agency within the system to hold accountable for anything.
      And:
      The consequence of casting a veil over the people’s actions is that they may not be held accountable for those actions – in effect, that a Veil of Decentralization, in which functions as a liability shield akin to the famed corporate veil.

      Moreover, being protected by a Veil of Decentralization may even be better than what blockchain participants could get if they actually formed a limited liability entity together. In entities, people making significant decisions that affect others (like directors, officers, or managers) generally owe fiduciary duties, but, despite my urging, no one has yet decided to treat the core developers or significant miners of blockchain protocols as fiduciaries. What’s more, the Veil of Decentralization is helpful to participants in the blockchain because it provides a liability shield without making the blockchain system a legal person that could be sued. With a limited liability entity, the corporation or LLC provides the site of legal personhood, but with a decentralized blockchain system, there is no such site. Thus, if we misapply the term “decentralized,” people within “decentralized” blockchain systems get the benefit of limited liability without the cost of certain duties and responsibilities.
Even after describing how "decentralization" functions to evade accountability, Walch doesn't seem to appreciate that this was to entire goal of the cryptocurrency project from its very beginnings. Dave Troy, David Golumbia and Finn Brunton relate how cryptocurrency emerged from the swamps of libertarianism. The whole point of libertarianism is the avoidance of accountability for one's actions.

Update: 5th September 2022
Source
Arijit Sarkar writes in Hetzner anti-crypto policies: A wake-up call for Ethereum’s future:
Hetzner, a private centralized cloud provider, stepped in on a discussion around running blockchain nodes, highlighting its terms of services that prohibit customers from using the services for crypto activities. However, the Ethereum community perceived the revelation as a threat to the ecosystem as Hetzner’s cloud services host nearly 16% of the Ethereum nodes
Note that 54% of the hash rate is hosted on Amazon Web Services.

DLF Forum Featured Sponsor: George Blood LP / Digital Library Federation

Featured post from 2022 DLF Forum sponsor George Blood LP. Learn more about this year’s sponsors on the Forum website.


Committed to Exceptional Quality and Service for Your Historic Media

George Blood LP logoFor more than 20 years, the Open Archival Information System reference model has been the guide for long-term digital preservation. Likewise, NDSA’s Levels of Digital Preservation has helped institutions assess where they are and plan their next steps. For over 30 years, George Blood LP has been working with archives that are ready to move past planning and into the production stage. In that time, we have helped large and small institutions preserve over 1.25 million machine dependent audiovisual records on obsolete media. Our unique experience with antiquated data carriers helps us understand what can go wrong with digital storage over time. From metadata to checksums, we have worked to ensure that content, now preserved through digital surrogates, will remain accessible for future generations.

A/V lab equipmentWe offer:

  • door-to-door pick-up in Boston, New York City, Washington, DC and surrounding areas
  • climate-controlled storage with fire protection, PEM loggers, and secure access
  • laboratory for conservation treatments with fume hoods, temperature-controlled ovens, and every known format of tape cleaner
  • playback of over 200 audiovisual formats and over 50 data formats
  • world class Engineers with over 50 Grammy nominations and two dozen Grammy awards
  • full-time Quality Control staff checking every deliverable
  • library and archives professionals throughout the organization, including Collections Care and Project Management

 

Our ongoing in-house research and development explores factors affecting the entire lifecycle of preservation. We widely share our findings through conference presentations and scholarly publications. Be sure to attend George’s presentation on our study of the carbon footprint of checksum verification, presented with our colleagues from WGBH, at 11:15am on Thursday, October 13th.

To discuss how George Blood LP can help you with your collection survey or digitization needs, visit our teamturntable in the Baltimore Foyer of the conference hotel.

https://www.georgeblood.com

FacebookInstagramLinkedIn

 

 

See what we’ve been doing lately

The post DLF Forum Featured Sponsor: George Blood LP appeared first on DLF.

New NDSA Code of Conduct Website / Digital Library Federation

A Code of Conduct webpage is now available sharing information on NDSA’s Code of Conduct practices.  The website links to the Code of Conduct itself, but also provides information on how to report code of conduct violations.  

In most NDSA online spaces, the quickest way to report any concerns or violations is to complete an anonymous form.  Other options include reaching out to Chairs of groups or someone in the Leadership group.  

In addition to this information being available on the website, it is also available at the top of all the Interest Group meeting notes as well as being a pinned post on all Slack channels. 

 

The post New NDSA Code of Conduct Website appeared first on DLF.

auto-archiver / Ed Summers

I spent a bit more time this weekend adding browsertrix-crawler to Bellingcat’s recently released auto-archiver utility. If you aren’t familiar with auto-archiver it lets you control a web archiving utility using a Google Sheet, and has specialized “archivers” for different platforms, and falls back to creating an snapshot using Internet Archive’s SavePageNow. It also will always create a screenshot.

My approach was to treat WACZ creation similar to how screenshots are created for all archiver plugins that access web pages. The WACZ is generated and then uploaded to the cloud storage. The beauty of WACZ means if the cloud storage is web accessible then you can view the web archive in ReplayWebPage.

The benefit of creating a snapshot with browsertrix-crawler, as opposed to Internet Archive’s SavePageNow is you are in possession of the created web archive data.

The WACZ contains the created WARC content that would have been created at the Internet Archive. The WACZ also has a manifest that lets you verify that it is complete and hasn’t been tampered with. You can control where this web archive content goes, and who can see it, whereas it needs to be public with SavePageNow.

Once there’s a way to easily configure browsertrix-crawler using auto-archiver’s config file, it should be possible to pass along a browser profile for archiving sites like Instagram that require a login. Also, as details about how to sign WACZs are worked out it should be possible for auto-archiver to sign these WACZs with a public key so that they can be authenticated and verified.

It seems to be working ok, but needs a little bit more effort to make it easily configurable. You can see my resulting spreadsheet here which has two additional columns:

  • WACZ: the location in cloud storage (in this case Amazon S3) where the WACZ was created.
  • ReplayWebPage: a link to ReplayWeb.page using the WACZ URL and the URL that was archived to view the archive

If you want to give it a try here is my branch.

5 ways coaching is game changing for women of color / Tara Robertson

Tara standing in front of a pink mural in the golden hour

Coaching is a powerful tool to help any leader accelerate their career, and it’s game changing in a different way for women of color who are often The Only. 

First, being The Only means you are succeeding in a system that was not designed for you. It is game changing to have a coach who helps you articulate your values, so that you move through the world in alignment and with clarity. It is game changing to be really clear on what you’re saying yes to and what you’re saying no to and setting your boundaries to align with your values and priorities.

Second, being The Only means you are subject to gaslighting in an organization. It is game changing to have a coach who believes and validates your experience. It is game changing to have a coach who will witness and celebrate your wins too.

Third, being The Only means you are literally alone which, not surprisingly, can be very lonely. It is game changing to have a coach walk beside you on your journey and be a confidential sounding board and thought partner.

Fourth, being The Only means being regularly underestimated by the people around you. It is game changing to have a coach who sees the most magnificent version of yourself and can hold that vision, even on the days that you can’t. It is game changing to have a coach who can help you draw out your inner wisdom.

Fifth, being a woman of color, especially being a Black women, means walking a very narrow tightrope at work. It is game changing to have a coach where you can be your whole self, not have to code switch, and have a supportive space to dream big on your terms.

Book a chemistry session to see if we’re a fit for each other.

The post 5 ways coaching is game changing for women of color appeared first on Tara Robertson Consulting.

LAPD PDFs / Ed Summers

I spent a few hours today using the Internet Archive Wayback Machine CDX API in a notebook to see how many snapshots of the Los Angeles Police Department website they’ve made (1,120,030), and specifically how many unique PDF documents there are in there (10,437).

It’s really quite beautiful how you can easily use the wayback module to drop the results of searching the API directly into a Pandas DataFrame for analysis:

import pandas
import wayback

wb = wayback.WaybackClient()
lapd = pandas.DataFrame(wb.search('lapdonline.org', scopeType='prefix')

I did run into some inconsistencies between searching using a scopeType=domain versus scopeType=prefix.

It’s my understanding that using domain allows you to fetch the complete results for any subdomain of that domain, so assets.lapdonline.org in addition to www.lapdonline.org when searching for lapdonline.org.

But that didn’t seem to be the case (see this example). So I used scopeType=domain to discover the relevant hostnames and then looked for each individually with a scopeType=prefix query.

Trip Report: NISO Plus Forum 2022 / Peter Murray

Earlier this week, NISO held its one-day NISO Plus Forum for 2022 .

NISO Plus Forum 2022 logo

This was an in-person meeting that is intended to feed into the online conference in February 2023. Around 100 people from NISO’s membership groups—libraries, content providers, and service providers—attended to talk about metadata. The meeting was structured in World Café style and was moderated by Jonathan Clark . The broad topic of “metadata” was broken down into three parts:

  • Identifiers: what identifiers are missing or underutilized
  • Exchange: what is the most significant barrier to seamless exchange?
  • Structure: what is impossible due to a lack of appropriate structures?

There were small table discussions for each part of no more than six people, with 15 minutes at a table before everyone got up and moved to a new table. After three rounds of 15 minutes, a scribe that stayed at the same table the whole time reported the major themes to the larger group. What makes this style interesting is that everyone’s experience is different. We agreed to use the Chatham House Rule ; what is reported here is my interpretation of my table’s discussion and my take on the broader outcomes.

Identifiers

The most fascinating idea I discovered here was how much the metadata ecosystem relies on “Publication Date”. Not only do several parts use publication date as an anchor, but different understandings of the meaning of “publication date” cause many problems downstream. There is the online publication date, the physical publication date, and sometimes simply an unlabeled publication date. Some publishers have a practice of changing an online publication date to the physical issue date when the issue comes out. (Changing a field that others use as part of metadata to distinguish one item from another is never a good thing.)

“Place of Publication” also has a lot of variability and inconsistency, even within a publisher. Institution identifiers were also a topic, particularly with the lack of hierarchy in the Research Organization Registry (ROR). Someone reported that ROR is working to address the problem, but right now there is not a good way to relate a department to its encompassing agency or organization.

I showed my professional age a bit by mentioning SICI —the Serial Item and Contribution Identifier. This is a compound identifier developed in the 1990s. Given a citation, you could construct a SICI that was a kind of key to the article. For instance,

Lynch, Clifford A. “The Integrity of Digital Information; Mechanics and Definitional Issues.” JASIS 45:10 (Dec. 1994) p. 737-44

…could be condensed into…

0002-8231(199412)45:10<737:TIODIM>2.3.TX;2-M

This standard didn’t last past the early 2000s, although a few people at my table mentioned that they saw examples of this identifier in their backfile as the publisher-specific suffix of a DOI.

Exchange

Among the metadata exchange topics, the one I found the most interesting was diversity-equity-inclusion data points in an authoring workflow. With a desire to address inequity in a field, these data points would be gathered from many sources. This is sensitive data, so how can it be kept secure while ensuring the integrity of the data (for instance, catching when false data is dumped into the system).

Structure

As we know, metadata is gathered, aggregated, mixed, and disseminated in ways that the originator can’t predict. A big problem when this happens is having ways to assert confidence in a data element. Take, for instance, the ORCID field for an author. Was that ORCID obtained when the author logged in with the Authenticated ORCID ID workflow ? Was it manually keyed by an author (and subject to typos)? Did the software guess the ORCID based on name and institution affiliation? There can be a range of certainty that an ORCID ID is correct for a particular author. And—related to “Exchange”—how can this certainty be expressed to subsequent users of the metadata record.

The Top-Level Topics

One goal of the NISO Plus Forum was to gather topics for sessions at next February’s NISO Plus Conference. At the end of the day, there was one final table session where we were asked to propose a session for the conference: what is the topic? what questions would the session answer? who should attend and who should speak?

Reflecting the observation that metadata is much more than technical specifications, the proposed conference topics tended to want to explain to an organization’s management and end-users why carefully curated metadata is essential. The session would answer questions like “why is it important to fund robust metadata systems?” and “how do we measure return-on-investment for our metadata systems?” One person said that metadata needs its own public relations manager. Another sought accessible messaging on the importance of metadata to send to the people making decisions. Relatedly, how can researchers be convinced of the importance of identifiers like ORCID and ROR as they input data on grant applications and institutional repository deposit forms?

My Takeaways

Honestly, those were not the outcomes I expected as the top-level ideas from the Forum. As you can tell from my summary above, I thought we’d focus on discussions of a collective understanding of a set of “publication date” fields. Or think about how the producers and consumers of metadata can agree on a range of confidence for a particular metadata field. The end-of-the-day outcomes were very high-level and not focused on making the exchange and use of metadata better across the field.

That aside, it was a wonderfully engaging conversation all day long. NISO is on the right track to having focused meetings like this that put a value on activities that are best done in person. This was not an event full of prepared presentations or passive panel sessions. It made the best use of precious face-to-face time and gathered topics that would feed into the online conference.

Thank you to the American Geophysical Union in Washington, DC, for the use of their meeting space, Silverchair for their significant sponsorship, and all of the other event sponsors. I ended up with seven pages of dense notes to think about, so thank you, too, to all of the participants.

H2O usability study: do students want physical casebooks? / Harvard Library Innovation Lab

This summer, one of our research assistants, Seonghee Lee, ran a study among current law students that is helping us reconsider some longstanding assumptions about student reading preferences and informing future development of the H2O Open Casebook platform.

H2O was launched in an early form in 2012, and for years we worked under the assumption that most books written with H2O would eventually be read in a print format. We have put a lot of work into improving the export experience so that professors can create a book using the H2O platform, export it as a Word document, format it as they like, and distribute it to students as a low-cost, print-on-demand book or as a printable PDF. Our expectations of reading formats began to evolve as we heard authors start to ask for multimedia options like video in their H2O casebooks, but we still heard strong feedback from many professors that they needed a print option for their students.

However, as with so many things, 2020 may have changed what we thought we knew by resetting students' expectations and preferences for their learning materials. This summer we spoke to 21 current law students who had not used H2O, and more than half told us that they prefer digital casebooks over physical texts. Cost was an obvious factor—if a digital book costs less than a physical book, they want the digital book—but many also cited their use of digital notetaking and writing tools as well as the clunkiness and inconvenience of heavy, printed books in their backpacks. Nine of those 21 students talked to us over Zoom (the others completed a survey), and once we were able to show those nine how H2O worked, they all said they could see themselves reading and annotating H2O directly on the platform.

While these conversations cut against the common wisdom about student reading preferences, they align with anecdotes I've been hearing from students. When chatting with some current students at a library event at Harvard Law School earlier this week, most told me that when a professor assigned an H2O book they read it on the H2O platform, even when the professor had created a print option.

Of course, all of these conversations put together still add up to a small number of law students, and even if it is only a minority of students who prefer physical books, we want to make sure H2O is a platform that can meet those students' learning needs as well. We will continue to support professors who want to create printable versions of their H2O books for their students.

But these early conversations about how students prefer to read and to learn are forcing us to ask new questions, too. What tools and capabilities do we owe students who are reading H2O directly on the platform? How can we work with professors to better understand their students' reading preferences? What expertise can we learn from in designing a digital reading platform that is as effective (or better!) than physical reading?

Some early answers come directly from the usability study—students were most concerned with whether a digital reading platform like H2O has embedded annotation tools they can use to mark up cases and inform the outlines they make for their classes. While many students thought they could use the annotation features already built into H2O, this feedback may point to a separate, student-centric set of annotation tools in H2O down the line. For now, we've added some improved UI to better direct readers of H2O casebooks to the annotation tools already there.

Read a summary of Seonghee's work here, and if you have ideas for us, let us know at info@opencasebook.org.

Impossibilities / David Rosenthal

I'm starting to see a series of papers each showing that some assertion about the cryptocurrency ecosystem that crypto-bros make can't be true. I wrote about the first one I noticed in Ethereum Has Issues, but I have since seen several more. Below the fold I briefly review them, I'll update this post if I see more to maintain a chronological list of these research results.

The list so far is:
  1. Blockchain Economics by Joseph Abadi and Markus Brunnermeier (18th June 2018) introduced the Blockchain Trilemma:
    The ideal qualities of any recordkeeping system are (i) correctness, (ii) decentralization, and (iii) cost efficiency. We point out a Blockchain Trilemma: no ledger can satisfy all three properties simultaneously.
  2. Bitcoin: A Natural Oligopoly by Nick Arnosti and S. Matthew Weinberg (21 November 2018) formalizes my observation in 2014's Economies of Scale in Peer-to-Peer Networks that economies of scale drive centralization:
    (a) ... if miner j has costs that are (e.g.) 20% lower than those of miner i, then miner j must control at least 20% of the total mining power. (b) In the presence of economies of scale (α>1), every market participant has a market share of at least 1−1/α, implying that the market features at most α/(α−1) miners in total.
  3. Impossibility of Full Decentralization in Permissionless Blockchains by Yujin Kwon et al (1st September 2019) provides a different formalization of the idea that economies of scale drive centralization by introducing the concept of the "Sybil cost":
    the blockchain system should be able to assign a positive Sybil cost, where the Sybil cost is defined as the difference between the cost for one participant running multiple nodes and the total cost for multiple participants each running one node.
    ...
    Considering the current gap between the rich and poor, this result implies that it is almost impossible for a system without Sybil costs to achieve good decentralization. In addition, because it is yet unknown how to assign a Sybil cost without relying on a TTP [Trusted Third Party] in blockchains, it also represents that currently, a contradiction between achieving good decentralization in the consensus protocol and not relying on a TTP exists.
  4. High-frequency trading on decentralized on-chain exchanges by L. Zhou, K. Qin, C. F. Torres, D. V. Le, and A. Gervais (29th September 2020) points out a problem facing "decentralized exchanges" (DEX):
    Our work, sheds light on a dilemma facing DEXs: if the default slippage is set too low, the DEX is not scalable (i.e. only supports few trades per block), if the default slippage is too high, adversaries can profit.
  5. Responsible Vulnerability Disclosure in Cryptocurrencies by Rainer Böhme, Lisa Eckey, Tyler Moore, Neha Narula, Tim Ruffing & Aviv Zohar (October 2020) see also Rethinking Responsible Disclosure for Cryptocurrency Security by Stewart Baker (8th September 2022). Both essentially argue that it is impossible for a trustless, decentralized system to prevent the inevitable vulnerabilities being exploited before they can be fixed. My restatement of their argument is:
    • Cryptocurrencies are supposed to be decentralized and trustless.
    • Their implementations will, like all software, have vulnerabilities.
    • There will be a delay between discovery of a vulnerability and the deployment of a fix to the majority of the network nodes.
    • If, during this delay, a bad actor finds out about the vulnerability, it will be exploited.
    • Thus if the vulnerability is not to be exploited its knowledge must be restricted to trusted developers who are able to upgrade vulnerable software without revealing their true purpose (i.e. the vulnerability). This violates the goals of trustlessness and decentralization.
    Both Böhme et al and Baker provide examples of the problem in practice.
  6. Irrationality, Extortion, or Trusted Third-parties: Why it is Impossible to Buy and Sell Physical Goods Securely on the Blockchain by Amir Kafshdar Goharshady (19th October 2021) reveals a major flaw in the use of cryptocurrencies as currencies rather than as gambling chips:
    Suppose that Alice plans to buy a physical good from Bob over a programmable Blockchain. Alice does not trust Bob, so she is not willing to pay before the good is delivered off-chain. Similarly, Bob does not trust Alice, so he is not willing to deliver the good before getting paid on-chain. Moreover, they are not inclined to use the services of a trusted third-party. Traditionally, such scenarios are handled by game-theoretic escrow smart contracts, such as BitHalo. In this work, we first show that the common method for this problem suffers from a major flaw which can be exploited by Bob in order to extort Alice. We also show that, unlike the case of auctions, this flaw cannot be addressed by a commitment-scheme-based approach. We then provide a much more general result: assuming that the two sides are rational actors and the smart contract language is Turing-complete, there is no escrow smart contract that can facilitate this exchange without either relying on third parties or enabling at least one side to extort the other.
  7. The Consequences of Scalable Blockchains on Datafinnovation's blog (1st April 2022) shows that implementing an Ethereum-like system whose performance in all cases is guaranteed to be faster than any single node in the network is equivalent to solving the great unsolved problem in the theory of computation, nicknamed P vs. NP. And thus that if it were implemented, the same technique could break all current cryptography, including that underlying Ethereum:
    What we are going to do here is pretty simple:
    1. Describe some thing a scalable blockchain could do.
    2. Prove that thing is NP-Complete.
    3. Show how, if you have such a blockchain, you can right now break hash functions and public-key cryptography and constructively prove P=NP.
    If you build this thing you can break nearly all the major protocols out there — blockchains, banks, SSL, RSA, nearly everything — right now.
    NB: it appears that the first application of computer science impossibility results to cryptocurrencies was in Ethereum's DAO Wars Soft Fork is a Potential DoS Vector by Tjaden Hess, River Keefer, and Emin Gün Sirer (28th June 2016), which applied the 'halting problem" to "smart contracts" when analyzing possible defenses against DOS attacks on a "soft fork" of Ethereum proposed in response to "The DAO".
  8. Sharding Is Also NP-Complete by Datafinnovation (2nd April2022) uses the same proof technique to show that sharding is subject to the same worst-case problem as scaling a single blockchain:
    The point of this post is not that sharding is useless. Sharding certainly helps sometimes. It might even help “on average.” But this is a hard problem. This leaves us with two choices:
    1. Scalable solutions which are prone to accidents
    2. Truly reliable scalability but P=NP etc
    What do I mean by accidents? Systems that fall over when they are overloaded. Whether that is exploding block times or proper crashing or whatever else is a software engineering question rather than a math one. But something bad. Mitigation is a requirement if you want a robust system because you can’t engineer around this challenge and still have the cryptography.
  9. Positive Risk-Free Interest Rates in Decentralized Finance by Ben Charoenwong, Robert M. Kirby, and Jonathan Reiter (14th April 2022) is summarized in Impossibility of DeFi Risk-Free Rates:
    This paper explores the idea of risk-free rates in trustless DeFi systems. The main result is that it is impossible, under a clearly stated set of conditions, to generate conventional risk-free rates.
    The paper uses a model:
    representing a large class of existing decentralized consensus algorithms [to show] that a positive risk-free rate is not possible. This places strong bounds on what decentralized financial products can be built and constrains the shape of future developments in DeFi. Among other limitations, our results reveal that markets in DeFi are incomplete.
    The paper was updated on 28th August 2022.
  10. Blockchain scalability and the fragmentation of crypto by Frederic Boissay et al (7th June 2022) formalizes and extends the argument I made in Fixed Supply, Variable Demand (3rd May 2022):
    To maintain a system of decentralised consensus on a blockchain, self-interested validators need to be rewarded for recording transactions. Achieving sufficiently high rewards requires the maximum number of transactions per block to be limited. As transactions near this limit, congestion increases the cost of transactions exponentially. While congestion and the associated high fees are needed to incentivise validators, users are induced to seek out alternative chains. This leads to a system of parallel blockchains that cannot harness network effects, raising concerns about the governance and safety of the entire system.
  11. Decentralized Stablecoin Design by Ben Charoenwong, Robert M. Kirby, and Jonathan Reiter (28th August 2022) uses the halting problem and Goharshady (2021) to investigate the stability of metastablecoins that are not fully backed by fiat held by a trusted party:
    Our methodology is as follows. First, we present a product with definitions taken from an economics context. This product may or may not have some desirable property. The question of whether it has the property is then formulated as a decision problem. Then, we use the theory of computation to reduce the problem using isomorphism and show that the reduced decision problem is undecidable based on the large extant literature on computer science. Finally, we then conclude that due to the undecidability, constructing such a product which provably satisfies the desirable property is impossible.
    They are careful not to over-sell their results:
    a caveat of our results is that the theoretical results pertain to the computational feasibility and provability of the property. It does not imply that a given design will not work for some (possibly very long) period of time. It simply means that we cannot know for sure that it will.

Cryptocurrency-enabled Crime / David Rosenthal

Source
Robin Wigglesworth's An anatomy of crypto-enabled cyber crime points to An Anatomy of Crypto-Enabled Cybercrimes by Lin William Cong, Campbell R. Harvey, Daniel Rabetti and Zong-Yu Wu. They write in their abstract that:
Assembling a diverse set of public, proprietary, and hand-collected data including dark web conversations in Russian, we conduct the first detailed anatomy of crypto-enabled cybercrimes and highlight relevant economic issues. Our analyses reveal that a few organized ransomware gangs dominate the space and have evolved into sophisticated firm-like operations with physical offices, franchising, and affiliation programs. Their techniques also have become more aggressive over time, entailing multiple layers of extortion and reputation management. Blanket restrictions on cryptocurrency usage may prove ineffective in tackling crypto-enabled cybercrime and hinder innovations. But blockchain transparency and digital footprints enable effective forensics for tracking, monitoring, and shutting down dominant cybercriminal organizations.
Wigglesworth comments:
Perhaps. But while it is true that blockchain transparency might enable arduous but effective analysis of crypto-enabled cyber crime, reading this report it’s hard not to think that the transparency remedy is theoretical, but the costs are real.
I have argued that the more "arduous but effective analysis" results in "tracking, monitoring, and shutting down" cybercriminals, the more they will use techniques such as privacy coins (Monero, Zcash) and mixers (Tornado Cash). Indeed, back in January Alexander Culafi reported that Ransomware actors increasingly demand payment in Monero:
In one example of this, DarkSide, the gang behind last year's Colonial Pipeline attack, accepted both Monero and Bitcoin but charged more for the latter because of traceability reasons. REvil, which gained prominence for last year's supply-chain attack against Kaseya, switched to accepting only Monero in 2021.
Below the fold I discuss both Cong et al's paper, and Erin Plante's $30 Million Seized: How the Cryptocurrency Community Is Making It Difficult for North Korean Hackers To Profit, an account of Chainalysis' "arduous but effective" efforts to recover some of the loot from the Axie Infinity theft.

Cong et al argue that:
A one-size-fits-all solution, such as restricting or banning cryptocurrency usage by individuals or organizations is problematic for three major reasons. First, this is not a national problem. Blockchains exist across multiple countries and harsh regulations in a particular country or jurisdiction have little or no effect outside that country. As we have seen from other global initiatives (e.g., carbon tax proposals), it is nearly impossible to get global agreement. Second, while an important problem, cryptocurrency plays a small role in the big picture of illegal payments. Physical cash is truly anonymous and, indeed, this may account for the fact that 80.2% of the value of U.S. currency is in $100 notes. It is rare the consumers use $100 bills and it is equally rare that retailers are willing to accept them. Third, and most importantly, expunging all cryptocurrency use in a country eliminates all of the benefits of the new technology. Even further, it puts the country at a potential competitive disadvantage. For example, a ban on crypto effectively eliminates both citizens and companies from participating in web3 innovation.
I would counter:
  1. The goal of cybercrime is not to amass cryptocurrency but fiat. Doing so involves organizations such as exchanges and banks that do respond to OFAC sanctions. The goal should be to ban the on- and off-ramps, making converting large amounts of cryptocurrency into fiat extremely difficult, risky and expensive.
  2. It is true that physical cash has excellent anonymity. But experts in illegal payments, such as drug smugglers, currently prefer cryptocurrency to cash as being more secure and more portable.
  3. This is the tell. Arguments in favor of cryptocurrencies always end up touting mythical future benefits such as "web3 innovation" to distract from the very large and very real negative externalities that they impose right now on everyone outside the crypto-bros in-group.
Nevertheless, the paper is the more interesting as not being the product of cryptocurrency skeptics.

Cong et al divide the crimes they study into two groups:
In the first, hackers exploit weaknesses in either centralized organizations such as crypto-exchanges or decentralized algorithms, using this to siphon out cryptocurrency. For example, Mt. Gox, a Japanese crypto-exchange, was the victim of multiple attacks—the last one in 2014 led to loss of almost 850,000 bitcoins ($17b at the time of writing). In these types of attacks, coins are transferred to a blockchain address. Given that these transactions and addresses do not require real names, the attackers are initially anonymous. Indeed, the exploit is available for anyone to see given that the ledger of all transactions is public here. While the original exploit is completely anonymous (assuming the address has not been used before), the exploiter needs to somehow “cash out.” Every further transaction from that address is also public, allowing for potential deployment of blockchain forensics to track down the attacker.
It is the fact that it is practically almost impossible, and theoretically unsafe, to purchase real goods with cryptocurrency that forces cybercriminals to "cash out" to fiat. Thus the need for regulators to crack down on on- and off-ramps.

They describe the second group thus:
Beyond stealing cryptocurrency via exchange and protocol exploits, traditional cybercriminal activities are now also enabled with a new payment channel using the new technology—the second opportunity our research focuses on. The use of cryptocurrencies replaces potentially traceable wire transfers or the traditional suitcase of cash, and is popular for extortion. Criminal organizations also use cryptocurrencies to launder money. According to Europol, criminals in Europe laundered approximately $125b in currency in 2018 and more than $5.5 billion through cryptocurrencies. The increasing cryptocurrency adoption also facilitates many other forms cybercrimes.
Again, the authors undercut their argument against regulation by acknowledging the advantages cryptocurrencies have over "the traditional suitcase of cash". Although Cong et al briefly survey these two groups, they conclude that:
As of April 2022, Ransomware leads BTC payments with (42.5%), followed by Other (45.7%), and Bitcoin Tumbler (6.9%). If Other is excluded, Ransomware dominates cybercrime-related bitcoin activity with 86.7% of the total BTC payments.
...
In light of these issues, the remainder of the article delves deeper into the economics of ransomware, the most threatening and consequential form of crypto-enabled cybercrime, to provide insights relevant for digital asset owners and investors, as well as regulatory agencies and policymakers.
Their detailed analysis of ransomware groups' business models and operations is fascinating and well worth study. But here I want to focus on their proposal for how to combat the scourge; chain analysis. They write:
While addresses are anonymous initially, funds are often transferred from one address to another in order to “cash out.” All transactions are viewable and immutable - a key feature of blockchain technology. This opens the possibility of deploying forensic tools with a focus on tracking, monitoring, and identifying the crypto transactions attributed to criminals. Indeed, our research provides a glimpse of what is possible given the transparent nature of blockchains.
Source
Erin Plante's $30 Million Seized: How the Cryptocurrency Community Is Making It Difficult for North Korean Hackers To Profit provides more than a "glimpse of what is possible, albeit not about ransomware but the latest fashion in cyrptocurrency theft:
One of the most troubling trends in crypto crime right now is the stunning rise in funds stolen from DeFi protocols, and in particular cross-chain bridges. Much of the value stolen from DeFi protocols can be attributed to bad actors affiliated with North Korea, especially elite hacking units like Lazarus Group. We estimate that so far in 2022, North Korea-linked groups have stolen approximately $1 billion of cryptocurrency from DeFi protocols.
Plante is celebrating Chainalysis' recebt success:
With the help of law enforcement and leading organizations in the cryptocurrency industry, more than $30 million worth of cryptocurrency stolen by North Korean-linked hackers has been seized. This marks the first time ever that cryptocurrency stolen by a North Korean hacking group has been seized, and we’re confident it won’t be the last.
...
The seizures represent approximately 10% of the total funds stolen from Axie Infinity (accounting for price differences between time stolen and seized), and demonstrate that it is becoming more difficult for bad actors to successfully cash out their ill-gotten crypto gains. We have proven that with the right blockchain analysis tools, world-class investigators and compliance professionals can collaborate to stop even the most sophisticated hackers and launderers.
The details are interesting but it appears that this success was enabled by regulatory action:
However, the U.S. Treasury’s Office of Foreign Assets Control (OFAC) recently sanctioned Tornado Cash for its role in laundering over $455 million worth of cryptocurrency stolen from Axie Infinity. Since then, Lazarus Group has moved away from the popular Ethereum mixer, instead leveraging DeFi services to chain hop, or switch between several different kinds of cryptocurrencies in a single transaction.
Why did OFAC sanctions cause Lazarus Group to avoid Tornado Cash? It is clearly not because they were worried that sanctions would apply to them. They worried that the exchanges they need to use to "cash out" would be penalized for accepting coins trackable to one of Tornado Cash's sanctioned wallets. The exchanges need access to the global banking system to accept and distribute fiat, and that access would be at risk if they traded with a Tornado Cash wallet. Note that this would be a "strict liability" offence, so ignorance would be no excuse.

Not wishing to rain on Chainalysis' parade, but $30M is 3% of the $1B that Chainalysis estimates North Korean groups have stolen from DeFi so far this year, and 0.3% of the running total at Molly White's Web3 is going just great. Plante notes:
Much of the funds stolen from Axie Infinity remain unspent in cryptocurrency wallets under the hackers’ control. We look forward to continuing to work with the cryptocurrency ecosystem to prevent them and other illicit actors from cashing out their funds.
There is clearly a long way to go before claiming that it is "Difficult for North Korean Hackers To Profit", let alone cyber criminals more generally. Despite all the focus on the blockchain, it is clear that the key vulnerability of cyber criminals is their need eventually to convert cryptocurrency into fiat. This was, for example, the undoing of Ilya Lichtenstein and Heather Morgan. Increasing regulation and its enforcement on the cryptocurrency on- and 0ff-ramps is essential.

DLF Forum Featured Sponsor: Quartex by Adam Matthew Digital / Digital Library Federation

Featured post from 2022 DLF Forum sponsor, Quartex by Adam Matthew Digital. Learn more about this year’s sponsors on the Forum website.


Adam Matthew: Primary sources for teaching and research/Quartex: create digital collections

At this year’s DLF Forum, learn how you can deliver multiple initiatives with just one digital collections platform – Quartex, from primary source publisher Adam Matthew Digital.

We know return on investment often determines budget and resource allocation. One of the ways in which our flexible and scalable solution achieves this is by accommodating multiple projects, whether delivered in-house or through external collaborations.

Syracuse University (SU) Libraries adopted Quartex with its core digital collections in mind but quickly realised it could achieve much more. In collaboration with its long term partner, the Institute for Veterans and Military Families (IVMF), the libraries created a separate digital collections site as a repository for the IVMF’s content. Hear from SU visiting librarian Grace Swinnerton at session M3: Developing Students’ Digital Scholarship Skills at this year’s DLF Forum.

And SU isn’t stopping there. Neither are the Peabody Essex Museum, MA, or Harris County Public Library, TX. Both have built multiple collections sites using Quartex, for internal purposes or to showcase discrete collections of digitised content.

In doing so, these institutions have created efficiencies and engaged new audiences. More crucially, they’re delivering greater return on investment than they originally expected.

Visit our team in-person at the DLF Forum to talk through your digital initiatives – you’ll find us at the Quartex booth in the Maryland Foyer. We look forward to sharing Quartex with you and answering your questions.

quartexcollections.com

Join us in Baltimore October 9-13, 2022; October 9: Learn@DLF; October 10-12: 2022 DLF Forum, October 12-13: NDSA Digital Preservation/Digitizing Hidden Collections symposium

The post DLF Forum Featured Sponsor: Quartex by Adam Matthew Digital appeared first on DLF.

Gifts / Ed Summers

I’m guessing this has been noted before, but it’s interesting to read this from Robin Wall Kimmerer about gift economies while also thinking about open source software:

From the viewpoint of a private property economy, the “gift” is deemed to be “free” because we obtain it free of charge, at no cost. But in the gift economy, gifts are not free. The essence of the gift is that it creates a set of relationships. The currency of a gift economy is, at its root, reciprocity. In Western thinking, private land is understood to be a “bundle of rights”, whereas in a gift economy property has a “bundle of responsibilities” attached. (Kimmerer, 2013, p. 28)

It really speaks to what is missing in the Free as in Beer / Free as in Freedom dualism when it comes to thinking about the sustainability of open source software. Open Source software is actually not free, but not because it requires your time, but because it requires you to enter into a set of sociotechnical relations with other people, and a new set of responsibilities. When open source software projects account for these relations and responsibilities, and practice them, they become more sustainable.

References

Kimmerer, R. W. (2013). Braiding sweetgrass: indigenous wisdom, scientific knowledge and the teachings of plants (First paperback edition). Minneapolis, Minn: Milkweed Editions.

Building a new banned books exhibit for a new era / John Mark Ockerbloom

When I first created Banned Books Online over 25 years ago, I wasn’t primarily worried about book censorship. I was worried about Internet censorship.

It was 1994, and the world at large was just getting to know the Internet, which not long before had been a network mostly limited to researchers, information technologists, and students and faculty at universities. After an undergraduate at my university wrote about pornography on the network (in a paper that would become the basis of a sensationalist Time magazine cover story the following year), our administration decided it needed to censor Usenet, a system of online discussion forums that then comprised the predominant social media of the online world. Specifically, Usenet forums discussing sex had to go. One of the administrators behind the decision was an English professor who in an interview praised James Joyce’s Ulysses, a groundbreaking novel still enjoyed and studied a century after its initial publication. But that publication had been banned for years, both in the US and elsewhere, due to the novel’s discussions and allusions to sex. I thought at the time, “if the Internet goes the way you want it to go here, no one will ever be able to publish Ulysses or anything like it online going forward.” Banned Books Online started as a way to reify that thought, and to demonstrate what we would lose in a heavily censored Internet.

I’m glad to say that, while there are still free speech battles to be fought on the Internet, pervasive censorship of the open Internet to something resembling the standards of broadcast TV (which some of us feared the mid-1990s panic might lead to) didn’t happen– at least not in the United States. But other forms of censorship have had a resurgence since then. That includes book censorship, which I largely treated in my exhibit as a relic of the past. “We used to ban books frequently years ago,” I implicitly argued in the exhibit, “and maybe there are still some isolated pockets of people trying to ban books even now. But we’ve moved beyond that today, and we enjoy a richer, freer culture because of that. So let’s not repeat our past mistakes on this new Internet”. That was the implication I had in mind.

But it’s become increasingly clear that my early optimism about the waning of book bans was misplaced. Book bans and ban attempts have surged in the US in recent years, particularly in schools (as PEN America documents) and libraries (as the American Library Association documents). As PEN America’s report notes, they’re now often not just isolated local affairs, but are driven by campaigns coordinated by nationally active advocacy groups. They’re often targeting books that feature LGBT viewpoints or that bring up issues of racism and injustice. They’re backed up by prominent politicians, some of whom proudly announce their intention to “ban critical race theory” (often defined vaguely enough to effectively mean “issues that make white people uncomfortable”) or to prohibit classroom discussion of sexual orientation beyond heterosexuality, or gender identities other than masculine and feminine ones assigned at birth.

In some cases, reports of attempts to ban specific famous titles in schools and libraries can increase their sales and visibility elsewhere. (Though in some cases, booksellers have also been threatened with prosecution for putting targeted titles on open shelves.) But broad ban lists, such as ones that cover hundreds of recent titles, can prevent many lesser-known books and authors from the chance to find their audience in the first place. They can discourage publishers from acquiring or releasing such books. They can also dissuade libraries from offering them, lest they get defunded if they carry books that some people don’t like.

My priorities for a banned books exhibit have changed accordingly. I want to draw attention to books under threat now, even when they’re not old enough to be in the public domain, and don’t have an authorized free online edition I can link to. I want to help people find copies they can read, in libraries and from booksellers, and I want to encourage support for those libraries and booksellers, so they hear from people who love the books and want to read them, and not just from those who want them gone. I want to show how and why books get targeted for bans both in the past and in the present, and understand the common themes that recur in these banning attempts, and in other manifestations of authoritarianism. And while library and school bans get the most press attention in the US, I also want to ensure that people don’t forget more pervasive book censorship in American prisons, and in other countries around the world. From a more technical standpoint, I’d also like to tap into the growth of linked open metadata to connect readers with information about books of interest to them, and with libraries that offer them.

So with that in mind, I’m now developing a new exhibit, Read Banned Books. I’m opening it to public preview on Banned Books Week 2022. It’s still under development, and I’ve just started to populate its collection, but it will grow over the course of this week and in the weeks that follow. The metadata and commentary in the collection will be shared on Github, and I hope it can be reused and applied in novel ways both by my site as it develops, and by others. While I don’t plan to try to make a comprehensive data set of all banned books and banning attempts, I do hope to highlight particularly important and interesting books and incidents, and to link to broader dossiers of censorship on other sites.

I invite you to check it out, and to let me know about useful things I can add to its knowledge base and functionality. And I hope you’ll be informed and active in resisting censorship and authoritarianism in this new era.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2022 September 20 / HangingTogether

The following  post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by Jay Weitz.

Book bans and Salman Rushdie

Deborah Caldwell-Stone, who directs the American Library Association Office for Intellectual Freedom, spoke with Dr. Melissa Harris-Perry, host of the New York Public Radio station WNYC program The Takeaway on August 15, 2022. In “How Efforts to Ban Books Impact Public Libraries,” they talked about protecting libraries from book bans, how resources concerning race and sexuality have been particular targets of book banners, and the August 12 attack on Satanic Verses author Salman Rushdie at the Chautauqua Institution in Western New York. Phil Morehart, who writes frequently about such topics as part of the ALA “I Love Libraries” initiative, wrote briefly about the Takeaway interview in “How Book Bans Impact Public Libraries” and suggested ways to protect the freedom to read. By the way, the Chautauqua Institution’s Smith Memorial Library (OCLC Symbol: 5XS) is one of the 38 member libraries of the Chautauqua-Cattauragus Library System (OCLC Symbol: VXU).

“The Futility of Information Literacy & EDI”

Former academic librarian Sofia Leung, currently “working at the intersection of academic libraries and social justice education,” writes “The Futility of Information Literacy & EDI: Toward What?“ in College and Research Libraries (Volume 83, Number 5, September 2022, pages 751-764). She compares the “one-shot information literacy session” with the “one-off EDI workshop,” finding them to be similarly “tools of settler colonialism and white supremacy.” Instead, Leung wants to open us to unaccustomed ways of being and of knowing. In particular, she suggests “Indigenous conceptions of knowledge, which center relationality, allowing us “to be in reciprocity and mutuality with one another, with our students, with the communities excluded from institutions.”

Accessibility workshop

Bloomfield Township (Michigan) Public Library (BTPL) (OCLC Symbol: EVX) will present the seventh biennial “Adaptive Umbrella: An Accessibility Workshop” on October 6, 2022, 10:00 a.m. to 3:30 p.m. Eastern. Disability rights activist Emily Ladau will talk about “Demystifying Disability: What to Know, What to Say, and How to Be an Ally;” autistic librarian Adriana Lebrón White will speak on “Lived Experiences in Literature and the Importance of Authentic Disabled Perspectives;” and Head of Youth Services at BTPL, Jen Taggart, addresses “Accessibility Collection Development 101.” There will also be a panel discussion on “Accessibility in the Workplace.” The event is free and will be recorded and made available to those who have registered for two months after the workshop.

School libraries and intellectual freedom

Artwork courtesy of the American Library Association, www.ala.org

As part of Banned Books Week 2022 (September 18 to 24), the Alabama School Library Association (ASLA) hosts its third annual webinar on intellectual freedom, September 21 at 5:00 p.m. Eastern Time. Dr. Shannon Oltmann, from the University of Kentucky School of Information Science (OCLC Symbol: LSK) in Lexington, will answers questions regarding school libraries and intellectual freedom. Dr. Oltmann wrote Practicing Intellectual Freedom in Libraries (2019) and edited The Fight against Book Bans: Perspectives from the Field (2023).

ACRL webcasts on DEIA and social justice

The Association of College and Research Libraries has made several recent webcasts available for no fee on the ACRL You Tube channel. Recorded on June 15, 2022, by ACRL’s University Library Section Professional Development Committee, “Introducing Conversations About Diversity, Equity, Inclusion, and Accessibility to Personnel at a Mid-Sized Academic Library” is an hour-long discussion among four white women librarians about how to create an environment conducive to DEIA progress. From May 16, 2022, the ACRL Education and Behavioral Sciences Section and Digital Scholarship Section session “Data Visualization for Social Justice” considers how the presentation of data can promote social justice.

Cultural Proficiencies for Racial Equity

The Joint American Library Association/Association of Research Libraries Building Cultural Proficiencies for Racial Equity Framework Task Force has released Cultural Proficiencies for Racial Equity: A Framework. It is intended to be both a theoretical and practical “guide for developing personal, organizational, institutional, and systems-level knowledge and understanding of the nature of racism and its many manifestations.” It is hoped that the framework will “provide the grounding needed to effect change in thinking, behavior, and practice that will lead to better outcomes for racialized and minoritized populations.”

Public library diversity study

The Public Library Association (PLA) has issued the second in a rotating series of three national surveys that delve into the roles, services, and resources of public libraries. Among other things, Public Library Staff and Diversity Report: Results from the 2021 PLA Annual Survey documents staff recruitment and retention efforts, EDI activities, and how staff roles are evolving. On October 4, 2022, 2:00 p.m. Eastern Time, PLA will present a free participatory webinar on the results of the survey, as well as the challenges it poses.

“Ethical and Accessible Design in Libraries”

Lyndsay Wasko, a designer, illustrator, and online MLIS student at the University of Alberta (OCLC Symbol: UAC) in Edmonton, Canada, asks us to consider the interaction between design and library studies in her essay “Ethical and Accessible Design in Libraries” in Hack Library School. “At its core, design communicates through a hierarchy of information (sounds familiar right?) and strives to make the complex more digestible and memorable. As librarianship strives to improve information-seeking, it makes sense that these disciplines could and should cooperate.”

Middle school librarian stands up to censorship

Amanda Jones, a middle school librarian in Denham Springs, Louisiana, and president of the Louisiana Association of School Librarians, “exhausted with the insults hurled at educators and librarians over LGBTQ materials, has sued two men who have “accused her of advocating to keep ‘pornographic’ materials in the parish library’s kids’ section.” Tyler Kingkade of NBC News reports that “In rare move, school librarian fights back in court against conservative activists.” Jones addressed a meeting of the board of the Livingston Parish Library (OCLC Symbol: LVGSN) in July, condemning censorship, even when well-intentioned. One of the men being sued, Michael Lunsford is head of an activist group Citizens for a New Louisiana. He spoke at the same meeting and recounts it from his perspective in “Livingston Parish Library Update.” Kingkade writes about the accusations against Jones from Lunsford and the second man, Ryan Thames, who runs the Facebook page “Bayou State of Mind,” that followed the meeting.

Welcoming transgender librarians and users

The nonprofit Infopeople will present a free hour-long webinar, “Practicing Inclusion: Welcoming Transgender Customers and Colleagues,” on October 11, 2022, at 3:00 p.m. Eastern Time. In the current political climate, the rights of transgender people are being challenged everywhere. Libraries want to welcome members of the transgender community, both as users and as colleagues. Beckett Czarnecki, the Equity, Diversity, and Inclusion Project Specialist for Denver Public Library (OCLC Symbol: DPL), will present the interactive webinar within a context of reframing gender to promote inclusion.

DEI audit using the ACRL Framework

After George Floyd’s murder in 2020, the University of the Pacific (OCLC Symbol: UOP), Stockton, California, undertook a diversity, equity, and inclusion audit of its book and musical score resources using the Association of College and Research Libraries‘ 2016 Framework for Information Literacy for Higher Education as a basis. In “Student learning and engagement in a DEI collection audit: Applying the ACRL Framework for Information Literacy” (College and Research Libraries News, Volume 83, Number 8, September 2022, pages 335-340), Veronica A. Wells, Michele Gibney, and Mickel Paris write about the audit, the incorporation of the ACRL Framework into the audit process, and the impact of hiring student interns to conduct the audit.

Library accessibility

The Training Committee of the Library Accessibility Alliance (LAA) will present Stephanie Rosen, Accessibility Strategist and Librarian for Disability Studies at the University of Michigan Library (OCLC Symbol: EYM), in the webinar “Disability Access and Climate in Libraries” on September 29, 2022, at 2:00 p.m. Eastern Time. The session “will expand participants’ awareness of disability, learn to promote accessibility, and consider how to contribute to a positive climate through their beliefs, behaviors, and communications.” LAA grew out of the Library E-Resource Accessibility Group, formed in 2015 by the Big Ten Academic Alliance (BTAA) (OCLC Symbol: YNT). In 2019, BTAA and the Association of Southern Research Libraries (ASERL) jointly created LAA, which was expanded in 2021 by adding the Greater Western Library Alliance (GWLA) and the Washington Research Library Consortium (WRLC) (OCLC Symbol: CAO). Registration for future webinars and videos of past presentations from LAA are available on its Events page.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2022 September 20 appeared first on Hanging Together.

White House Statement On Cryptocurrency Regulation / David Rosenthal

The White House issued a statement entitled Following the President’s Executive Order, New Reports Outline Recommendations to Protect Consumers, Investors, Businesses, Financial Stability, National Security, and the Environment describing the state of the policy development process to which I contributed twice:
The nine reports submitted to the President to date, consistent with the EO’s deadlines, reflect the input and expertise of diverse stakeholders across government, industry, academia, and civil society. Together, they articulate a clear framework for responsible digital asset development and pave the way for further action at home and abroad. The reports call on agencies to promote innovation by kickstarting private-sector research and development and helping cutting-edge U.S. firms find footholds in global markets. At the same time, they call for measures to mitigate the downside risks, like increased enforcement of existing laws and the creation of commonsense efficiency standards for cryptocurrency mining. Recognizing the potential benefits and risks of a U.S. Central Bank Digital Currency (CBDC), the reports encourage the Federal Reserve to continue its ongoing CBDC research, experimentation, and evaluation and call for the creation of a Treasury-led interagency working group to support the Federal Reserve’s efforts.
Below the fold I describe some of the details of this "framework", which unfortunately continues to use the misleading "digital asset" framing.

The framework addresses seven areas:
  1. Protecting Consumers, Investors, and Businesses. This area involves directing regulatory agencies to "aggressively pursue investigations and enforcement actions against unlawful practices" and consumer protection agencies to "monitor consumer complaints and to enforce against unfair, deceptive, or abusive practices". Alas, it fails to come down against the cryptocurrency lobbyists pushing the CFTC to be the regulator instead of the SEC.
  2. Promoting Access to Safe, Affordable Financial Services. This area recognizes the need to compete with "digital assets" by "adoption of instant payment systems, like FedNow, by supporting the development and use of innovative technologies by payment providers to increase access to instant payments, and using instant payment systems for their own transactions". It is ridiculuous that I can transfer money in the UK in minutes, but in the US it takes many days so the banks can feast on the float.
  3. Fostering Financial Stability. This area directs the Treasury to "work with financial institutions to bolster their capacity to identify and mitigate cyber vulnerabilities" and work internationally to identify systemic risks. Clearly, intrernational cooperation is needed, especially to rein in Binance.
  4. Advancing Responsible Innovation. This area is the inevitable sop to the cryptocurrency industry's peddling of the innovation meme about a system which simply repikcates existing financial products without the necessary regulation.
  5. Reinforcing Our Global Financial Leadership and Competitiveness. This area encourages agencies to work internationally to increase "collaboration with—and assistance to—partner agencies in foreign countries through global enforcement bodies". But, alas, it also directs the Commerce Department to "help cutting-edge U.S. financial technology and digital asset firms find a foothold in global markets for their products". Pro tip: you can't have it both ways.
  6. Fighting Illicit Finance. This area suggests needed legislative actions, and is based upon input from:
    Treasury, DOJ/FBI, DHS, and NSF drafted risk assessments to provide the Administration with a comprehensive view of digital assets’ illicit-finance risks. The CFPB, an independent agency, also voluntarily provided information to the Administration as to risks arising from digital assets. The risks that agencies highlight include, but are not limited to, money laundering; terrorist financing; hacks that result in losses of funds; and fragilities, common practices, and fast-changing technology that may present vulnerabilities for misuse.
    Sanctioning Tornado Cash is a good start, but in the end the miscreants need exchanges to "cash out", so taking action against exchanges that accept coins tainted by Tornado Cash is the next important step.
  7. Exploring a U.S. Central Bank Digital Currency (CBDC). This area directs the Treasury to "lead an interagency working group to consider the potential implications of a U.S. CBDC, leverage cross-government technical expertise, and share information with partners". The US doesn't actually need a CBDC of the kind they're considering. A combination of FedNow and reviving postal banking (dormant since 1967) would do the trick.
Regulation of cryptocurrencies in the US is coming, albeit too slowly. Much of the progress reported here is worthy, especially considering the vast resources lobbying to defeat or water it down.

Graduate Hourly Position: Metadata Quality Investigation / Jodi Schneider

Start Date: ASAP

Descriptions, Responsibilities, and Qualifications
This project offers an excellent opportunity for a University of Illiniois Urbana-Champaign MSLIS student interested in metadata, data quality, database search, information retrieval and related topics. The incumbent will collect information about how well databases track retracted information, under the mentorship of Dr. Jodi Schneider, Assistant Professor and Director of the Information Quality Lab. The project will produce data analyses and reports to support a NISO Working Group in information gathering about how to improve metadata quality and display standards for retracted publications, in the Alfred P. Sloan foundation grant “Reducing the Inadvertent Spread of Retracted Science II: Research and Development towards the Communication of Retractions, Removals, and Expressions of Concern Recommended Practice”.

We will first search multidisciplinary databases (Scopus and Web of Science) as well as other sources (e.g., Crossref, Retraction Watch) for retracted publications. Then, we will compile a list of known retracted publications across these sources. We will compare across sources to identify retracted publications that have inconsistent information about whether or not they are retracted. We will also calculate which percentage of retracted publications indexed in the source are correctly indexed as retracted. We will then investigate how retractions are indexed in specific domain databases, using established retraction type indexing in biomedicine (PubMed, PubMed Europe)and psychology (PsycINFO), and investigating how retraced publications are indexed in chemistry (CAS SciFinder) and engineering (IEEE Xplore). We will also manually check indexing on a small dataset in search engines such as Google Scholar and Semantic Scholar.

Duties include:

  • Searching databases
  • Collating publication data
  • Deduplicating publication data
  • Documenting all aspects of the projects
  • Producing project memos and reports

Required Qualification:

  • Enrollment in the Master’s in Library and Information Science program at the University of Illinois at Urbana-Champaign
  • Interest in topics such as metadata, data quality, database search, etc.
  • Interest in quantitative research using publications as data
  • Detail orientation
  • Excellent communication skills in written and spoken English

Preferred Qualifications:

  • Available for continued work in spring 2023
  • Project management experience
  • Experience with quantitative data
  • Experience in database searching
  • Experience manipulating data using spreadsheet software (e.g., Excel) and/or scripting languages (e.g., R or Python)
  • Interest in reproducibility and open science
  • Interest or experience in writing research reports and/or publications

Compensation: paid as a graduate hourly through the University of Illinois, $20/hour for 10-15 hours a week.

Application Procedures: Interested candidates should send a cover letter and resume to Dr. Jodi Schneider in a single PDF file named Lastname-metadata-hourly.pdf to jodi@illinois.edu

Review of applications will begin immediately. Applications will be accepted until the position is filled. All applications received by Sunday October 2, 2022, will receive full consideration.

Posted on Handshake and on the iSchool website

The First 500 Mistakes You Will Make While Streaming on Twitch.tv / Information Technology and Libraries

Three librarians at the Mitchell Park branch of the Palo Alto City Library detail two years of lessons learned while streming a virtual event series on Twitch.tv for the first time. This series, titled Teach a Librarian How to Play Videogames, began at the start of the COVID-19 pandemic. We hope this article will inspire you to try something new with your library events, and encourage readers to learn from these mistakes and build off our success.

An Omeka S Repository for Place- and Land-Based Teaching and Learning / Information Technology and Libraries

Our small community college library developed a learning object repository to support a cross-institutional, land-based, multidisciplinary academic initiative using the open-source platform Omeka S. Drawing on critical, feminist, and open practices, we document the relational labor, dialogue, and tensions involved with this open education project. This case study shares our experience with tools and processes that may be helpful for other small-scale open education initiatives, including user-centered iterative design, copyright education, metadata design, and user-interface development in Omeka S.

Using Machine Learning and Natural Language Processing to Analyze Library Chat Reference Transcripts / Information Technology and Libraries

The use of artificial intelligence and machine learning has rapidly become a standard technology across all industries and businesses for gaining insight and predicting the future. In recent years, the library community has begun looking at ways to improve library services by applying AI and machine learning techniques to library data. Chat reference in libraries generates a large amount of data in the form of transcripts. This study uses machine learning and natural language processing methods to analyze one academic library’s chat transcripts over a period of eight years. The built machine learning model tries to classify chat questions into a category of reference or nonreference questions. The purpose is to predict the category of future questions by the model with the hope that incoming questions can be channeled to appropriate library departments or staff.

Navigating Uncharted Waters / Information Technology and Libraries

In 2019, the University of Houston Libraries formed a Theses and Dissertations Digitization Task Force charged with digitizing and making more widely accessible the University’s collection of over 19,800 legacy theses and dissertations. Supported by funding from the John P. McGovern Foundation, this initiative has proven complex and multifaceted, and one that has engaged the task force in a broad range of activities, from purchasing digitization equipment and software to designing a phased, multiyear plan to execute its charge. This plan is structured around digitization preparation (phase one), development of procedures and workflows (phase two), and promotion and communication to the project’s targeted audiences (phase three). The plan contains step-by-step actions to conduct an environmental scan, inventory the theses and dissertations collections, purchase equipment, craft policies, establish procedures and workflows, and develop digital preservation and communication strategies, allowing the task force to achieve effective planning, workflow automation, progress tracking, and procedures documentation. The innovative and creative approaches undertaken by the Theses and Dissertations Digitization Task Force demonstrated collective intelligence resulting in scaled access and dissemination of the University’s research and scholarship that helps to enhance the University’s impact and reputation.

Library Management Practices in the Libraries of Pakistan / Information Technology and Libraries

Library and information science has been at an infant stage in Pakistan, primarily in resource management, description, discovery, and access. The reasons are many, including the lack of interest and use of modern tools, techniques, and best practices by librarians in Pakistan. Finding a solution to these challenges requires a comprehensive study that identifies the current state of libraries in Pakistan. This paper fills this gap in the literature by reviewing the relevant literature published between 2015 and 2021 and selected through a rigorous search and selection methodology. It also analyzes the websites of 82 libraries in Pakistan through a theoretical framework based on various aspects. The findings of this study include: Libraries in Pakistan need a transition from traditional and limited solutions to more advanced information and communication technology (ICT)-enabled, user-friendly, and state-of-the-art systems to produce dynamic, consumable, and sharable knowledge space. They must adopt social semantic cataloging to bring all the stakeholders on a single platform. A libraries consortium should be developed to link users to local, multilingual, and multicultural collections for improved knowledge production, recording, sharing, acquisition, and dissemination. These findings benefit Pakistani libraries, librarians, information science professionals, and researchers in other developing countries. To the best of our knowledge, this is the first study of its kind providing insights into the current state of libraries in Pakistan through the study of their websites using a rigorous theoretical framework and in the light of the latest relevant literature.

Perceived Quality of Reference Service with WhatsApp / Information Technology and Libraries

Academic libraries are experiencing significant changes and making efforts to deliver their service in the digital environment. Libraries are transforming from being places for reading to extensions of the classroom and learning spaces. Due to the globalized digital environment and intense competition, libraries are trying to improve their service quality through various evaluations. As reference service is crucial to users, this study explores user satisfaction towards the reference service through WhatsApp, a social media instant messenger, at a major university in Hong Kong and discusses the correlation between the satisfaction rating and three variables. Suggestions and recommendations are raised for future improvements. The study also sheds light on the usage of reference services through instant messaging in other academic libraries.

Measuring Library Broadband Networks to Address Knowledge Gaps and Data Caps / Information Technology and Libraries

In this paper, we present findings from a three-year research project funded by the US Institute of Museum and Library Services that examined how advanced broadband measurement capabilities can support the infrastructure and services needed to respond to the digital demands of public library users across the US. Previous studies have identified the ongoing broadband challenges of public libraries while also highlighting the increasing digital expectations of their patrons. However, few large-scale research efforts have collected automated, longitudinal measurement data on library broadband speeds and quality of service at a local, granular level inside public libraries over time, including when buildings are closed. This research seeks to address this gap in the literature through the following research question: How can public libraries utilize broadband measurement tools to develop a better understanding of the broadband speeds and quality of service that public libraries receive? In response, quantitative measurement data were gathered from an open-source broadband measurement system that was both developed for the research and deployed at 30 public libraries across the US. Findings from our analysis of the data revealed that Ookla measurements over time can confirm when the library’s internet connection matches expected service levels and when they do not. When measurements are not consistent with expected service levels, libraries can observe the differences and correlate this with additional local information about the causes. Ongoing measurements conducted by the library enable local control and monitoring of this vital service and support critique and interrogation of the differences between internet measurement platforms. In addition, we learned that speed tests are useful for examining these trends but are only a small part of assessing an internet connection and how well it can be used for specific purposes. These findings have implications for state library agencies and federal policymakers interested in having access to data on observed versus advertised speeds and quality of service of public library broadband connections nationwide.

Stacks / Ed Summers

I’m looking forward to this meeting next week where Nathan Schneider will be talking about about the limits and pitfalls of classical conceptions of open source, and how we can do better to build sustainable open source projects and software development practices. To prepare for this I read the two articles authored by Schneider that were mentioned as background material (Schneider, 2021 ; Schneider, 2022). I really enjoyed both, but for slightly different reasons. Even though the meeting itself is focused on the [meet.coop] organization, I’m hoping that it will inform some recent thinking and discussion about the sustainability of web archiving software.

The Tyranny of Openness synthesizes a ton of material related to changes in the open source landscape with regards to ethics, and covered several things I had not seen before. Seeing these developments presented together, through the lens of Standpoint Theory is really helpful I think, and reminded me of work Bergis and I did on the Ferguson Principles to help unpack the power dynamics at play when creating web archives. I like how Schneider calls on the work of Elinor Ostrom to emphasize the role of governance in healthy open source projects:

… clear and fairly enforced rules are essential for managing common resource pools. Peer producers, like anyone else, need a stable and trustworthy stage in order to freely contribute … One further dimension of agency … is the need for autonomy from external authorities and organizations–to ensure, for instance that participants can help craft their own standards of excellence, rather than simply adopting those of outside funders or norms. But crafting such standards means having processes for deliberation and decision, as well as the power to enforce what the community decides.

This is a topic that is top of mind for me as I’ve been working with Webrecorder tools lately at Stanford, and also helping here and there with bug reporting, support questions and some technical writing. As organizations are transitioning from the now abandoned OpenWayback project towards pywb there needs to be some way of structuring the work so these people can work together to support the software, while also allowing the Webrecorder project itself to survive. Frameworks like Github’s Minimal Viable Governance may be a good example to start from and adapt. Schneider also mentions the Debian Constitution which (since 1998) defines how that project collectively makes decisions.

In Governable Stacks Schneider strikes a balance between the colloquial notion of a tech stack (an assemblage of software and hardware), and Benjamin Bratton’s more holistic, sociotechnical concept of “The Stack” (Bratton, 2016) to encompass, for example:

… all that enables one to use a social media service … the server farms, the corporation that owns them, its investors, the software the servers run on, the secret algorithms that analyse one’s data, the mobile device, its accelerometer sending biometric data to the server farm, the network provider, the backdoor access for law enforcement, and so on.

Schneider draws on his experience as a member of the May First technical cooperative combined with other examples of technical and cooperative efforts to articulate three strategies for developing anti-colonial alternatives to market driven technology: sovereignty, democracy and insurgency. While sovereignty is usually associated with the apparatus of the state here Schneider is talking about autonomy and self-determination at various levels of the stack. By democracy Schneider means the daily practices that help ensure that technical stacks remain accountable to the people who use them. And by insurgency Schneider means putting the democratic practices and sovereignty to work in the defense of its members against efforts to dismantle the stack or disempower the individuals that use it. Grace Lee Boggs’ idea of [dialectical humanism] runs like a red thread through this paper, as the means by which a group of people develop political strategies, not through orthodoxy to a particular platform or ideology, but through cooperative struggle, together.

Now you might be thinking this sounds pretty abstract, but the really nice thing that Schneider does in these two articles do is show how these ideas are getting expressed in many different projects already, and are part of a more general arc for open source software development over the past few decades. Sustaining open source software is a struggle, and projects are especially vulnerable when they become the target of enclosure, capture, and when they are weakened by benevolent dictators who must, eventually, cede their authority, by design or as a matter of circumstance.

This is especially the case for Webrecorder, which is really the only viable open source stack for creating archives of the web, and playing them back again. Some of the challenges it faces, in terms of replay of highly dynamic web web applications, are really only tractable if approached collectively, drawing on the expertise of a group of individuals who share the same goals and ethics, and are willing to dive in and help. Making it clear how to constructively engage, without simply signing over authority to a group of national libraries and other powerful organizations who are able to pay the membership dues, is important work that remains to be done, especially in the context of tools like ArchiveWeb.page and ReplayWeb.page which are designed for individuals and collectives to use, and do not require large investments to keep online.

Look to the soon to be released report from New Design Congress for why shared ethics and governance are so important when it comes to deploying open source web archiving software.

References

Bratton, B. (2016). The stack. MIT Press.
Schneider, N. (2021). The Tyranny of openness: what happened to peer production? Feminist Media Studies, 1–18. https://doi.org/10.1080/14680777.2021.1890183
Schneider, N. (2022). Governable Stacks against Digital Colonialism. tripleC: Communication, Capitalism & Critique. Open Access Journal for a Global Sustainable Information Society, 20(1), 19–36. https://doi.org/10.31269/triplec.v20i1.1281

DLF Forum Featured Sponsor: Cayuse / Digital Library Federation

Featured post from 2022 DLF Forum sponsor, Cayuse. Learn more about this year’s sponsors on the Forum website.


Harnessing the Power of Open, Accessible, and Shareable Research

Cayuse Repository promotional image: pair of hands typing on a laptop on a table surrounded by flasks, beakers, and other scientific equipment.

Data is at the heart of research. So it stands to reason that data management is at the center of a researcher’s toolbox. Data Management Plans (DMP) relate to the management and storage of the data acquired during the research process. In essence, DMPs provide protection during and after the research project.

One of our top engineers, Taylor Mudd, recently shared some best practices for DMPs, including the benefits of migrating a repository and achieving compliance without complaints. Let’s dig into some of those insights that will enable you and your team to truly harness the power of your data.

Changing Role of Repository Data Management Plans

Data Management Plans are often required to be submitted alongside ethical approval and funding applications for research projects. They’re a time-consuming requirement, often managed in separate systems and stored in unstructured formats. Typically, they’re completed at the start of a research project and then ignored until the project ends.

However, recording a DMP in a structured “machine-actionable” format has significant benefits:

  • Improves researcher compliance
  • Minimizes administrative burden
  • Ensures easier data tracking and reporting

“The idea is to make it easy for researchers to create and update DMPs and see them as an integral part of their research with significant benefits,” said Taylor.

Challenges and Opportunities Associated with Repository Migration

“We know that migrating decades of data is a daunting prospect in itself, let alone dealing with any repercussions in terms of discoverability and search engine rankings,” said Taylor. There are, however, ways in which a modern repository can increase search engine discoverability, with improved user experience and even credibility.

Here are a few ways teams can improve discoverability:

  • Update repository registry services
  • Register sites with Google Webmaster and Google Scholar
  • Use more HTML <meta> tags than usual for better indexing by web crawlers

The benefits of a successful migration are considerable. One client averaged just over 2500 downloads a month pre-migration, and achieved an impressive 285% increase post-migration with over 9700 average downloads!

Compliance Without Complaint

Compliance with countless conflicting funder policies and convoluted, time-consuming repository submission processes have long been challenges for researchers. With the help of Cayuse Repository, not only can researchers meet compliance requirements easier, but repository managers can also verify compliance in a simple and streamlined workflow.

“The Repository development team accomplished this by working with repository managers and researchers from the University of Westminster,” said Taylor. “The result of redesigning the submission process and researcher interface was a process that makes compliance easy, appealing, and—perhaps, most importantly—achievable without any additional effort on behalf of researchers.”

Cayuse Repository Streamlines Processes and Drives Engagement

Many research organizations and institutions struggle with disconnected systems and processes that make data repository difficult. With the Cayuse Repository, researchers and administrators can organize, manage, and protect their research data in one streamlined location.

Some of the key benefits for the Cayuse Repository include:

  • A single repository for all research, enabling research outputs related to a project to be stored together and searched simultaneously.
  • An extensive API for publishing and receiving machine-readable structured data in open formats, as well as standards-based API. Our carefully designed APIs make it easy to support custom integrations.
  • High security and easy-to-use access request workflows allow users to manage restricted files with confidence.

Our research solutions are specifically designed for the research community. We are committed to empowering organizations to conduct globally connected research. We are honored to contribute to the data health and protection of our customers with the Cayuse Repository and Cayuse Research Suite as a whole.

Come Say Hello at the DLF Forum to Learn More About Cayuse Repository

We are excited to participate in this October’s DLF Forum in Baltimore, and hope to see you there! Stop by our booth to learn more about Cayuse Repository, and how our industry-leading research administration software can help your organization harness the power of open, accessible, and shareable research.

Until then, visit the links below to learn more about Cayuse Repository:

Join us in Baltimore October 9-13, 2022; October 9: Learn@DLF; October 10-12: 2022 DLF Forum, October 12-13: NDSA Digital Preservation/Digitizing Hidden Collections symposium

The post DLF Forum Featured Sponsor: Cayuse appeared first on DLF.

Web Archiving: Opportunities and Challenges of Client-Side Playback / Harvard Library Innovation Lab

Historically, the complexities of the backend infrastructures needed to play back and embed rich web archives on a website have limited how we explore, showcase and tell stories with the archived web. Client-side playback is an exciting emerging technology that lifts a lot of those restraints.

The replayweb.page software suite developed by our long-time partner Webrecorder is, to date, the most advanced client-side playback technology available, allowing for the easy embedding of rich web archive playbacks on a website without the need for a complex backend infrastructure. However, entirely delegating to the browser the responsibility of downloading, parsing, and restituting web archives also means transferring new responsibilities to the client, which comes with its own set of challenges.

In this post, we'll reflect on our experience deploying replayweb.page on Perma.cc and provide general security, performance and practical recommendations on how to embed web archives on a website using client-side playback.

Security model

Conceptually, embedding a web archive on a web page is equivalent to embedding a third-party website: the embedder has limited control over what is embedded, and the embedded content should therefore be as isolated as possible from its parent context.

Although the software replaying a web archive can attempt to prevent replayed JavaScript from escaping its context, we believe embedding should be implemented in a way that benefits as much as possible from the built-in protections the browser offers for such use cases: namely, the same-origin policy.

Embedding third-party code from a web archive can go wrong in a few ways: there could be an intentional cross-site scripting attack, where JavaScript code is added to a web archive with the intent of accessing or modifying information on the top-level document. There could be an accidental cookie rewrite, where the archive creates a new cookie overwriting one already in use by the embedding website. There could also be proxying conflicts, where a URL of the embedding website ends up being caught by the proxying system of the playback software, making it harder to reach.

Our experience so far tells us that these "context clashes" are more easily prevented by instructing the browser to isolate the archive replay as much as possible.

For that reason, although it is entirely possible—and convenient—to mix web archive content directly into a top-level HTML page, our recommendation is to use an iframe to do so, pointing at a subdomain of the embedding website.

<!-- On www.example.com -->
<iframe
  src="https://warc.example.com/replay/{id}"
  sandbox="allow-scripts allow-same-origin allow-modals allow-forms">
</iframe>

In this example, www.example.com uses an iframe to embed warc.example.com/replay/{id}, which serves an HTML document containing an instance of replayweb.page, pointing at an archive file identified by {id}.

A few reasons for that recommendation:

  • warc.example.com is a different origin: therefore the browser will greatly restrict interactions between the embedded replay and its parent, helping prevent context leaks that the playback system might have not accounted for. This should remain true even though the embedding iframe needs to allow both allow-scripts and allow-same-origin for the playback system to work properly.
  • But, it is still on the example.com domain: and the browser will therefore allow this frame to install a service worker. Service workers are subject to the same restrictions as cookies in a third-party embedding context: as such, if third-party cookies are blocked by the browser (which is becoming the default in most browsers), so are third-party service workers.

Client-side performance and caching

Transition from server-side to client-side playback also forces us to reconsider performance and caching strategies, informed by the client's network access characteristics and the limitations of their browsers. The following recommendations are specific to replayweb.page, but are likely applicable, to a certain extent, to other client-side playback solutions.

By default, replayweb.page will try to store every payload it pulled from the archive into IndexedDB for caching purposes. Different browsers have different storage allowances and eviction mechanisms, and it is not unlikely that said allowance runs out after a few archive playbacks. This is a problem we faced with Safari on Perma.cc, and recovery mechanisms proved difficult to efficiently implement.

While this caching feature is helpful to reduce bandwidth usage for returning visitors, turning it off via the noCache attribute may make sense.

There seems to be a strong-enough correlation between browsers giving limited storage allowances and browsers not supporting the StorageManager.estimate API to formulate the following recommendation: noCache should be added if StorageManager.estimate is either not available, or indicates that storage usage is above a certain threshold.

It should be noted that, even when using noCache, replayweb.page needs to store content indexes and other information about the archives in IndexedDB to function. As such, determining how much space should be left for that purpose is context-specific, and we are unfortunately unable to make a general recommendation on this topic.

Alternatively, always using noCache may be considered an acceptable trade-off, if bandwidth usage matters less than reliability for your use case.

Storing and serving archive files

Retrieving and parsing archive files directly within the browser means that client-side constraints now apply to this set of operations. The following recommendations focus on the use case of serving archives files over HTTP for use with replayweb.page, or similar client-side playback solutions.

CORS

Replayweb.page uses the Fetch API to download archive files, which enforces the Cross-Origin Resource Sharing policy. Pointing replayweb.page's source attribute at a resource hosted on a different domain will trigger a preflight request, which will fail unless the target file bears sufficiently permissive CORS headers:

  • Access-Control-Allow-Origin should at least allow the embedder's origin.
  • Access-Control-Allow-Methods should allow HEAD and GET.
  • Access-Control-Allow-Headers should be permissive.
  • Access-Control-Expose-Headers should include headers needed for range request support, such as Content-Range and Content-Length. Content-Encoding should likely also be exposed.

Content-Type

Archive files generally need to explicitly state their MIME type for the player to properly identify them. We recommend populating the Content-Type headers with the following values when serving archive files:

  • application/x-gzip for .warc.gz files
  • application/wacz for .wacz files

Support for range requests and range-request caching

Replayweb.page makes extensive use of HTTP range requests to more efficiently retrieve resources from a given archive without having to download the entire file. This is especially true for wacz files, which were designed specifically for this purpose.

As a result, and although there is a "standard" fallback mode for warc.gz files in replayweb.page, servers hosting files for client-side playback should support range requests, or go through a proxy to palliate the absence of that feature.

That shift from single whole-file HTTP requests to myriads of partial HTTP requests may have an impact on billing with certain cloud storage providers. Although this problem is likely vendor-specific, our experiments so far indicate that using a proxy-cache may be a viable option to deal with the issue.

That said, caching range requests efficiently is notoriously difficult and implementations vary widely from provider to provider. To our knowledge, for the use case of client-side web archives playback, slice-by-slice range request caching appears to be the most efficient approach.

Other recommendations

No playback outside of cross-origin iframes

As a way to ensure that an archive replay is not taken out of context and that it is executed in a cross-origin iframe, we recommend checking that properties of parent.window are not accessible before injecting <replay-web-page> in the document.

Replayweb.page and Apple Safari

What appears to be a bug in the way certain versions of Safari handle state partitioning in Web Workers spun from Service Workers in the context of cross-origin iframes may cause replayweb.page to freeze.

This problem should be fixed in Safari 16: in the meantime, we recommend using replayweb.page's noWebWorker option with problematic versions of Safari, which can be identified in JavaScript by the presence of window.GestureEvent, and the absence of window.SharedWorker.

Warc-embed: our experimental boilerplate

Warc-embed is LIL's experimental client-side playback integration boilerplate, which can be used to test out and explore the recommendations described in this article. It consists of: a basic web server configuration for storing, proxying, caching and serving web archive files; and a pre-configured "embed" page, serving an instance of replayweb.page aimed at a given archive file.

Source code and documentation on GitHub: https://github.com/harvard-lil/warc-embed.


These notes have been compiled as part of a new chapter exploring this technology, but the foundation of our insight was built long ago by Rebecca Cremona as she spearheaded the integration of client-side playback into Perma.cc.

Miners' Extractable Value / David Rosenthal

According to the official Ethereum website "Maximal Extractable Value" (MEV) is a feature not a bug. MEV is a consequence of the fact that it is the miners, or rather in almost all cases the mining pools, that decide which transactions, from the public mempool of pending transactions, or from a dark pool, or from the mining pool itself, will be included in the block that they mine, and in what order. The order is especially important in Turing-complete blockchains such as Ethereum; allowing miners to front-run, back-run or sandwich transactions from elsewhere. The profit from doing so is MEV. MEV is being renamed from Miners Extractable Value to Maximal Extractable Value since it turns out that miners are not the only actors who can extract it.

Ethereum mining 11/07/21
In Ethereum, the MEV profit is enhanced because mining is dominated by a very small number of large pools; last November two pools shared a majority of the mining power. Thus there is a high probability that these pools will mine the next block and thus reap the MEV. Note that activities such as front-running are illegal in conventional finance, although high-frequency traders arguably use these techniques.

I wrote about these issues in Ethereum Has Issues, discussing Philip Daian et al's Flash Boys 2.0: Frontrunning in Decentralized Exchanges, Miner Extractable Value, and Consensus Instability and Julien Piet et al's Extracting Godl [sic] from the Salt Mines: Ethereum Miners Extracting Value, but this just scratched the surface. Below the fold I review ten more contributions.

Eskandari et al (9th April 2019)

Sok: Transparent dishonesty: front-running attacks on blockchain by S. Eskandari, S. Moosavi, and J. Clark was an early investigation of front-running in Ethereum. From their abstract:
We consider front-running to be a course of action where an entity benefits from prior access to privileged market information about upcoming transactions and trades. Front-running has been an issue in financial instrument markets since the 1970s. With the advent of the blockchain technology, front-running has resurfaced in new forms we explore here, instigated by blockchain’s decentralized and transparent nature. In this paper, we draw from a scattered body of knowledge and instances of front-running across the top 25 most active decentral applications (DApps) deployed on Ethereum blockchain. Additionally, we carry out a detailed analysis of Status.im initial coin offering (ICO) and show evidence of abnormal miner’s behavior indicative of front-running token purchases. Finally, we map the proposed solutions to front-running into useful categories.
They define Ethereum's front-running vulnerability thus:
Any user monitoring network transactions (e.g., running a full node) can see unconfirmed transactions. On the Ethereum blockchain, users have to pay for the computations in a small amount of Ether called gas. The price that users pay for transactions, gasPrice, can increase or decrease how quickly miners will execute them and include them within the blocks they mine. A profit-motivated miner who sees identical transactions with different transaction fees will prioritize the transaction that pays a higher gas price due to limited space in the blocks. ... Therefore, any regular user who runs a full-node Ethereum client can front-run pending transactions by sending adaptive transactions with a higher gas price
They divide front-running attacks into three classes:
displacement, insertion, and suppression attacks. In all three cases, Alice is trying to invoke a function on a contract that is in a particular state, and Mallory will try to invoke her own function call on the same contract in the same state before Alice.

In the first type of attack, a displacement attack, it is not important to the adversary for Alice’s function call to run after Mallory runs her function. Alice’s can be orphaned or run with no meaningful effect. Examples of displacement include: Alice trying to register a domain name and Mallory registering it first; Alice trying to submit a bug to receive a bounty and Mallory stealing it and submitting it first; and Alice trying to submit a bid in an auction and Mallory copying it.

In an insertion attack, after Mallory runs her function, the state of the contract is changed and she needs Alice’s original function to run on this modified state. For example, if Alice places a purchase order on a blockchain asset at a higher price than the best offer, Mallory will insert two transactions: she will purchase at the best offer price and then offer the same asset for sale at Alice’s slightly higher purchase price. If Alice’s transaction is then run after, Mallory will profit on the price difference without having to hold the asset.

In a suppression attack, after Mallory runs her function, she tries to delay Alice from running her function. After the delay, she is indifferent to whether Alice’s function runs or not.
They examined a range of Ethereum DApps. Perhaps the most revealing was their study of the Status.im ICO. As usual during times of high demand, the Ethereum network suffered from severe congestion:
During the time frame the ICO was open for participation, there were reports of Ethereum network being unusable and transactions were not confirming. ... there were many transactions sent with a higher gas price to front-run other transactions, however, these transactions were failing due to the restriction in the ICO smart contract to reject transactions with higher than 50 GWei gas price (as a mitigation against front-running).
Eskandari et al Fig 3
And, as usual, Ethereum mining was dominated by a few large pools; in this case Ethermine and F2Pool controlled 49% of the mining power. Eskandari et al discovered that at least one of them was abusing the system by front-running:
F2Pool—an Ethereum mining pool that had around 23% of the mining hash rate at the time —sent 100 Ether to 30 new Ethereum addresses before the Status.im ICO started. When the ICO opened, F2Pool constructed 31 transactions to the ICO smart contract from their addresses, without broadcasting the transactions to the network. They used their entire mining power to mine their own transactions and some other potentially failing high gas price transactions.
The "high gas price transactions" were added to congest the network and reduce the chance that competing transactions would be confirmed. Eskandari et al tracked the Ether from the 30 addresses to discover:
the funds deposited by F2Pool in these addresses were sent to Status.im ICO and mined by F2Pool themselves, where the dynamic ceiling algorithm refunded a portion of the deposited funds. A few days after these funds were sent back to F2Pool main address and the tokens were aggregated later in one single address.
This early example clearly shows a mining pool using their ability to insert non-public transactions to generate MEV.

Zhou et al (29th September 2020)

In High-frequency trading on decentralized on-chain exchanges by L. Zhou, K. Qin, C. F. Torres, D. V. Le, and A. Gervais:
focus on a combination of front- and back-running, known as a sandwiching, for a single onchain DEX. To the best of our knowledge, we are the first to formalize and quantify sandwich attacks.
They stress that:
While the SEC defines front-running as an action on private information, we only operate on public trade information.
They claim four contributions:
  • Formalization of sandwich attacks. We state a mathematical formalization of the AMM [Automated Market Maker] mechanism and the sandwich attack, providing an adversary with a framework to manage their portfolio of assets and maximize the profitability of the attack.
  • Analytic and empirical evaluation. We analytically and empirically evaluate sandwich attacks on AMM DEX. Besides an adversarial liquidity taker, we introduce a new class of sandwich attacks performed by an adversarial liquidity provider. We quantify the optimal adversarial revenue and validate our results on the Uniswap exchange (largest DEX, with 5M USD trading volume at the time of writing). Our empirical results show that an adversary can achieve an average daily revenue of 3,414 USD. Even without collusion with a miner, we find that, in the absence of other adversaries, the likelihood to position a transaction before or after another transaction within a blockchain block is at least 79%, using a transaction fee payment strategy of ±1 Wei.
  • Multiple Attacker Game. We simulate the sandwich attacks under multiple simultaneous attackers that follow a reactive counter-bidding strategy. We find that the presence of 2, 5 and 10 attackers respectively reduce the expected profitability for each attacker by 51.0%, 81.4% and 91.5% to 0.45, 0.17, 0.08 ETH (67, 25, 12 USD), given a victim that transacts 20 ETH to DAI on Uniswap with a transaction pending on the P2P layer for 10 seconds before being mined. If the blockchain is congested (i.e. the victim transaction remains pending for longer than the average block interval), we show that the breakeven of the attacker becomes harder to attain.
  • DEX security vs. scalability tradeoff. Our work uncovers an inherent tension between the security and scalability of an AMM DEX. If the DEX is used securely (i.e. under a low or zero price slippage), trades are likely to fail under high transaction volume; and an adversarial trader may profit otherwise.
There are a number of interesting points here. It is striking that even a small 1 Wei difference gave almost 4 in 5 chance of correct positioning. And that means that even without a miner a lone sandwich trader could have made about $1.25M/year. But, of course, an opportunity like this is rapidly competed away, especially if the victim transaction is visible for more than one block. The advantage of dark pools is clear.

The tradeoff they identify is interesting, because it is based on the inevitable "price slippage":
Price slippage is the change in the price of an asset during a trade. Expected price slippage is the expected increase or decrease in price based on the volume to be traded and the available liquidity, where the expectation is formed at the beginning of the trade. The higher the quantity to be traded, the greater the expected slippage ... Unexpected price slippage refers to any additional increase or decrease in price, over and above the expected slippage, during the interveni[ng] period from the submission of a trade commitment to its execution.
They describe the tradeoff thus:
Our work, sheds light on a dilemma facing DEXs: if the default slippage is set too low, the DEX is not scalable (i.e. only supports few trades per block), if the default slippage is too high, adversaries can profit.

Zhou et al (3rd March 2021)

In On the just-in-time discovery of profit-generating transactions in defi protocols, Liyi Zhou, Kaihua Qin, Antoine Cully, Benjamin Livshits and Arthur Gervais:
investigate two methods that allow us to automatically create profitable DeFi trades, one well-suited to arbitrage and the other applicable to more complicated settings. We first adopt the Bellman-Ford-Moore algorithm with DEFIPOSERARB and then create logical DeFi protocol models for a theorem prover in DEFIPOSER-SMT. While DEFIPOSER-ARB focuses on DeFi transactions that form a cycle and performs very well for arbitrage, DEFIPOSER-SMT can detect more complicated profitable transactions. We estimate that DEFIPOSER-ARB and DEFIPOSER-SMT can generate an average weekly revenue of 191.48 ETH (76,592 USD) and 72.44 ETH (28,976 USD) respectively, with the highest transaction revenue being 81.31 ETH (32,524 USD) and 22.40 ETH (8,960 USD) respectively.
The key to their analysis is the ability to execute trades on multiple DeFi platforms:
A peculiarity of DeFi platforms is their ability to interoperate; e.g., one may borrow a cryptocurrency asset on one platform, exchange the asset on another, and for instance, lend the resulting asset on a third system. DeFi’s composability has led to the emergence of chained trading and arbitrage opportunities throughout the tightly intertwined DeFi space. Reasoning about what this easy composition entails is not particularly simple; on one side, atomic composition allows to perform risk-free arbitrage — that is to equate asset prices on different DeFi markets. Arbitrage is a benign and important endeavor to keep markets synchronized.

On the other side, we have seen multi-million-revenue trades that cleverly use the technique of flash loans to exploit economic states in DeFi protocols ... While exploiting economic states, however, is not a security attack in the traditional sense, the practitioners’ community often frames these high-revenue trades as “hacks.” Yet, the executing trader follows the rules set forth by the deployed smart contracts. Irrespective of the framing, liquidity providers engaging with DeFi experience millions of USD in unexpected losses.
They validate their approach by showing that:
DEFIPOSER-SMT finds the known economic bZx attack from February 2020, which yields 0.48M USD. Our forensic investigations show that this opportunity existed for 69 days and could have yielded more revenue if exploited one day earlier.
The details of the attack DEFIPOSER-SMT found were:
On the 15th of February, 2020, a trader performed a pump and arbitrage attack on the margin trading platform bZx3 . The core of this trade was a pump and arbitrage involving four DeFi platforms atomically executed in one single transaction. ... this trade resulted in in 4,337.62 ETH (1,735,048 USD) loss from bZx loan providers, where the trader gained 1,193.69 ETH (477,476 USD) in total
Zhou et al Fig 7
One interesting aspect of their results is the relationship between capital employed and return:
We visualize in Figure 7 the revenue generated by DEFIPOSER-SMT and DEFIPOSER-ARB as a function of the initial capital. If a trader owns the base asset (e.g., ETH), most strategies require less than 150 ETH. Only 10 strategies require more than 100 ETH for DEFIPOSER-SMT, and only 7 strategies require more than 150 ETH for DEFIPOSER-ARB.
But this need for capital can be avoided:
This capital requirement is reduced to less than 1.00 ETH (400 USD) when using flash loans (cf. Figure 7 (b, d)).
Eric Budish showed that, to be safe, the value of transactions in a Bitcoin block must not exceed the block reward. Zhou et al study the effect of the MEV they find on Ethereum's safety:
Looking beyond the financial gains mentioned above, forks deteriorate the blockchain consensus security, as they increase the risks of double-spending and selfish mining. We explore the implications of DEFIPOSER-ARB and DEFIPOSER-SMT on blockchain consensus. Specifically, we show that the trades identified by our tools exceed the Ethereum block reward by up to 874×. Given optimal adversarial strategies provided by a Markov Decision Process (MDP), we quantify the value threshold at which a profitable transaction qualifies as Miner Extractable Value (MEV) and would incentivize MEV-aware miners to fork the blockchain. For instance, we find that on Ethereum, a miner with a hash rate of 10% would fork the blockchain if an MEV opportunity exceeds 4× the block reward.
ETH mining 09/03/22
Note that, as I write, three mining pools have more than 10% of the ETH hash rate (Ethermine 27.1, f2pool 15.5, Hiveon.net 10.2) for a total of 52.8%. The risk to Ethereum safety that Zhou et al identify is thus very real.

Torres et al (11th August 2021)

Frontrunner jones and the raiders of the dark forest: An empirical study of frontrunning on the ethereum blockchain by Christof Ferreira Torres, Ramiro Camino and Radu State:
aims to shed some light into what is known as a dark forest and uncover these predators’ actions. We present a methodology to efficiently measure the three types of frontrunning: displacement, insertion, and suppression. We perform a largescale analysis on more than 11M blocks and identify almost 200K attacks with an accumulated profit of 18.41M USD for the attackers, providing evidence that frontrunning is both, lucrative and a prevalent issue.
That is an average of about $9,200 per attack. Given the skewed distribution, this is surprisingly small. The authors provide a breakdown of attack cost and profit in their Figure 7. The most likely results are:
  • Displacement: cost around $10 and profit over $100.
  • Insertion: cost below $10 and profit under $100.
  • Suppression: cost around $1,000 and profit around $10,000.
Thus the typical return on investment is around 10x. Despite the relatively small dollar profit per attack, automation and the 10x RoI is enough to motivate attackers.

Torres et al observe the same DEX dilemma as Zhou et al:
Uniswap, the DEX most affected by frontrunning, is aware of the frontrunning issue and proposes a slippage tolerance parameter that defines how distant the price of a trade can be before and after execution. The higher the tolerance, the more likely the transaction will go through, but also the easier it will be for an attacker to frontrun the transaction. The lower the tolerance, the more likely the transaction will not go through, but also the more difficult it will be for an attacker to frontrun the transaction. As a result, Uniswap’s users find themselves in a dilemma.

Judmayer et al (21st September 2021)

Estimating (miner) extractable value is hard, let’s go shopping! by Aljosha Judmayer, Nicholas Stifter, Philipp Schindler, and Edgar Weippl describes:
different forms of extractable value and how they relate to each other. Furthermore, we outline a series of observations which highlight the difficulties in defining these different forms of extractable value and why a generic and thorough definition of it (and thus its precise calculation) is impossible for permissionless cryptocurrencies without assuming bounds regarding all available resources (e.g., other cryptocurrencies) that are, or could be of relevance for economically rational players. We also describe a way to estimate the minimum extractable value, measured in multiples of normalized block rewards of a reference resource, to incentivize adversarial behaviour of participants which can lead to consensus instability. In the end, we propose a peculiar yet straightforward technique for choosing the personal security parameters k regardless of extractable value opportunities.
They explain the role of k:
As the concept of MEV/BEV is tied to the economic incentives of whether or not to fork a certain block/chain [44], this question also relates to economic considerations regarding the choice of the personal security parameter k of merchants. An accurate estimation of the overall MEV/BEV value, would allow to adapt and increase k accordingly in periods of high overall MEV/BEV. The choice of the security parameter k, which determines the number of required confirmation blocks until a payment can safely be considered confirmed, has been studied in a variety of works. Rosenfeld [35] showed that, although waiting for more confirmations exponentially decreases the probability of successful attacks, no amount of confirmations will reduce the success rate of attacks to 0 in the probabilistic security model of PoW, and that there is nothing special about the often-cited figure of k = 6 confirmations.
Goharshady showed that there is inevitable risk for merchants in selling goods (or even fiat currency) for cryptocurrency. In Section 4.1 the authors describe a technique by which a merchant can choose k so as to transfer this risk to their counterparty.

Note: I could not find a specific publication date for this paper, the 21st September 2021 date is from the Wayback Machine.

Obadia et al (7th December 2021)

Unity is strength: A formalization of cross-domain maximal extractable value by Alexandre Obadia, Alejo Salles, Lakshman Sankar, Tarun Chitra, Vaibhav Chellani, and Philip Daian examines the potential for MEV across multiple interconnected blockchains:
In this work, we call each of these interconnected blockchains ‘domains’, and study the manifestation of Maximal Extractable Value (MEV, a generalization of “Miner Extractable Value”) across them. In other words, we investigate whether there exists extractable value that depends on the ordering of transactions in two or more domains jointly.
There are many types of domain:
Layer 1s, Layer 2s, side-chains, shards, centralized exchanges are all examples of domains.
The key to Cross-Domain Maximal Extractable Value is:
MEV extraction has historically been thought of as a self-contained process on a single domain, with a single actor (traditionally the miner) earning an atomic profit that serves as an implicit transaction fee. In a multi-chain future, extracting the maximum possible value from multiple domains will likely require collaboration, or collusion, of each domain’s sequencers if action across multiple domains is required to maximize profit.
...
We expect that, given the deployment of AMMs and other MEV-laden technologies across multiple domains, the benefit of extracting MEV across multiple domains will often outweigh the cost of collusion
Obadia Fig 1
And thus:
Since it exists across domains and given it is finite, the competition for such opportunities will be fierce and it is likely no bridge will be fast enough to execute a complete arbitrage transaction as exemplified in Figure 1.

One observation is that a player that already has assets across both domains does not need to bridge funds to capture this MEV profit, reducing the time, complexity, and trust required in the transaction. This means that cross-domain opportunities may be seized in two simultaneous transactions, with inventory management across many domains being internal to a player’s strategy to optimize their MEV rewards.
And this, for advocates of decentralization, is a problem because:
Such behavior is similar to the practice of inventory management that market makers and bridges in traditional finance do, which primarily consists of keeping assets scattered across multiple heterogeneous-trust domains (typically centralized exchanges), managing risks associated with these domains, and determining relative pricing. Some key differences include the ability to coordinate with actors in the system other than themselves, such as through a DAO or a system like Flashbots. However, given these conclusions, it is likely that traditional financial actors may have a knowledge-based advantage in cross-domain MEV risk management, which may induce centralization vectors that come from such actors being able to run more profitable validators.

Despite being grim, this fact is important as it reveals a key property that cross-domain interactions are subject to: the loss of composability. There is no more atomic execution. This introduces additional execution risk, as well as requires higher capital requirements, further raising the barriers-to-entry required to extract MEV.
This is not the only "negative externality" the authors identify. Another comes from the dominance of large, coin-operated miners or validators, who can afford to dominate in multiple domains at once:
Cross-domain extractable value may create an incentive for sequencers (i.e. validators in most domains) to amass votes across the networks with the most extractable value.

This is especially relevant when realizing there already exist large validators and staking providers running infrastructure across many networks. It is unlikely that those who are for-profit entities will forego access to MEV revenue long term if such revenue is substantial.
Another is the risk of concensus instability:
Time-bandit attacks were first introduced in [Daian et al], and consist in looking at cases where the miner has a direct financial incentive to re-org the chain it is mining on.

In a cross-domain setting, there may now exist incentives to re-org multiple domains, or to re-org weaker domains in order to execute attacks resembling double-spends. This will be particularly relevant to the security of bridges.
Yet another arises from my favorite issue, Economies of Scale in Peer-to-Peer Networks:
One general worry is of the potential economies of scale, or moats, that a trader could create across domains which would end up increasing the barrier to entry for new entrants and enshrine existing players’ dominance.

example. Suppose a domain orders transactions on a first-in first-out basis (FIFO). Such a domain effectively creates a latency race between traders going after the same opportunity. As seen in traditional markets, traders will likely invest in latency infrastructure in order to stay competitive, possibly reducing the efficiency of the market.

If several domains also have such ordering rules, traders could engage in a latency race in each of them, or a latency race to propagate arbitrage across domains, making the geographical points in which these systems operate targets for latency arbitrage. It is likely the infrastructure developed to optimize one domain, can be used across multiple domains. This could create a ‘super’ latency player, and certainly advantages entities which already have considerable expertise in building such systems in traditional finance. Such latency-sensitive systems may erode the security of systems that do not rely on latency, by advantaging latency-optimizing players in the cross-domain-MEV game. This area warrants substantial further study, as it is well-known that global network delays require relatively long block times in Nakamoto-style protocols to achieve security under asynchrony [Pass et al], and it is possible cross-domain-MEV may erode the fairness of validator rewards in such protocols.
And finally, a consequence of Cross-Domain Maximal Extractable Value is:
the introduction of heterogeneous pricing models across different players in the blockchain ecosystem. Previously, MEV was an unambiguous quantity denominated in a common base asset, ETH. In a multi-chain future, the relative price differences between actors is not only relevant in calculating MEV, it can in fact create MEV. For example, a previous validator may leave a system in what is in their pricing model a 0-MEV state, but the next validator, who disagrees with this pricing model, may see opportunities to rebalance MEV to increase assets it subjectively values more. This provides yet another intution for why MEV is fundamental to global, permissionless systems even if they do not suffer from ordering manipulation of the kind described in [Daian et al].
To sum up, the availability of MEV derived from executing transactions in multiple domains greatly advantages large players, especially those with expertise from traditional finance, and increases the risk of consensus instability.

Qin et al (10th December 2021)

Quantifying Blockchain Extractable Value: How dark is the forest? by Kaihua Qin, Liyi Zhou and Arthur Gervais claims three main contributions:
  • We are the first to comprehensively measure the breadth of BEV from known trading activities (i.e., sandwich attacks, liquidations, and arbitrages). Although related works have studied sandwich attacks in isolation, there is a lack of quantitative data from real-world exploitation to objectively assess their severity.
  • We are the first to propose and empirically evaluate a transaction replay algorithm, which could have resulted in 35.37M USD of BEV. Our algorithm extends the total captured BEV by 35.18M USD, while intersecting with only 1.43% of the liquidation and 0.11% of the arbitrage transactions
  • We are the first to formalize the BEV relay concept as an extension of the P2P transaction fee auction model. Contrary to the suggestions of the practitioner community, we find that a BEV relayer does not substantially reduce the P2P network overhead from competitive trading.
The authors examined trading over 32 months from the 1st of December, 2018 to the 5th of August, 2021 looking for the three types of attacks. As regards sandwich attacks they found:
2,419 Ethereum user addresses and 1,069 smart contracts performing 750,529 sandwich attacks on Uniswap V1/V2/V3, Sushiswap, and Bancor, with a total profit of 174.34M USD (cf. Fig. 3). Our heuristics do not find sandwich attacks on Curve, Swerve, and 1inch. Curve/Swerve are specialized in correlated, i.e., pegged- coins with minimal slippage. Despite the small market cap. (< 1% of Bitcoin), SHIB is the most sandwich attack-prone ERC20 token with an adversarial profit of 6.84M USD.

We notice that 240,053 sandwich attacks (31.98%) are privately relayed to miners (i.e., zero gas price), accumulating a profit of 81.04M USD. Sandwich attackers therefore actively leverage BEV relay systems (cf. Section VI-A) to extract value. We also observe that 17.57% of the attacks use different We also observe that 17.57% of the attacks use different accounts to issue the front- and back-running transactions.
...
Zhou et al. [2] estimate that under the optimal setting, the adversary can attack 7,793 Uniswap V1 transactions, and realize 98.15 ETH of revenue from block 8M to 9M. Based on our data, we estimate that only 63.30% (62.13 ETH) of the available extractable value was extracted.
Qin et al Fig 5
Note that the 32% of non-public sandwich attacks generated 46% of the MEV, so there is an advantage in keeping them private. Clearly, with only 63% of the potential MEV realized, there was room to optimize sandwich trading strategies. As regards liquidations, they examined:
all liquidation events on Aave (Version 1 and 2), Compound, and dYdX from their inception until block 12965000 (5th of August, 2021). We observe a total of 31,057 liquidations, yielding a collective profit of 89.18M USD over 28 months (cf. Fig. 5a and 5b). Note that we use the prices provided by the price oracles of the liquidation platforms to convert the profits to USD at the moment of the liquidation.
...
We identify 1,956 transactions (6.3%) with zero gas price out of the 31,057 liquidation events, implying that liquidators relay liquidation transactions to miners privately without using the P2P network. These privately relayed transactions yield a total profit of 10.69M USD.
The 6.3% of private transactions generated 12% of the MEV. As regards arbitrage, they write:
From the 1st of December, 2018 to the 5th of August, 2021, we identify 6,753 user addresses and 2,016 smart contracts performing 1,151,448 arbitrage trades on Uniswap V1/V2/V3, Sushiswap, Curve, Swerve, 1inch, and Bancor, amounting to a total profit of 277.02M USD. We find that 110,026 arbitrage transactions (9.6%) are privately relayed to miners, representing 82.75M USD of extracted value.
...
ETH, USDC, USDT, and DAI are involved in 99.91% of the detected arbitrages.
Note that the 9.6% of non-public transactions represent 30% of the MEV, so the advantage of keeping arbitrage transactions private is obvious. They also investigated attacks that congest the blockchain:
A clogging attack is, therefore, a malicious attempt to consume block space to prevent the timely inclusion of other transactions. To perform a clogging attack, the adversary needs to find an opportunity (e.g., a liquidation, gambling, etc.) which does not immediately allow to extract monetary value. The adversary then broadcasts transactions with high fees and computational usage to congest the pending transaction queue. Clogging attacks on Ethereum can be successful because 79% of the miners order transactions according to the gas price
...
We identify 333 clogging periods from block 6803256 to 12965000 , where 10 user addresses and 75 smart contracts are involved ... While the longest clogging period lasts for 5 minutes (24 blocks), most of the clogging periods (83.18%) account for less than 2 minutes (10 blocks).
Much longer clogging attacks have happened:
This is what appears to have happened with the infamous Fomo3D game, where an adversary realized a profit of 10,469 ETH by conducting a clogging attack over 66 consecutive blocks.
Blockchain scalability and the fragmentation of crypto by Frederic Boissay et al shows that my Fixed Supply, Variable Demand was correct in suggesting that congestion was necessary to the stability of blockchains. Boissay et al write:
To maintain a system of decentralised consensus on a blockchain, self-interested validators need to be rewarded for recording transactions. Achieving sufficiently high rewards requires the maximum number of transactions per block to be limited. As transactions near this limit, congestion increases the cost of transactions exponentially. While congestion and the associated high fees are needed to incentivise validators, users are induced to seek out alternative chains. This leads to a system of parallel blockchains that cannot harness network effects, raising concerns about the governance and safety of the entire system.
Thus these systems are necessarily vulnerable to clogging attacks.

Capponi et al (11th February 2022)

In The Evolution of Blockchain: from Lit to Dark, Agostino Capponi, Ruizhe Jia and Ye Wang describe a model built to:
study the economic incentives behind the adoption of blockchain dark venues, where users’ transactions are observable only by miners on these venues. We show that miners may not fully adopt dark venues to preserve rents extracted from arbitrageurs, hence creating execution risk for users. The dark venue neither eliminates frontrunning risk nor reduces transaction costs. It strictly increases payoff of miners, weakly increases payoff of users, and weakly reduces arbitrageurs’ profits. We provide empirical support for our main implications, and show that they are economically significant. A 1% increase in the probability of being frontrun raises users’ adoption rate of the dark venue by 0.6%. Arbitrageurs’ cost-to-revenue ratio increases by a third with a dark venue.
In other words, dark pools transfer MEV mostly from external arbitrageurs to miners, with a smaller transfer to users. But the equilibrium involves continued use of the public mempool because miners enjoy the extra fees arbitrageurs use to attempt to position their trades relative to their victim's trades. Capponi et al describe a pair of tradeoffs. First, for blockchain users:
On the one hand, using the dark venue alone presents execution risk to users. Transaction submitted to the dark venue face the risk of not being observed by the miner updating the blockchain, who may not have adopted the dark venue. On the other hand, users who only submit through the dark venue avoid the risk of being frontrun.
Second, for arbitrageurs:
Arbitrageurs who only use the dark venue would not leak out information about the identified opportunity to their competitors. They also gain prioritized execution for their orders because miners on the dark venue prioritize transactions sent through such venue. ... If instead the execution risk is high, arbitrageurs will use both the lit and the dark venue: through the dark venue they gain prioritized execution, and through the lit venue they are guaranteed execution.
Their model predicts that:
both arbitrageurs and the frontrunnable user will submit their transactions through the dark venue, if sufficiently many miners adopt it.
There is a weakness in their model hiding behind "sufficiently many miners". Their model of mining consists of:
a continuum of homogeneous, rational miners. All miners have the same probability of earning the right to append a new block to the blockchain.
But this does not model the real world. Miners are not the ones who chose transactions from the lit or dark pools, the choice is made by mining pools, and a small number of large pools have a much higher "probability of earning the right to append a new block to the blockchain". Thus the use of the dark venue depends upon whether the large mining pools adopt it. Given that typically 2 or 3 pools dominate Ethereum mining, if they adopt the dark pools usage will be high and users of the lit venue will be disadvantaged.

The authors also study dark pool usage empirically:
Our dataset contain dark venue transaction-level data of Ethereum blockchain collected from Flashbots API, Ethereum block data, and transaction-level data from Uniswap V2 and Sushiswap AMMs. Our empirical analysis confirms that the dark venue is partially adopted, and further estimates the dark venue adoption rate around 60% as of Jul 2021. Our analysis also shows that miners who join the dark venue have higher revenue than those who stay on the lit venue.
The methodology described in Seciton 6.3 suggests that what the authors mean is that about 60% of the hash rate has adopted the dark pools; for Ethereum this would be no more than 4 mining pools.

Elder (10th June 2022)

Bryce Elder's How should we police the trader bots? is about bots trading in the conventional markets, but the message should be heeded by the developers of cryptocurrency trading bots:
An overarching rule of securities legislation is that market abuse is market abuse, irrespective of whether it’s committed by a human or machine. What matters is behaviour. An individual or firm can expect trouble if they threaten to undermine market integrity, destabilise an order book, send a misleading signal, or commit myriad other loosely defined infractions. The mechanism is largely irrelevant.

And importantly, an algorithm that misbehaves when pitted against another firm’s manipulative or borked trading strategy is also committing market abuse. Acting dumb under pressure is no more of an alibi for a robot than it is for a human.

For that reason, trading bots need to be tested before deployment. Firms must ensure not only that they will work in all weathers, but also that they won’t be bilked by fat finger errors or popular attack strategies such as momentum ignition. The intention here is to protect against cascading failures such as the “hot potato” effect that contributed to the 2010 flash crash, where algos didn’t recognise a liquidity shortage because they were trading rapidly between themselves.

Mifid II (in force from 2018) applies a very broad Voight-Kampff test. Investment companies using European venues are obliged to ensure that any algorithm won’t contribute to disorder and will keep working effectively “in stressed market conditions”. The burden of policing falls partly on exchanges, which should be asking members to certify before every deployment or upgrade that bots are fully tested in “real market conditions”.
It is easy to mandate testing but not so easy to do it. Running the bot against market history is easy but doesn't actually prove anything because the history of trades doesn't reflect the effect on prices of the bot's trades, and thus doesn't reflect the response of all the other bots to this bots trades, and their effects on prices, which would have caused both this bot and all the other bots to respond, which ...

And, even if realistic testing were possible, since the DEX are "decentralized" they have no way to require that bots using them are tested, or how they are tested. So, based on the history of conventional markets, even if cryptocurrencies were not infested with pump-and-dump schemes, flash crashes and the resulting market disruptions are inevitable.

Auer et al (16th June 2022)

Auer et al Graph 1
Raphael Auer, Jon Frost and Jose Maria Vidal Pastor's Miners as intermediaries: extractable value and market manipulation in crypto and DeFi provides a readable overview of MEV:
Far from being “trustless”, cryptocurrencies and decentralised finance (DeFi) rely on intermediaries who must be incentivised to maintain the ledger of transactions. Yet each of the validators or “miners” updating the blockchain can determine which transactions are executed and when, thus affecting market prices and opening the door to front-running and other forms of market manipulation.
Auer et al explain:
MEV can hence resemble illegal front-running by brokers in traditional markets: if a miner observes a large pending transaction in the mempool that will substantially move market prices, it can add a corresponding buy or sell transaction just before this large transaction, thereby profiting from the price change (at the expense of other market participants). Miners can also engage in “back-running” or placing a transaction in a block directly after a user transaction or market-moving event. This could entail buying new tokens just after they are listed, eg in automated strategies from multiple addresses, to manipulate prices. Finally, miners can engage in “sandwich trading”, where they execute trades both before and after a user, thus making profits without having to take on any longer-term position in the underlying assets.
Auer et al Graph 2
They estimate the scale of the problem:
Since 2020, total MEV has amounted to an estimated USD 550–650million on just the Ethereum network, according to two recent estimates (Graph 2, left-hand panel). In addition to sandwich attacks, MEV results from liquidation attacks (ie forcing liquidations), replay attacks (cloning and front-running a victim’s trade) and decentralised exchange arbitrage (right-hand panel; online appendix). Notably, these estimates are based on just the largest protocols and are hence likely to be understated. Thus, the amount of MEV captured in the data is only one portion of the total profits that miners can extract from other users.
Source
Extracting data from this chart and assuming that the $550-650K represents "price" at the time, it appears that over the same period Ethereum miners earned about $27,616K, so MEV represented only about 2.4% additional mining income. This sounds small, but most of the time Proof-of-Work mining is a low-margin business so 2.4% extra income could well be significant. It isn't clear how much capital is needed to exploit MEV, so 2.4% needs to be decreased by the cost of the required capital.

Of course, this being Ethereum, capturing MEV has been automated:
"Bots” that exploit MEV are now active on different decentralised exchanges. This imposes a fixed cost of mining, encouraging concentration. Solutions such as moving to dark venues, where transactions are only visible to miners, have not so far reduced front-running risk (Capponi et al (2022)). The additional fees and unpredictability for users mean an additional form of insider rents in DeFi markets.
And things are only going to get worse:
Looking forward, MEV could intensify. Indeed, in general equilibrium, miners may be forced to engage in MEV to survive. Miners who engage in MEV will on average make higher profits and buy more computing power, and they could thus eventually crowd out miners who do not. Thus, a form of rat race develops from the combination of the competitive and decentralised nature of updating and the fact that every miner can assemble their block any way they want. It has been argued that MEV forms an existential risk to the integrity of the Ethereum ledger (Daian et al (2019); Obadia (2020)).

UNREJECTED / Nick Ruest

Two months ago I wrote about being denied for Twitter Academic Access. Yesterday, inexplicably, I received an email from Twitter stating that I was approved for Twitter Academic Access.

Academic Research Access Application Email

Much like the rejection, the approval is pretty opaque. I didn’t appeal, because you can’t appeal. I’m very grateful to have access, but also extremely confused. Maybe I shouldn’t be bothered by the insectoid Gregor Samsa letting me know that he approves of me, and my trial is over. But, this is a pretty big problem, and again, I can’t imagine I’m alone here. Why the rejection? Why the approval out of nowhere months later?

…a couple hours after writing the above paragraph, I see this Tweet, and the mystery is resolved!

I am extremely grateful and humbled by all the support that came out of my initial post. Thank you all so very much, and a very special thank you to daniel and Igor Brigadir!!

Grateful Ed / Ed Summers

My life has taken quite a few turns, but some days (like today) I’m overcome by a sense of how grateful I am to be able to work in the place that I live, and with friendly people who enjoy doing the same.

My desk, also known as the kitchen table.

What Legal Hackers Can Learn From Libraries / Harvard Library Innovation Lab

This is a lightly edited transcript of a talk I gave at the 2022 Legal Hackers International Summit on September 10, 2022.

Hello, everyone! I'm Jack Cushman. I'm the director of the Harvard Library Innovation Lab.

Jameson encouraged us to include a big idea in these talks. And we're here at Legal Hackers, whose mission is to work on "the most pressing issues at the intersection of law and technology."

So the big idea I wanted to bring to you as legal hackers is: the most pressing issue at the intersection of law and technology is that we don't know how to have a civilization anymore.

Larry Lessig famously said that what's at the intersection of law and technology is us: we're this pathetic dot at the middle, being regulated by law, by tech, by markets, by norms.

And the Internet has disrupted all of those! It's made all of those start to regulate us in much faster, less predictable ways. So we're now exploring what it means to be a civilization, what our options are, much faster than we ever did before, and we don't know if any of that works yet.

We don't know if we can have a civilization in the presence of the Internet yet.

What it means to have a job is changing incredibly fast right now. We can no longer assume that the same kind of jobs will exist at the end of our careers as the start of our careers.

What it means to form a consensus truth is changing incredibly fast right now.

What it means to choose a government is changing incredibly fast right now, and we don't know if it works yet.

What I want to bring to you beyond that moment of panic is to say, hey, I work at a library.

I work at a law library and I want all of you legal hackers, all of us legal hackers who are reinventing how the world works — that's what legal hacking is! — to steal more from libraries. Steal more ideas from libraries.

Ideas like, libraries are places that help us remember who we are, and they help us remember generationally. They help us remember, at a scale of decades and centuries, who we are and where we came from and where we're going. Steal that idea.

Libraries, especially public libraries, are the places of last resort where you go when you just don't know what to do next. Whether you're in a domestic violence situation or you don't know how to file your taxes or you just don't know what to read next, libraries are places with a person with an ethical commitment to help you out as best they can. It's an extraordinary resource. Let's borrow that idea.

Libraries are an essential part of the speech network that we maintain as societies. Even a tiny town will pay to have a public library, because the public library is a core part of how we form consensus truth. We need to pay attention to those networks that help tell us who we are.

Libraries are little anti-capitalist experiments! You have your economy working along in whatever way it does, and then within the walls of the library they're like, "it all works differently in here! Let's try this other thing for a while!" Whatever economy you're in, libraries are a chance to try something else to experiment and learn. They help you stabilize the change that's happening in your society by experimenting.

And libraries are places that think about citizens and not consumers or users. Libraries call you "patrons." And what we mean by patrons is sort of like citizens of your community — not citizens on a government list, but in the sense of people who are part of this community that we're trying to build, people who are part of our civic infrastructure.

That's how your library sees you.

They don't see you as a user, they don't see you as a resource to exploit. They see you as someone they can help be whatever it is you're trying to be.

We need to borrow that idea.

We need to borrow all those ideas because, after fifty years of the internet, libraries are the one information technology I know of that actually scales. Meaning, the more it grows the more it helps knit your social fabric together instead of tearing it apart. [OK, I didn't say this line in the talk, but I meant to.]

If we are to answer this pressing question of, like, "can we have civilization together anymore," now that we can all talk to each other all the time and don't know what to say — if we are to answer that, I think libraries are one of the core tools that we can use to do it.

And since I'm here from a library, I wanted to pass that along.

That was only three minutes and 45 seconds. So let me tell you very quickly a few of the things that I would love to talk with you about that we're working on at the Harvard Library Innovation Lab, and the very small part of the "saving civilization" problem that we're thinking about:

How do we collaboratively update the legal curriculum? I mean questions like, how do we teach criminal law? We have to start moving faster and including more people in that question. Tools like our Open Casebook platform can help professors collaboratively decide what to teach.

How do we make core legal data open and computable — like our Caselaw Access Project, which scanned all of the precedential legal cases in the United States. And what happens when we do, and who gets exposed, and is that good or bad or both?

How do we preserve data for the next fifty years? The internet is only fifty years old and we don't know if we can remember things from generation to generation yet. Websites break within months of posting them; they need constant maintenance. We need to make websites that last for decades. We need to make data that lasts for centuries. Let's figure out how to do that together.

We're thinking about how to get more people included in that cultural record. The question of whether you are remembered, whether you are part of that generational memory the libraries offer, has always depended on how legally precarious you are. I'm thinking of examples like the sex worker advocacy movement that responded to the SESTA-FOSTA debate, that is now at risk of being forgotten already because the platforms where the movement happened were removed by the law that the movement was about. What gets remembered in the record depends a lot on who you are, and the law has a lot to say about that, and technology does too. So we're thinking about those sorts of precarious archives that are legally in danger.

And we're thinking about, how do we help internet communities grow into civic communities?

As we move from, "my people are on Main Street, my civic life is on Main Street, my civic sustenance is on Main Street," to where my people are in a Slack group, or maybe they're a group of people I talk to on Twitter, but maybe they don't talk to each other — there's a sense of hollowness that comes from what we left behind, and haven't figured out how to bring along yet.

I get to think about that from the library perspective, because libraries are one of those core resources in a small town. I think they might be a core resource in our new civic life as well, in those Slack groups and the other ways that we build a civic society online — but libraries certainly are not the only one. What else does it take to build a government out of a pile of online communities, to build a people, a society, a civilization out of online communities?

Finally, since we are coming from a bunch of law schools, how do we involve students in this conversation? When we're teaching classes about innovation, beyond the design thinking stuff — which is really important, but it's just a tool they can use — what conversation are we trying to have with students about this saving-the-world stuff? Many of them won't just go out and work at law firms anymore, so what other perspectives should we be bringing to them?

So that's what's on my mind. Thank you so much.

Browsertrix Drivers / Ed Summers

TL;DR if you are annoyed that your web archiving crawls aren’t collecting everything you want, and you don’t mind a bit of JavaScript then Browsertrix Crawler “drivers” might be for you.


As part of my involvement with web archiving at Stanford I’ve had the opportunity to work with Peter Chan (Stanford’s Web Archivist) to diagnose some crawls that have proven difficult to do with Archive-It for various reasons.

One example is the Temporality website which was created by poet Stephen Ratcliffe. I actually haven’t asked Peter the full story of how this site came to be selected for archiving at Stanford. But Ratcliffe is the author of many books, the Director of the Creative Writing program at Mills College (in Oakland, CA) and was also a fellow at Stanford, so there is a strong California connection.

At first Temporality looks like a pretty vanilla Blogger website, with some minor customization, which should pose little difficulty for archiving. But on closer inspection things get interesting because Ratcliffe has been writing a post every day there since May 1, 2009. Yes, every day–sometimes the post includes a photograph too.

Given that time span there should be about 5000 posts, and perhaps a few thousand more post listing pages. But the problem is that even after 2 months of continuous crawling with Archive-It, and collecting 1,010,351 pages (90GB), the dashboard still indicated that it needed to crawl another million pages.

We took a look at the Archive-It reports and noticed the crawler had spent a lot of time fetching URLs like this:

https://www.blogger.com/navbar.g?targetBlogID=5356039418422135632&blogName=Temporality&publishMode=PUBLISH_MODE_BLOGSPOT&navbarType=LIGHT&layoutType=LAYOUTS&searchRoot=https://stephenratcliffe.blogspot.com/search&blogLocale=en&v=2&homepageUrl=https://stephenratcliffe.blogspot.com/&vt=7450728192459364499&usegapi=1&jsh=m%3B%2F_%2Fscs%2Fabc-static%2F_%2Fjs%2Fk%3Dgapi.lb.en.z9QjrzsHcOc.O%2Fd%3D1%2Frs%3DAHpOoo8359JQqZQ0dzCVJ5Ui3CZcERHEWA%2Fm%3D__features__#id=navbar-iframe&_gfid=navbar-iframe&parent=https%3A%2F%2Fstephenratcliffe.blogspot.com&pfname=&rpctoken=24676009 

These turned out to be some kind of iframe tracker, that seemed to generate a unique URL for every page load. Unless these were blocked the crawl would never complete!

Adding exclusions for these isn’t hard in Archive-It–but knowing what patterns to exclude when you are looking at millions of queued URLs can be a bit tricky.

Unfortunately we expended our monthly budget doing this crawl, so we decided to give Browsertrix Crawler a try using our staff workstation.

Actually there actually was another reason why we wanted to try Browsertrix Crawler.

We noticed during playback that the calendar menu on the right wasn’t working correctly. When you selected a month it would just display a Loading message forever.

The problem here is that the button to open the month was a <span> element that had an onClick event which triggered an AJAX call. Since it wasn’t a button or an anchor tag the crawlers wouldn’t usually click on them, and so the AJAX response would never be made part of the archive.

Browsertrix Crawler does have a feature called behaviors which allow you to trigger custom crawler actions for popular social media platforms like Twitter, Instagram and Facebook. There isn’t a behavior for Blogger, so I started out by creating one (more about this in a future post).

In the process of writing a unit test for my new Blogger behavior I discovered that Blogger can be customized quite a bit. The calendar menu present on Temporality was just one of many possible ways to configure your archive links. So a general purpose behavior probably wasn’t the right way to approach this problem.

Browsertrix Crawler offers an another option called drivers, which inject crawl specific behavior directly into the crawl. This was actually quite a bit simpler than developing a plugin, because it was easier to ensure the behavior ran only once during the entire crawl, rather than for each page (which is how Browsertrix Behaviors usually work).

Like the site specific behavior plugins drivers are a bit of JavaScript that Browsertrix Crawler will load and call when it starts up. It passes in the data which includes information about the crawl, a page which is a Chromium browser page managed by Puppeteer, and the browsertrix Crawler object.

In my case all I had to do was to create a new function openCalendar() which clicks on all the links triggering the AJAX calls, and then continues along with the crawl by calling crawler.loadPage(). Note the use of openedCalendar to track whether the calendar links had been opened already.

let openedCalendar = false;

async function openCalendar(page) {

  // the first browser window crawls the calendar
  if (openedCalendar) return;
  openedCalendar = true;

  // Go to the Temporality homepage
  await page.goto("https://stephenratcliffe.blogspot.com/")

  // This is evaluated in the context of the browser page with access to the DOM
  await page.evaluate(async () => {
  
    // A helper function for sleeping
    function sleep(timeout) {
      return new Promise((resolve) => setTimeout(resolve, timeout));
    }

    // Get each element in the list of years
    for (const year of document.querySelectorAll("#ArchiveList > div > ul.hierarchy > li")) {

      // Expand each year
      const yearToggle = year.children[0];
      if (year.classList.contains("collapsed")) {
        yearToggle.click();
      }

      // Expand each month within each year
      for (const month of year.querySelectorAll("ul.hierarchy > li.collapsed")) {
        const monthToggle = month.children[0];
        await monthToggle.click();
        await sleep(250);
      }
    }
  });

}

module.exports = async ({data, page, crawler}) => {
  await openCalendar(page);
  await crawler.loadPage(page, data);
};

I’ve found running the Browsertrix Crawler is easiest (and most repeatable) when you put all the options in a YAML configuration file and then starting up Browsertrix Crawler pointed at it:

docker run -p 9037:9037 -it -v $PWD:/crawls/ webrecorder/browsertrix-crawler crawl --config /crawls/config/temporality.yaml

You can keep the YAML file as a record of how the crawl was configured, you can use it to rerun the crawl, and you can copy the config into new files and edit them to setup similar crawls of different sites.

collection: stephenratcliffe
workers: 4
generateWACZ: true
screencastPort: 9037
driver: /crawls/drivers/stephenratcliffe.js
seeds:
  - url: https://stephenratcliffe.blogspot.com
    scopeType: host
    exclude:
      - stephenratcliffe\.blogspot\.com/navbar.g.*

The -p 9037:9037 in the long docker command above allows you to open http://localhost:9037 in your browser to watch the progress of your crawl by seeing the 4 browser windows do their work. This is surprisingly useful for diagnosing when things appear to be stuck.

I wanted to write this up mostly to encourage others to fine tune their crawl behaviours using drivers to meet the specific needs of websites they are trying to archive. While there are aspects of web archiving that can be generalized, the web is a platform for distributing client side applications so there is a real limit on what can be done generally, and a real need to be able to control aspects of the crawl.

I also wanted to practice writing about this topic because I want to help think about the maintenance and further development of Browsertrix Behaviors which are so important for the Webrecorder approach to web archiving.


PS. The browsertrix crawl ran for 2.6 hours, collected 9,994 pages and generated 1.4 GB of WARC data. An improvement!

PSS. Karen has some other good examples of where drivers can be helpful!

2022 Julie Allinson Award Recipient / Samvera

It is our great pleasure to announce that the recipient of the 2022 Julie Allinson Award is tamsin johnson from University of California, Santa Barbara. 

tamsin was jointly nominated by Julie Hardesty (Indiana University), Jessica Hilt (University of California, San Diego), and Chrissy Rissmeyer (University of California, Santa Barbara). In their nomination letter, they nominated tamsin for the award “in recognition of their outstanding technical leadership, mentorship, and tireless efforts in addressing and raising awareness about issues of equity, diversity, and inclusion in our community.” 

They also highlighted that tamsin is “generous with their knowledge, always willing to pair with new and existing members of the Samvera Community. They are patient with new contributors, encouraging members to take tickets that are outside their comfort zone with the offer of support and education throughout the process.” 

The letter ends by noting that “the efforts of one person can truly make a difference. Julie Allinson was one such individual and tamsin johnson is as well.”

Congratulations tamsin, and thank you for your outstanding and continuing contributions to the Samvera Community!

The post 2022 Julie Allinson Award Recipient appeared first on Samvera.

Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2022 September 6 / HangingTogether

The following  post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by Jay Weitz.

Access to information on reproductive health

The American Library Association Executive Board responded to efforts to limit information about reproductive health. Its 2022 August 9 statement, “American Library Association (ALA) Condemns Proposed State Legislation Limiting Access to Information on Reproductive Health,” says in part, “As members of a profession committed to free and equitable access to information and the pursuit of knowledge, we stand firm in opposing any effort to suppress access to information about reproductive health, including abortion, whether for medical purposes or as a matter of public concern and individual liberty.” Specific guidance for libraries and library workers beyond the many ALA resources already available are under development. “ALA calls on elected officials and policymakers to honor their oaths of office and protect the First Amendment rights and privacy of the people whom they are entrusted to serve,” the document continues, calling upon advocates to Unite Against Book Bans and stand up to all forms of censorship.

Ensuring that people experiencing homelessness can vote

Julie Ann Winkelstein, who co-edits the SRRT Newsletter of ALA’s Social Responsibilities Round Table, calls attention to Every One Votes: A Toolkit to Ensure People Experiencing Homelessness Can Exercise Their Right to Vote from the nonprofit National Alliance to End Homelessness. The free online resource emphasizes how vitally important it is to get out the vote by making sure that every eligible citizen can get registered and exercise this most fundamental of rights.

Michigan library millage defeated over LGBTQ books

In the Michigan election on 2022 August 2, voters in the state’s Jamestown Township voted by a margin of 25 points to reject the proposed renewal of a millage to support the Patmos Library (OCLC Symbol: MIPAT), raising the possibility that it would have to close during 2023, according to the nonprofit and nonpartisan online Bridge Michigan in “Upset over LGBTQ books, a Michigan town defunds its library in tax vote.” The local group Jamestown Conservatives — describing itself on its Facebook page as “created to help others of the community to be aware of the pushed agenda of explicit sexual content that is being infiltrated into our local libraries aiming toward our children. We stand to keep our children safe, and protect their purity, as well as to keep the nuclear family intact as God designed” — campaigned against the millage via flyers, yard signs, and attendance at library board meetings. Ironically, the library’s community room hosted one of the township’s three voting precincts in the August 2nd election. As reported in the Michigan Bridge on August 11, over $100,000 had been raised in two GoFundMe counter campaigns to save the library.

Community resilience through truth and equity

The Center for Community Resilience (CCR) of the George Washington University (OCLC Symbol: DGW) Milken Institute School of Public Health has been compiling information intended to build community resilience through countering what they call the “Pair of ACEs” — Adverse Childhood Experiences and Adverse Community Environments. CCR’s “Fostering Equity: Creating Shared Understanding for Building Community Resilience” brings together a webinar, four equity-based learning modules, case studies, and additional resources for community engagement in developing equity and promoting truth.

“Afternoon of Social Justice”

Recordings of the ALA Social Responsibilities Round Table (SRRT) “Afternoon of Social Justice,” held virtually on 2022 August 3, are freely available. In “Paying Better Attention to Indigenous Communities,” Karleen Delaurier-Lyle spoke about service to Indigenous students at Canada’s University of British Columbia (OCLC Symbol: UBC) Xwi7xwa Library and Kael Moffat of Saint Martin’s University (OCLC Symbol: WSL) of Lacey, Washington, USA, considered means by which libraries that are based on Western ways of knowing and sit on occupied lands might begin to “desettle” and “hear tribal voices.” “Neurodiversity in the Library” featured Kate Thompson and Rachel Bussan, both of West Des Moines Public Library (OCLC Symbol: IW9), talking about improving the library environment for users, staff, and potential employees. The originally planned third session, “Diversity is Not a Bad Word,” is expected to rescheduled at a later date.

“Critical Conversations in LIS”

During September and October, the University of South Carolina (OCLC Symbol: SUC) School of Library and Information Science will present five “Critical Conversations in LIS” guest lectures as part of its course on “Critical Cultural Information Studies:”

Each lecture is free and open to the whole LIS community. They will be presented online and will also be recorded for later viewing.

Academic libraries helping first-generation students

As part of its Carterette Series Webinars, the Georgia Library Association will present “Creating an Inclusive Learning Environment: The Academic Library’s Role in Helping First-Generation College Students Succeed” on September 21, 2022, 2 p.m. Eastern. Liya Deng, Social Sciences Librarian and Associate Professor at Eastern Washington University (OCLC Symbol: WEA), will address the expansion of library services to help ensure that the diverse needs of first-generation students are met, encouraging them to overcome the cultural, social, communication, and economic hurdles they may face.

Ethical cataloging

The current issue (Volume 60, Number 5, 2022) of Cataloging and Classification Quarterly contains several articles that touch upon DEI topics. Treshani Perera of the University of Kentucky (OCLC Symbol: KUK) researches how lived experience may inform descriptive inclusivity and reflect the professional values that librarians claim to work by in “Description Specialists and Inclusive Description Work and/or Initiatives—An Exploratory Study” (pages 355-386). Paromita Biswas of the University of California, Los Angeles (OCLC Symbol: CLU) considers the role that the Program for Cooperative Cataloging (PCC) CONSER program might play in more ethical authority work for corporate entities in her opinion piece “Can CONSER Lead the Way? Considering Ethical Implications for Corporate Bodies in Name Authority Records” (pages 387-399).

“Weeding Out Oppression in Libraries”

Both the fifth and sixth episodes of the podcast Overdue: Weeding Out Oppression in Libraries from the Oregon Library Association (OLA) Diversity, Equity, Inclusion, and Antiracism Committee are now available. Episode 5 features American Library Association (OCLC Symbol: IEH) Executive Director Tracie D. Hall in discussion about diversifying library staff with Ericka Brunson-Rochette, Community Librarian at the Deschutes Public Library (OCLC Symbol: DCH) in Oregon, and Melissa Anderson, Campus Engagement and Research Services Librarian at Southern Oregon University (OCLC Symbol: SOS), in “Mentoring and Developing the Profession.” In Episode 6, “How Bias, Power, and Privilege Show Up In Libraries,” consultant Christina Fuller-Gregory, who facilitates the Libraries of Eastern Oregon EDI Cohort and serves as assistant director of libraries at the South Carolina Governor’s School for the Arts and Humanities, talks with hosts Brittany Young of the Lane County Law Library in Eugene, Oregon, and Roxanne M. Renteria of the Deschutes Public Library (OCLC Symbol: DCH), on dealing with inequities and bias in the workplace.

The post Advancing IDEAs: Inclusion, Diversity, Equity, Accessibility, 2022 September 6 appeared first on Hanging Together.

NDSA Interest Group Meetings Open to All / Digital Library Federation

NDSA Interest Groups meet quarterly on a rotating schedule. Meetings are held via Zoom and are open to all. You are invited to attend to learn more about digital preservation, to meet colleagues in other organizations, and to keep up to date on NDSA. Meeting registration is not used, so you are free to drop in as your schedule permits.  The three Interest groups focus on the following areas: Infrastructure, Standards and Practices, and Content.  Please visit the group webpages to learn more about each interest group, including links to meeting agendas and notes, which includes Zoom meeting information.

Interest group meetings for the remainder of 2022

September 29, 3 pm Eastern time

  • Infrastructure
  • Co-chairs: Robin Ruggaber and Eric Lopatin
  • https://ndsa.org/groups/infrastructure/
  • In September, the Infrastructure Interest Group will shift its meeting format to a reading group approach. With our Infrastructure hats on, we intend to discuss two documents that have recently been introduced through different means to the larger digital preservation community. Between now and the end of September, please take some time to review:
    • The Digital Preservation Declaration of Shared Values (version 3 draft) put forth by the Digital Preservation Services Collaborative, and
    • Preservica’s Charter for Long-term Digital Preservation Sustainability

October 17, 1 pm Eastern time

  • Standards and Practices
  • Co-chairs: Felicity Dykas and Ann Hanlon
  • https://ndsa.org/groups/standards-and-practices/
  • Note that the October meeting date was changed to avoid a conflict with the DLF Forum.
  • Guest speakers Amy Currie and Sharon McMeekin from the Digital Preservation Coalition will give an overview of the newly released DPC Digital Preservation Competency Framework.

November 2, 1 pm Eastern time

December 19, 3 pm Eastern time

There is much happening with digital preservation infrastructure, content, and standards and practices. The more people at the table, the better our digital preservation efforts will be. Please join us!

The post NDSA Interest Group Meetings Open to All appeared first on DLF.

DLF Digest: September 2022 / Digital Library Federation

DLF Digest

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation.


This month’s news:

 

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Can’t find meeting call-in information? Email us at info@diglib.org

 

 

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member institution. Learn more about our working groups and how to get involved on the DLF website. Interested in starting a new working group or reviving an older one? Need to schedule an upcoming working group call? Check out the DLF Organizer’s Toolkit to learn more about how Team DLF supports our working groups, and send us a message at info@diglib.org to let us know how we can help. 

The post DLF Digest: September 2022 appeared first on DLF.