Planet Code4Lib

Searching Project Gutenberg at the Distant Reader / Eric Lease Morgan

flowersThe venerable Project Gutenberg is a collection of about 60,000 transcribed editions of classic literature in the public domain, mostly from the Western cannon. A subset of about 30,000 Project Gutenberg items has been cached locally, indexed, and made available through a website called the Distant Reader. The index is freely for anybody and anywhere to use. This blog posting describes how to query the index.

The index is rooted in a technology called Solr, a very popular indexing tool. The index supports simple searching, phrase searching, wildcard searches, fielded searching, Boolean logic, and nested queries. Each of these techniques are described below:

  • simple searches – Enter any words you desire, and you will most likely get results. In this regard, it is difficult to break the search engine.
  • phrase searches – Enclose query terms in double-quote marks to search the query as a phrase. Examples include: "tom sawyer", "little country schoolhouse", and "medieval europe".
  • wildcard searches – Append an asterisk (*) to any non-phrase query to perform a stemming operation on the given query. For example, the query potato* will return results including the words potato and potatoes.
  • fielded searches – The index has many different fields. The most important include: author, title, subject, and classification. To limit a query to a specific field, prefix the query with the name of the field and a colon (:). Examples include: title:mississippi, author:plato, or subject:knowledge.
  • Boolean logic – Queries can be combined with three Boolean operators: 1) AND, 2) OR, or 3) NOT. The use of AND creates the intersection of two queries. The use of OR creates the union of two queries. The use of NOT creates the negation of the second query. The Boolean operators are case-sensitive. Examples include: love AND author:plato, love OR affection, and love NOT war.
  • nested queries – Boolean logic queries can be nested to return more sophisticated sets of items; nesting allows you to override the way rudimentary Boolean operations get combined. Use matching parentheses (()) to create nested queries. An example includes (love NOT war) AND (justice AND honor) AND (classification:BX OR subject:"spiritual life"). Of all the different types of queries, nested queries will probably give you the most grief.

Becase this index is a full text index on a wide variety of topics, you will probably need to exploit the query language to create truly meaningful results.

Stablecoins Part 2 / David Rosenthal

I wrote Stablecoins about Tether and its "magic money pump" seven months ago. A lot has happened and a lot has been written about it since, and some of it explores aspects I didn't understand at the time, so below the fold at some length I try to catch up.

In the postscript to Stablecoins I quoted David Gerard's account of the December 16th pump that pushed BTC over $20K:
We saw about 300 million Tethers being lined up on Binance and Huobi in the week previously. These were then deployed en masse.

You can see the pump starting at 13:38 UTC on 16 December. BTC was $20,420.00 on Coinbase at 13:45 UTC. Notice the very long candles, as bots set to sell at $20,000 sell directly into the pump.
In 2020 BTC had dropped from around $7.9K on March 10th to under $5K on March 11th. It spiked back up on March 18th, then gradually rose to just under $11K by October 10th.

During that time Tether issuance went from $4.7B to $15.7B, an increase of over 230% with large jumps on four occasions:
  1. March 28-29th: $1.6B = $4.6B to $6.2B (a weekend)
  2. May 12-13th $2.4B = $6.4B to $8.8B
  3. July 20th-21st $0.8B = $9.2B to $10B (a weekend)
  4. August 19-20th $3.4B = $10B to $13.4B (a weekend)
Then both BTC and USDT really took off, with BTC peaking April 13th at $64.9K, and USDT issuing more than $30B. BTC then started falling. Tether continued to issue USDT, peaking 55 days later on May 30th after nearly another $16B at $61.8B. Issuance slowed dramatically, peaking 19 days later on June 18th at $62.7B when BTC had dropped to $$35.8K, 55% of the peak. Since then USDT has faced gradual redemptions; it is now down to $61,8B.

What on earth is going on? How could USDT go from around $6B to around $60B in just over a year?


In Crypto and the infinite ladder: what if Tether is fake?, the first of a two-part series, Fais Kahn asks the same question:
Tether (USDT) is the most used cryptocurrency in the world, reaching volumes significantly higher than Bitcoin. Each coin is supposed to be backed by $1, making it “stable.” And yet no one knows if this is true.

Even more odd: in the last year, USDT has exploded in size even faster than Bitcoin - going from $6B in market cap to over $60B in less than a year. This includes $40B of new supply - a straight line up - after the New York Attorney General accused Tether of fraud.
I and many others have considered a scenario in which the admitted fact that USDT is not backed 1-for-1 by USD causes a "run on the bank". Among the latest is Taming Wildcat Stablecoins by Gary Gorton and Jeffery Zhang. Zhang is one of the Federal Reserve's attorney, but who is Gary Gorton? Izabella Kaminska explains:
Over the course of his career, Gary Gorton has gained a reputation for being something of an experts’ expert on financial systems. Despite being an academic, this is in large part due to what might be described as his practitioner’s take on many key issues.

The Yale School of Management professor is, for example, best known for a highly respected (albeit still relatively obscure) theory about the role played in bank runs by information-sensitive assets.
the two authors make the implicit about stablecoins explicit: however you slice them, dice them or frame them in new technology, in the grand scheme of financial innovation stablecoins are actually nothing new. What they really amount to, they say, is another form of information sensitive private money, with stablecoin issuers operating more like unregulated banks.
Gorton and Zhang write:
The goal of private money is to be accepted at par with no questions asked. This did not occur during the Free Banking Era in the United States—a period that most resembles the current world of stablecoins. State-chartered banks in the Free Banking Era experienced panics, and their private monies made it very hard to transact because of fluctuating prices. That system was curtailed by the National Bank Act of 1863, which created a uniform national currency backed by U.S. Treasury bonds. Subsequent legislation taxed the state-chartered banks’ paper currencies out of existence in favor of a single sovereign currency.
Unlike me, Kahn is a "brown guy in fintech", so he is better placed to come up with answers than I am. For a start, he is skeptical of the USDT "bank run" scenario:
The unbacked scenario is what concerns investors. If there were a sudden drop in the market, and investors wanted to exchange their USDT for real dollars in Tether’s reserve, that could trigger a “bank run” where the value dropped significantly below one dollar, and suddenly everyone would want their money. That could trigger a full on collapse.

But when that might actually happen? When Bitcoin falls in the frequent crypto bloodbaths, users actually buy Tether - fleeing to the safety of the dollar. This actually drives Tether’s price up! The only scenario that could hurt is when Bitcoin goes up, and Tether demand drops.

But hold on. It’s extremely unlikely Tether is simply creating tokens out of thin air - at worst, there may be some fractional reserve (they themselves admitted at one point it was only 74% backed) that is split between USD and Bitcoin.

The NY AG’s statement that Tether had “no bank anywhere in the world” strongly suggests some money being held in crypto (Tether has stated this is true, but less than 2%), and Tether’s own bank says they use Bitcoin to hold customer funds! That means in the event of a Tether drop/Bitcoin rise, they are hedged.

Tether’s own Terms of Service say users may not be redeemed immediately. Forced to wait, many users would flee to Bitcoin for lack of options, driving the price up again.
Kahn agrees with me that Tether may have a magic "money" pump:
It’s possible Tether didn’t have the money at some point in the past. And it’s just as possible that, with the massive run in Bitcoin the last year Tether now has more than the $62B they claim!

In that case Tether would seem to have constructed a perfect machine for printing money. (And America has a second central bank.)
Of course, the recent massive run down in Bitcoin will have caused the "machine for printing money" to start running in reverse.

Matt Levine listened to an interview with Tether's CTO Paolo Ardoino and General Counsel Stuart Hoegner, and is skeptical about Tether's backing:
Tether is a stablecoin that we have talked about around here because it was sued by the New York attorney general for lying about its reserves, and because it subsequently disclosed its reserves in a format that satisfied basically no one. Tether now says that its reserves consist mostly of commercial paper, which apparently makes it one of the largest commercial paper holders in the world. There is a fun game among financial journalists and other interested observers who try to find anyone who has actually traded commercial paper with Tether, or any of its actual holdings. The game is hard! As far as I know, no one has ever won it, or even scored a point; I have never seen anyone publicly identify a security that Tether holds or a counterparty that has traded commercial paper with it.
USDT reserve disclosure
Levine contrasts Tether's reserve disclosure with that of another instrument that is supposed to maintain a stable value, a money market fund:
Here is the website for the JPMorgan Prime Money Market Fund. If you click on the tab labeled “portfolio,” you can see what the fund owns. The first item alphabetically is $50 million face amount of asset-backed commercial paper issued by Alpine Securitization Corp. and maturing on Oct. 12. Its CUSIP — its official security identifier — is 02089XMG9. There are certificates of deposit at big banks, repurchase agreements, even a little bit of non-financial commercial paper. ... You can see exactly how much (both face amount and market value), and when it matures, and the CUSIP for each holding.

JPMorgan is not on the bleeding edge of transparency here or anything; this is just how money market funds work. You disclose your holdings.


But the big picture is that USDT pumped $60B into cryptocurrencies. Where did the demand for the $60B come from? In my view, some of it comes from whales accumulating dry powder to use in pump-and-dump schemes like the one illustrated above. But Kahn has two different suggestions. First:
One of the well-known uses for USDT is “shadow banking” - since real US dollars are highly regulated, opening an account with Binance and buying USDT is a straightforward way to get a dollar account.

The CEO of USDC himself admits in this Coindesk article: “In particular in Asia where, you know, these are dollar-denominated markets, they have to use a shadow banking system to do it...You can’t connect a bank account in China to Binance or Huobi. So you have to do it through shadow banking and they do it through tether. And so it just represents the aggregate demand. Investors and users in Asia – it’s a huge, huge piece of it.”
Binance also hosts a massive perpetual futures market, which are “cash-settled” using USDT. This allows traders to make leveraged bets of 100x margin or more...which, in laymen’s terms, is basically a speculative casino. That market alone provides around ~$27B of daily volume, where users deposit USDT to trade on margin. As a result, Binance is by far the biggest holder of USDT, with $17B sitting in its wallet.
Wikipedia describes "perpetual futures" thus:
In finance, a perpetual futures contract, also known as a perpetual swap, is an agreement to non-optionally buy or sell an asset at an unspecified point in the future. Perpetual futures are cash-settled, and differ from regular futures in that they lack a pre-specified delivery date, and can thus be held indefinitely without the need to roll over contracts as they approach expiration. Payments are periodically exchanged between holders of the two sides of the contracts, long and short, with the direction and magnitude of the settlement based on the difference between the contract price and that of the underlying asset, as well as, if applicable, the difference in leverage between the two sides
In Is Tether a Black Swan? Bernhard Mueller goes into more detail about Binance's market:
According to Tether’s rich list, 17 billion Tron USDT are held by Binance alone. The list also shows 2.68B USDT in Huobi’s exchange wallets. That’s almost 20B USDT held by two exchanges. Considering those numbers, the value given by CryptoQuant appears understated. A more realistic estimate is that ~70% of the Tether supply (43.7B USDT) is located on centralized exchanges.

Interestingly, only a small fraction of those USDT shows up in spot order books. One likely reason is that a large share is sitting on wallets to collateralize derivative positions, in particular perpetual futures. The CEX futures market is essentially a casino where traders bet on crypto prices with insane amounts of leverage. And it’s a massive market: Futures trading on Binance alone generated $60 billion in volume over the last 24 hours. It’s important to understand that USDT perpetual futures implementations are 100% USDT-based, including collateralization, funding and settlement. Prices are tied to crypto asset prices via clever incentives, but in reality, USDT is the only asset that ever changes hands between traders. This use-case generates significant demand for USDT.
Why is this "massive perpetual futures market" so popular? Kahn provides answers:
That crazed demand for margin trading is how we can explain one of the enduring mysteries of crypto - how users can get 12.5% interest on their holdings when banks offer less than 1%.
The high interest is possible because:
The massive supply of USDT, and the host of other dollar stablecoins like USDC, PAX, and DAI, creates an arbitrage opportunity. This brings in capital from outside the ecosystem seeking the “free money” making trades like this using a combination of 10x leverage and and 8.5% variance between stablecoins to generate an 89% profit in just a few seconds. If you’re only holding the bag for a minute, who cares if USDT is imaginary dollars?
Rollicking good times like these attract the attention of regulators, as Amy Castor reported on July 2nd in Binance: A crypto exchange running out of places to hide:
Binance, the world’s largest dark crypto slush fund, is struggling to find corners of the world that will tolerate its lax anti-money laundering policies and flagrant disregard for securities laws.
As a result, Laurence Fletcher, Eva Szalay and Adam Samson report that Hedge funds back away from Binance after regulatory assault :
The global regulatory pushback “should raise red flags for anyone keeping serious capital at the exchange”, said Ulrik Lykke, executive director at ARK36, adding that the fund has “scaled down” exposure.
Lykke described it as “especially concerning” that the recent moves against Binance “involve multiple entities from across the financial sphere”, such as banks and payments groups.
This leaves some serious money looking for an off-ramp from USDT to fiat. These are somewhat scarce:
if USDT holders on centralized exchanges chose to run for the exits, USD/USDC/BUSD liquidity immediately available to them would be relatively small. ~44 billion USDT held on exchanges would be matched with perhaps ~10 billion in fiat currency and USDC/BUSD
This, and the addictive nature of "a casino ... with insane amounts of leverage", probably account for the relatively small drop in USDT market cap since June 18th. Amy Castor reported July 13th on another reason in Binance: Fiat off-ramps keep closing, reports of frozen funds, what happened to Catherine Coley?:
Binance customers are becoming trapped inside of Binance — or at least their funds are — as the fiat exits to the world’s largest crypto exchange close around them. You can almost hear the echoes of doors slamming, one by one, down a long empty corridor leading to nowhere.

In the latest bit of unfolding drama, Binance told its customers today that it had disabled withdrawals in British Pounds after its key payment partner, Clear Junction, ended its business relationship with the exchange.
There’s a lot of unhappy people on r/BinanceUS right now complaining their withdrawals are frozen or suspended — and they can’t seem to get a response from customer support either.
Binance is known for having “maintenance issues” during periods of heavy market volatility. As a result, margin traders, unable to exit their positions, are left to watch in horror while the exchange seizes their margin collateral and liquidates their holdings.
And it isn't just getting money out of Binance that is getting hard, as David Gerard reports:
Binance is totally not insolvent! They just won’t give anyone their cryptos back because they’re being super-compliant. KYC/AML laws are very important to Binance, especially if you want to get your money back after suspicious activity on your account — such as pressing the “withdraw” button. Please send more KYC. [Binance]
Issues like these tend to attract the attention of the mainstream press. On July 23rd the New York Times' Eric Lipton and Ephrat Livni profiled Sam Bankman-Fried of the FTX exchange in Crypto Nomads: Surfing the World for Risk and Profit:
The highly leveraged form of trading these platforms offer has become so popular that the overall value of daily purchases and sales of these derivatives far surpasses the daily volume of actual cryptocurrency transactions, industry data analyzed by researchers at Carnegie Mellon University shows.
FTX alone has one million users across the world and handles as much as $20 billion a day in transactions, most of them derivatives trades.

Like their customers, the platforms compete. Mr. Bankman-Fried from FTX, looking to out promote BitMEX, moved to offer up to 101 times leverage on derivatives trades. Mr. Zhao from Binance then bested them both by taking it to 125.
Then on the 25th, as the regulators' seriousness sank in, the same authors reported Leaders in Cryptocurrency Industry Move to Curb the Highest-Risk Trades:
Two of the world’s most popular cryptocurrency exchanges announced on Sunday that they would curb a type of high-risk trading that has been blamed in part for sharp fluctuations in the value of Bitcoin and the casino-like atmosphere on such platforms globally.

The first move came from the exchange, FTX, which said it would reduce the size of the bets investors can make by lowering the amount of leverage it offers to 20 times from 101 times. Leverage multiplies the traders’ chance for not only profit, but also loss.
About 14 hours later, Changpeng Zhao [CZ], the founder of Binance, the world’s largest cryptocurrency exchange, echoed the move by FTX, announcing that his company had already started to limit leverage to 20 times for new users and it would soon expand this limit to other existing clients.
Early the next day, Tom Schoenberg, Matt Robinson, and Zeke Faux reported for Bloomberg that Tether Executives Said to Face Criminal Probe Into Bank Fraud:
U.S. probe into Tether is homing in on whether executives behind the digital token committed bank fraud, a potential criminal case that would have broad implications for the cryptocurrency market.

Tether’s pivotal role in the crypto ecosystem is now well known because the token is widely used to trade Bitcoin. But the Justice Department investigation is focused on conduct that occurred years ago, when Tether was in its more nascent stages. Specifically, federal prosecutors are scrutinizing whether Tether concealed from banks that transactions were linked to crypto, said three people with direct knowledge of the matter who asked not to be named because the probe is confidential.

Federal prosecutors have been circling Tether since at least 2018. In recent months, they sent letters to individuals alerting them that they’re targets of the investigation, one of the people said.
Once again, David Gerard pointed out the obvious market manipulation:
This week’s “number go up” happened several hours before the report broke — likely when the Bloomberg reporter contacted Tether for comment. BTC/USD futures on Binance spiked to $48,000, and the BTC/USD price on Coinbase spiked at $40,000 shortly after.

Here’s the one-minute candles on Coinbase BTC/USD around 01:00 UTC (2am BST on this chart) on 26 July — the price went up $4,000 in three minutes. You’ve never seen something this majestically organic
And so did Amy Castor in The DOJ’s criminal probe into Tether — What we know:
Last night, before the news broke, bitcoin was pumping like crazy. The price climbed nearly 17%, topping $40,000. On Coinbase, the price of BTC/USD went up $4,000 in three minutes, a bit after 01:00 UTC.

After a user placed a large number of buy orders for bitcoin perpetual futures denominated in tethers (USDT) on Binance — an unregulated exchange struggling with its own banking issues — The BTC/USDT perpetual contract hit a high of $48,168 at around 01:00 UTC on the exchange.

Bitcoin pumps are a good way to get everyone to ignore the impact of bad news and focus on number go up. “Hey, this isn’t so bad. Bitcoin is going up in price. I’m rich!”
As shown in the graph, the perpetual futures market is at least an order of magnitude larger than the spot market upon which it is based. and as we saw for example on December 16th and July 26th, the spot market is heavily manipulated. Pump-and-dump schemes in the physical market are very profitable, and connecting them to the casino in the futures market with its insane leverage can juice profitability enormously.

Tether and Binance

Fais Kahn's second part, Bitcoin's end: Tether, Binance and the white swans that could bring it all down, explores the mutual dependency between Tether and Binance:
There are $62B tokens for USDT in circulation, much of which exists to fuel the massive casino that is the perpetual futures market on Binance. These complex derivatives markets, which are illegal to trade in the US, run in the tens of billions and help drive up the price of Bitcoin by generating the basis trade.
The "basis trade":
involves buying a commodity at spot (taking a long position) and simultaneously establishing a short position through derivatives like options or futures contracts
Kahn continues:
For Binance to allow traders to make such crazy bets, it needs collateral to make sure if traders get wiped out, Binance doesn’t go bankrupt. That collateral is now an eye-popping $17B, having grown from $3B in February and $10B in May:

But for that market to work, Binance needs USDT. And getting fresh USDT is a problem now that the exchange, which has always been known for its relaxed approach to following the laws, is under heavy scrutiny from the US Department of Justice and IRS: so much so that their only US dollar provider, Silvergate Bank, recently terminated their relationship, suggesting major concerns about the legality of some of Binance’s activities. This means users can no longer transfer US dollars from their bank to Binance, which were likely often used to fund purchases of USDT.

Since that shutdown, the linkages between Binance, USDT, and the basis trade are now clearer than ever. In the last month, the issuance of USDT has completely stopped:

Likewise, futures trading has fallen significantly. This confirms that most of the USDT demand likely came from leveraged traders who needed more and more chips for the casino. Meanwhile, the basis trade has completely disappeared at the same time.

Which is the chicken and which is the egg? Did the massive losses in Bitcoin kill all the craziest players and end the free money bonanza, or did Binance’s banking troubles choke off the supply of dollars, ending the game for everyone? Either way, the link between futures, USDT, and the funds flooding the crypto world chasing free money appears to be broken for now.
This is a problem for Binance:
Right now Tether is Binance’s $17B problem. At this point, Binance is holding so much Tether the exchange is far more dependent on USDT’s peg staying stable than it is on any of its banking relationships. If that peg were to break, Binance would likely see capital flight on a level that would wreak untold havoc in the crypto markets
Regulators have been increasing the pace of their enforcements. In other words, they are getting pissed, and the BitMex founders going to jail is a good example of what might await.

Binance has been doing all it can to avoid scrutiny, and you have to award points for creativity. The exchange was based in Malta, until Malta decided Binance had “no license” to operate there, and that Malta did not have jurisdiction to regulate them. As a result, CZ began to claim that Binance “doesn’t have” a headquarters.

Wonder why? Perhaps to avoid falling under anyone’s direct jurisdiction, or to avoid a paper trail?

CZ went on to only reply that he is based in “Asia.” Given what China did to Jack Ma recently, we can empathize with a desire to stay hidden, particularly when unregulated exchanges are a key rail for evading China’s strict capital controls. Any surprise that the CFO quit last month?
But it is also a problem for Tether:
Here’s what could trigger a cascade that could bring the exchange down and much of crypto with it: the DOJ and IRS crack down on Binance, either by filing charges against CZ or pushing Biden and Congress to give them the death penalty: full on sanctions. This would lock them out of the global financial system, cause withdrawals to skyrocket, and eventually drive them to redeem that $17B of USDT they are sitting on.

And what will happen to Tether if they need to suddenly sell or redeem those billions?

We have no way of knowing. Even if fully collateralized, Tether would need to sell billions in commercial paper on short notice. And in the worst case, the peg would break, wreaking absolute havoc and crushing crypto prices.
It’s possible that regulators will move as slow as they have been all along - with one country at a time unplugging Binance from its banking system until the exchange eventually shrinks down to be less of a systemic risk than it is.
That's my guess — it will become increasingly difficult either to get USD or cryptocurrency out of Binance's clutches, or to send them fiat, as banks around the world realize that doing business with Binance is going to get them in trouble with their regulators. Once customers realize that Binance has become a "roach motel" for funds, and that about 25% of USDT is locked up there, things could get quite dynamic.

Kahn concludes:
Everything around Binance and Tether is murky, even as these entities two dominate the crypto world. Tether redemptions are accelerating, and Binance is in trouble, but why some of these things are happening is guesswork. And what happens if something happens to one of those two? We’re entering some uncharted territory. But if things get weird, don’t say no one saw it coming.

Policy Responses

Gorton and Zhang argue that the modern equivalent of the "free banking" era is fraught with too many risks to tolerate. David Gerard provides an overview of the era in Stablecoins through history — Michigan Bank Commissioners report, 1839:
The wildcat banking era, more politely called the “free banking era,” ran from 1837 to 1863. Banks at this time were free of federal regulation — they could launch just under state regulation.

Under the gold standard in operation at the time, these state banks could issue notes, backed by specie — gold or silver — held in reserve. The quality of these reserves could be a matter of some dispute.

The wildcat banks didn’t work out so well. The National Bank Act was passed in 1863, establishing the United States National Banking System and the Office of the Comptroller of the Currency — and taking away the power of state banks to issue paper notes.
Gerard's account draws from a report of Michigan's state banking commissioners, Documents Accompanying the Journal of the House of Representatives of the State of Michigan, pp. 226–258, which makes clear that Tether's lack of transparency as to its reserves isn't original. Banks were supposed to hold "specie" (money in the form of coin) as backing but:
The banking system at the time featured barrels of gold that were carried to other banks, just ahead of the inspectors
For example, the commissioners reported that:
The Farmers’ and Mechanics’ bank of Pontiac, presented a more favorable exhibit in point of solvency, but the undersigned having satisfactorily informed himself that a large proportion of the specie exhibited to the commissioners, at a previous examination, as the bona fide property of the bank, under the oath of the cashier, had been borrowed for the purpose of exhibition and deception; that the sum of ten thousand dollars which had been issued for “exchange purposes,” had not been entered on the books of the bank, reckoned among its circulation, or explained to the commissioners.
Gorton and Zhang summarize the policy choices thus:
Based on historical lessons, the government has a couple of options: (1) transform stablecoins into the equivalent of public money by (a) requiring stablecoins to be issued through FDIC- insured banks or (b) requiring stablecoins to be backed one-for-one with Treasuries or reserves at the central bank; or (2) introduce a central bank digital currency and tax private stablecoins out of existence.
Their suggestions for how to implement the first option include:
  • the interpretation of Section 21 of the Glass-Steagall Act, under which "it is unlawful for a non-bank entity to engage in deposit-taking"
  • the interpretation of Title VIII of the Dodd-Frank Act, under which the Financial Stability Oversight Council could "designate stablecoin issuance as a systemic payment activity". This "would give the Federal Reserve the authority to regulate the activity of stablecoin issuance by any financial institution."
  • Congress could pass legislation that requires stablecoin issuers to become FDIC-insured banks or to run their business out of FDIC-insured banks. As a result, stablecoin issuers would be subject to regulations and supervisory activities that come along with being an FDIC-insured bank.
Alternatively, the second option would involve:
Congress could require the Federal Reserve to issue a central bank digital currency as a substitute to privately produced digital money like stablecoins
The question then becomes whether policymakers would want to have central bank digital currencies coexist with stablecoins or to have central bank digital currencies be the only form of money in circulation. As discussed previously, Congress has the legal authority to create a fiat currency and to tax competitors of that uniform national currency out of existence.
They regard the key attribute of an instrument that acts as money to be that it is accepted at face value "No Questions Asked" (NQA). Thus, based on history they ask:
In other words, should the sovereign have a monopoly on money issuance? As shown by revealed preference in the table below, the answer is yes. The provision of NQA money is a public good, which only the government can supply.

Building a Transatlantic Digital Scholarship Skills Exchange for Research Libraries: Moving Forward / Digital Library Federation

There may be an ocean between the US and the UK, but in the age of Zoom, collaboration can transcend geographical boundaries. Research Libraries UK’s Digital Scholarship Network (DSN) and CLIR’s Digital Library Federation’s Data and Digital Scholarship (DDS) working group are continuing to foster a partnership aimed at building transatlantic collaborations and connections. The two groups have had a series of conversations, and we have now hosted two joint meetings. At an April 2021 event, we divided into small groups to share ideas about the potential collaborative future of our two groups.  Then we conducted a follow-up Expressions of Interest survey in April-May 2021. 

In response to the April event and the Expressions of Interest survey, we hosted a second event on 14 July 2021, at which we launched a beta Skills Exchange Directory, and we offered a series of tailored “skills conversations.” 

This post provides more detail on the July event and plans for next steps in our transatlantic collaboration.

July Event

One of the main takeaways from the first event in April was that participants appreciated being able to meet colleagues both from the UK and US so we were keen to facilitate this again and started the July event with 121 or small group “meet and greets”. Participants enjoyed the serendipity of these meetings and individual connections have already been made. 

The second part of the event was more structured and aimed to build on the information we had gathered from the Expressions of Interest survey. From this survey we were able to identify key areas of digital scholarship skills that colleagues were most keen to develop and could match these with colleagues who were willing to share their expertise in these areas by facilitating curated conversations. The areas identified were:

  • Artificial Intelligence and machine learning 
  • Tools for digital scholarship 
  • Planning and managing a digital scholarship centre 
  • Assessment and Metrics

The success of these curated conversations was dependent on the experts being willing to share their knowledge and we are grateful to Carol Chiodo (Harvard Library), Alexandra Sarkozy (Wayne State University),  Sarah Melton (Boston College), Kirsty Lingstadt (University of Edinburgh), Eleonora Gandolfi (University of Southampton), Gavin Boyce (University of Sheffield) and Matt Philips (University of Southampton) for being so willing to participate and share their experiences so openly. 

In these breakout sessions the experts talked about their services and how they developed skills for 10 mins each and then participants who had signed up for the session were able to ask follow up questions. DSN and DDS partners acted as moderators in each breakout session and we were impressed with how informative and interactive the sessions were. If you weren’t able to make it to the July event, each session documented the conversation in a shared notes document.

These interactive sessions are core for the success of a skills exchange but the finale of this event was the launch of the Skills Directory – a dynamic resource through which colleagues from both groups can share their skills and expertise with one another. 

Transatlantic Skills Directory

Participants were introduced to the directory (designed by Stephanie Jesper and Susan Halfpenny at University of York) and shown how to search (via the Google sheets filter function) to identify colleagues with varying levels of expertise across 18 skills areas relating to digital scholarship activities within research libraries. The directory includes the names and contact details of colleagues who are willing to share their skills and expertise around a skills area, and the means through which they are willing to do so (e.g. one-on-one conversation, contributing to a training session etc). Due to the potential demand on colleagues the directory is only available to DSN and DDS members but a recording of the introduction to the directory from the session is available here.


screenshot of the RLUK/DDS skills directory

The success of the Directory depends on colleagues signing up to share their skills and after the demonstration participants were given time to register their own skills – the group were impressed by the response. As we write this, 30 colleagues have registered in the directory, offering more than 175 skills between them. However, we still need more and encourage members of RLUK DSN and DLF DDS to register their skills. We need your expertise to make this a success! We also encourage members to make good use of the Directory to learn new skills or move forward with digital scholarship services and would be keen to hear of any contacts made via the directory.

Next Steps

These events and tools have helped us learn more about our colleagues on both sides of the Atlantic, and RLUK DSN and DLF DDS look forward to continuing this partnership. Please check out the shared meeting notes, resources, and other materials available on our Open Science Framework site. We plan to host more joint events to support networking and idea-generation, and we will continue to expand the directory with more colleagues who are interested in exchanging skills.

Thanks to everyone involved in arranging these events and the directory and of course to everyone who participates in the skills exchange!

Colleagues leading this work

Beth Clark, Associate Director, Digital Scholarship & Innovation, London School of Economics, and RLUK DSN member.

Sara Mannheimer, Associate Professor, Data Librarian, Montana State University, and DLF DDS co-convener.

Jason Clark, Professor, Lead for Research Informatics, Montana State University, and DLF DDS co-convener.

Susan Halfpenny, Head of Digital Scholarship & Innovation, University of York, and RLUK DSN member.

Matt Greenhall, Deputy Executive Director, RLUK

Thanks go to Gayle Schechter (Program Associate, CLIR/DLF), Louisa M. Kwasigroch (Director, Outreach and Engagement at CLIR and Interim DLF Senior Program Officer), Kirsty Lingstadt (Deputy Director, University of Edinburgh and RLUK DSN co-convener), Eleonora Gandolfi (Head of Digital Scholarship and Innovation, University of Southampton and RLUK DSN co-convener), Stephanie Jesper (Teaching & Learning Advisor, University of York), and Melanie Cheung (RLUK Executive Assistant).

The post Building a Transatlantic Digital Scholarship Skills Exchange for Research Libraries: Moving Forward appeared first on DLF.

Register your Interest: Open Knowledge Justice Programme Community Meetups / Open Knowledge Foundation

What’s this about?

The Open Knowledge Justice Programme is kicking off a series of free, monthly community meetups to talk about Public Impact Algorithms.

Register here.

Who is this for?

Do you want to learn more about Public Impact Algorithms?

Would you like to know how to spot one, and how they might affect the clients you represent?

Do you work in government, academia, policy-making or civil society – and are interested in learning how to deploy a Public Impact Algorithm fairly?

Tell me more

Whether you’re a new to tech or a seasoned pro, join us once a month to share your experiences, listen to our guest speakers and ask our data expert questions on this fast changing issue.

= = = = =
When? Lunch time every second Thursday of the month – starting September 9th 2021.
How? Register your interest here
= = = = =

More info:

DLF Digest: August 2021 / Digital Library Federation

DLF Digest

A monthly round-up of news, upcoming working group meetings and events, and CLIR program updates from the Digital Library Federation

This month’s news:

This month’s DLF group events:

DLF Digital Library Pedagogy group – #DLFteach Twitter Chat

Tuesday, August 17, 8pm ET/5pm PT; participate on Twitter using the hashtag #DLFteach

Join the DLF Digital Library Pedagogy group for this month’s Twitter chat on building stronger community engagement for open source. Twitter chat details, instructions, and archives of past #DLFteach chats are available on the DLF wiki.

This month’s open DLF group meetings:

For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), make sure to bookmark the DLF Community Calendar. Can’t find meeting call-in information? Email us at

DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member institution. Learn more about our working groups and how to get involved on the DLF website. Interested in starting a new working group or reviving an older one? Need to schedule an upcoming working group call? Check out the DLF Organizer’s Toolkit to learn more about how Team DLF supports our working groups, and send us a message at to let us know how we can help. 

The post DLF Digest: August 2021 appeared first on DLF.

Arabic Translations Available of the 2019 Levels of Digital Preservation / Digital Library Federation

The NDSA is pleased to announce that the 2019 Levels of Preservation documents have been translated into Arabic by our colleagues from Thesaurus Islamicus Foundation’s Qirab project (

Translations for the Levels of Digital Preservation Matrix and Implementation Guide were completed.

Links to these documents are found on the 2019 Levels of Digital Preservation OSF site (

If you would be interested in translating the Levels of Digital Preservation V2.0 into another language please contact us at

الترجمات العربيَّة لمستويات الحفظ لعام 2019 والوثائق المرتبطة بها

يسرُّ الاتحاد الوطني للإشراف الرقمي أن يُعلنَ عن صدور ترجمة باللغة العربيَّة لوثائق مستويات الحفظ الرقمي لعام 2019، نفَّذها زملاؤنا من مشروع «قِرَاب» التابع لجمعية المكنز الإسلامي.



وقد اكتملت التَّرجمات الخاصة بمصفوفة ودليل تنفيذ مستويات الحفظ الرقمي.

علمًا أنه يمكن الوصولُ إلى تلك الوثائق من خلال مستويات الحفظ الرقمي لعام 2019 في الموقع الإلكتروني الخاص بمؤسسة أوبن ساينس فريم ورك (OSF).


وفي حالة اهتمامكم بترجمة الإصدار الثاني من مستويات الحفظ الرقمي إلى لغةٍ أخرى، يُرجى التواصلُ معنا عبر عنوان البريد الإلكتروني التالي:

The post Arabic Translations Available of the 2019 Levels of Digital Preservation appeared first on DLF.

Presentation: Two Metadata Directions / Lorcan Dempsey


Presentation: Two Metadata Directions

I was pleased to deliver a presentation at the Eurasian Academic Libraries Conference - 2021, organized by The Nazarbayev University Library and the Association of University Libraries in the Republic of Kazakhstan. Thanks to April Manabat of Nazarbayev University for the invitation and support. I was asked to talk about metadata and to mention OCLC developments. The conference topic was: Contemporary Trends in Information Organization in the Academic Library Environment.

Further information here:

Nazarbayev University LibGuides: Eurasian Academic Libraries Conference - 2021: Home
Nazarbayev University LibGuides: Eurasian Academic Libraries Conference - 2021: Home
Presentation: Two Metadata Directions


I spoke about two trends in metadata developments: entification and pluralization. Each of these is important and I provided an example under each head of a related initiative at OCLC.

I discuss these trends in more detail in a recent blog entry, which recapitulated and extends the material presented at the conference:

Two metadata directions in libraries
Metadata practice is evolving. I discuss two important trends here: entification and pluralization.
Presentation: Two Metadata Directions

Additional materials

The slides as presented are here:

This is a presentation about trends in metadata, focusing on two important issues. The first is entification, moving from strings to things. The second is pluralization, as we seek to better represent the diversity of perspectives, experiences and memories. It discusses an OCLC initiative associated…
Presentation: Two Metadata Directions

The conference organizers have made a video of the sessions available. Here is the video for day two, which should begin as I begin speaking. Move back to see more of the presentations.

As we prepared the video, I did reflect on the future of conferences and conference-going. Clearly, much to work through here, and we are certainly seeing new and engaging online and hybrid experiences. In writing the accompanying blog entry, I finished with this observation:

The Pandemic is affecting how we think about work travel and the design of events, although in as yet unclear ways. One pandemic effect, certainly, has been the ability to think about both audiences and speakers differently. It is unlikely that I would have attended this conference had it been face to face, however, I readily agreed to be an online participant. // Two Metadata Directions

Economics Of Evil Revisited / David Rosenthal

Eight years ago I wrote Economics of Evil about the death of Google Reader and Google's habit of leaving its customers users in the lurch. In the comments to the post I started keeping track of accessions to le petit musée des projets Google abandonnés. So far I've recorded at least 33 dead products, an average of more than 4 a year. Two years ago Ron Amadeo wrote about the problem this causes in Google’s constant product shutdowns are damaging its brand:
We are 91 days into the year, and so far, Google is racking up an unprecedented body count. If we just take the official shutdown dates that have already occurred in 2019, a Google-branded product, feature, or service has died, on average, about every nine days.
Below the fold, some commentary on Amadeo's latest report from the killing fields, in which he detects a little remorse.

Belatedly, someone at Google seems to have realized that repeatedly suckering people into using one of your products then cutting them off at the knees, in some cases with one week's notice, can reduce their willingness to use your other products. And they are trying to do something about it, as Amadeo writes in Google Cloud offers a model for fixing Google’s product-killing reputation:
A Google division with similar issues is Google Cloud Platform, which asks companies and developers to build a product or service powered by Google's cloud infrastructure. Like the rest of Google, Cloud Platform has a reputation for instability, thanks to quickly deprecating APIs, which require any project hosted on Google's platform to be continuously updated to keep up with the latest changes. Google Cloud wants to address this issue, though, with a new "Enterprise API" designation.
What Google means by "Enterprise API" is:
Our working principle is that no feature may be removed (or changed in a way that is not backwards compatible) for as long as customers are actively using it. If a deprecation or breaking change is inevitable, then the burden is on us to make the migration as effortless as possible.
They then have this caveat:
The only exception to this rule is if there are critical security, legal, or intellectual property issues caused by the feature.
And go on to explain what should happen:
Customers will receive a minimum of one year’s notice of an impending change, during which time the feature will continue to operate without issue. Customers will have access to tools, docs, and other materials to migrate to newer versions with equivalent functionality and performance. We will also work with customers to help them reduce their usage to as close to zero as possible.
This sounds good, but does anyone believe if Google encountered "critical security, legal, or intellectual property issues" that meant they needed to break customer applications they'd wait a year before fixing them?

Amadeo points out that:
Despite being one of the world's largest Internet companies and basically defining what modern cloud infrastructure looks like, Google isn't doing very well in the cloud infrastructure market. Analyst firm Canalys puts Google in a distant third, with 7 percent market share, behind Microsoft Azure (19 percent) and market leader Amazon Web Services (32 percent). Rumor has it (according to a report from The Information) that Google Cloud Platform is facing a 2023 deadline to beat AWS and Microsoft, or it will risk losing funding.
The linked story from 2019 actually says:
While the company has invested heavily in the business since last year, Google wants its cloud group to outrank those of one or both of its two main rivals by 2023
On Canalys numbers, the "and" target to beat (AWS plus Azure) has happy customers forming 51% of the market. So there is 42% of the market up for grabs. If Google added every single one of them to its 7% they still wouldn't beat a target of "both". Adding six times their customer base in 2 years isn't a realistic target.

Even the "or" target of Azure is unrealistic. Since 2019 Google's market share has been static while Azure's has been growing slowly. Catching up in the 2 years remaining would involve adding 170% of Google's current market share. So le petit musée better be planning to enlarge its display space to make room for a really big new exhibit in 2024.

Call for Nominations to the NDSA Coordinating Committee / Digital Library Federation

NDSA will be electing three members to its Coordinating Committee (CC) this year, with terms starting in January 2022. CC members serve a three year term, participate in a monthly call, and meet at the annual Digital Preservation Conference. The Coordinating Committee provides strategic leadership to the organization in coordination with group co-chairs. NDSA is a diverse community with a critical mission, and we seek candidates to join the CC that bring a variety of cultures and orientations, skills, perspectives and experiences, to bear on leadership initiatives. Working on the CC is an opportunity to contribute your leadership for the community as a whole, while collaborating with a wonderful group of dynamic and motivated professionals. 

If you are interested in joining the NDSA Coordinating Committee (CC) or want to nominate another member, please complete the nomination form by 11:59pm EDT Friday, August 13, 2021, which asks for the name, e-mail address, brief bio/candidate statement (nominee-approved), and NDSA-affiliated institution of the nominee. We particularly encourage and welcome nominations of people from underrepresented groups and sectors. 

As members of the NDSA, we join together to form a consortium of more than 260 partnering organizations, including businesses, government agencies, nonprofit organizations, professional associations and universities, all engaged in the long-term preservation of digital information. Committed to preserving access to our national digital heritage, we each offer our diverse skills, perspectives, experiences, cultures and orientations to achieve what we could not do alone. 

The CC is dedicated to ensuring a strategic direction for NDSA, to the advancement of NDSA activities to achieve community goals, and to further communication among digital preservation professionals and NDSA member organizations. The CC is responsible for reviewing and approving NDSA membership applications and publications; updating eligibility standards for membership in the alliance, and other strategic documents; engaging with stakeholders in the community; and working to enroll new members committed to our core mission. More information about the duties and responsibilities of CC members can be found at the NDSA’s Leadership Page.

We hope you will give this opportunity serious consideration and we value your continued contributions and leadership in our community.

Any questions can be directed to  

Thank you,

Nathan Tallman, Vice Chair
On behalf of the NDSA Coordinating Committee

The post Call for Nominations to the NDSA Coordinating Committee appeared first on DLF.

Fedora Migration Paths and Tools Project Update: July 2021 / DuraSpace News

This is the latest in a series of monthly updates on the Fedora Migration Paths and Tools project – please see the previous post for a summary of the work completed up to that point. This project has been generously funded by the IMLS.

We completed some final performance tests and optimizations for the University of Virginia pilot. Both the migration to their AWS server and the Fedora 6.0 indexing operation were much slower than anticipated, so the project team tested a number of optimizations, including:

  1. Adding more processing threads
  2. Increasing the size of the server instance 
  3. Using a separate and larger database server 
  4. Using locally attached flash storage

Fortunately, these improvements made a big difference; for example, ingest speed was increased from 6.8 resources per second to 45.6 resources per second. In general, this means that institutions with specific performance targets can use a combination of parallel processing and increased computational resources. Feedback from this pilot has been incorporated into the migration guide, updates to the migration-utils to improve performance, updates to the aws-deployer tool to provide additional options, and improvements to the migration-validator to handle errors.

The Whitman College team has begun their production migration using Islandora Workbench. Initial benchmarking has shown that running Workbench from the production server rather than locally on a laptop achieves much better performance, so this is the recommended approach. The team is working collection-by-collection using CSV files and a tracking spreadsheet to keep track of each collection as it is ingested and ready to be tested. They have also developed a Quality Control checklist to make sure everything is working as intended – we anticipate doing detailed checks on the first few collections and spot checks for subsequent collections.

As we near the end of the pilot project phase of the grant work we are focused on documentation for the migration toolkit. We plan to complete a draft of this documentation over the summer, after which this draft will be shared with the broader community for feedback. We will organize meetings in the Fall to provide opportunities for community members to provide additional feedback on the toolkit and make suggestions for improvements.

The post Fedora Migration Paths and Tools Project Update: July 2021 appeared first on

How well does EAD tag usage support finding aid discovery? / HangingTogether

In November, we shared information with you about the Building a National Finding Aid Network project (NAFAN). This is a two-year research and demonstration project to build the foundation for a (US) national archival finding aid network. OCLC is engaged as a partner in the project, leading qualitative and quantitative research efforts. This post will gives some details on just one of those research strands, evaluation of finding aid data quality.

In considering building a nationwide aggregation of finding aids, looking at the potential raw materials that will make up that resource helps us to both scope the network’s functionality to the finding aid data​ and to lay the groundwork for data remediation and expanded network features.​

We have two main research questions when approaching the finding aid data quality:

  • What is the structure and extent of consistency across finding aid data in current aggregations?
  • Can that data support the needs to be identified in the user research phase of the study? If so, how? If not, what are the gaps?

About the research aggregation

Twelve NAFAN partners made their finding aids available to the project for quantitative analysis, producing a total of over 145 thousand documents. The finding aids were provided in the Encoded Archival Description (EAD) format. EAD is an XML-based standard for describing collections of archival materials. 

As a warning to the reader: this post delves deeply into EAD elements and attributes and assumes at least a passing knowledge of the encoding standard. For those wishing to learn more about the definitions and structure, we recommend the official EAD website or the less official but highly readable and helpful EADiva site.”

A treemap visualization of finding aid sources in the NAFAN research aggregationA treemap visualization of finding aid sources in the NAFAN research aggregation.

This treemap visualizes the relative proportion of the finding aid aggregation from the partners:

  • Archival Resources in Wisconsin
  • Archives West
  • Arizona Archives Online (AAO)
  • Black Metropolis Research Consortium (BMRC)
  • Chicago Collections Consortium
  • Connecticut’s Archives Online (CAO)
  • Empire Archival Discovery Cooperative (EmpireADC)
  • Online Archives of California (OAC)
  • Philadelphia Area Archival Research Portal (PAARP)
  • Rhode Island Archives and Manuscripts Online (RIAMCO)
  • Texas Archival Resources Online (TARO)
  • Virginia Heritage

​Though a few of the partners provided much of the content, the aggregation is a very good mix from a wide variety of United States locales and institution types.​

Dimensions for analysis

This analysis continues work carried out previously, including a 2013 EAD tag analysis that OCLC worked on with a different aggregation of EAD documents, based on about 120,000 finding aids drawn from OCLC’s ArchiveGrid discovery system. ​You can check out that previous study, “Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems” published in code4lib Journal issue 22.

OCLC’s 2013 analysis looked at EAD tag and attribute usage from a discovery perspective. For that study, we identified five high-level features that were often present in archival discovery systems.  ​

  • Search: all discovery systems have a keyword search function; many also include the ability to search by a particular field or element.​
  • Browse: many discovery systems include the ability to browse finding aids by title, subject, dates, or other facets.​
  • Results display: once a user has done a search, the results display will return portions of the finding aid to help with further evaluation.​
  • Sort: once a user has done a search, they may have the option to reorder the results.​
  • Facet: once a user has done a search, they may have the option to narrow the results to only include results that fall within certain facets.​

The analysis used that framework of high-level discovery features to select EAD elements and attributes that, if present, could be accessed, indexed, and displayed.​

This is the categorization of EAD elements and attributes that the study found to be relevant for supporting discovery system features.  ​

For example, dates could potentially be utilized as search terms, or leveraged for for browsing or sorting. They may also be important for disambiguating similarly named collections in displays.  ​Similarly, material types, represented by form and genre terms, could be important for narrowing a large result using a facet.​

(Thank you to eadiva for providing the excellent tag library that is linked to from the EAD elements names above.)

The question then was, how often are these key elements and attributes used?​

Defining Thresholds for Discovery

A table showing the thresholds of EAD tag usage for discoveryA table showing the thresholds of EAD tag usage for supporting discovery.

We should preface this by saying that it is difficult to predefine thresholds for the level of usage of an element at which it becomes more or less useful for discovery. ​Is an element that is used 95% of the time still useful but one that is used 94% not?​

OCLC’s 2013 study developed these thresholds after evaluating the EAD aggregation.​ The absence of an element does not directly lead to a breakdown in a discovery system. It is more like a gradual decay of its effectiveness. ​

Although we used these levels as a reference point in the 2013 study, we recognized that correlating usage with discovery is an artificial construct. ​

A table comparing EAD tag usage in 2021 and 2013A table comparing EAD tag usage in 2021 and 2013.

The above figure shows a comparison of the usage thresholds from the 2013 study (right), compared with the same tag analysis applied to the NAFAN corpus in 2021 (left).​

You will notice that there are some elements that we have added for analysis in the 2021 study, thanks to input from the expanded project team and the advisory board which is reviewing / providing input into work. This input has helped to breathe new life into old research.​

​The findings of the 2013 study were decidedly mixed. Some important elements were at the high or complete thresholds.  But many elements that are necessary for discovery interfaces were at medium or low use. Though the NAFAN EAD aggregation is a different corpus of data provided by different contributing institutions at a different time, the EAD tag analysis for it hasn’t changed the picture very much.  ​

A few elements have moved from the high threshold to complete, and a few from medium to high. And we found that there was mostly low-level use of content tags in origination and control access.​ Apart from that, the 2013 study’s appraisal of how well EAD supports the typical features of discovery systems could be considered mostly unchanged.​ This may be due in part to the relatively static nature of EAD finding aids. Once written and published, some documents may not receive further updates and improvements. It is not uncommon to find EAD documents in this aggregation that were published several years ago and have not been updated since.

Looking back on the conclusions of the 2013 study suggests that its cautionary forecast about underutilization of EAD to support discovery has proven to be accurate, while the study’s vision of the promise and potential for improving EAD encoding has yet to be fulfilled.​

If the archival community continues on its current path, then the potential of the EAD format to support researchers or the public in discovery of material will remain underutilized. Minimally, collection descriptions that are below the thresholds for discovery will hinder their discovery efforts and maximally will remain hidden from view.​

Perhaps with emerging evidence about the corpus of EAD, continued discussion of practice, recognition of a need for greater functionality, and shared tools both to create new EAD documents and improve existing encoding, we can look forward to further increasing the effectiveness and efficiency of EAD encoding and develop a practice of EAD encoding that pushes collection descriptions across the threshold of discovery.

Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems

More research opportunities

Though replicating the 2013 EAD Tag Analysis was an important step to confirm what we previously understood about the content and character of EAD finding aids, it only scratched the surface of what’s left to learn.​ While OCLC’s qualitative research is still being carried out and its findings won’t be available until later in the project, we can pursue other quantitative research right now to learn more about the NAFAN finding aid aggregation.​

Here are some of the areas that we’re investigating:​

  • What is the linking potential of the NAFAN EAD finding aids?​
  • What is the completeness and consistency of the description of collections’ physical characteristics and genre?​
  • Are content element values associated with controlled vocabularies, or can they be?​
  • Is institutional contact information in EAD finding aids consistent and reliable? ​
  • How do EAD finding aids inform researchers about access to, use of, and reuse of materials in the described collections?​

There are many possible avenues for research, but we want to be truly informed by the focus groups and researcher interviews before investing additional effort.​

The first area of investigation noted here about finding aid links to digital content correlates with early findings from OCLC’s NAFAN pop-up survey which show that, for many users, only digitized materials would be of interest.   ​

Investigating the linking potential of the aggregated finding aids could help answer several questions, including:​

  • What is the average number of external links per finding aid?​
  • What EAD elements and attributes are most frequently used for external links?​
  • What types of digital objects are linked?​
  • How many relative URLs are present, that rely on the finding aid to be accessed within its local context?​
  • What percentage of external links still resolve?​

OCLC will be investigating these areas and publishing findings over the coming months.​ Please get in touch with us if you’d like to discuss this work in more detail.​

The post How well does EAD tag usage support finding aid discovery? appeared first on Hanging Together.

Applications for the CoAct Open Calls on Gender Equality (July 1st, 2021- September 30th, 2021) are open! / Open Knowledge Foundation

CoAct is launching a call for proposals, inviting civil society initiatives to apply for our cascading grants with max. 20.000 Euro to conduct Citizen Social Science research on the topic of Gender Equality. A maximum of four (4) applicants will be selected across three (3) different open calls. Applications from a broad range of backgrounds are welcome, including feminist, LGTBQ+, none-binary and critical masculinity perspectives.

Eligible organisations can apply until September 30th, 2021, 11:59 PM GMT. All information for submitting applications is available here:

If selected, CoAct will support your work by

  • providing funding for your project (10 months max), alongside dedicated activities, resources and tools
  • providing a research mentoring program for your team. In collaborative workshops you will be supported to co-design and explore available tools, working together with the CoAct team to achieve your goals.
  • connecting you to a community of people and initiatives, tackling similar challenges and contributing to common aims. You will have the opportunity to discuss your projects with the other grantees and, moreover, will be invited to join CoAct´s broader Citizen Social Science network.

You should apply if you:

  • are an ongoing Citizen Social Science project looking for support, financial and otherwise, to grow and become sustainable;
  • are a community interested in co-designing research to generate new knowledge about gender equality topics, broadly defined;
  • are a not-for-profit organization focusing on community building, increasing the visibility of specific communities, increasing civic participation, and being interested in exploring the use of Citizen Social Science in your work.

Read more about the Open Calls here:

AltAir / Ed Summers

Звезда Альтаир

We use Airtable quite a bit at $work for building static websites. It provides a very well designed no-code or low-code environment for creating and maintaining databases. It has an easy to use, beautifully documented, API which makes it simple to use your data in many different settings, and also to update the data programmatically, if that’s needed.

Airtable is a product, which means there is polished documentation, videos, user support, and helpful people keeping the lights on. But Airtable also have a fiendishly inventive marketing and sales department who are quite artful at designing their pricing scheme with the features that will get you in the door, and the features that will act as pressure points to drive you to start paying them money … and then more money.

Of course, it’s important to pay for services you use on the web…it helps sustain them, which is good for everyone. But tying a fundamental part of your infrastructure to the whims of a company trying to maximize its profits sometimes has its downsides, which normally manifest over time. Wouldn’t it be nice to be able to pay for a no-code database service like Airtable that had more of a platform-cooperative mindset, where the thing being sustained was the software, the hardware and an open participatory organization for managing them? I think this approach has real value, especially in academia and other non-profit and activist organizations, where the focus is not endless growth and profits.

I’ve run across a couple open source alternatives to Airtable and thought I would just quickly note them down here for future self in case they are ever useful. Caveat lector: I haven’t actually tried either of them yet, so these are just general observations after quickly looking at their websites, documentation and their code repositories.


nocodb is a TypeScript/Vue web application that has been designed to provide an interface for an existing database such as MySQL, PostgreSQL, sqlite, SQLServer, . I suspect you can also use it to create new databases too, but the fact that it can be used with multiple database backends distinguishes it from the next example. The idea is that you will deploy nocodb on your own infrastructure (using Docker or installing it into a NodeJS environment). They also provide a one click Heroku installer. It has token based REST and GraphQL APIs for integration with other applications. All the code is covered by a GNU AGPL 3 license. It seems like nocodb is tailored for gradual introduction into an already existing database ecosystem, which is good for many people.


baserow is a Python/Django + Vue + PostgreSQL application that provides Airtable like functionality in a complete application stack. So unlike nocodb, baserow seems to want to manage the database entirely. While this might seem like a limitation at first I think it’s probably a good thing, since PostgreSQL is arguably the best open source relational database out there in terms of features, support, extensibility and scalability. The fact that nocodb supports so many database backends makes me worry that it might not take full advantage of each, and it may be more difficult to scale. Perhaps the nocodb folks see administering and tuning the database as an orthogonal problem to the one they are solving. Having an application that uses one open source database, and uses it well seems like a plus. But that assumes that there are easy ways to import existing data.

While the majority of the baserow code is open source with an MIT Expat license, they do have some code that is designated as Premium with a separate Baserow Premium Edition License that requires you to get a key to deploy. It’s interesting that the premium code appears to be open in the GitLab, and that they are relying on people to do right by purchasing a key if they use it. Or I guess it’s possible that the runtime requires a valid key to be in place for premium features? Their pricing also has a hosted version if you don’t want to deploy the application stack yourself, which is “free for now”, implying that it won’t be in the future, which makes sense. But it’s kind of strange to have to think about the hosting costs and the premium costs together. Having the JSON and XML export be a premium feature seems a bit counter-intuitive, unless it’s meant to be a way to quickly extract money as people leave the platform.


Anyway these are some general quick notes. If I got anything wrong, or you know of other options in this area of open source, no-code databases please let me know. If I ever get around to trying either of these I’ll be sure to update this post.

To return to this earlier idea of a platform-coop that supported these kinds of services I think we don’t see that idea present in either nocodb or baserow. It looks like baserow was started by Bram Wiepjes in the Netherlands in 2020, and that it is being set up as a profitable company. nocodb also appears to be a for profit startup. What would it look like to structure these kinds of software development projects around a co-operative governance? Another option is to deploy and sustain these open source technologies as part of a separate co-op, which is actually how I found out about baserow, through the Co-op Cloud Matrix chat. One downside to this approach is that all the benefits of having an participatory decision making process accrue to the people who are running the infrastructure, and not to the people designing and making the software. Unless of course there is overlap in membership between the co-op and the software development.

Social interoperability: Getting to know all about you / HangingTogether

Photo by Mihai Surdu on Unsplash

Building and sustaining productive cross-campus partnerships in support of the university research enterprise is both necessary and fraught with challenges. Social interoperability – the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding – is the key to making these partnerships work. There are strategies and tactics for building social interoperability – can you use them to learn more about an important campus stakeholder and potential partner in research support services?

This was the challenge we posed to participants in the third and final session of the joint OCLC-LIBER online workshop Building Strategic Relationships to Advance Open Scholarship at your Institution, based on the findings of the recent OCLC Research report Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise. This three-part workshop brought together a dynamic group of international participants to examine the challenges of working across the institution, identify tools for cross-unit relationship building, and develop plans for increasing their own social interoperability.

In this post, we share insights and perspectives from the workshop’s final session: “Making your plan for developing cross-functional relationships at your institution.” In the first two sessions, we learned why social interoperability is important, and how it can be developed into a skill set we can use in our work. In the third session, we explored how, as library professionals, we can use our social interoperability skill set to reach out to other parts of the campus.

The Library as campus partner

The session began with a reminder of the important role the Library plays as a stakeholder and partner in the delivery of research support services, with recognized expertise in areas such as metadata, licensing/vendor negotiations, and bibliometrics/research impact. At the same time, the Library often occupies a distinct space in terms of the perspective it brings to its mission, such as a strong preference for “free” and “open” solutions. Given this unique blend of skills and values, the Library is often viewed as a trusted and “agnostic” partner on campus, as we heard from our interviewees for our Social Interoperability report.

But we also heard from our interviewees about several sources of frustration encountered when working with the Library. For example, some campus stakeholders described how, in their experience, the Library did not focus enough on the “bottom line”, or moved too slowly in comparison to the needs and workflows of researchers. During the session, we conducted a quick poll of the workshop participants, asking them to put themselves in the role of a different campus stakeholder and consider how that unit would describe the Library in one word or phrase. The results were revealing: while the top responses included “supportive”, “helpful”, “competent”, and “expert”, “slow” was also a frequent choice, and other responses included “friendly but not fully relevant”, “opaque”, and “reactive not proactive”.

Social interoperability: One word or phrase to describe the library

The important takeaway was that in building cross-campus relationships, library professionals need to take into account, and in some cases, shift, how the Library is perceived by potential collaborative partners, rather than relying on self-perceptions. While unflattering characterizations of the Library may be based on misinformation, unfamiliarity, or differing priorities, they can still impede the development of productive working relationships (see the Social Interoperability report for more on the Library as a campus partner).

Learn about your partners

Our breakout discussions were motivated by an enormously valuable resource shared by colleagues at Rutgers University-New Brunswick Libraries. Developed as part of a strategic planning initiative, the resource is a questionnaire that the Library used to structure conversations with their stakeholders across campus. The purpose of the questionnaire was to learn about the Library’s stakeholders: their goals, their challenges, their needs. Consequently, almost all of the questions focused on the stakeholder’s priorities and interests; it is only toward the end of the questionnaire that library services are mentioned. This helped elicit the context that the Library needed to align its work with the needs of stakeholders (see the Social Interoperability report for the full questionnaire) .

The first discussion placed our participants in the role of discovering information about a hypothetical campus partner, using the Rutgers questionnaire as a guide. This discussion elicited some good advice on how to approach prospective campus partners and learn about them. For example, in considering how to break the ice and learn about colleagues in campus IT, several participants suggested initiating a discussion of general IT-related topics, such as network security or new technologies. Another participant suggested that a careful review of unit web sites would help in gathering information about prospective campus partners. While the pandemic has certainly introduced challenges in connecting to colleagues around campus, it has also sometimes made it easier: as one person pointed out, more people are attending inter-departmental meetings because it is easier to join a Zoom meeting than to be physically present at a particular location on campus.

Discovering information about other units on campus is not without challenges, as many of our workshop participants shared in the discussions. For example, several participants reported that it was sometimes difficult to pinpoint who to connect with, or even to identify the hierarchy of the unit. While direct, interpersonal contact usually helped in forming relationships, campus units with high staff turnover – such as the IT unit – made this problematic. And sometimes the information you discover about another unit’s responsibilities and needs may make forming successful partnerships even more daunting – for example, when it is clear that the other unit’s priorities do not easily mesh with those of the Library. By way of illustration, some participants observed that university Communications teams often seem remote from the Library, and do not adequately relay information about the Library and its activities. But as one person noted, it is a challenge for them to communicate everything, and currently, outreach to students and COVID-related information are understandably their priorities.

But participants also told many stories of how discovering more about their colleagues around campus helped create actionable opportunities for the Library. Often, it was as simple as discovering how Library skills and capacities matched up to the needs of other units. For example, one participant described how units in the area of Academic Affairs valued input from the Library in regard to accreditation processes. Others talked about the fact that mandates aimed at promoting open science have started to have more “teeth”, with compliance receiving greater scrutiny. This creates a stronger demand for Library expertise in areas such as data management plans. Several participants mentioned that they have learned that rising interest in inter-disciplinary, “Grand Challenge” projects has created a need for project support capacities that the Library can provide. And hearing from other campus units and departments about their need to better document productivity and output creates opportunities for Library staff to initiate conversations about tools such as ORCID that help advance that goal.

What do your campus partners know about you?

In one of the breakout discussions, a participant observed that they did not know very much about what their university Communications team did. But, the participant continued, this probably means that the Communications staff probably did not know much about what the Library does!

In the second set of breakout discussions, participants once again utilized the Rutgers script as a frame as they considered what other campus units might say about how the Library and its services contribute toward their work. One participant noted that the script questions were particularly useful for bringing out misconceptions about what the library can and cannot do. In the course of the discussions, several themes emerged from participants’ experiences of how the Library is perceived across campus.

First, it was clear that campus stakeholders often do not have a clear picture of the expertise and capacities of today’s academic library. Participants noted, for example, that stakeholders often are unaware that the Library can help with every aspect of the research cycle. Much more effort is needed to raise awareness about the Library’s role in the university research enterprise. One participant observed that at their university, the Library provides good data management support, but it took a lot of work to make researchers see this expertise. Another person related a similar experience, remarking that Library participation in cross-campus projects requires a lot of energy and communication – not least because many campus stakeholders are not fully aware of what the Library can do. Or, as one participant pointed out, stakeholders may utilize Library services without realizing it is the Library that is providing them.

Another theme touched on the need to establish a clear boundary around Library services within the broader university service eco-system. One participant remarked that the Library provides many services, but some of them are also offered by other campus units. For example, if a researcher needs data storage services, should they go to campus IT, or to the Library? One participant described a circular process whereby the Library receives a technical question and passes it to the IT unit, which then passes it back to the Library for resolution. Participants also noted that as the Library takes on new, emerging roles beyond its traditional functions, there is a tendency to “step on toes” and awaken territorial instincts in other campus units. But some participants also pointed out synergies that could be leveraged. A good example is that both the Library and the IT unit face budgetary challenges. When requests are received for support that neither unit can provide, but for which there is a clear need, they can collaborate to build a case for additional resources to address the gap.

Finally, workshop participants noted a shared need to elevate the perception of the Library’s capabilities across campus. Participants shared examples they have encountered of outdated or even indifferent perceptions of the Library and its services: “important but not essential to Research Office day-to-day business”; “the Office of the Vice Provost would not think of many areas that the Library supports in the university research enterprise”; “the Library buys the books”; “seen as useful, but difficult to get them to see things the Library should take the lead on”; “not seen as thought leaders or a source for answers, but as service providers.” Several participants cautioned against the Library being seen strictly in an administrative support role in cross-unit initiatives; one person observed that the Library’s responsibility to manage article processing charges (APCs) reinforces a perception as “book keeper” or “note-taker”. Library staff are often included in projects only after funding is received, rather than being included as a partner as the project is being developed.

How to counteract these perceptions? Participants emphasized the need for a “negotiation” process to ease the tension between what is expected from libraries and what libraries can offer. In short, libraries must learn to say “No” when necessary. Other campus units often expect a great deal from libraries, and library staff must strike a difficult balance between doing as much as possible to advance the interests of other units while at the same time preserving clear goals and advocating for Library-related priorities. As one person noted, “there is SO MUCH education to be done” to dispel the notion that libraries are useful only for administrative support. Libraries must break down and re-build these expectations. To do this, library staff need to be more proactive, rather than reactive, in their cross-campus partnerships. More openness across units is also needed, and libraries can set a good example in promoting transparency. And because, as one participant put it, “our services are not always top of mind”, library staff should work with the university Communications team, as well as influential faculty and administrators, “to get our message across.”

How do you feel about cross-unit partnerships now?

We concluded the workshop by asking participants to select one word to describe their current feelings about the prospect for cross-campus partnerships at their institution, in light of what they learned over the three sessions. We were gratified to see that the top response was “optimistic”! And indeed, with careful attention to the importance and need for social interoperability, and the techniques and practices we discussed to build it in the campus environment, library staff can be optimistic that their campus partnerships will be successful, and that the full value proposition of the Library will be better understood and utilized across the university research enterprise.

Social interoperability: How do you feel about cross-campus collaboration

Special thanks to all of our workshop participants for sharing their insights through lively and enlightening discussions, and to our colleagues at LIBER for working with us to make the workshop a success (a great example of social interoperability in action!)

The post Social interoperability: Getting to know all about you appeared first on Hanging Together.

Yet Another DNA Storage Technique / David Rosenthal

An alternative approach to nucleic acid memory by George D. Dickinson et al from Boise State University describes a fundamentally different way to store and retrieve data using DNA strands as the medium. Will Hughes et al have an accessible summary in DNA ‘Lite-Brite’ is a promising way to archive data for decades or longer:
We and our colleagues have developed a way to store data using pegs and pegboards made out of DNA and retrieving the data with a microscope – a molecular version of the Lite-Brite toy. Our prototype stores information in patterns using DNA strands spaced about 10 nanometers apart.
Below the fold I look at the details of the technique they call digital Nucleic Acid Memory (dNAM).

The traditional way to use DNA as a storage medium is to encode the data in the sequence of bases in a synthesized strand, then use sequencing to retrieve the data. Instead:
dNAM uses advancements in super-resolution microscopy (SRM)15 to access digital data stored in short oligonucleotide strands that are held together for imaging using DNA origami. In dNAM, non-volatile information is digitally encoded into specific combinations of single-stranded DNA, commonly known as staple strands, that can form DNA origami nanostructures when combined with a scaffold strand. When formed into origami, the staple strands are arranged at addressable locations ... that define an indexed matrix of digital information. This site-specific localization of digital information is enabled by designing staple strands with nucleotides that extend from the origami.


In dNAM, writing their 20 character message "Data is in our DNA!\n" involved encoding it into 15 16-bit fountain code droplets then synthesizing two different types of DNA sequences:
  • Origami: There is one origami for each 16 bits of data to be stored. It forms a 6x8 matrix holding a 4 bit index, the 16 bits of droplet data, 20 bits of parity, 4 bits of checksum, and 4 orientation bits. Each of the 48 cells thus contains a unique, message-specific DNA sequence.
  • Staples: There is one staple for each of the 15x48 matrix cells, with one end of the strand matching the matrix cell's sequence, and the other indicating a 0 or a 1 by the presence or absence of a sequence that binds to the flourescent DNA used for reading.
When combined, the staple strands bind to the appropriate cells in the origami, labelling each cell as a 0 or a 1.


The key difference between dNAM and traditional DNA storage techniques is that dNAM reads data without sequencing the DNA. Instead, it uses optical microscopy to identify each "peg" (staple strand) in each matrix cell as either a 0 or a 1:
The patterns of DNA strands – the pegs – light up when fluorescently labeled DNA bind to them. Because the fluorescent strands are short, they rapidly bind and unbind. This causes them to blink, making it easier to separate one peg from another and read the stored information.
The difficulty in doing so is that the pegs are on a 10 nanometer grid:
Because the DNA pegs are positioned closer than half the wavelength of visible light, we used super-resolution microscopy, which circumvents the diffraction limit of light.
The technique is called "DNA-Points Accumulation for Imaging in Nanoscale Topography (DNA-PAINT)". The process to recover the 20 character message was:
40,000 frames from a single field of view were recorded using DNA-PAINT (~4500 origami identified in 2982 µm2). The super-resolution images of the hybridized imager strands were then reconstructed from blinking events identified in the recording to map the positions of the data domains on each origami ... Using a custom localization processing algorithm, the signals were translated to a 6 × 8 grid and converted back to a 48-bit binary string — which was passed to the decoding algorithm for error correction, droplet recovery, and message reconstruction ... The process enabled successful recovery of the dNAM encoded message from a single super-resolution recording.


The first thing to note is that whereas traditional DNA storage techniques are volumetric, dNAM like hard disk or tape is areal. It will therefore be unable to match the extraordinary data density potentially achievable using the traditional approach. dNAM claims:
After accounting for the bits used by the algorithms, our prototype was able to read data at a density of 330 gigabits per square centimeter.
Current hard disks have an areal density of 1.3Tbit/inch2, or about 200Gbit/cm2, so for a prototype this is good but not revolutionary, The areal density is set by the 10nm grid space, so it may not be possible to greatly reduce it. Hard disk vendors have demonstrated 400Gbit/cm2 and have roadmaps to around 800Gbit/cm2.

dNAM's writing process seems more complex than the traditional approach, so is unlikely to be faster or cheaper. The read process is likely to be both faster and cheaper, because DNA-PAINT images a large number of origami in parallel, whereas sequencing is sequential (duh!). But, as I have written, the big barrier to adoption of DNA storage is the low bandwidth and high cost of writing the data.

Searching CORD-19 at the Distant Reader / Eric Lease Morgan

This blog posting documents the query syntax for an index of scientific journal articles called CORD-19.

flowerCORD-19 is a data set of scientific journal articles on the topic of COVID-19. As of this writing, it includes more than 750,000 items. This data set has been harvested, pre-processed, indexed, and made available as a part of the Distant Reader. Access to the index is freely available to anybody and everybody.

The index is rooted in a technology called Solr, a very popular indexing tool. The index supports simple searching, phrase searching, wildcard searches, fielded searching, Boolean logic, and nested queries. Each of these techniques are described below:

  • simple searches – Enter any words you desire, and you will most likely get results. In this regard, it is difficult to break the search engine.
  • phrase searches – Enclose query terms in double-quote marks to search the query as a phrase. Examples include: "waste water", "circulating disease", and "acute respiratory syndrome".
  • wildcard searches – Append an asterisk (*) to any non-phrase query to perform a stemming operation on the given query. For example, the query virus* will return results including the words virus and viruses.
  • fielded searches – The index has many different fields. The most important include: authors, title, year, journal, abstract, and keywords. To limit a query to a specific field, prefix the query with the name of the field and a colon (:). Examples include: title:disease, abstract:"cardiovascular disease", or year:2020. Of special note is the keywords field. Keywords are sets of statistically significant and computer-selected terms akin to traditional library subject headings. The use of the keywords field is a very efficient way to create a small set of very relevant articles. Examples include: keywords:mrna, keywords:ribosome, or keywords:China.
  • Boolean logic – Queries can be combined with three Boolean operators: 1) AND, 2) OR, or 3) NOT. The use of AND creates the intersection of two queries. The use of OR creates the union of two queries. The use of NOT creates the negation of the second query. The Boolean operators are case-sensitive. Examples include: covid AND title:SARS, abstract:cat* OR abstract:dog*, and abstract:cat* NOT abstract:dog*
  • nested queries – Boolean logic queries can be nested to return more sophisticated sets of articles; nesting allows you to override the way rudimentary Boolean operations get combined. Use matching parentheses (()) to create nested queries. An example includes ((covid AND title:SARS) OR abstract:cat* OR abstract:dog*) NOT year:2020. Of all the different types of queries, nested queries will probably give you the most grief.

Strategic and effective storytelling with data / Tara Robertson

vertical bar graph with 3 bars: first bar: women in tech globally 11.6%, women in leadership roles, underrepresented minorities 8.3%

I was delighted to speak at Data Science by Design’s Creator Conference. DSxD is a community of researchers, educators, artists, computer scientists who conference organizers described as curious, dynamic, creative and interdisciplinary. I chose to talk about the challenge of communicating about diversity metrics in a way that informs and inspires your audience to want to push for change using Mozilla’s last external diversity disclosure as an example. 

Here are 3 important things to keep in mind when storytelling using data:

  1. Understand who your audience is
  2. Share the context of the data with your audience
  3. Think about how you want your audience to feel  (and what you want them to do) after seeing the data 


As this was an external disclosure it is obvious that one audience was people outside Mozilla. Transparency about DEI metrics is table stakes for companies that say they care about these things. It’s also an important part of a company’s employer brand, especially for younger workers. When I’ve been interested in a job, I’m checking to see if companies are sharing their diversity metrics, what photos they use to illustrate who works there, what diversity in senior leadership looks like and how they’re telling the story of what their culture is like. When I see stories of free beer Fridays, ping pong tables and “a work hard play hard” culture, I am much less interested in applying. I don’t drink alcohol, I don’t like ping pong, and I value work-life balance.

Equally important to me was the internal audience at Mozilla. We had done a lot of internal systems work including redesigning our hiring process, removing meritocracy from our governance and leadership structures, and improving accessibility internally by live captioning all company meetings being more intentional about accessibility at events. I wanted people to see that all of these projects laddered up to measurable change. I wanted them to see progress and feel proud of all of our hard work.


A few years ago this is something I would have said: 

By the end of 2019, representation of women in technical roles was 21.6%.

This would prompt many questions from people, including:

  • Is this any good? 
  • How do we compare to other companies?
  • How do we compare to the labor market?
  • What does our pipeline look like?
  • Are we getting better or worse?
  • What do you mean by technical role? Is a data analyst a technical role? What about a data scientist? 

Context is so important! 

I decided to add an explainer video to the disclosure to help people understand the context. In addition, the video starts with the big picture context on why we were invested in D&I at Mozilla: 

  • from the the individual experience of feeling like you belong
  • to being directly connected to the mission “to make the internet open and accessible for all”
  • as well as the business case on innovation and performance.


Taking a data-driven approach is necessary in an engineering organization–most people want to see and understand the numbers. For the last diversity disclosure I prepared, I saw the opportunity to try and tell a story that could connect to people’s heads and hearts. 


A rainbow-colored, watercolor "emotion wheel"-- a circle divided into six pie-like wedges. Each wedge has a core emotion written in the center and related emotions written in a middle and outer layer. The top wedge is yellow with "joy" at the center. Moving clockwise, the next wedge is orange with "genius" at the center-- then red with "anger" at the center-- the bottom wedge is blue with "sad" at the center-- then purple with "fear at the center" and finally green with "disgust" at the center.

I wanted them to feel something, whether that was pride at the progress we’d made, or frustration that we weren’t making change quickly enough. My ideal outcome was to pique people’s curiosity and have them ask “what can I do to make Mozilla a diverse and inclusive inclusive place?”. My worst case scenario is that people heard this update and thought “meh, whatever”.

Here’s the 3 minute video:

Drew Merit is the illustrator who brought this idea to life.

I’d love examples from your work, or examples that you’ve seen out in the wild where people have used data to tell a story that inspires the audience to take action.

The post Strategic and effective storytelling with data appeared first on Tara Robertson Consulting.

Untitled / Ed Summers

Several years ago someone in our neighborhood was moving out of the area and was giving away their old piano. We figured out a way to get it to our house, and it has sat (still untuned) in a corner of our living room ever since.


The kids have sporadically taken piano lessons, and the various teachers who have heard it have all been polite not to comment on its current state. Mostly it just sits there, part table, part decoration, until someone stops by to play a little tune.

It’s interesting to hear the kids develop little signature riffs that they play while walking by the piano in the morning, or sometimes before going to bed. Here’s a short untitled little tune that Maeve regularly plays, almost like a little prayer, or memory:

Reimagine Descriptive Workflows: meeting the challenges of inclusive description in shared infrastructure / HangingTogether

In a previous blog post, I told you about our Reimagine Descriptive Workflows project, and the path we took to get there. In that post, I shared the three objectives we have in this project.

  • Convene a conversation of community stakeholders about how to address the systemic issues of bias and racial inequity within our current collection description infrastructure.
  • Share with libraries the need to build more inclusive and equitable library collections.
  • Develop a community agenda to help clarify issues for those who do knowledge work in libraries, archives, and museums; prioritize areas for attention from these institutions; and provide guidance for those national agencies and suppliers.

Coming together

In this post, I’m going to fill you in on the Reimagine Descriptive Workflows convening we held in June. Our virtual meeting took place in June (22 – 24 in North America, 23-25 in Australia & New Zealand). Fifty-nine people from the US, Canada, Australia, and New Zealand attended the meeting, which was designed and co-facilitated by Shift Collective.

Prior to the convening, the project team met twice with the advisory group, who helped shape the following goals for the event:

  • Create a safe space to share and connect honestly as humans
  • Lay the foundations for relationship building and repair
  • Building a basis for reciprocal relationships between communities and centers of power
  • Inspire radical thinking to rebuild a more just metadata infrastructure
  • Start building a concrete roadmap for change in the sector and keep conversation going

The project team identified potential participants through a consultative process and via self nomination. We prioritized attendance for those who had demonstrated leadership working in the area of “just descriptions.” We also prioritized the attendance of BIPOC colleagues, as well as others with lived experiences as members of underrepresented groups. All participants were offered a stipend to acknowledge and partially compensate in recognition of the valuable time, labor, and expertise they would bring to the event.

The Reimagine Descriptive Workflows meeting was held, as so many are in these times, via Zoom. Recognizing that virtual meeting fatigue is real (and that our group included participants from the middle of Australia all the way to the east coast of North America) we met between two and three hours each day. For meeting organizers this presented a challenge – how to structure the time together so that people could connect with one another as humans and build connections and trust, but also produce concrete outputs that would help move the conversation forward.

In order to help foster connections and build community, Asante Salaam (who I think of as the Shift team’s Minister of Culture!) helped to create a unique set of “cultural infusions” for participants, bringing in artists, musicians, a chef, and a poet and fostering conversations to give us an encounter with local flavor and culture from just some of the communities we were connecting with. It was not the same as being able to share a meal or gallery walk with others, but, for me, her efforts created a communal experience that supported the opportunity to connect with others outside of the official convening agenda.

To help establish the space we would share for three days, the meeting hosts put forward the following Agreements.

  • Share the space, step forward/step back
  • Listen and share bravely
  • Listen for understanding
  • Sense and speak your feelings
  • Use “I” statements
  • Discomfort is not the same as harm
  • No Alphabet Soup (don’t use acronyms and insider language without explaining it)
  • Be kind to yourself and others
  • Take care of your needs

Although we were together as the full group at the beginning and end of each day, and for our “cultural infusions,” most time was spent in smaller groups of five to six people, each supported by a guide. The guides took notes on behalf of the group and offered timekeeping support and gentle moderation when needed. Notes were kept in Miro (an ever-expanding online whiteboard and collaboration space). Here is an example of one of the discussion groups’ Miro board at the conclusion of the convening. Thanks to Shift team member, Tayo Medupin, who put a lot of effort and artistic touches into the design of these boards – it made me feel like I was in a distinctive space as opposed to a characterless virtual room.

Screen capture: notes from one discussion group

Composting, weeding, and seeding

The topic for the first day was “Composting: What is driving us forward?” During this day we worked through prompts such as… Why is taking this journey together meaningful? Where is our abundance, and what assets should we bring with us? What stories and experiences of positive change will we build upon? What must be acknowledged?

On day two, the topic was “Weeding: What is holding us back?” Here, participants were given time to draw a map or diagram of what a just, anti-racist and equitable descriptive workflow would look like. Prompts included calling out systemic, technical or procedural, social, cultural or personal blockers that might exist to implementing that workflow.

Between the second and third days, Tayo Medupin, together with other members of the Shift team, worked her Miro magic, collecting notes from the various discussion groups and mapping them to eleven Design Challenges. On the final day of the convening the small groups explored the topic “Seeding: Opportunities for change?” and adopted a Design Challenge, using the time to explore questions related to the challenge. Because of the limited amount of time we had to spend together, the small groups were only able to dig into a few of these.

Reimagine Descriptive Workflows design challenges

[Note: These are in draft form and have received little review. We are sharing to give a sense of meeting outcomes.]

Screen capture: notes from discussions mapped to design challenges


Insight: We’re trying to catalog and describe a world which is dynamic, fluid, complex and evolving over time in a cataloging culture that rewards the singular, definitive, and static.

Opportunity: How might we create the conditions for / support a move towards a cataloging culture that embraces the long-term view, valuing and rewarding evolution, deepening, enrichment and progress over the concept of ‘complete’?


Insight: We’re trying to slow down and involve communities in our workflows in equitable ways within a cataloging culture that pushes us to speed up and to spend and value time / resource in ways that can be at odds with slowing down and equitable collaboration.

Opportunity: How might we create the conditions for / support a move towards a cataloging culture that demonstrably values community engagement by making it accepted and even expected to slow down and invest our time and money in this way?


Insight: We have been and will be trying to create just metadata description across multiple generations. We are currently riding a wave of socio-political interest and prioritization that may or may not last.

Opportunity: How might we create the conditions for / support the foundations for a resilient (anti-fragile) system of actors and activity pushing towards just metadata description that will be able to survive the generation to come?


Insight: We are trying to redress hundreds of years of white supremacist colonial describing at scale in a system that is judged and valued on the legacy descriptions and language we can still see right now.

Opportunity: How might we create the conditions for / support a move towards a mutuality of understanding about where we are in the journey and what road is left ahead?


Insight: We are trying to work towards a just, equitable, anti-racist, anti-oppressive approach, but are we working within a common understanding of what this means and should/could look like in the sector?

Opportunity: How might we create the conditions for / support the creation of share visions and definitions of ‘good’ held by those working towards just description?


Insight: We’re often trying to make changes within organizational structures and cultures that can feel resistant or challenging to change.

Opportunity: How might we create the conditions for / support individuals, teams and collectives to help shape and reshape the cultures of our core institutions to ready them for this long and hard period of change?


Insight: We’re trying to change a huge legacy system often in our silos, in isolation, experiencing scarcity and without the clout of a network of others also making strides in the fight.

Opportunity: How might we create the conditions / support the growth of a thriving and resilient network of people, groups and organizations sharing the energy, bravery, resource, ideas, information and rest needed for the sector to transform?


Insight: We’re trying to change a huge legacy system in our own ways but many of us in our work, teams, institutions and sector do not feel we have the power and agency to make the necessary change

Opportunity: How might we create the conditions / support the growth of a sector where everyone feels the power and agency to drive forward the necessary change?


Insight: There are pockets of the future in the present in smaller institutions and in individuals who are pioneering just, anti-oppressive approaches, but they are often hampered by scale, visibility, recognition and reward.

Opportunity: How might we create the conditions / support the growth and progress of our system liberators to help them to create and scale the changes and cultures we need to transform us?


Insight: We’re trying to create just, equitable, anti-racist and anti-oppressive descriptions within a structure and worldview of describing which is conceptually unjust, inequitable, racist and oppressive.

Opportunity: How might we create the conditions for / support a radical rethink of the very concept of cataloging and metadata description, to lay the foundations for an approach that will better serve us for the next 200 years?


Insight: We’re trying to create just metadata description in a culture that doesn’t currently prioritize, demand, embrace or leave space for external feedback.

Opportunity: How might we create the conditions for / support a move towards a cataloging culture that demands, priorities and creates room for external / community feedback?

Thanks and gratitude

An important output of the Reimagine Descriptive Workflows project was the construction of a novel online convening that helped to support a brave space for productive and honest conversations about the challenges and solutions around inclusive and anti-racist description. This convening was the mudsill, setting the stage for everything to come. For seven hours of meeting, many, many more were put into the planning.

First and foremost, we want to thank our advisory group, which has really been at the heart of this project. We are grateful to this amazing group which not only brings their substantial professional perspectives but also their network connections and their lived experiences in this space. This group devoted heart and dedication, doing this work on top of their very busy professional and personal lives. Stacy Allison-Cassin, Jennifer Baxmeyer, Dorothy Berry, Kimberley Bugg, Camille Callison, Lillian Chavez, Trevor A. Dawes, Jarret Martin Drake, Bergis Jules, Cellia Joe-Olsen, Katrina Tamaira, Damien Webb.

We were gratified that nearly every person who was invited to the meeting not only accepted our invitation but came to the meeting and shared experiences and ideas. Convening attendees added so much by contributing, preparing, and being present. They made this so much more than another Zoom meeting. Audrey Altman, Jill Annitto, Heidy Berthoud, Kelly Bolding, Stephanie Bredbenner, Itza Carbajal, May Chan, Alissa Cherry, Sarah Dupont, Maria Estorino, Sharon Farnel, Lisa Gavell, Marti Heyman, Jay Holloway, Jasmine Jones, Michelle Light, Sharon Leon, Koa Luke, Christina Manzella, Mark Matienzo, Rachel Merrick, Shaneé Yvette Murrain, Lea Osborne, Ashwinee Pendharkar, Treshani Perera, Nathan Putnam, Keila Zayas Ruiz, Holly Smith, Gina Solares, Michael Stewart, Katrina Tamaira, Diane Vizine-Goetz, Bri Watson, Beacher Wiggins, and Pamela Wright.

Many thanks also to the team at Shift Collective that helped to design and facilitate the meetings: Gerry Himmelreich, Jennifer Himmelreich, Lynette Johnson, Tayo Medupin, Asante Salaam, and Jon Voss. An OCLC team also contributed to the planning and implementation: Rachel Frick, Bettina Huhn, Nancy Lensenmayer, Mercy Procaccini, Merrilee Proffitt, and Chela Scott Weber.

Finally, a big thank you to the Andrew W. Mellon Foundation for co-investing alongside OCLC. This seed funding made this convening possible.

Next steps

The eleven Design Challenges barely scratch the surface of everything that was covered at the meeting. The project team still has hours of transcripts and other meeting outputs to dig through. We’ll be using those outputs to construct a draft Community Agenda, as we promised at the outset of this project. We will make that draft available for broad community comment before publishing. We will also be using the Community Agenda to structure conversations with library leaders and other stakeholders – we believe it is important in socializing this work to get a sense of how those with power and access to purse strings see their role in implementing this work. And, of course we will be doing work internal to OCLC to consider our own role in the vision that this community has created.

As we consider our next steps, we are taking seriously our responsibilities as stewards of this conversation. Although preliminary feedback from the meeting was overwhelmingly positive, many attendees expressed a yearning to be able to connect, or continue to connect and learn from one another. We are considering how best to nurture that seed so that it can grow.

Thanks to Marti Heyman, Andrew Pace, Mercy Procaccini, and Chela Weber who reviewed and improved this blog post.

The post Reimagine Descriptive Workflows: meeting the challenges of inclusive description in shared infrastructure appeared first on Hanging Together.

Dismantling the Evaluation Framework / In the Library, With the Lead Pipe

(Atharva Tulsi, Unsplash,

By Alaina C. Bull, Margy MacMillan, and Alison J. Head

In brief

For almost 20 years, instruction librarians have relied on variations of two models, the CRAAP Test and SIFT, to teach students how to evaluate printed and web-based materials. Dramatic changes to the information ecosystem, however, present new challenges amid a flood of misinformation where algorithms lie beneath the surface of popular and library platforms collecting clicks and shaping content. When applied to increasingly connected networks, these existing evaluation heuristics have limited value. Drawing on our combined experience at community colleges and universities in the U.S. and Canada, and with Project Information Literacy (PIL), a national research institute studying college students’ information practices for the past decade, this paper presents a new evaluative approach for teaching students to see information as the agent, rather than themselves. Opportunities and strategies are identified for evaluating the veracity of sources, first as students, leveraging the expertise they bring with them into the classroom, and then as lifelong learners in search of information they can trust and rely on.

1. Introduction

Arriving at deeply considered answers to important questions is an increasingly difficult task. It often requires time, effort, discernment, and a willingness to dig below the surface of Google-ready answers. Careful investigation of content is needed more than ever in a world where information is in limitless supply but often tainted by misinformation, while insidious algorithms track and shape content that users see on their screens. Teaching college students evaluative strategies essential for academic success and in their daily lives is one of the greatest challenges of information literacy instruction today.

In the last decade, information evaluation — the ability to ferret out the reliability, validity, or accuracy of sources — has changed substantively in both teaching practice and meaning. The halcyon days of teaching students the CRAAP Test1, a handy checklist for determining the credibility of digital resources, are over2; and, in many cases, SIFT3, another reputation heuristic, is now in use on numerous campuses. At the same time, evaluative strategies have become more nuanced and complex as librarians continue to debate how to best teach these critically important skills in changing times.4 

In this article, we introduce the idea of proactivity as an approach that instruction librarians can use for re-imagining evaluation. We explore new ways of encouraging students to question how information works, how information finds them, and how they can draw on their own strengths and experiences to develop skills for determining credibility, usefulness, and trust of sources in response to an information ecosystem rife with deception and misinformation. Ultimately, we discuss how a proactive approach empowers students to become experts in their own right as they search for reliable information they can trust.

2. A short history of two models for teaching evaluation

Mention “information literacy instruction” and most academic librarians and faculty think of evaluation frameworks or heuristics that have been used and adapted for nearly two decades. The most widely known are the CRAAP method, and more recently, SIFT, both designed to determine the validity and reliability of claims and sources.

CRAAP debuted in 20045 when several academic librarians developed an easy to use assessment framework for helping students and instructors evaluate information for academic papers. CRAAP, a catchy acronym for Currency, Relevancy, Accuracy, Authority, Purpose, walks students through the criteria for assessing found content. For librarians, this approach to evaluation is a manifestation of the Information Literacy Competency Standards for Higher Education developed by the ACRL and especially an outcome of Standard 3.2: “Examines and compares information from various sources in order to evaluate reliability, validity, accuracy, authority, timeliness, and point of view or bias.”6

When the CRAAP method was first deployed nearly 20 years ago, the world was still making the transition from Web 1.0 to Web 2.0. Most online content was meant to be consumed, not interacted with, altered, changed, and shared. CRAAP was developed in a time when you found information, before the dramatic shift to information finding you. As monolithic players like Google and Facebook began using tracking software on their platforms in 2008 and selling access to this information in 2012, web evaluation became a very different process. In a role reversal, media and retail platforms, such as Amazon, had begun to evaluate their users to determine what information they should receive, rather than users evaluating what information they found.

Since 2015, criticism has mounted about the CRAAP test, despite its continued and widespread use on campuses. Checklists like CRAAP are meant to reduce cognitive overload, but they can actually increase it, leading students to make poor decisions about the credibility of sources, especially in densely interconnected networks.7 As one critic has summed it up: “CRAAP isn’t about critical thinking – it’s about oversimplified binaries.”8 We agree: CRAAP was designed for a fairly narrow range of situations, where students might have little background knowledge to assist in judging claims and often had to apply constraints of format, date, or other instructor-imposed requirements; but these bore little resemblance to everyday interactions with information, even then.

When Mike Caulfield published the SIFT model in 2019, it gave instruction librarians a  progressive alternative to the CRAAP test. Caulfield described his evaluation methods as a “networked reputation heuristic,”9 developed in response to the spread of misinformation and disinformation in the post-truth era. The four “moves” he identified — Stop, Investigate, Find, Trace — are meant to help people recontextualize information through placing a particular work and its claims within the larger realm of content about a topic.

SIFT offers major improvements over CRAAP in speed, simplicity, and applicability to a wider scope of print and online publications, platforms, and purposes. Recently, researchers have identified the benefits of using this approach,10 and, in particular, the lateral reading strategies it incorporates. SIFT encourages students to base evaluation on cues that go beyond the intrinsic qualities of the article and to use comparisons across media sources to understand the trustworthiness of an article. This is what Justin Reich,11 Director of the MIT Teaching Systems Lab, noted in a 2020 Project Information Literacy (PIL) interview, calling SIFT a useful “first step,” since it may assist students in acquiring the background knowledge they need to evaluate the next piece of information they encounter on the topic. 

Crucially, SIFT also includes the context of the information needed as part of evaluation – some situations require a higher level of verification than others. The actions SIFT recommends are more closely aligned with the kind of checking students are already using to detect bias12 and decide what to believe and how researchers themselves judge the quality of information.13 And while it is much better suited to today’s context, where misinformation abounds and algorithms proliferate, SIFT is still based on students encountering individual information objects, without necessarily understanding them as part of a system. 

Our proposed next step, what we call proactive evaluation, would allow them not only to evaluate what they’re seeing but consider why they’re seeing what they do and what might be missing. SIFT, like CRAAP, is based on a reactive approach: the individual is an agent, acting upon information objects they find. In today’s information landscape, we think it is more useful to invert this relationship and consider the information object as the agent that is acting on the individual it finds. 

3.  Information with agency

Thinking of information as having agency allows us to re-examine the information environment we think we know. By the time they get to college, today’s students are embedded in the information infrastructure: a social phenomenon of interconnected sources, creators, processes, filters, stories, formats, platforms, motivations, channels, and audiences. Their profiles and behaviors affect not only what they see and share but also the relative prominence of stories, images, and articles in others’ feeds and search results. Information enters, flows through, and ricochets around the systems they inhabit – fueled, funded, and filtered by data gathered from every interaction.

Research from PIL,14 and elsewhere,15 indicates that students who see algorithmic personalization at work in their everyday information activities already perceive information as having agency, specifically, the  ability to find them, follow them across platforms, and keep them in filter bubbles. They understand the bargain they are required to make with corporations like Amazon, Alphabet, and Facebook where they exchange personal data for participation in communities, transactions, or search efficiency.

When PIL interviewed 103 undergraduates at eight U.S. colleges and universities in 2019 for the algorithm study, one student at a liberal arts college described worries we heard from others about the broader social impact of these systems: “I’m more concerned about the large-scale trend of predicting what we want, but then also predicting what we want in ways that push a lot of people towards the same cultural and political endpoint.”16

This student’s concern relates to the effects of algorithmic personalization and highlights student awareness of deliberate efforts to affect and, in many cases, infect the system.17 Subverting the flow of information for fun and profit has become all too common practice for trolls, governments, corporations, and other interest groups.18 The tactics we’ve taught students for evaluating items one at a time provide slim defenses against the networked efforts of organizations that flood feeds, timelines, and search results. While SIFT at least considers information as part of an ecosystem, we still need to help students go beyond evaluating individual information objects and understand the systems that intervene during the search processes, sending results with the agency to nudge, if not shove, users in certain directions. 

That is why it is time to consider a new approach to the teaching of source evaluation in order to keep up with the volatile information ecosystem. Allowing for information to have agency, i.e. acknowledging information as active, targeted, and capable of influencing action, fundamentally alters the position of the student in the act of evaluation and demands a different approach from instruction librarians. We call this approach proactive evaluation.

4. Proactive evaluation

What happens if we shift our paradigm from assuming that students are agents in the information-student interaction to assuming that the information source is the agent? This change in perspective will dramatically reframe our instruction in important ways. This perspective may initially seem to disempower the information literacy student and instructor, but given widespread disinformation in this post-truth era, this reversal might keep us, as instructors, grounded in our understanding of information literacy.

Once we shift the understanding of who is acting upon whom, we can shift our approaches and techniques to reflect this perspective. This change in thinking allows us to move from reactive evaluation, that is, “Here is what I found, what do I think of it?” to proactive evaluation, “Because I understand where this information came from and why I’m seeing it, I can trust it for this kind of information, and for this purpose.”

What does a proactive approach look like? Table 1 presents comparisons between reactive and proactive approaches to information literacy as a starting point for thinking about this shift in thinking. This typology acknowledges that college and university students come into our classrooms with a deep and wide knowledge of the information landscapes in which they exist. 

 A Model for Transitioning from Reactive to Proactive Evaluation

Understanding of information
Individual objects you findNetworked objects that find you
Understanding of evaluation
Intrinsic (to the object)Contextual (within the network)
User is the agentInformation is the agent→Both the user and the information have agency in a dynamic relationship
How/what we teach
Closed yes/no questions with defined answersOpen questions
Binaries (good/bad, scholarly/popular)Contextual continua (useful for topic x in circumstance y if complemented by z)
Student as perpetual novice (evaluates from scratch every time)Student as developing expert with existing knowledge, who brings expertise about information, subject, sources, processes
Evaluate individual objects with novice tools and surface heuristicsEvaluate based on network context and connections, and build networks of trusted knowledge/sources
CRAAPSIFT→Into the unknown

As this typology suggests, our thinking rejects the “banking model of education” where students are empty vessels that educators must fill.19 To illustrate this point, PIL’s 2020 algorithm study has confirmed what we have long suspected: many students are already using evasive strategies to circumvent algorithmic tracking and bias. Their tactics, learned from friends and family, not their instructors, range from creating throwaway email accounts to using VPNs and ad-blocking apps to guard their personal data from algorithms.20

Students know that information is constantly trying to find them, identify them, label them, and sway them. And they may know this better than the faculty that teach them.21 Applying this to information literacy instruction means acknowledging that students approach information skeptically, and at least some students arrive in the classroom with defensive practices for safeguarding their privacy and mitigating invasive, biased information as they navigate the web and search for information.

To build on this premise, we should be asking students to apply their defensive strategies to classroom-based tasks. “If this information showed up in your news stream, what tactics would you use to decide if you wanted to pass it along?” “What do you look for to know if this is valid or useful information?” “Instead of asking yes/no questions, e.g., ‘Is it written by an expert?’” “Is it current?” In particular, we should shift our assessment questions to an open-ended inquiry with students. 

An example of how this could work would be asking the class what they do when they encounter a new piece of information in their own information landscape, such as a news story. How would students go about deciding if they would reshare it?  What are their motivations for sharing a news story? In PIL’s news study, for instance, more than half of the almost 6,000 students surveyed (52%) said their reason for sharing news on social media was to let friends and followers know about something they should be aware of, while more than two fifths (44%) said sharing news gives them a voice about a larger political or social cause.22 Does the same drive hold true for students in this classroom example? 

For librarians using a proactive approach like this one, they could have a classroom discussion to see if their students also see themselves as stewards of what is important to know, while having a voice about larger causes in the world. A proactive approach also allows students to bring their prior networked knowledge into the discussion, rather than looking at a single point of information in isolation when directed by an instruction librarian. Asking students to make their tacit processes more explicit will also help them see the information networks they have already built more clearly. They may be using other factors in their decision-making, like who recommended a source or the context in which the information will be used. These evaluation points are also used by researchers when assessing the credibility of information.23 Providing opportunities for students to reflect on and articulate their interactions with information in the subject areas where they feel confident may allow them to transfer skills more easily to new, less familiar, academic domains.

Students sharing these kinds of spontaneous reflections can leverage the social aspect of information skills. PIL studies have shown repeatedly that students lean on each other when they evaluate content for academic, employment, and everyday purposes; when necessary they also look to experts, including their instructors, to suggest or validate resources. Evaluation is fundamentally a social practice, but the existing heuristics don’t approach it this way. Reliance on other people as part of trusted information networks is rarely even acknowledged, let alone explicitly taught in formal instruction, as we tend to focus on the stereotype of the solitary scholar.

Gaining understanding of their own information networks, students can learn to see the operations of other networks, within disciplines, news, and other commercial media. If they are aware of the interconnectedness of information, they can use those connections to evaluate content and develop their mental Rolodexes of trusted sources.24 Understanding which sources are trustworthy for which kinds of information in which contexts is foundational knowledge for both academic work and civic engagement.

Building on SIFT strategies, it’s possible for students to accumulate knowledge about sources by validating them with tools like Wikipedia. Comparing and corroborating may illuminate the impact of algorithms and other systems that make up the information infrastructure.25 Developing this kind of map of their network of trusted sources can help them search and verify more strategically within that network, whether they’re in school or not.

As they come to understand themselves as part of the information infrastructure, students may be able to reclaim some agency from the platforms that constrain and control the information they see. While they may not ever be able to fully escape mass personalization, looking more closely at its effects may increase awareness of when and how search results and news feeds are being manipulated. Students need to understand why they see the information that streams at them, the news that comes into their social media feeds, the results that show up at the top of a search, and what they can do to balance out the agency equation and regain some control.

Admittedly, this form of instruction is clearly more difficult to implement than turnkey checklists and frameworks. It is much harder to fit into the precious time of a one-shot. It requires trust in the students, trust in their prior knowledge, and trust in their sense-making skills. This change in perspective about how we teach evaluation is not a magic bullet for fixing our flawed instruction practices. But we see proactive evaluation as an important step for moving our profession forward in teaching students how to navigate an ever-changing information landscape. This proactive model can be used in conjunction with, or independent of, SIFT to create a more complex information literacy.

Reactive evaluation considers found information objects in isolation, based on intrinsic qualities, regardless of the user or intended use. In a proactive approach, the user considers the source while evaluating information contextually, through its relationships to other sources and to the user. Over time, a user can construct their own matrix of trusted sources. It’s similar to getting to know a new city; a newcomer’s mental map gradually develops overlays of shortcuts, the safe and not-so-safe zones, and the likely places to find what they need in a given situation. Eventually, they learn where to go for what, a critical thinking skill they can take with them through the rest of their education and everyday lives and apply with confidence long after graduation.

5. Into the unknown

Reactive approaches to evaluation are not sufficient to equip students to navigate the current and evolving information landscape. What we have proposed in this paper is an alternative, what we call a proactive approach, to information evaluation that moves away from finite and simple source evaluation questions to open-ended and networked questions. While a proactive approach may feel unfamiliar and overwhelming at first, it moves away from the known to the unknown to create a more information-literate generation of students and lifelong learners.  

But what if this approach is actually not as unfamiliar as it may seem? The current ACRL framework paints a picture of the “information-literate student” that speaks to a pedagogy that cultivates a complex and nuanced understanding of the information creation process and landscape. For example, in the “Scholarship as Conversation” frame, these dispositions include “recognize that scholarly conversations take place in various venues,” and “value user-generated content and evaluate contributions made by others.”26 

Both dispositions require a nuanced understanding of the socialness of scholarship and imply evaluation within a social context. And while heuristics that rely on finite and binary responses are easy to teach, they create more problems than they solve. Focusing on the network processes that deliver the information in front of us, instead of focusing on these finite questions, allows for a different kind of knowing. 

The next question for instructors to tackle is what this proactive approach looks like in the classroom. In our field, discussions of “guide on the side” and “sage on the stage” are popular, but what we are actually advocating in this article isn’t a guide or a sage, as both assume a power structure and expertise that is incomplete and outdated. In the classroom, we advocate a shift from guiding or lecturing to conversation. We do not have a set of desired answers that we are hoping to coax out of the students: Which of these sources is valid? Who authored this source, and are they an expert? Rather, a proactive approach encourages students to engage and interact with their ideas and previous experiences around information agency, the socialness of the information, and how they evaluate non-academic sources. This will allow students to bring their deep expertise into the classroom.

We have alluded to open-ended questions as part of the proactive approach, but this is more accurately described as an open dialogue. This type of instruction is difficult in the one-shot structure, as it relies on trust. An unsuccessful session looks like your worst instruction experience, with the students staring blankly at you and not engaging, leaving lots of empty space and the strong desire to revert to lecturing on database structures. A successful session will feel like an intellectual conversation where you as the “teacher” learn as much as you impart, and the conversation with students is free-flowing and engaging. 

Returning to the earlier example of asking how a student would chose whether or not to reshare a news story, this type of dialogue could include conversations about what they already know about the news source, what they know about the person or account that initially shared it, how they might go about reading laterally, what their instincts say, how this does or does not fit with their prior knowledge on the subject, and their related reactions. During the course of the discussion, it will lead to what areas of information literacy and assessment need more dialogue and what areas the students are already skilled and comfortable in. 

The kind of information literacy instruction that assumes agency rests solely with the user, who finds and then evaluates individual information objects, is no longer valid now that information seeks out the user through networked connections. This reversal of the power dynamic underlies many of the gaps between how evaluation is taught in academic settings and how it occurs in everyday life. The approach we advocate balances out these extremes and helps students recognize and regain some of their agency. By understanding how information infrastructures work and their roles within them, students can adapt the tactics that many of them are already using to become more conscious actors.

6. Looking Ahead 

In this article, we have discussed an alternative to current evaluation approaches that is closely tied to the issue of trust: trusting our students to bring their own experiences and expertise to the information literacy classroom. But our work doesn’t end there. Our approach also requires us to trust ourselves as instructors. We will need to trust that we do in fact understand the continuously changing information landscape well enough to engage with open-ended, complex questions, rather than a prescribed step-by-step model. We must continue to inform ourselves and reevaluate information systems — the architectures, infrastructures, and fundamental belief systems — so we can determine what is trustworthy. We have to let go of simple solutions to teach about researching complex, messy problems.

For college students in America today, knowing how to evaluate news and information is not only essential for academic success but urgently needed for making sound choices during tumultuous times. We must embrace that instruction, and information evaluation, are going to be ugly, hard, and confusing for us to tackle but worth it in the end to remain relevant and useful to the students we teach. 


We are grateful to Barbara Fister, Contributing Editor of the “PIL Provocation Series” at Project Information Literacy (PIL) for making incisive suggestions for improving this paper, and Steven Braun, Senior Researcher in Information Design at PIL, for designing Table 1. The article has greatly benefited from the reviewers assigned by In the Library with the Lead Pipe: Ian Beilin, Ikumi Crocoll, and Jessica Kiebler.


“Framework for Information Literacy for Higher Education.” 2016. Association of College and Research Libraries. January 16.

“Information Literacy Competency Standards for Higher Education.” 2000. Association of College and Research Libraries. January 18. ala/mgrps/divs/acrl/standards/standards.pdf.

Bengani, Priyanjana. “As Election Looms, a Network of Mysterious ‘Pink Slime’ Local News Outlets Nearly Triples in Size.” Columbia Journalism Review, August 4, 2020.

Blakeslee, Sarah. “The CRAAP Test.” LOEX Quarterly 31, no. 3 (2004).

Breakstone, Joel, Mark Smith, Priscilla Connors, Teresa Ortega, Darby Kerr, and Sam Wineburg. “Lateral Reading: College Students Learn to Critically Evaluate Internet Sources in an Online Course.” The Harvard Kennedy School Misinformation Review 2, no. 1 (2021): 1–17.

Brodsky, Jessica E., Patricia J. Brooks, Donna Scimeca, Ralitsa Todorova, Peter Galati, Michael Batson, Robert Grosso, Michael Matthews, Victor Miller, and Michael Caulfield. “Improving College Students’ Fact-Checking Strategies through Lateral Reading Instruction in a General Education Civics Course.” Cognitive Research: Principles and Implications 6 (2021).

Caulfield, Mike. “A Short History of CRAAP.” Blog. Hapgood (blog), September 14, 2018.

———. Truth is in the network. Email, May 31, 2019.

———. Web Literacy for Student Fact-Checkers, 2017.

Dube, Jacob. “No Escape: The Neverending Online Threats to Female Journalists.” Ryerson Review of Journalism, no. Spring 2018 (May 28, 2018).

Fister, Barbara. “The Information Literacy Standards/Framework Debate.” Inside Higher Ed, Library Babel Fish, January 22, 2015.

Foster, Nancy Fried. “The Librarian-Student-Faculty Triangle: Conflicting Research Strategies?” Library Assessment Conference, 2010.

Freire, Paulo. “The Banking Model of Education.” In Critical Issues in Education: An Anthology of Readings, 105–17. Sage, 1970.

Haider, Jutta, and Olof Sundin. “Information Literacy Challenges in Digital Culture: Conflicting Engagements of Trust and Doubt.” Information, Communication and Society, 2020.

Head, Alison J., Barbara Fister, and Margy MacMillan. “Information Literacy in the Age of Algorithms.” Project Information Literacy Research Institute, January 15, 2020.

Head, Alison J., John Wihbey, P. Takis Metaxas, Margy MacMillan, and Dan Cohen. “How Students Engage with News: Five Takeaways for Educators, Journalists, and Librarians.” Project Information Literacy Research Institute, October 16, 2018.

Maass, Dave, Aaron Mackey, and Camille Fischer. “The Follies 2018.” Electronic Frontier Foundation, March 11, 2018.

Meola, Marc. “Chucking the Checklist: A Contextual Approach to Teaching Undergraduates Web-Site Evaluation.” Libraries and the Academy 4, no. 3 (2004): 331–44.

Reich, Justin. Tinkering Toward Networked Learning: What Tech Can and Can’t Do for Education. December 2020.

Seeber, Kevin. “Wiretaps and CRAAP.” Blog. Kevin Seeber (blog), March 18, 2017.

  1. The CRAAP Test (Currency, Relevance, Authority, Accuracy, Purpose) is a reliability heuristic designed by Sarah Blakeslee and her librarian colleagues at Chico State University. See: Sarah Blakeslee, “The CRAAP Test,” LOEX Quarterly 31 no. 3 (2004):
  2. Kevin Seeber, “Wiretaps and CRAAP,” Kevin Seeber [Blog], (March 18, 2017):
  3. Mike Caulfield, “The Truth is in the Network” [email interview by Barbara Fister], Project Information Literacy, Smart Talk Interview, no. 31, (December 1, 2020)
  4. Barbara Fister, “The Information Literacy Standards/Framework Debate,” Library Babel Fish column, Inside Higher Education, (January 22, 2015):
  5. Sarah Blakeslee, “The CRAAP test,” op. cit.
  6. Association of College and Research Libraries, Information Literacy Competency Standards for Higher Education, (2000),  Note: These standards were rescinded in 2016.
  7. Mike Caulfield, “A Short History of CRAAP,” Hapgood, (June 14, 2018):
  8. Kevin Seeber (March 18, 2017), “Wiretaps and CRAAP,” op. cit.
  9. Mike Caulfield, “The Truth is in the Network,” op. cit. Caulfield developed SIFT from earlier version of this heuristic, “four moves and a habit,” described in his 2017 OER book Web Literacy for Student Fact-Checkers, (December 1, 2020)
  10. Jessica E. Brodsky, Patricia J. Brooks, Donna Scimeca, Ralitsa Todorova, Peter Galati, Michael Batson, Robert Grosso, Michael Matthews, Victor Miller, and Michael Caulfield , “Improving College Students’ Fact-Checking Strategies Through Lateral Reading Instruction in a General Education Civics Course,” Cognitive Research: Principles and Implications, 6(1) (2021), 1-18,; Joel Breakstone, Mark Smith, Priscilla Connors, Teresa Ortega, Darby Kerr, and Sam Wineburg, “Lateral Reading: College Students Learn to Critically Evaluate Internet Sources in an Online Course,” The Harvard Kennedy School Misinformation Review, 2(1), (2021) 1-17,
  11. Justin Reich, “Tinkering Toward Networked Learning: What Tech Can and Can’t Do for Education” [email interview by Barbara Fister], Project Information Literacy, Smart Talk Interview, no. 33, (December 2020):
  12. Alison J. Head, John Wihbey, P. Takis Metaxas, Margy MacMillan, and Dan Cohen, How Students Engage with News: Five Takeaways for Educators, Journalists, and Librarians, Project Information Literacy Research Institute, (October 16, 2018), pp. 24-28, 
  13. Nancy Fried Foster , “The Librarian‐Student‐Faculty Triangle: Conflicting Research Strategies?.” 2010 Library Assessment Conference,(2010):
  14. Alison J. Head, Barbara Fister, and Margy MacMillan, Information Literacy in the Age of Algorithms, Project Information Literacy Research Institute, (January 15, 2020):
  15. Jutta Haider and Olof Sundin (2020), “Information Literacy Challenges in Digital Culture: Conflicting Engagements of Trust and Doubt,” Information, Communication and Society, ahead-of-print,
  16. Alison J. Head, Barbara Fister, and Margy MacMillan (January 15, 2020), op. cit.
  17. Alison J. Head, Barbara Fister, and Margy MacMillan, (January 15, 2020), op. cit., 5-8.
  18. See for example, Dave Mass, Aaron Mackey, and Camille Fischer, “The Foilies, 2018,” Electronic Frontier Foundation,(March 11, 2018):; Jacob Dube, “No Escape: The Neverending Online Threats to Female Journalists,” Ryerson Review of Journalism, (May 28, 2018):; Priyanjana Bengani, “As Election Looms, a Network of Mysterious ‘Pink Slime’ Local News Outlets Nearly Triples in Size,” Columbia Journalism Review,(August 4, 2020): 
  19. Paulo Freire, “The Banking Model of Education,” In Provenzo, Eugene F. (ed.). Critical Issues in Education: An Anthology of Readings, Sage, (1970), 105-117.
  20. Alison J. Head, Barbara Fister, and Margy MacMillan (January 15, 2020), op.cit., 16-19.
  21. Alison J. Head, Barbara Fister, and Margy MacMillan (January 15, 2020), op.cit., 22-25.
  22. Alison J. Head, John Wihbey, P. Takis Metaxas, Margy MacMillan, and Dan Cohen (October 16, 2018), op.cit., 20
  23. Nancy Fried Foster (2010), op. cit.
  24. Barbara Fister, “Lizard People in the Libraries,” PIL Provocation Series, No. 1, Project Information Literacy Research Institute,(February 3, 2021): 
  25. Marc Meola , “Chucking the Checklist: A Contextual Approach to Teaching Undergraduates Web-site Evaluation,” portal: Libraries and the Academy 4, no.3 (2004): 331-344,, p.338
  26. Association of College and Research Libraries , Framework for Information Literacy for Higher Education (2016)

Welcome Livemark – the New Frictionless Data Tool / Open Knowledge Foundation

We are very excited to announce that a new tool has been added to the Frictionless Data toolkit: Livemark.

What is Frictionless?

Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The project is funded by the Sloan Foundation and Open Data Institute.

Learn more about Frictionless data here.

What is Livemark?

Livemark is a great tool that allows you to publish data articles very easily, giving you the possibility to see your data live on a working website in a blink of an eye.

How does it work?

Livemark is a Python library generating a static page that extends Markdown with interactive charts, tables, scripts, and much much more. You can use the Frictionless framework as a frictionless variable to work with your tabular data in Livemark.

Livemark offers a series of useful features, like automatically generating a table of contents and providing a scroll-to-top button when you scroll down your document. You can also customise the layout of your newly created webpage.

How can you get started?

Livemark is very easy to use. We invite you watch this great demo by developer Evgeny Karev:

You can also have a look at the documentation on GitHub

What do you think?

If you create a site using Livemark, please let us know! Frictionless Data is an open source project, therefore we encourage you to give us feedback. Let us know your thoughts, suggestions, or issues by joining us in our community chat on Discord (opens new window) or by opening an issue in the GitHub repo.

Take a Virtual Tour of Samvera Repositories / Samvera

The Samvera Repository Online Tour provides an overview of a range of Samvera-based digital repositories and their collections at institutions across the US and UK. Click through to explore how Samvera technologies allow institutions and organizations to organize and provide access to diverse materials — from oral histories and historic photographs, to student projects and research data. 

The tour was planned by the Samvera Marketing Working Group and created using StoryMapJS by Lafayette College Libraries student workers Grayce Walker, Deja Jackson, and Khaknazar Shyntassov. A huge thanks to them, as well as to Charlotte Nunes at Lafayette College Libraries Digital Scholarship Services for coordinating the work.

If you’d like to include your Samvera-based repository on the tour, simply fill out this form.

The post Take a Virtual Tour of Samvera Repositories appeared first on Samvera.

Alternatives To Proof-of-Work / David Rosenthal

The designers of peer-to-peer consensus protocols such as those underlying cryptocurrencies face three distinct problems. They need to prevent:
  • Being swamped by a multitude of Sybil peers under the control of an attacker. This requires making peer participation expensive, such as by Proof-of-Work (PoW). PoW is problematic because it has a catastrophic carbon footprint.
  • A rational majority of peers from conspiring to obtain inappropriate benefits. This is thought to be achieved by decentralization, that is a network of so many peers acting independently that a conspiracy among a majority of them is highly improbable. Decentralization is problematic because in practice all successful cryptocurrencies are effectively centralized.
  • A rational minority of peers from conspiring to obtain inappropriate benefits. This requirement is called incentive compatibility. This is problematic because it requires very careful design of the protocol.
In the rather long post below the fold I focus on some potential alternatives to PoW, inspired by Jeremiah Wagstaff's Subspace: A Solution to the Farmer’s Dilemma, the white paper for a new blockchain technology.

Careful design of the economic mechanisms of the protocol can in theory ensure incentive compatibility, or as Ittay Eyal and Emin Gun Sirer express it:
the best strategy of a rational minority pool is to be honest, and a minority of colluding miners cannot earn disproportionate benefits by deviating from the protocol
They showed in 2013 that the Bitcoin protocol was not incentive-compatible, but this is in principle amenable to a technical fix. Unfortunately, ensuring decentralization is a much harder problem.


Vitalik Buterin, co-founder of Ethereum, wrote in The Meaning of Decentralization:
In the case of blockchain protocols, the mathematical and economic reasoning behind the safety of the consensus often relies crucially on the uncoordinated choice model, or the assumption that the game consists of many small actors that make decisions independently.
The Internet's basic protocols, TCP/IP, DNS, SMTP, HTTP are all decentralized, and yet the actual Internet is heavily centralized around a few large companies. Centralization is an emergent behavior, driven not by technical but by economic forces. W. Brian Arthur described these forces before the Web took off in his 1994 book Increasing Returns and Path Dependence in the Economy.

Similarly, the blockchain protocols are decentralized but ever since 2014 the Bitcoin blockchain has been centralized around 3-4 large mining pools. Buterin wrote:
can we really say that the uncoordinated choice model is realistic when 90% of the Bitcoin network’s mining power is well-coordinated enough to show up together at the same conference?
This is perhaps the greatest among the multiple failures of Satoshi Nakamoto's goals for Bitcoin. The economic forces driving this centralization are the same as those that centralized other Internet protocols. I explored how they act to centralize P2P systems in 2014's Economies of Scale in Peer-to-Peer Networks. I argued that an incentive-compatible protocol wasn't adequate to prevent centralization. The simplistic version of the argument was:
  • The income to a participant in an incentive-compatible P2P network should be linear in their contribution of resources to the network.
  • The costs a participant incurs by contributing resources to the network will be less than linear in their resource contribution, because of the economies of scale.
  • Thus the proportional profit margin a participant obtains will increase with increasing resource contribution.
  • Thus the effects described in Brian Arthur's Increasing Returns and Path Dependence in the Economy will apply, and the network will be dominated by a few, perhaps just one, large participant.
And I wrote:
The advantages of P2P networks arise from a diverse network of small, roughly equal resource contributors. Thus it seems that P2P networks which have the characteristics needed to succeed (by being widely adopted) also inevitably carry the seeds of their own failure (by becoming effectively centralized). Bitcoin is an example of this.
My description of the fundamental problem was:
The network has to arrange not just that the reward grows more slowly than the contribution, but that it grows more slowly than the cost of the contribution to any participant. If there is even one participant whose rewards outpace their costs, Brian Arthur's analysis shows they will end up dominating the network. Herein lies the rub. The network does not know what an individual participant's costs, or even the average participant's costs, are and how they grow as the participant scales up their contribution.

So the network would have to err on the safe side, and make rewards grow very slowly with contribution, at least above a certain minimum size. Doing so would mean few if any participants above the minimum contribution, making growth dependent entirely on recruiting new participants. This would be hard because their gains from participation would be limited to the minimum reward. It is clear that mass participation in the Bitcoin network was fuelled by the (unsustainable) prospect of large gains for a small investment.
The result of limiting reward growth would be a blockchain with limited expenditure on mining which, as we see with the endemic 51% attacks against alt-coins, would not be secure. But without such limits, economies of scale mean that the blockchain would be dominated by a few large mining pools, so would not be decentralized and would be vulnerable to insider attacks. Note that in June 2014 the mining pool alone had more than 51% of the Bitcoin mining power.

But the major current problem for Bitcoin, Ethereum and cryptocurrencies in general is not vulnerability to 51% attacks. Participants in these "trustless" systems trust that the mining pools are invested in their security and will not conspire to misbehave. Events have shown that this trust is misplaced as applied to smaller alt-coins. Trustlessness was one of Nakamoto's goals, another of the failures. But as regards the major cryptocurrencies this trust is plausible; everyone is making enough golden eggs to preserve the life of the goose.

Alternatives to Proof-of-Work

The major current problem for cryptocurrencies is that their catastrophic carbon footprint has attracted attention. David Gerard writes:
The bit where proof-of-work mining uses a country’s worth of electricity to run the most inefficient payment system in human history is finally coming to public attention, and is probably Bitcoin’s biggest public relations problem. Normal people think of Bitcoin as this dumb nerd money that nerds rip each other off with — but when they hear about proof-of-work, they get angry. Externalities turn out to matter.
Yang Xiao et al's A Survey of Distributed Consensus Protocols for Blockchain Networks is very useful. They:
identify five core components of a blockchain consensus protocol, namely, block proposal, block validation, information propagation, block finalization, and incentive mechanism. A wide spectrum of blockchain consensus protocols are then carefully reviewed accompanied by algorithmic abstractions and vulnerability analyses. The surveyed consensus protocols are analyzed using the five-component framework and compared with respect to different performance metrics.
Their "wide spectrum" is comprehensive as regards the variety of PoW protocols, and as regards the varieties of Proof-of-Stake (PoS) protocols that are the leading alternatives to PoW. Their coverage of other consensus protocols is less thorough, and as regards the various protocols that defend against Sybil attacks by wasting storage instead of computation it is minimal.

The main approach to replacing PoW with something equally good at preventing Sybil attacks but less good at cooking the planet has been PoS, but a recent entrant using Proof-of-Time-and-Space (I'll use PoTaS since the acronyms others use are confusing) to waste storage has attracted considerable attention. I will discuss PoS in general terms and two specific systems, Chia (PoTaS) and Subspace (a hybrid of PoTaS and PoS).


In PoW as implemented by Nakamoto, the probability of a winning the next block is proportional to the number of otherwise useless hashes computed — Nakamoto thought by individual CPUs but now by giant mining pools driven by warehouses full of mining ASICs. The idea of PoS is that the resource being wasted to deter Sybil attacks is the cryptocurrency itself. In order to mount a 51% attack the attacker would have to control more of the cryptocurrency that the loyal peers. In vanilla PoS the probability of winning the next block is proportional to the amount of the cryptocurrency "staked", i.e. effectively escrowed and placed at risk of being "slashed" if the majority concludes that the peer has misbehaved. It appears to have been first proposed in 2011 by Bitcointalk user QuantumMechanic.

The first cryptocurrency to use PoS, albeit as a hybrid with PoW, was Peercoin in 2012. There have been a number of pure PoS cryptocurrencies since, including Cardano from 2015 and Algorand from 2017 but none have been very successful.

Ethereum, the second most important cryptocurrency, understood the need to replace PoW in 2013 and started work in 2014. But as Vitalik Buterin then wrote:
Over the last few months we have become more and more convinced that some inclusion of proof of stake is a necessary component for long-term sustainability; however, actually implementing a proof of stake algorithm that is effective is proving to be surprisingly complex.

The fact that Ethereum includes a Turing-complete contracting system complicates things further, as it makes certain kinds of collusion much easier without requiring trust, and creates a large pool of stake in the hands of decentralized entities that have the incentive to vote with the stake to collect rewards, but which are too stupid to tell good blockchains from bad.
Buterin was right about making "certain kinds of collusion much easier without requiring trust". In On-Chain Vote Buying and the Rise of Dark DAOs Philip Daian and co-authors show that "smart contracts" provide for untraceable on-chain collusion in which the parties are mutually pseudonymous. It is obviously much harder to prevent bad behavior in a Turing-complete environment. Seven years later Ethereum is still working on the transition, which they currently don't expect to be complete for another 18 months:
Shocked to see that the timeline for Ethereum moving to ETH2 and getting off proof-of-work mining has been put back to late 2022 … about 18 months from now. This is mostly from delays in getting sharding to work properly. Vitalik Buterin says that this is because the Ethereum team isn’t working well together. [Tokenist]
Skepticism about the schedule for ETH2 is well-warranted, as Julia Magas writes in When will Ethereum 2.0 fully launch? Roadmap promises speed, but history says otherwise:
Looking at how fast the relevant updates were implemented in the previous versions of Ethereum roadmaps, it turns out that the planned and real release dates are about a year apart, at the very minimum.
Are there other reasons why PoS is so hard to implement safely? Bram Cohen's talk at Stanford included a critique of PoS:
  • Its threat model is weaker than Proof of Work.
  • Just as Proof of Work is in practice centralized around large mining pools, Proof of Stake is centralized around large currency holdings (which were probably acquired much more cheaply than large mining installations).
  • The choice of a quorum size is problematic. "Too small and it's attackable. Too large and nothing happens." And "Unfortunately, those values are likely to be on the wrong side of each other in practice."
  • Incentivizing peers to put their holdings at stake creates a class of attacks in which peers "exaggerate one's own bonding and blocking it from others."
  • Slashing introduces a class of attacks in which peers cause others to be fraudulently slashed.
  • The incentives need to be strong enough to overcome the risks of slashing, and of keeping their signing keys accessible and thus at risk of compromise.
  • "Defending against those attacks can lead to situations where the system gets wedged because a split happened and nobody wants to take one for the team"
Cohen seriously under-played PoS's centralization problem. It isn't just that the Gini coefficients of cryptocurrencies are extremely high, but that this is a self-reinforcing problem. Because the rewards for mining new blocks, and the fees for including transactions in blocks, flow to the HODL-ers in proportion to their HODL-ings, whatever Gini coefficient the systems starts out with will always increase. As I wrote, cryptocurrencies are:
a mechanism for transferring wealth from later adopters, called suckers, to early adopters, called geniuses.
PoS makes this "ratchet" mechanism much stronger than PoW, and thus renders them much more vulnerable to insider 51% attacks. I discussed one such high-profile attack by Justin Sun of Tron on the Steemit blockchain in Proof-of-Stake In Practice :
One week later, on March 2nd, Tron arranged for exchanges, including Huobi, Binance and Poloniex, to stake tokens they held on behalf of their customers in a 51% attack:
According to the list of accounts powered up on March. 2, the three exchanges collectively put in over 42 million STEEM Power (SP).

With an overwhelming amount of stake, the Steemit team was then able to unilaterally implement hard fork 22.5 to regain their stake and vote out all top 20 community witnesses – server operators responsible for block production – using account @dev365 as a proxy. In the current list of Steem witnesses, Steemit and TRON’s own witnesses took up the first 20 slots.
Although this attack didn't provide Tron with an immediate monetary reward, the long term value of retaining effective control of the blockchain was vastly greater than the cost of staking the tokens. I've been pointing out that the high Gini coefficients of cryptocurrencies means Proof-of-Stake centralizes control of the blockchain in the hands of the whales since 2017's Why Decentralize? quoted Vitalik Buterin pointing out that a realistic scenario was:
In a proof of stake blockchain, 70% of the coins at stake are held at one exchange.
Or in this case three exchanges cooperating.
Note that economic analyses of PoS, such as More (or less) economic limits of the blockchain by Joshua Gans and Neil Gandal, assume economically rational actors care about the iliquidity of staked coins and the foregone interest. But true believers in "number go up" have a long-term perspective similar to Sun's. The eventual progress of their coin "to the moon!" means that temporary, short-term costs are irrelevant to long-term HODL-ers.

Jude C. Nelson amplifies the centralization point:
PoW is open-membership, because the means of coin production are not tied to owning coins already. All you need to contribute is computing power, and you can start earning coins at a profit.

PoS is closed-membership with a veneer of open-membership, because the means of coin production are tied to owning a coin already. What this means in practice is that no rational coin-owner is going to sell you coins at a fast enough rate that you'll be able to increase your means of coin production. Put another way, the price you'd pay for the increased means of coin production will meet or exceed the total expected revenue created by staking those coins over their lifetime. So unless you know something the seller doesn't, you won't be able to profit by buying your way into staking.

Overall, this makes PoS less resilient and less egalitarian than PoW. While both require an up-front capital expenditure, the expenditure for PoS coin-production will meet or exceed the total expected revenue of those coins at the point of sale. So, the system is only as resilient as the nodes run by the people who bought in initially, and the only way to join later is to buy coins from people who want to exit (which would only be viable if these folks believed the coins are worth less than what you're buying them for, which doesn't bode well for you as the buyer).
Nelson continues:
PoW requires less proactive trust and coordination between community members than PoS -- and thus is better able to recover from both liveness and safety failures -- precisely because it both (1) provides a computational method for ranking fork quality, and (2) allows anyone to participate in producing a fork at any time. If the canonical chain is 51%-attacked, and the attack eventually subsides, then the canonical chain can eventually be re-established in-band by honest miners simply continuing to work on the non-attacker chain. In PoS, block-producers have no such protocol -- such a protocol cannot exist because to the rest of the network, it looks like the honest nodes have been slashed for being dishonest. Any recovery procedure necessarily includes block-producers having to go around and convince people out-of-band that they were totally not dishonest, and were slashed due to a "hack" (and, since there's lots of money on the line, who knows if they're being honest about this?).
PoS conforms to Mark 4:25:
For he that hath, to him shall be given: and he that hath not, from him shall be taken even that which he hath.
In Section VI(E) Yang Xiao et al identify the following types of vulnerability in PoS systems:
  1. Costless simulation:
    literally means any player can simulate any segment of blockchain history at the cost of no real work but speculation, as PoS does not incur intensive computation while the blockchain records all staking history. This may give attackers shortcuts to fabricate an alternative blockchain.
    It is the basis for attacks 2 through 5.
  2. Nothing at stake
    Unlike a PoW miner, a PoS minter needs little extra effort to validate transactions and generate blocks on multiple competing chains simultaneously. This “multi-bet” strategy makes economical sense to PoS nodes because by doing so they can avoid the opportunity cost of sticking to any single chain. Consequently if a significantly fraction of nodes perform the “multi-bet” strategy, an attacker holding far less than 50% of tokens can mount a successful double spending attack.
    The defense against this attack is usually "slashing", forfeiting the stake of miners detected on multiple competing chains. But slashing, as Cohen and Nelson point out, is in itself a consensus problem.
  3. Posterior corruption
    The key enabler of posterior corruption is the public availability of staking history on the blockchain, which includes stakeholder addresses and staking amounts. An attacker can attempt to corrupt the stakeholders who once possessed substantial stakes but little at present by promising them rewards after growing an alternative chain with altered transaction history (we call it a “malicious chain”). When there are enough stakeholders corrupted, the colluding group (attacker and corrupted once-rich stakeholders) could own a significant portion of tokens (possibly more than 50%) at some point in history, from which they are able to grow an malicious chain that will eventually surpass the current main chain.
    The defense is key-evolving cryptography, which ensures that the past signatures cannot be forged by the future private keys.
  4. Long-range attack as introduced by Buterin:
    foresees that a small group of colluding attackers can regrow a longer valid chain that starts not long after the genesis block. Because there were likely only a few stakeholders and a lack of competition at the nascent stage of the blockchain, the attackers can grow the malicious chain very fast and redo all the PoS blocks (i.e. by costless simulation) while claiming all the historical block rewards.
    Evangelos Deirmentzoglou et al's A Survey on Long-Range Attacks for Proof of Stake Protocols provides a useful review of these attacks. Even if there are no block rewards, only fees, a variant long-range attack is possible as described in Stake-Bleeding Attacks on Proof-of-Stake Blockchains by Peter Gazi et al, and by Shijie Zhang and Jong-Hyouk Lee in Eclipse-based Stake-Bleeding Attacks in PoS Blockchain Systems.
  5. Stake-grinding attack
    unlike PoW in which pseudo-randomness is guaranteed by the brute-force use of a cryptographic hash function, PoS’s pseudo-randomness is influenced by extra blockchain information—the staking history. Malicious PoS minters may take advantage of costless simulation and other staking-related mechanisms to bias the randomness of PoS in their own favor, thus achieving higher winning probabilities compared to their stake amounts
  6. Centralization risk as discussed above:
    In PoS the minters can lawfully reinvest their profits into staking perpetually, which allows the one with a large sum of unused tokens become wealthier and eventually reach a monopoly status. When a player owns more than 50% of tokens in circulation, the consensus process will be dominated by this player and the system integrity will not be guaranteed.
    There are a number of papers on this problem, including Staking Pool Centralization in Proof-of-Stake Blockchain Network by Ping He et al, Compounding of wealth in proof-of-stake cryptocurrencies by Giulia Fanti et al, and Stake shift in major cryptocurrencies: An empirical study by Rainer Stütz et al. But to my mind none of them suggest a realistic mitigation.
These are not the only problems from which PoS suffers. Two more are:
  • Checkpointing. Long-range and related attacks are capable of rewriting almost the entire chain. To mitigate this, PoS systems can arrange for consensus on checkpoints, blocks which are subsequently regarded as canonical forcing any rewriting to start no earlier than the following block. Winkle – Decentralised Checkpointing for Proof-of-Stake is:
    a decentralised checkpointing mechanism operated by coin holders, whose keys are harder to compromise than validators’ as they are more numerous. By analogy, in Bitcoin, taking control of one-third of the total supply of money would require at least 889 keys, whereas only 4 mining pools control more than half of the hash power
    It is important that consensus on checkpoints is achieved through a different mechanism than consensus on blocks. To over-simplify, Winkle piggy-backs votes for checkpoints on transactions; a transaction votes for a block with the number of coins remaining in the sending account, and with the number sent to the receiving account. A checkpoint is final once a set proportion of the coins have voted for it. For the details, see Winkle: Foiling Long-Range Attacks in Proof-of-Stake Systems by Sarah Azouvi et al.
  • Lending. In Competitive equilibria between staking and on-chain lending, Tarun Chitra demonstrates that it is:
    possible for on-chain lending smart contracts to cannibalize network security in PoS systems. When the yield provided by these contracts is more attractive than the inflation rate provided from staking, stakers will tend to remove their staked tokens and lend them out, thus reducing network security. ... Our results illustrate that rational, non-adversarial actors can dramatically reduce PoS network security if block rewards are not calibrated appropriately above the expected yields of on-chain lending.
    I believe this is part of a fundamental problem for PoS. The token used to prevent a single attacker appearing as a multitude of independent peers can be lent, and thus the attacker can borrow a temporary majority of the stake cheaply, for only a short-term interest payment. Preventing this increases implementation complexity significantly.
In summary, despite PoS' potential for greatly reducing PoW's environmental impact and cost of defending against Sybil attacks, it has a major disadvantage. It is significantly more complex and thus its attack surface is much larger, especially when combined with a Turing-complete execution environment such as Ethereum's. It therefore needs more defense mechanisms, which increase complexity further. Buterin and the Ethereum developers realize the complexity of the implementation task they face, which is why their responsible approach is taking so long. Currently Ethereum is the only realistic candidate to displace Bitcoin, and thus reduce cryptocurrencies' carbon footprint, so the difficulty of an industrial-strength implementation of PoS for Ethereum 2.0 is a major problem.


Back in 2018 I wrote about Bram Cohen's PoTaS system, Chia, in Proofs of Space and Chia Network. Instead of wasting computation to prevent Sybil attacks, Chia wastes storage. Chia's "space farmers" create and store "plots" consisting of large amounts of otherwise useless data. The technical details are described in Chia Consensus. They are comprehensive and impressively well thought out.

Because, like Bitcoin, Chia is wasting a real resource to defend against Sybil attacks it lacks many of PoS' vulnerabilities. Nevertheless, the Chia protocol is significantly more complex than Bitcoin and thus likely to possess additional vulnerabilities. For example, whereas in Bitcoin there is only one role for participants, mining, the Chia protocol involves three roles:
  • Farmer, "Farmers are nodes which participate in the consensus algorithm by storing plots and checking them for proofs of space."
  • Timelord, "Timelords are nodes which participate in the consensus algorithm by creating proofs of time".
  • Full node, which involves "broadcasting proofs of space and time, creating blocks, maintaining a mempool of pending transactions, storing the historical blockchain, and uploading blocks to other full nodes as well as wallets (light clients)."
Figure 11
Another added complexity is that the Chia protocol maintains three chains (Challenge, Reward and Foliage), plus an evanescent chain during each "slot" (think Bitcoin's block time), as shown in the document's Figure 11. The document therefore includes a range of attacks and their mitigations which are of considerable technical interest.

Cohen's praiseworthy objective for Chia was to avoid the massive power waste of PoW because:
"You have this thing where mass storage medium you can set a bit and leave it there until the end of time and its not costing you any more power. DRAM is costing you power when its just sitting there doing nothing".
Alas, Cohen was exaggerating:
A state-of-the-art disk drive, such as Seagate's 12TB BarraCuda Pro, consumes about 1W spun-down in standby mode, about 5W spun-up idle and about 9W doing random 4K reads.
Which is what it would be doing much of the time while "space farming". Clearly, PoTaS uses energy, just much less than PoW. Reporting on Cohen's 2018 talk at Stanford I summarized:
Cohen's vision is of a PoSp/VDF network comprising large numbers of desktop PCs, continuously connected and powered up, each with one, or at most a few, half-empty hard drives. The drives would have been purchased at retail a few years ago.
My main criticism in those posts was Cohen's naiveté about storage technology, the storage market and economies of scale:
There would appear to be three possible kinds of participants in a pool:
  • Individuals using the spare space in their desktop PC's disk. The storage for the Proof of Space is effectively "free", but unless these miners joined pools, they would be unlikely to get a reward in the life of the disk.
  • Individuals buying systems with CPU, RAM and disk solely for mining. The disruption to the user's experience is gone, but now the whole cost of mining has to be covered by the rewards. To smooth out their income, these miners would join pools.
  • Investors in data-center scale mining pools. Economies of scale would mean that these participants would see better profits for less hassle than the individuals buying systems, so these investor pools would come to dominate the network, replicating the Bitcoin pool centralization.
Thus if Chia's network were to become successful, mining would be dominated by a few large pools. Each pool would run a VDF server to which the pool's participants would submit their Proofs of Space, so that the pool manager could verify their contribution to the pool.

The emergence of pools, and dominance of a small number of pools, has nothing to do with the particular consensus mechanism in use. Thus I am skeptical that alternatives to Proof of Work will significantly reduce centralization of mining in blockchains generally, and in Chia Network's blockchain specifically.
As I was writing the first of these posts, TechCrunch reported:
Chia has just raised a $3.395 million seed round led by AngelList’s Naval Ravikant and joined by Andreessen Horowitz, Greylock and more. The money will help the startup build out its Chia coin and blockchain powered by proofs of space and time instead of Bitcoin’s energy-sucking proofs of work, which it plans to launch in Q1 2019.
Even in 2020 the naiveté persisted, as Chia pitched the idea that space farming on a Raspberry Pi was a way to make money. It still persists, as Chia's president reportedly claims that "recyclable hard drives are entering the marketplace". But when Chia Coin actually started trading in early May 2021 the reality was nothing like Cohen's 2018 vision:
  • As everyone predicted, the immediate effect was to create a massive shortage of the SSDs needed to create plots, and the hard drives needed to store them. Even Gene Hoffman, Chia's CEO, admitted that Bitcoin rival Chia 'destroyed' hard disc supply chains, says its boss:
    Chia, a cryptocurrency intended to be a “green” alternative to bitcoin has instead caused a global shortage of hard discs. Gene Hoffman, the president of Chia Network, the company behind the currency, admits that “we’ve kind of destroyed the short-term supply chain”, but he denies it will become an environmental drain.
    The result of the spike in storage prices was a rise in the vendors stock:
    The share price of hard disc maker Western Digital has increased from $52 at the start of the year to $73, while competitor Seagate is up from $60 to $94 over the same period.
    To give you some idea of how rapidly Chia has consumed storage in the two months since launch, it is around 20% of the rate at which the entire industry produced hard disk in 2018.

  • Chia Pools
    Mining pools arose. As I write the network is storing 30.06EB of otherwise useless data, of which one pool, is managing 10.78EB, or 39.3%. Unlike Bitcoin, the next two pools are much smaller, but large enough so that the top four pools have 42% of the space. The network is slightly more decentralized than Bitcoin has been since 2014, and for reasons discussed below is less vulnerable to an insider 51% attack.

  • Chia "price"
    The "price" of Chia Coin collapsed, from $1934.51 at the start of trading to $165.41 Sunday before soaring to $185.78 as I write. Each circulating XCH corresponds to about 30TB. The investment in "space farming" hardware vastly outweighs, by nearly six times, the market cap of the cryptocurrency it is supporting.

  • The "space farmers" are earning $1.69M/day, or about $20/TB/year. A 10TB internal drive is currently about $300 on Amazon, so it will be about a 18 months before it earns a profit. The drive is only warranted for 3 years. But note that the warranty is limited:
    Supports up to 180 TB/yr workload rate Workload Rate is defined as the amount of user data transferred to or from the hard drive.
    Using the drive for "space farming" would likely void the warranty and, just as PoW does to GPUs, burn out the drive long before its warranted life. If you have two years, the $300 investment theoretically earns a 25% return before power and other costs.

  • But the hard drive isn't the only cost of space farming. In order to become a "space farmer" in the first place you need to create plots containing many gigabytes of otherwise useless cryptographically-generated data. You need lots of them; the probability of winning your share of the $2.74M/day is how big a fraction of the nearly 30EB you can generate and store. The 30EB is growing rapidly, so the quicker you can generate the plots, the better your chance in the near term. To do so in finite time you need in addition to the hard drive a large SSD at extra cost. Using it for plotting will void its warranty and burn it out in as little as six weeks. And you need a powerful server running flat-out to do the cryptography, which both rather casts doubt on how much less power than PoW Chia really uses, and increases the payback time significantly.

  • In my first Chia post I predicted that "space farming" would be dominated by huge data centers such as Amazon's. Sure enough, Wolfie Zhao reported on May 7th that:
    Technology giant Amazon has rolled out a solution dedicated to Chia crypto mining on its AWS cloud computing platform.

    According to a campaign page on the Amazon AWS Chinese site, the platform touts that users can deploy a cloud-based storage system in as quickly as five minutes in order to mine XCH, the native cryptocurrency on the Chia network.
    Two weeks later David Gerard reported that:
    The page disappeared in short order — but an archive exists.
    Because Chia mining trashes the drives, something else I pointed out in my first Chia post, storage services are banning users who think that renting something is a license to destroy it. In any case, 10TB of Amazon's S3 Reduced Redundancy Storage costs $0.788/day, so it would be hard to make ends meet. Cheaper storage services, such as Wasabi at $0.20/day are at considerable risk from Chia.

  • Although this isn't an immediate effect, as David Gerard writes, because creating Chia plots wears out SSDs, and Chia farming wears out hard disks:
    Chia produces vast quantities of e-waste—rare metals, assembled into expensive computing components, turned into toxic near-unrecyclable landfill within weeks.
Miners are incentivized to join pools because they prefer a relatively predictable, frequent flow of small rewards to very infrequent large rewards. The way pools work in Bitcoin and related protocols is that the pool decides what transactions are in the block it hopes to mine, and gets all the pool participants to work on that block. Thus a pool, or a conspiracy among pools, that had 51% of the mining power would have effective control over the transactions that were finalized. Because they make the decision as to which transactions happen, Nicholas Weaver argues that mining pools are money transmitters and thus subject to the AML/KYC rules. But in Chia pools work differently:
First and foremost, even when a winning farmer is using a pool, they themselves are the ones who make the transaction block - not the pool. The decentralization benefits of this policy are obvious.
The potential future downside is that while Bitcoin miners in a pool can argue that AML/KYC is the responsibility of the pool, Chia farmers would be responsible for enforcing the AML/KYC rules and subject to bank-sized penalties for failing to do so.

In Bitcoin the winning pool receives and distributes both the block reward and the (currently much smaller) transaction fees. Over time the Bitcoin block reward is due to go to zero and the system is intended to survive on fees alone. Alas, research has shown that a fee-only Bitcoin system is insecure.

Chia does things differently in two ways. First:
all the transaction fees generated by a block go to the farmer who found it and not to the pool.

Trying to split the transaction fees with the pool could result in transaction fees being paid ‘under the table’ either by making them go directly to the farmer or making an anyone can spend output which the farmer would then pay to themselves. Circumventing the pool would take up space on the blockchain. It could also encourage the emergence of alternative pooling protocols where the pool makes the transaction block which is a form of centralization we wish to avoid.
The basic argument is that in Bitcoin the 51% conspiracy is N pools where in Chia it is M farmers (M ≫ N). Chia are confident that this is safe:
This ensures that even if a pool has 51% netspace, they would also need to control ALL of the farmer nodes (with the 51% netspace) to do any malicious activity. This will be very difficult unless ALL the farmers (with the 51% netspace) downloaded the same malicious Chia client programmed by a Bram like level genius.
I'm a bit less confident because, like Ethereum, Chia has a Turing-complete programming environment. In On-Chain Vote Buying and the Rise of Dark DAOs Philip Daian and co-authors showed that "smart contracts" provide for untraceable on-chain collusion in which the parties are mutually pseudonymous. Although their conspriacies were much smaller, similar techniques might be the basis for larger attacks on blockchains with "smart contracts".

This method has the downside of reducing the smoothing benefits of pools if transaction fees come to dominate fixed block rewards. That’s never been a major issue in Bitcoin and our block reward schedule is set to only halve three times and continue at a fixed amount forever after. There will alway be block rewards to pay to the pool while transaction fees go to the individual farmers.
So unlike the Austrian economics of Bitcoin, Chia plans to reward farming by inflating the currency indefinitely, never depending wholly on fees. In Bitcoin the pool takes the whole block reward, but the way block rewards work is different too:
fixed block rewards are set to go 7/8 to the pool and 1/8 to the farmer. This seems to be a sweet spot where it doesn’t reduce smoothing all that much but also wipes out potential selfish mining attacks where someone joins a competing pool and takes their partials but doesn’t upload actual blocks when they find them. Those sort of attacks can become profitable when the fraction of the split is smaller than the size of the pool relative to the whole system.
Last I checked had almost 40% of the total system.

Rational economics are not in play here. "Space farming" makes sense only at scale or for the most dedicated believers in "number go up". Others are less than happy:
So I tested this Chia thing overnight. Gave it 200GB plot and two CPU threads. After 10 hours it consumed 400GB temp space, didn’t sync yet, CPU usage is always 80%+. Estimated reward time is 5 months. This isn’t green, already being centralised on large waste producing servers.
The problem for the "number go up" believers is that the "size go up" too, by about half-an-exabyte a day. As the network grows, the chance that your investment in hardware will earn a reward goes down because it represents a smaller proportion of the total. Unless "number go up" much faster than "size go up", your investment is depreciating rapidly not just because you are burning it out but because its cost-effectiveness is decaying. And as we see, "size go up" rapidly but "number go down" rapidly. And economies of scale mean that return on investment in hardware will go up significantly with the proportion of the total the farmer has. So the little guy gets the short end of the stick even if they are in a pool.

Chia's technology is extremely clever, but the economics of the system that results in the real world don't pass the laugh test. Chia is using nearly a billion dollars of equipment being paid for by inflating the currency at a rate of currently 2/3 billion dollars a year to process transactions at a rate around five billion dollars a year, a task that could probably be done using a conventional database and a Raspberry Pi. The only reason for this profligacy is to be able to claim that it is "decentralized". It is more decentralized than PoW or PoS systems, but over time economies of scale and free entry will drive the reward for farming in fiat terms down and mean that small-scale farmers will be squeezed out.

The Chia "price" chart suggests that it might have been a "list-and-dump" scheme, in which A16Z and the other VCs incentivized the miners to mine and the exchanges to list the new cryptocurrency so that the VCs could dump their HODL-ings on the muppets seduced by the hype and escape with a profit. Note that A16Z just raised a $2.2B fund dedicated to pouring money into similar schemes. This is enough to fund 650 Chia-sized ventures! (David Gerard aptly calls Andreesen Horowitz "the SoftBank of crypto") They wouldn't do that unless they were making big bucks from at least some of the ones they funded earlier. Chia's sensitivity about their PR led them to hurl bogus legal threats at the leading Chia community blog. Neither is a good look.


As we see, the Chia network has one huge pool and a number of relatively miniscule pools. In Subspace" A Solution to the Farmer's Dilemma, Wagstaff describes the "farmer's dilemma" thus:
Observe that in any PoC blockchain a farmer is, by-definition, incentivized to allocate as much of its scarce storage resources as possible towards consensus. Contrast this with the desire for all full nodes to reserve storage for maintaining both the current state and history of the blockchain. These competing requirements pose a challenge to farmers: do they adhere to the desired behavior, retaining the state and history, or do they seek to maximize their own rewards, instead dedicating all available space towards consensus? When faced with this farmer’s dilemma rational farmers will always choose the latter, effectively becoming light clients, while degrading both the security and decentralization of the network. This implies that any PoC blockchain would eventually consolidate into a single large farming pool, with even greater speed than has been previously observed with PoW and PoS chains.
Subspace proposes to resolve this using a hybrid of PoS and PoTaS:
We instead clearly distinguish between a permissionless farming mechanism for block production and permissioned staking mechanism for block finalization.
Wagstaff describes it thus:
  1. To prevent farmers from discarding the history, we construct a novel PoC consensus protocol based on proofs-of-storage of the history of the blockchain itself, in which each farmer stores as many provably-unique replicas of the chain history as their disk space allows.
  2. To ensure the history remains available, farmers form a decentralized storage network, which allows the history to remain fully-recoverable, load-balanced, and efficiently-retrievable.
  3. To relieve farmers of the burden of maintaining the state and preforming [sic] redundant computation, we apply the classic technique in distributed systems of decoupling consensus and computation. Farmers are then solely responsible for the ordering of transactions, while a separate class of executor nodes maintain the state and compute the transitions for each new block.
  4. To ensure executors remain accountable for their actions, we employ a system of staked deposits, verifiable computation, and non-interactive fraud proofs.
Separating consensus (PoTaS) and computation (PoS) has interesting effects:
  • Like Chia, the only function of pools is to smooth out farmer's rewards. They do not compose the blocks. Pools will compete on their fees. Economics of scale mean that the larger the pool, the lower the fees it can charge. So, just like Chia, Subspace will end up with one, or only a few, large pools.
  • Like Chia, if they can find a proof, farmers assemble transactions into a block which they can submit to executors for finalization. Subspace shares with Chia the property that a 51% attack requires M farmers not N pools (M ≫ N), assuming of course no supply chain attack or abuse of "smart contracts".
  • Subspace uses a LOCKSS-like technique of electing a random subset of executors for each finalization. Because any participant can unambiguously detect fraudulent execution, and thus that the finalization of a block is fraudulent, the opportunity for bad behavior by executors is highly constrained. A conspiracy of executors has to hope that no honest executor is elected.
Like Chia, the technology is extremely clever but there are interesting economic aspects. As regards farmers, Wagstaff writes:
To ensure the history does not grow beyond total network storage capacity, we modify the transaction fee mechanism such that it dynamically adjusts in response to the replication factor. Recall that in Bitcoin, the base fee rate is a function of the size of the transaction in bytes, not the amount of BTC being transferred. We extend this equation by including a multiplier, derived from the replication factor. This establishes a mandatory minimum fee for each transaction, which reflects its perpetual storage cost. The multiplier is recalculated each epoch, from the estimated network storage and the current size of the history. The higher the replication factor, the cheaper the cost of storage per byte. As the replication factor approaches one, the cost of storage asymptotically approaches infinity. As the replication factor decreases, transaction fees will rise, making farming more profitable, and in-turn attracting more capacity to the network. This allows the cost of storage to reach an equilibrium price as a function of the supply of, and demand for, space.
There are some issues here:
  • The assumption that the market for fees can determine the "perpetual storage cost" is problematic. As I first showed back in 2011, the endowment needed for "perpetual storage" depends very strongly on two factors that are inherently unpredictable, the future rate of decrease of media cost in $/byte (Kryder rate), and the future interest rate. The invisible hand of the market for transaction fees cannot know these, it only knows the current cost of storage. Nor can Subspace management know them, to set the "mandatory minimum fee". Thus it is likely that fees will significantly under-estimate the "perpetual storage cost", leading to problems down the road.
  • The assumption that those wishing to transact will be prepared to pay at least the "mandatory minimum fee" is suspect. Cryptocurrency fees are notoriously volatile because they are based on a blind auction; when no-one wants to transact a "mandatory minimum fee" would be a deterrent, when everyone wants to fees are unaffordable. Research has shown that if fees dominate block rewards systems become unstable.
Wagstaff's paper doesn't seem to describe how block rewards work; I assume that they go to the individual farmer or are shared via a pool for smoother cash flow. I couldn't see from the paper whether, like Chia, Subspace intends to avoid depending upon fees.

As regards executors:
For each new block, a small constant number of executors are chosen through a stake-weighted election. Anyone may participate in execution by syncing the state and placing a small deposit.
But the chance that they will be elected and gain the reward for finalizing a block and generating an Execution Receipt (ER) depends upon how much they stake. The mechanism for rewarding executors is:
Farmers split transaction fee rewards evenly with all executors, based on the expected number of ERs for each block.7 For example, if 32 executors are elected, the farmer will take half of the all transaction fees, while each executor will take 1/64. A farmer is incentivized to include all ERs which finalize execution for its parent block because doing so will allow it to claim more of its share of the rewards for its own block. For example, if the farmer only includes 16 out of 32 expected ERs, it will instead receive 1/4 (not 1/2) of total rewards, while each of the 16 executors will still receive 1/64. Any remaining shares will then be escrowed within a treasury account under the control of the community of token holders, with the aim of incentivizing continued protocol development.
Although the role of executor demands significant resources, both in hardware and in staked coins, these rewards seem inadequate. Every executor has to execute the state transitions in every block. But for each block only a small fraction of the executors receive only, in the example above, 1/64 of the fees. Note also footnote 7:
7 We use this rate for explanatory purposes, while noting that in order to minimize the plutocratic nature of PoS, executor shares should be smaller in practice.
So Wagstaff expects that an executor will receive only a small fraction of a small fraction of 1/64 of the transaction fees. Even supposing the stake distribution among executors was small and even, unlikely in practice, for the random election mechanism to be effective there need to be many times 32 executors. For example, if there are 256 executors, and executors share 1/8 of the fees, each can expect around 0.005% of the fees. Bitcoin currently runs with fees less than 10% of the block rewards. If Subspace had the same split in my example executors as a class would expect around 1.2% of the block rewards, with farmers as a class receiving 100% of the block rewards plus 87.5% of the fees.

There is another problem — the notorious volatility of transaction fees set against the constant cost of running an executor. Much of the time there would be relatively low demand for transactions, so a block would contain relatively few transactions that each offered the mandatory minimum fee. Unless the fees, and especially the mandatory minimum fee, are large relative to the block reward it isn't clear why executors would participate. But fees that large would risk the instability of fee-only blockchains.

There are two other roles in Subspace, verifiers and full nodes. As regards incentivizing verifiers:
we rely on the fact that all executors may act as verifiers at negligible additional cost, as they are already required to maintain the valid state transitions in order to propose new ERs. If we further require them to reveal fraud in order to protect their own stake and claim their share of the rewards, in the event that they themselves are elected, then we can provide a more natural solution to the verifier’s dilemma.
As regards incentivizing full nodes, Wagstaff isn't clear.
In addition to executors, any full node may also monitor the network and generate fraud proofs, by virtue of the fact that no deposit is required to act as verifier.
As I read the paper, full nodes have similar hardware requirements as executors but no income stream to support them unless they are executors too.

Overall, Subspace is interesting. But the advantage from a farmer's point of view of Subspace over Chia is that their whole storage resource is devoted to farming. Everything else is not really significant, and all this would be dominated by a fairly small difference in "price". Add to that the fact that Chia has already occupied the market niche for new PoTaS systems, and has high visibility via Bram Cohen and A16Z, and the prospects for Subspace don't look good. If Subspace succeeds, economies of scale will have two effects:
  • Large pools will dominate small pools because they can charge smaller fees.
  • Large farmers will dominate small farmers because their rewards are linear in the resource they commit, but their costs are sub-linear, so their profit is super-linear. This will likely result in the most profitable, hassle-free way for smaller consumers to participate being investing in a pool rather than actually farming.


The overall theme is that permissionless blockchains have to make participating in consensus expensive in some way to defend against Sybils. Thus if you are expending an expensive resource economies of scale are an unavoidable part of Sybil defense. If you want to be "decentralized" to avoid 51% attacks from insiders you have to have some really powerful mechanism pushing back against economies of scale. I see three possibilities, either the blockchain protocol designers:
  1. Don't understand why successful cryptocurrencies are centralized, so don't understand the need to push back on economies of scale.
  2. Do understand the need to push back on economies of scale but can't figure out how to do it. It is true that figuring this out is incredibly difficult, but their response should be to say "if the blockchain is going to end up centralized, why bother wasting resources trying to be permissionless?" not to implement something they claim is decentralized when they know it won't be.
  3. Don't care about decentralization, they just want to get rich quick, and are betting it will centralize around them.
In most cases, my money is on #3. At least both Chia and Subspace have made efforts to defuse the worst aspects of centralization.

Open Data Day 2021 – read the Report / Open Knowledge Foundation

= = = = = = =

We are really pleased to share with you our Report on Open Data Day 2021.

= = = = = = =

We wrote this report for the Open Data Day 2021 funding partners – Microsoft, UK Foreign, Commonwealth and Development Office, Mapbox, Global Facility for Disaster Reduction and Recovery, Latin American Open Data Initiative, Open Contracting Partnership and Datopian.

But we also decided to publish it here so that everyone interested in Open Data Day can learn more about the Open Data Day mini-grant scheme – and the impact of joining Open Data Day 2022 as a funding partner.

= = = = = = =

Highlights from the report include:

  • a list of the 36 countries that received mini-grants in 2021

  • a breakdown of the 56 mini-grants by World Bank region. It’s notable that most of the Open Data Day 2021 mini-grants were distributed to Sub-Saharan Africa and Latin America & the Caribbean. No mini-grants were sent to North America or the Middle East & North Africa. If you would like to help us reach these two regions in Open Data Day 2022 – please do email us at

  • a chart showing 82% of mini-grants went to Lower, Lower Middle or Upper Middle income countries (by World Bank lending group). We think this is probably about the right kind of distribution.
  • eleven case studies demonstrating the impact of the Open Data Day mini-grant programme.

= = = = = = =

To find out more, you can download the report here

Please do email us at if you would like a high resolution copy.

= = = = = = =

If you would like to learn more about Open Data Day, please visit, join the Open Data Day forum or visit the Open Knowledge Foundation blog, where we regularly post articles about Open Data Day.

Graphing China's Cryptocurrency Crackdown / David Rosenthal

Below the fold an update to last Thursday's China's Cryptocurrency Crackdown with more recent graphs.

McKenzie Sigalos reports that Bitcoin mining is now easier and more profitable as algorithm adjusts after China crackdown:
China had long been the epicenter of bitcoin miners, with past estimates indicating that 65% to 75% of the world's bitcoin mining happened there, but a government-led crackdown has effectively banished the country's crypto miners.

"For the first time in the bitcoin network's history, we have a complete shutdown of mining in a targeted geographic region that affected more than 50% of the network," said Darin Feinstein, founder of Blockcap and Core Scientific.

More than 50% of the hashrate – the collective computing power of miners worldwide – has dropped off the network since its market peak in May.
Here is the hashrate graph. It is currently 86.3TH/s, down from a peak of 180.7TH/s, so down 52.2% from the peak and trending strongly down. We may not have seen the end of the drop. This is good news for Bitcoin.

The result is that the Bitcoin system slowed down:
Typically, it takes about 10 minutes to complete a block, but Feinstein told CNBC the bitcoin network has slowed down to 14- to 19-minute block times.
And thus, as shown in the difficulty graph, the Bitcoin algorithm adjusted the difficulty:
This is precisely why bitcoin re-calibrates every 2016 blocks, or about every two weeks, resetting how tough it is for miners to mine. On Saturday, the bitcoin code automatically made it about 28% less difficult to mine – a historically unprecedented drop for the network – thereby restoring block times back to the optimal 10-minute window.
It went from a peak of 25.046t to 19.933t, a drop of 20.4%. This is good news for Bitcoin, as Sigalos writes:
Fewer competitors and less difficulty means that any miner with a machine plugged in is going to see a significant increase in profitability and more predictable revenue.

"All bitcoin miners share in the same economics and are mining on the same network, so miners both public and private will see the uplift in revenue," said Kevin Zhang, former Chief Mining Officer at Greenridge Generation, the first major U.S. power plant to begin mining behind-the-meter at a large scale.

Assuming fixed power costs, Zhang estimates revenues of $29 per day for those using the latest-generation Bitmain miner, versus $22 per day prior to the change. Longer-term, although miner income can fluctuate with the price of the coin, Zhang also noted that mining revenues have dropped only 17% from the bitcoin price peak in April, whereas the coin's price has dropped about 50%.
Here is the miners' revenue graph. It went from a peak of $80.172M/day on April 15th to a trough of $13.065M/day on June 26th, a drop of 83.7%. It has since bounced back a little, so this is good news for Bitcoin, if not quite as good as Zhang thinks. Obviously, the trough was before the decrease in difficulty, which subsequently resulted in 6.25BTC rewards happening more frequently than before and thus increased miners' revenue somewhat.

Have you noticed how important it is to check the numbers that the HODL-ers throw around?

Matt Novak reported on June 21st that:
Miners in China are now looking to sell their equipment overseas, and it appears many have already found buyers. CNBC’s Eunice Yoon tweeted early Monday that a Chinese logistics firm was shipping 6,600 lbs (3,000 kilograms) of crypto mining equipment to an unnamed buyer in Maryland for just $9.37 per kilogram.
And Sigalos adds details:
Of all the possible destinations for this equipment, the U.S. appears to be especially well-positioned to absorb this stray hashrate. CNBC is told that major U.S. mining operators are already signing deals to patriate some of these homeless Bitmain miners.

U.S. bitcoin mining is booming, and has venture capital flowing to it, so they are poised to take advantage of the miner migration, Arvanaghi told CNBC.

"Many U.S. bitcoin miners that were funded when bitcoin's price started rising in November and December of 2020 means that they were already building out their power capacity when the China mining ban took hold," he said. "It's great timing."
And, as always, the HODL-ers ignore economies of scale and hold out hope for the little guy:
But Barbour believes that much smaller players in the residential U.S. also stand a chance at capturing these excess miners.

"I think this is a signal that in the future, bitcoin mining will be more distributed by necessity," said Barbour. "Less mega-mines like the 100+ megawatt ones we see in Texas and more small mines on small commercial and eventually residential spaces. It's much harder for a politician to shut down a mine in someone's garage."
It is good news for Bitcoin that more of the mining power is in the US where the US government could suppress it by, for example, declaring that Mining Is Money Transmission and thus that pools needed to adhere to the AML/KYC rules. Doing so would place the poor little guy in a garage in a dilemma — mine on his own and be unlikely to get a reward before their rig was obsolete, or join an illegal pool and risk their traffic being spotted.


The Malaysian government's crackdown is an example to the world. Andrew Hayward reports that Police Destroy 1,069 Bitcoin Miners With Big Ass Steamroller In Malaysia.

Automatically extracting keyphrases from text / Ted Lawless

I've posted an explainer/guide to how we are automatically extracting keyphrases for Constellate, a new text analytics service from JSTOR and Portico. We are defining keyphrases as up to three word phrases that are key, or important, to the overall subject matter of the document Keyphrase is often used interchangeably with keywords, but we are opting to use the former since it's more descriptive We did a fair amount of reading to grasp prior art in this area, extracting keyphrases is a long standing research topic in information retrieval and natural language processing, and ended up developing a custom solution based on term frequency in the Constellate corpus If you are interested in this work generally, and not just the Constellate implementation, Burton DeWilde has published an excellent primer on automated keyphrase extraction. More information about Constellate can be found here. Disclaimer: this is a work-related post I don't intend to speak for my employer, Ithaka

Venture Capital Isn't Working / David Rosenthal

I was an early employee at three VC-funded startups from the 80s and 90s. All of them IPO-ed and two (Sun Microsystems and Nvidia) made it into the list of the top 100 US companies by market capitalization. So I'm in a good position to appreciate Jeffrey Funk's must-read The Crisis of Venture Capital: Fixing America’s Broken Start-Up System. Funk starts:
Despite all the attention and investment that Silicon Valley’s recent start-ups have received, they have done little but lose money: Uber, Lyft, WeWork, Pinterest, and Snapchat have consistently failed to turn profits, with Uber’s cumulative losses exceeding $25 billion. Perhaps even more notorious are bankrupt and discredited start-ups such as Theranos, Luckin Coffee, and Wirecard, which were plagued with management failures, technical problems, or even outright fraud that auditors failed to notice.

What’s going on? There is no immediately obvious reason why this generation of start-ups should be so financially disastrous. After all, Amazon incurred losses for many years, but eventually grew to become one of the most profitable companies in the world, even as Enron and WorldCom were mired in accounting scandals. So why can’t today’s start-ups also succeed? Are they exceptions, or part of a larger, more systemic problem?
Below the fold, some reflections on Funk's insightful analysis of the "larger, more systemic problem".

Funk introduces his argument thus:
In this article, I first discuss the abundant evidence for low returns on VC investments in the contemporary market. Second, I summarize the performance of start-ups founded twenty to fifty years ago, in an era when most start-ups quickly became profitable, and the most successful ones rapidly achieved top-100 market capitalization. Third, I contrast these earlier, more successful start-ups with Silicon Valley’s current set of “unicorns,” the most successful of today’s start-ups. Fourth, I discuss why today’s start-ups are doing worse than those of previous generations and explore the reasons why technological innovation has slowed in recent years. Fifth, I offer some brief proposals about what can be done to fix our broken start-up system. Systemic problems will require systemic solutions, and thus major changes are needed not just on the part of venture capitalists but also in our universities and business schools.

Is There A Problem?

Funk's argument that there is a problem can be summarized thus:
  • The returns on VC investments over the last two decades haven't matched the golden years of the proceeding two decades.
  • In the golden years startups made profits.
  • Now they don't.

VC Returns Are Sub-Par

This graph from a 2020 Morgan Stanley report shows that during the 90s the returns from VC investments greatly exceeded the returns from public equity. But since then the median VC return has been below that of public equity. This doesn't reward investors for the much higher risk of VC investments. The weighted average VC return is slightly above that of public equity because, as Funk explains:
a small percentage of investments does provide high returns, and these high returns for top-performing VC funds persist over subsequent quarters. Although this data does not demonstrate that select VCs consistently earn solid profits over decades, it does suggest that these VCs are achieving good returns.
It was always true that VC quality varied greatly. I discussed the advantages of working with great VCs in Kai Li's FAST Keynote:
Work with the best VC funds. The difference between the best and the merely good in VCs is at least as big as the difference between the best and the merely good programmers. At nVIDIA we had two of the very best, Sutter Hill and Sequoia. The result is that, like Kai but unlike many entrepreneurs, we think VCs are enormously helpful.
One thing that was striking about working with Sutter Hill was how many entrepreneurs did a series of companies with them, showing that both sides had positive experiences.

Startups Used To Make Profits

Before the dot-com boom, there used to be a rule that in order to IPO a company, it had to be making profits. This was a good rule, since it provided at least some basis for setting the stock price at the IPO. Funk writes:
There was a time when venture capital generated big returns for investors, employees, and customers alike, both because more start-ups were profitable at an earlier stage and because some start-ups achieved high market capitalization relatively quickly. Profits are an important indicator of economic and technological growth, because they signal that a company is providing more value to its customers than the costs it is incurring.

A number of start-ups founded in the late twentieth century have had an enormous impact on the global economy, quickly reaching both profitability and top-100 market capitalization. Among these are the so-called FAANMG (Facebook, Amazon, Apple, Microsoft, Netflix, and Google), which represented more than 25 percent of the S&P’s total market capitalization and more than 80 percent of the 2020 increase in the S&P’s total value at one point—in other words, the most valuable and fastest-growing compa­nies in America in recent years.
Funk's Table 2 shows the years to profitability and years to top-100 market capitalization for companies founded between 1975 and 2004. I'm a bit skeptical of the details because, for example, the table says it took Sun Microsystems 6 years to turn a profit. I'm pretty sure Sun was profitable at its 1986 IPO, 4 years from its founding.

Note Funk's stress on achieving profitability quickly. An important Silicon Valley philosophy used to be:
  • Success is great!
  • Failure is OK.
  • Not doing either is a big problem.
The reason lies in the Silicon Valley mantra of "fail fast". Most startups fail, and the costs of those failures detract from the returns of the successes. Minimizing the cost of failure, and diverting the resource to trying something different, is important.

Unicorns, Not So Much

What are these unicorns? Wikipedia tells us:
In business, a unicorn is a privately held startup company valued at over $1 billion. The term was coined in 2013 by venture capitalist Aileen Lee, choosing the mythical animal to represent the statistical rarity of such successful ventures.
Back in 2013 unicorns were indeed rare, but as Wikipedia goes on to point out:
According to CB Insights, there are over 450 unicorns as of October 2020.
Unicorns are breeding like rabbits, but the picture Funk paints is depressing:
In the contemporary start-up economy, “unicorns” are purportedly “disrupting” almost every industry from transportation to real estate, with new business software, mobile apps, consumer hardware, internet services, biotech, and AI products and services. But the actual performance of these unicorns both before and after the VC exit stage contrasts sharply with the financial successes of the previous generation of start-ups, and suggests that they are dramatically overvalued.

Figure 3 shows the profitability distribution of seventy-three unicorns and ex-unicorns that were founded after 2013 and have released net income and revenue figures for 2019 and/or 2020. In 2019, only six of the seventy-three unicorns included in figure 3 were profitable, while for 2020, seven of seventy were.
Hey, they're startups, right? They just need time to become profitable. Funk debunks that idea too:
Furthermore, there seems to be little reason to believe that these unprofitable unicorn start-ups will ever be able to grow out of their losses, as can be seen in the ratio of losses to revenues in 2019 versus the founding year. Aside from a tiny number of statistical outliers ... there seems to be little relationship between the time since a start-up’s founding and its ratio of losses to revenues. In other words, age is not correlated with profits for this cohort.
Funk goes on to note that startup profitability once public has declined dramatically, and appears inversely related to IPO valuation:
When compared with profitability data from decades past, recent start-ups look even worse than already noted. About 10 percent of the unicorn start-ups included in figure 3 were profitable, much lower than the 80 percent of start-ups founded in the 1980s that were profitable, according to Jay Ritter’s analysis, and also below the overall percentage for start-ups today (20 percent). Thus, not only has profitability dramatically dropped over the last forty years among those start-ups that went public, but today’s most valuable start-ups—those valued at $1 billion or more before IPO—are in fact less profitable than start-ups that did not reach such lofty pre-IPO valuations.
Funk uses electric vehicles and biotech to illustrate startup over-valuation:
For instance, driven by easy money and the rapid rise of Tesla’s stock, a group of electric vehicle and battery suppliers—Canoo, Fisker Automotive, Hyliion, Lordstown Motors, Nikola, and QuantumScape—were valued, combined, at more than $100 billion at their listing. Likewise, dozens of biotech firms have also achieved billions of dollars in market capitalizations at their listings. In total, 2020 set a new record for the number of companies going public with little to no revenue, easily eclipsing the height of the dot-com boom of telecom companies in 2000.
The Alphaville team have been maintaining a spreadsheet of the EV bubble. They determined that there was no way these companies' valuations could be justified given the size of the potential market. Jamie Powell's April 12th Revisiting the EV bubble spreadsheet celebrates their assessment:
At pixel time the losses from their respective peaks from all of the electric vehicle, battery and charging companies on our list total some $635bn of market capitalisation, or a fall of just under 38 per cent. Ouch.

What Is Causing The Problem

This all looks like too much money chasing too few viable startups, and too many me-too startups chasing too few total available market dollars.

Funk starts his analysis of the causes of poor VC returns by pointing to the obvious one, one that applies to any successful investment strategy. Its returns will be eroded over time by the influx of too much money:
There are many reasons for both the lower profitability of start-ups and the lower returns for VC funds since the mid to late 1990s. The most straightforward of these is simply diminishing returns: as the amount of VC investment in the start-up market has increased, a larger proportion of this funding has necessarily gone to weaker opportunities, and thus the average profitability of these investments has declined.
But the effect of too much money is even more corrosive. I'm a big believer in Bill Joy's Law of Startups — "success is inversely proportional to the amount of money you have". Too much money allows hard decisions to be put off. Taking hard decisions promptly is key to "fail fast".

Nvidia was an example of this. The company was founded in one of Silicon Valley's recurring downturns. We were the only hardware company funded in that quarter. We got to working silicon on a $2.5M A round. Think about it — each of our VCs invested $1.25M to start a company currently valued at $380,000M. Despite delivering ground-breaking performance, as I discussed in Hardware I/O Virtualization, that chip wasn't a success. But it did allow Jen-Hsun Huang to raise another $6.5M. He down-sized the company by 2/3 and got to working silicon of the highly successful second chip with, IIRC, six weeks' money left in the bank.

Funk then discusses a second major reason for poor performance:
A more plausible explanation for the relative lack of start-up successes in recent years is that new start-ups tend to be acquired by large incumbents such as the faamng companies before they have a chance to achieve top 100 market capitalization. For instance, YouTube was founded in 2004 and Instagram in 2010; some claim they would be valued at more than $150 billion each (pre-lockdown estimates) if they were independent companies, but instead they were acquired by Google and Facebook, respectively.18 In this sense, they are typical of the recent trend: many start-ups founded since 2000 were subsequently acquired by faamng, including new social media companies such as GitHub, LinkedIn, and WhatsApp. Likewise, a number of money-losing start-ups have been acquired in recent years, most notably DeepMind and Nest, which were bought by Google.
But he fails to note the cause of the rash of acquisitions, which is clearly the total Lack Of Anti-Trust Enforcement in the US. As with too much money, the effects of this lack are more pernicious than at first appears. Again, Nvidia provides an example.

Just like the founders and VCs of Sun, when we started Nvidia we knew that the route to an IPO and major return on investment involved years and several generations of product. So, despite the limited funding and with the full support of our VCs, we took several critical months right at the start to design an architecture for a family of successive chip generations based on Hardware I/O Virtualization. By ensuring that the drivers in application software interacted only with virtual I/O resources, the architecture decoupled the hardware and software release cycles. The strong linkage between them at Sun had been a consistent source of schedule slip.

The architecture also structured the implementation of the chip as a set of modules communicating via an on-chip network. Each module was small enough that a three-person team could design, simulate and verify it. The restricted interface to the on-chip network meant that, if the modules verified correctly, it was highly likely that the assembled chip would verify correctly.

Laying the foundations for a long-term product line in this way paid massive dividends. After the second chip, Nvidia was able to deliver a new chip generation every 6 months like clockwork. 6 months after we started Nvidia, we knew over 30 other startups addressing the same market. Only one, ATI, survived the competition with Nvidia's 6-month product cycle.

VCs now would be hard to persuade that the return on the initial time and money to build a company that could IPO years later would be worth it when compared to lashing together a prototype and using it to sell the company to one of the FAANMGs. In many cases, simply recruiting a team that could credibly promise to build the prototype would be enough for an "aqui-hire", where a FAANMG buys a startup not for the product but for the people. Building the foundation for a company that can IPO and make it into the top-100 market cap list is no longer worth the candle.

But Funk argues that the major cause of lower returns is this:
Overall, the most significant problem for today’s start-ups is that there have been few if any new technologies to exploit. The internet, which was a breakthrough technology thirty years ago, has matured. As a result, many of today’s start-up unicorns are comparatively low-tech, even with the advent of the smartphone—perhaps the biggest technological breakthrough of the twenty-first century—fourteen years ago. Ridesharing and food delivery use the same vehicles, drivers, and roads as previous taxi and delivery services; the only major change is the replacement of dispatchers with smartphones. Online sales of juicers, furniture, mattresses, and exercise bikes may have been revolutionary twenty years ago, but they are sold in the same way that Amazon currently sells almost everything. New business software operates from the cloud rather than onsite computers, but pre-2000 start-ups such as Amazon, Google, and Oracle were already pursuing cloud computing before most of the unicorns were founded.
Remember, Sun's slogan in the mid 80s was "The network is the computer"!

Virtua Fighter on NV1
In essence, Funk argues that succssful startups out-perform by being quicker than legacy companies to exploit the productivity gains made possible by a technological discontinuity. Nvidia was an example of this, too. The technological discontinuity was the transition of the PC from the ISA to the PCI bus. It wasn't possible to do 3D games over the ISA bus, it lacked the necessary bandwidth. The increased bandwidth of the first version of the PCI bus made it just barely possible, as Nvidia's first chip demonstrated by running Sega arcade games at full frame rate. The advantages startups have against incumbents include:
  • An experienced, high-quality team. Initial teams at startups are usually recruited from colleagues, so they are used to working together and know each other's strengths and weaknesses. Jen-Hsun Huang was well-known at Sun, having been the application engineer for LSI Logic on Sun's first SPARC implementation. The rest of the initial team at Nvidia had all worked together building graphics chips at Sun. As the company grows it can no longer recruit only colleagues, so usually experiences what at Sun was called the "bozo invasion".
  • Freedom from backwards compatibility constraints. Radical design change is usually needed to take advantage of a technological discontinuity. Reconciling this with backwards compatibility takes time and forces compromise. Nvidia was able to ignore the legacy of program I/O from the ISA bus and fully exploit the Direct Memory Access capability of the PCI bus from the start.
  • No cash cow to defend. The IBM-funded Andrew project at CMU was intended to deploy what became the IBM PC/RT, which used the ROMP, an IBM RISC CPU competing with Sun's SPARC. The ROMP was so fast that IBM's other product lines saw it as a threat, and insisted that it be priced not to under-cut their existing product's price/performance. So when it finally launched, its price/performance was much worse than Sun's SPARC-based products, and it failed.
Funk concludes this section:
In short, today’s start-ups have targeted low-tech, highly regulated industries with a business strategy that is ultimately self-defeating: raising capital to subsidize rapid growth and securing a competitive position in the market by undercharging consumers. This strategy has locked start-ups into early designs and customer pools and prevented the experimentation that is vital to all start-ups, including today’s unicorns. Uber, Lyft, DoorDash, and GrubHub are just a few of the well-known start-ups that have pursued this strategy, one that is used by almost every start-up today, partly in response to the demands of VC investors. It is also highly likely that without the steady influx of capital that subsidizes below-market prices, demand for these start-ups’ services would plummet, and thus their chances of profitability would fall even further. In retrospect, it would have been better if start-ups had taken more time to find good, high-tech business opportunities, had worked with regulators to define appropriate behavior, and had experimented with various technologies, designs, and markets, making a profit along the way.
But, if the key to startup success is exploiting a technological discontinuity, and there haven't been any to exploit, as Funk argues earlier, taking more time to "find good, high-tech business opportunities" wouldn't have helped. They weren't there to be found.

How To Fix The Problem?

Funk quotes Charles Duhigg skewering the out-dated view of VCs:
For decades, venture capitalists have succeeded in defining themselves as judicious meritocrats who direct money to those who will use it best. But examples like WeWork make it harder to believe that V.C.s help balance greedy impulses with enlightened innovation. Rather, V.C.s seem to embody the cynical shape of modern capitalism, which too often rewards crafty middlemen and bombastic charlatans rather than hardworking employees and creative businesspeople.
Venture capitalists have shown themselves to be far less capable of commercializing breakthrough technologies than they once were. Instead, as recently outlined in the New Yorker, they often seem to be superficial trend-chasers, all going after the same ideas and often the same entrepreneurs. One managing partner at SoftBank summarized the problem faced by VC firms in a marketplace full of copycat start-ups: “Once Uber is founded, within a year you suddenly have three hundred copycats. The only way to protect your company is to get big fast by investing hundreds of millions.”
VCs like these cannot create the technological discontinuities that are the key to adequate returns on investment in startups:
we need venture capitalists and start-ups to create new products and new businesses that have higher productivity than do existing firms; the increased revenue that follows will then enable these start-ups to pay higher wages. The large productivity advantages needed can only be achieved by developing breakthrough technologies, like the integrated circuits, lasers, magnetic storage, and fiber optics of previous eras. And different players—VCs, start-ups, incumbents, universities—will need to play different roles in each in­dustry. Unfortunately, none of these players is currently doing the jobs required for our start-up economy to function properly.

Business Schools

Success in exploiting a technological discontinuity requires understanding of, and experience with, the technology, its advantages and its limitations. But Funk points out that business schools, not being engineering schools, need to devalue this requirement. Instead, they focus on "entrepreneurship":
In recent decades, business schools have dramatically increased the number of entrepreneurship programs—from about sixteen in 1970 to more than two thousand in 2014—and have often marketed these programs with vacuous hype about “entrepreneurship” and “technology.” A recent Stanford research paper argues that such hype about entrepreneurship has encouraged students to become entrepreneurs for the wrong reasons and without proper preparation, with universities often presenting entrepreneurship as a fun and cool lifestyle that will enable them to meet new people and do interesting things, while ignoring the reality of hard and demanding work necessary for success.
One of my abiding memories of Nvidia is Tench Coxe, our partner at Sutter Hill, perched on a stool in the lab playing the "Road Rash" video game about 2am one morning as we tried to figure out why our first silicon wasn't working. He was keeping an eye on his investment, and providing a much-needed calming influence.

Focus on entrepreneurship means focus on the startup's business model not on its technology:
A big mistake business schools make is their unwavering focus on business model over technology, thus deflecting any probing questions students and managers might have about what role technological breakthroughs play and why so few are being commercialized. For business schools, the heart of a business model is its ability to capture value, not the more important ability to create value. This prioritization of value capture is tied to an almost exclusive focus on revenue: whether revenues come from product sales, advertising, subscriptions, or referrals, and how to obtain these revenues from multiple customers on platforms. Value creation, however, is dependent on technological improvement, and the largest creation of value comes from breakthrough technologies such as the automobile, microprocessor, personal computer, and internet commerce.
The key to "capturing value" is extracting value via monopoly rents. The way to get monopoly rents is to subsidize customer acquisition and buy up competitors, until the customers have no place to go. This doesn't create any value. In fact once the monopolist has burnt through the investor's money they find they need a return that can only be obtained by raising prices and holding the customer to ransom, destroying value for everyone.

It is true a startup that combines innovation in technology with innovation in business has an advantage. Once more, Nvidia provides an example. Before starting Nvidia, Jen-Hsun Huang had run a division of LSI Logic that traded access to LSI Logic's fab for equity in the chips it made. Based on this experience on the supplier side of the fabless semiconductor business, one of his goals for Nvidia was to re-structure the relationship between the fabless company and the fab to be more of a win-win. Nvidia ended up as one of the most successful fabless companies of all time. But note that the innovation didn't affect Nvidia's basic business model — contract with fabs to build GPUs, and sell them to PC and graphics board companies. A business innovation combined with technological innovation stands a chance of creating a big company; a business innovation with no technology counterpart is unlikely to.


Funk assigns much blame for the lack of breakthrough technologies to Universities:
University engineering and science programs are also failing us, because they are not creating the breakthrough technologies that America and its start-ups need. Although some breakthrough technologies are assembled from existing components and thus are more the responsibility of private companies—for instance, the iPhone—universities must take responsibility for science-based technologies that depend on basic research, technologies that were once more common than they are now.
Note that Funk accepts as a fait accompli the demise of corporate research labs, which certainly used to do the basic research that led not just to Funk's examples of "semiconductors, lasers, LEDs, glass fiber, and fiber optics", but also, for example, to packet switching, and operating systems such as Unix. As I did three years ago in Falling Research Productivity, he points out that increased government and corporate funding of University research has resulted in decreased output of breakthrough technologies:
Many scientists point to the nature of the contemporary university research system, which began to emerge over half a century ago, as the problem. They argue that the major breakthroughs of the early and mid-twentieth century, such as the discovery of the DNA double helix, are no longer possible in today’s bureaucratic, grant-writing, administration-burdened university. ... Scientific merit is measured by citation counts and not by ideas or by the products and services that come from those ideas. Thus, labs must push papers through their research factories to secure funding, and issues of scientific curiosity, downstream products and services, and beneficial contributions to society are lost.
Funk's analysis of the problem is insightful, but I see his ideas for fixing University research as simplistic and impractical:
A first step toward fixing our sclerotic university research system is to change the way we do basic and applied research in order to place more emphasis on projects that may be riskier but also have the potential for greater breakthroughs. We can change the way proposals are reviewed and evaluated. We can provide incentives to universities that will encourage them to found more companies or to do more work with companies.
Funk clearly doesn't understand how much University research is already funded by companies, and how long attempts to change the reward system in Universities have been crashing into the rock comprised of senior faculty who achieved their position through the existing system.

He is more enthusiastic but equally misled about how basic research in corporate labs could be revived:
One option is to recreate the system that existed prior to the 1970s, when most basic research was done by companies rather than universities. This was the system that gave us transistors, lasers, LEDs, magnetic storage, nuclear power, radar, jet engines, and polymers during the 1940s and 1950s. ... Unlike their predecessors at Bell Labs, IBM, GE, Motorola, DuPont, and Monsanto seventy years ago, top university scientists are more administrators than scientists now—one of the greatest mis­uses of talent the world has ever seen. Corporate labs have smaller administrative workloads because funding and promotion depend on informal discussions among scientists and not extensive paperwork.
Not understanding the underlying causes of the demise of corporate research labs, Funk reaches for the time-worm nostrums of right-wing economists, "tax credits and matching grants":
We can return basic research to corporate labs by providing much stronger incentives for companies—or cooperative alliances of companies—to do basic research. A scheme of substantial tax credits and matching grants, for instance, would incentivize corporations to do more research and would bypass the bureaucracy-laden federal grant process. This would push the management of detailed technological choices onto scientists and engineers, and promote the kind of informal discussions that used to drive decisions about technological research in the heyday of the early twentieth century. The challenge will be to ensure these matching funds and tax credits are in fact used for basic research and not for product development. Requiring multiple companies to share research facilities might be one way to avoid this danger, but more research on this issue is needed.
In last year's The Death Of Corporate Research Labs I discussed a really important paper from a year earlier by Arora et al, The changing structure of American innovation: Some cautionary remarks for economic growth, which Funk does not cite. I wrote:
Arora et al point out that the rise and fall of the labs coincided with the rise and fall of anti-trust enforcement:
Historically, many large labs were set up partly because antitrust pressures constrained large firms’ ability to grow through mergers and acquisitions. In the 1930s, if a leading firm wanted to grow, it needed to develop new markets. With growth through mergers and acquisitions constrained by anti-trust pressures, and with little on offer from universities and independent inventors, it often had no choice but to invest in internal R&D. The more relaxed antitrust environment in the 1980s, however, changed this status quo. Growth through acquisitions became a more viable alternative to internal research, and hence the need to invest in internal research was reduced.
Lack of anti-trust enforcement, pervasive short-termism, driven by Wall Street's focus on quarterly results, and management's focus on manipulating the stock price to maximize the value of their options killed the labs:
Large corporate labs, however, are unlikely to regain the importance they once enjoyed. Research in corporations is difficult to manage profitably. Research projects have long horizons and few intermediate milestones that are meaningful to non-experts. As a result, research inside companies can only survive if insulated from the short-term performance requirements of business divisions. However, insulating research from business also has perils. Managers, haunted by the spectre of Xerox PARC and DuPont’s “Purity Hall”, fear creating research organizations disconnected from the main business of the company. Walking this tightrope has been extremely difficult. Greater product market competition, shorter technology life cycles, and more demanding investors have added to this challenge. Companies have increasingly concluded that they can do better by sourcing knowledge from outside, rather than betting on making game-changing discoveries in-house.
It is pretty clear that "tax credits and matching grants" aren't the fix for the fundamental anti-trust problem. Not to mention that the idea of "Requiring multiple companies to share research facilities" in and of itself raises serious ant-trust concerns. After such a good analysis, it is disappointing that Funk's recommendations are so feeble.

We have to add inadequate VC returns and a lack of startups capable of building top-100 companies to the long list of problems that only a major overhaul of anti-trust enforcement can fix. Lina Khan's nomination to the FTC is a hopeful sign that the Biden adminstration understands the urgency of changing direction, but Biden's hesitation about nominating the DOJ's anti-trust chief is not.

Update: Michael Cembalest's Food Fight: An update on private equity performance vs public equity markets has a lot of fascinating information about private equity in general and venture capital in particular. His graphs comparing MOIC (Multiple Of Invested Capital) and IRR (Internal Rate of Return) across vintage years support his argument that:
We have performance data for venture capital starting in the mid-1990s, but the period is so distorted by the late 1990’s boom and bust that we start our VC performance discussion in 20045. In my view, the massive gains earned by VC managers in the mid-1990s are not relevant to a discussion of VC investing today. As with buyout managers, VC manager MOIC and IRR also tracked each other until 2012 after which a combination of subscription lines and faster distributions led to rising IRRs despite falling MOICs. There’s a larger gap between average and median manager results than in buyout, indicating that there are a few VC managers with much higher returns and/or larger funds that pull up the average relative to the median.

The gap is pretty big:
VC managers have consistently outperformed public equity markets when looking at the “average” manager. But to reiterate, the gap between average and median results are substantial and indicate outsized returns posted by a small number of VC managers. For vintage years 2004 to 2008, the median VC manager actually underperformed the S&P 500 pretty substantially.
Another of Cembalest's fascinating graphs addresses this question:
One of the other “food fight” debates relates to pricing of venture-backed companies that go public. In other words, do venture investors reap the majority of the benefits, leaving public market equity investors “holding the bag”? Actually, the reverse has been true over the last decade when measured in terms of total dollars of value creation accruing to pre- and post-IPO investors: post-IPO investor gains have often been substantial.

To show this:
We analyzed all US tech, internet retailing and interactive media IPOs from 2010 to 2019. We computed the total value created since each company’s founding, from original paid-in capital by VCs to its latest market capitalization. We then examined how total value creation has accrued to pre- and post-IPO investors6. Sometimes both investor types share the gains, and sometimes one type accrues the vast majority of the gains. Pre-IPO investors earn the majority of the pie when IPOs collapse or flat-line after being issued, and post-IPO investors reap the majority of the pie when IPOs appreciate substantially after being issued.

There are three general regions in the chart. As you can see, the vast majority of the 165 IPOs analyzed resulted in a large share of the total value creation accruing to public market equity investors; nevertheless, there were some painful exceptions (see lower left region on the chart).

A Modest Proposal About Ransomware / David Rosenthal

On the evening of July 2nd the REvil ransomware gang exploited a 0-day vulnerability to launch a supply chain attack on customers of Kaseya's Virtual System Administrator (VSA) product. The timing was perfect, with most system administrators off for the July 4th long weekend. By the 6th Alex Marquardt reported that Kaseya says up to 1,500 businesses compromised in massive ransomware attack. REvil, which had previously extorted $11M from meat giant JBS, announced that for the low, low price of only $70M they would provide everyone with a decryptor.

The US government's pathetic response is to tell the intelligence agencies to investigate and to beg Putin to crack down on the ransomware gangs. Good luck with that! It isn't his problem, because the gangs write their software to avoid encrypting systems that have default languages from the former USSR.

I've writtten before (here, here, here) about the importance of disrupting the cryptocurrency payment channel that enables ransomware, but it looks like the ransomware crisis has to get a great deal worse before effective action is taken. Below the fold I lay out a modest proposal that could motivate actions that would greatly reduce the risk.

It turns out that the vulnerability that enabled the REvil attack didn't meet the strict definition of a 0-day. Gareth Corfield's White hats reported key Kaseya VSA flaw months ago. Ransomware outran the patch explains:
Rewind to April, and the Dutch Institute for Vulnerability Disclosure (DIVD) had privately reported seven security bugs in VSA to Kaseya. Four were fixed and patches released in April and May. Three were due to be fixed in an upcoming release, version 9.5.7.

Unfortunately, one of those unpatched bugs – CVE-2021-30116, a credential-leaking logic flaw discovered by DIVD's Wietse Boonstra – was exploited by the ransomware slingers before its fix could be emitted.
DIVD praised Kaseya's response:
Once Kaseya was aware of our reported vulnerabilities, we have been in constant contact and cooperation with them. When items in our report were unclear, they asked the right questions. Also, partial patches were shared with us to validate their effectiveness.

During the entire process, Kaseya has shown that they were willing to put in the maximum effort and initiative into this case both to get this issue fixed and their customers patched. They showed a genuine commitment to do the right thing. Unfortunately, we were beaten by REvil in the final sprint, as they could exploit the vulnerabilities before customers could even patch.
But if Kaseya's response to DIVD's disclosure was praisworthy, it turns out it was the exception. In Kaseya was warned about security flaws years ahead of ransomware attack by J., Fingas reports that:
The giant ransomware attack against Kaseya might have been entirely avoidable. Former staff talking to Bloomberg claim they warned executives of "critical" security flaws in Kaseya's products several times between 2017 and 2020, but that the company didn't truly address them. Multiple staff either quit or said they were fired over inaction.

Employees reportedly complained that Kaseya was using old code, implemented poor encryption and even failed to routinely patch software. The company's Virtual System Administrator (VSA), the remote maintenance tool that fell prey to ransomware, was supposedly rife with enough problems that workers wanted the software replaced.

One employee claimed he was fired two weeks after sending executives a 40-page briefing on security problems. Others simply left in frustration with a seeming focus on new features and releases instead of fixing basic issues. Kaseya also laid off some employees in 2018 in favor of outsourcing work to Belarus, which some staff considered a security risk given local leaders' partnerships with the Russian government.
The company's software was reportedly used to launch ransomware at least twice between 2018 and 2019, and it didn't significantly rethink its security strategy.
To reiterate:
  • The July 2nd attack was apparently at least the third time Kaseya had infected customers with ransomware!
  • Kaseya outsourced development to Belarus, a country where ransomware gangs have immunity!.
  • Kaseya fired security whistleblowers!
The first two incidents didn't seem to make either Kaseya or its customers re-think what they were doing. Clearly, the only reason Kaseya responded to DIVD's warning was the threat of public disclosure.

Without effective action to change this attitude the ransomware crisis will definitely result in what Stephen Diehl calls The Oncoming Ransomware Storm:
Imagine a hundred new Stuxnet-level exploits every day, for every piece of a equipment in public works and health care. Where every day your check your phone for the level of ransomware in the wild just like you do the weather. Entire cities randomly have their metro systems, water, power grids and internet shut off and on like a sudden onset of bad cybersecurity “weather”.

Or a time in business in which every company simply just allocates a portion of its earnings upfront every quarter and pre-pays off large ransomware groups in advance. It’s just a universal cost of doing business and one that is fully sanctioned by the government because we’ve all just given up trying to prevent it and it’s more efficient just to pay the protection racket.
To make things worse, companies can insure against the risk of ransomware, essentially paying to avoid the hassle of maintaining security. Insurance companies can't price these policies properly, because they can't do enough underwriting to know, for example, whether the customer's backups actually work and whether they are offline enough so the ransomware doesn't encrypt them too.

In Cyber insurance model is broken, consider banning ransomware payments, says think tank Gareth Corfield reports on the Royal United Services Institute's (RUSI) latest report, Cyber Insurance and the Cyber Security Challenge:
Unfortunately, RUSI's researchers found that insurers tend to sell cyber policies with minimal due diligence – and when the claims start rolling in, insurance company managers start looking at ways to escape an unprofitable line of business.
RUSI's position on buying off criminals is unequivocal, with [Jason] Nurse and co-authors Jamie MacColl and James Sullivan saying in their report that the UK's National Security Secretariat "should conduct an urgent policy review into the feasibility and suitability of banning ransom payments."
The fundamental problem is that neither the software vendors nor the insurers nor their customers are taking security seriously enough because it isn't a big enough crisis yet. The solution? Take control of the crisis and make it big enough that security gets taken seriously.

The US always claims to have the best cyber-warfare capability on the planet, so presumably they could do ransomware better and faster than gangs like REvil. The US should use this capability to mount ransomware attacks against US companies as fast as they can. Victims would see, instead of a screen demanding a ransom in Monero to decrypt their data, a screen saying:
US Government CyberSecurity Agency

Patch the following vulnerabilities immediately!

The CyberSecurity Agency (CSA) used some or all of the following vulnerabilities to compromise your systems and display this notice:
  • CVE-2021-XXXXX
  • CVE-2021-YYYYY
  • CVE-2021-ZZZZZ
Three days from now if these vulnerabilities are still present, the CSA will encrypt your data. You will be able to obtain free decryption assistance from the CSA once you can prove that these vulnerabilities are no longer present.
If the victim ignored the notice, three days later they would see:
US Government CyberSecurity Agency

The CyberSecurity Agency (CSA) used some or all of the following vulnerabilities to compromise your systems and encrypt your data:
  • CVE-2021-XXXXX
  • CVE-2021-YYYYY
  • CVE-2021-ZZZZZ
Once you have patched these vulnerabilities, click here to decrypt your data

Three days from now if these vulnerabilities are still present, the CSA will re-encrypt your data. For a fee you will be able to obtain decryption assistance from the CSA once you can prove that these vulnerabilities are no longer present.
The program would start out fairly gentle and ramp up, shortening the grace period to increase the impact.

The program would motivate users to keep their systems up-to-date with patches for disclosed vulnerabilities, which would not merely help with ransomware, but also with botnets, data breaches and other forms of malware. It would also raise the annoyance factor customers face when their supplier fails to provide adequate security in their products. This in turn would provide reputational and sales pressure on suppliers to both secure their supply chain and, unlike Kaseya, prioritize security in their product development.

Of course, the program above only handles disclosed vulnerabilities, not the 0-days REvil used. There is an flourishing trade in 0-days, of which the NSA is believed to be a major buyer. The supply in these markets is increasing, as Dan Goodin reports in iOS zero-day let SolarWinds hackers compromise fully updated iPhones:
In the first half of this year, Google’s Project Zero vulnerability research group has recorded 33 zero-day exploits used in attacks—11 more than the total number from 2020. The growth has several causes, including better detection by defenders and better software defenses that require multiple exploits to break through.

The other big driver is the increased supply of zero-days from private companies selling exploits.

“0-day capabilities used to be only the tools of select nation-states who had the technical expertise to find 0-day vulnerabilities, develop them into exploits, and then strategically operationalize their use,” the Google researchers wrote. “In the mid-to-late 2010s, more private companies have joined the marketplace selling these 0-day capabilities. No longer do groups need to have the technical expertise; now they just need resources.”

The iOS vulnerability was one of four in-the-wild zero-days Google detailed on Wednesday.
Based on their analysis, the researchers assess that three of the exploits were developed by the same commercial surveillance company, which sold them to two different government-backed actors.
As has been true since the Cold-War era and the "Crypto Wars" of the 1980s when cryptography was considered a munition, the US has prioritized attack over defense. The NSA routinely hoards 0-days, preferring to use them to attack foreigners rather than disclose them to protect US citizens (and others). This short-sighted policy has led to several disasters, including the Juniper supply-chain compromise and NotPetya. Senators wrote to the head of the NSA, and the EFF sued the Director of National Intelligence, to obtain the NSA's policy around 0-days:
Since these vulnerabilities potentially affect the security of users all over the world, the public has a strong interest in knowing how these agencies are weighing the risks and benefits of using zero days instead of disclosing them to vendors,
It would be bad enough if the NSA and other nations' security services were the only buyers of 0-days. But the $11M REvil received from JBS buys a lot of them, and if each could net $70M they'd be a wonderful investment. Forcing ransomware gangs to use 0-days by getting systems up-to-date with patches is good, but the gangs will have 0-days to use. So although the program above should indirectly reduce the supply (and thus increase the price) of 0-days by motivating vendors to improve their development and supply chain practices, something needs to be done to reduce the impact of 0-days on ransomware.

The Colonial Pipeline and JBS attacks, not to mention the multiple hospital chains that have been disrupted, show that it is just a matter of time before a ransomware attack has a major impact on US GDP (and incidentally on US citizens). In this light, the idea that NSA should stockpile 0-days for possible future use is counter-productive. At any time 0-days in the hoard might leak, or be independently discovered. In the past the fallout from this was limited, but no longer; they might be used for a major ransomware attack. Is the National Security Agency's mission to secure the United States, or to have fun playing Team America: World Police in cyberspace?

Unless they are immediately required for a specific operation, the NSA should disclose 0-days it discovers or purchases to the software vendor, and once patched, add them to the kit it uses to run its "ransomware" program. To do less is to place the US economy at risk.

PS: David Sanger reported Tuesday that Russia’s most aggressive ransomware group disappeared. It’s unclear who disabled them.:
Just days after President Biden demanded that President Vladimir V. Putin of Russia shut down ransomware groups attacking American targets, the most aggressive of the groups suddenly went off-line early Tuesday.
A third theory is that REvil decided that the heat was too intense, and took the sites down itself to avoid becoming caught in the crossfire between the American and Russian presidents. That is what another Russian-based group, DarkSide, did after the ransomware attack on Colonial Pipeline, ...

But many experts think that DarkSide’s going-out-of-business move was nothing but digital theater, and that all of the group’s key ransomware talent will reassemble under a different name.
This is by far the most likely explanation for REvil's disappearance, leaving victims unable to pay. The same day, Bogdan Botezatu and Radu Tudorica reported that Trickbot Activity Increases; new VNC Module On the Radar:
The Trickbot group, which has infected millions of computers worldwide, has recently played an active role in disseminating ransomware.

We have been reporting on notable developments in Trickbot’s lifecycle, with highlights including the analysis in 2020 of one of its modules used to bruteforce RDP connections and an analysis of its new C2 infrastructure in the wake of the massive crackdown in October 2020.

Despite the takedown attempt, Trickbot is more active than ever. In May 2021, our systems started to pick up an updated version of the vncDll module that Trickbot uses against select high-profile targets.
As regards the "massive crackdown", Ravie Lakshmanan notes:
The botnet has since survived two takedown attempts by Microsoft and the U.S. Cyber Command,

Via Barry Ritholtz we find this evidence of Willie Sutton's law in action. When asked "Why do you rob banks?", Sutton replied "Because that's where the money is."


And, thanks to Jack Cable, there's now, which tracks ransomware payments in real time. It suffers a bit from incomplete data. Because it depends upon tracking Bitcoin addresses, it will miss the increasing proportion of demands that insist on Monero.

Top 5 Big Library Ideas in History / Hugh Rundle

GLAM Blog Club has the first in a new approach to themes this month - Top five big.... I'm also taking a new approach, with some shitty drawings to spice things up. I hope you enjoy.

5 - Found in translation (Academy of Gondishapur, Persia)

The Academy of Gondishapur (modern day Iran) was centred around a library of medical books from around the world known to Persia, and a school of talented translators to make it all readable in Persian.

Apologies Shahanshah, we have not found the cure for gout

4 - Oppression through "standards" (Library of Congress, United States of America)

Most libraries in the Anglosphere use Library of Congress Subject Headings or variations based on them, for classifying knowledge. It's pretty ridiculous for a primary school library in Fiji to organise knowledge according to the worldview of American politicians, but it sure is effective soft power.

Say Chad, let's do cultural imperialism

3 - Shelve it like you stole it (Library of Alexandria)

The Ptolomies wanted their library to have a copy of every book in the known world. Woe betide anyone foolish enough to turn up with a boatload of books.

We keep the original, they get the copy

2 - Give communists somewhere to write (British Library, London)

World champion grump and famous communist Karl Marx washed up in London after he was thrown out of everywhere else in Europe. The British Library famously became the place he wrote most of his world-changing book Capital.

Karl Marx at the library

1 - Read in the bath (Baths of Trajan, Rome)

The enormous Baths of Trajan in ancient Rome were commissioned by Emperor Trajan and included several cold and hot baths, a swimming pool, sports stadium, and two libraries (Greek, and Latin).

A great idea for Trajan

Catching up with past NDSA Innovation Awards Winners: ePADD / Digital Library Federation

Nominations are now being accepted for the NDSA 2021 Excellence Awards.

In 2017, ePADD won the NDSA Innovation Award in the Project category. At that time, this project was an undertaking to develop free and open-source computational analysis software that facilitates screening, browsing, and access for historically and culturally significant email collections. The software incorporates techniques from computer science and computational linguistics, including natural language processing, named entity recognition, and other statistical machine-learning associated processes. Glynn Edwards accepted the award on behalf of the project. She is currently the Assistant Director in the Department of Special Collections & University Archives at Stanford University and kindly took a few minutes to help us catch up on where the project stands today.

What have you been doing since receiving an NDSA Innovation Award in 2017?

We completed two additional phases of software development for ePADD. Phase 2, funded by the Institute of Museum & Library Studies (IMLS) National Leadership Grant (NLG) and Phase 3, funded by the Andrew W. Mellon Foundation. These rounds of development focused on adding new features and functionality to the software to support the appraisal, processing, discovery, and delivery of email collections of historic value. Our project team also changed before this third phase with Sally DeBauche, our new digital archivist, taking over project management full-time for 18 months. 

Before Phase 3 launched in January 2020 with Harvard Library as our official partner, Jessica Smith, Ian Gifford, and Jochen Farwer from the University of Manchester contacted us about their own independent project to redevelop aspects of ePADD. They created a prototype version of ePADD that would display a full-text email archive in the Discovery Module, allowing users to view an email collection online. 

Meetings with the Harvard team, represented by Tricia Patterson & Stephen Abrams, progressed to the proposal of ceasing the redevelopment of their in-house email processing and preservation software, EAS, and instead collaborating with us to add specific preservation functionality to ePADD. At this stage, we brought the team from the University of Manchester into those discussions to help us shape the requirements for a new version of ePADD with greater support for preservation workflows. Concurrently with our Phase 3 grant, our three institutions began working on a joint grant proposal for Phase 4 of ePADD’s software development, funded by the University of Illinois’s Email Archives: Building Capacity and Community (EA:BCC) re-grant program, supported by the Andrew W. Mellon Foundation. We have been meeting together for the past year as we document requirements, identify roles and responsibilities for each of our units to carry out this work. For this phase of the project, we have contracted with an independent software development team, Sartography, to implement changes to the software, while retaining ePADD’s original development team to ensure consistency in our approach.

Internally at Stanford, we continue to use ePADD as our production tool for appraising, processing, and delivering email archives at Stanford. Our digital archives team, Sally DeBauche & Annie Schweikert, have presented on the software to our group of curators and have been in contact with them about appraising and processing new acquisitions. Annie & Sally have processed several new email collections, including the Ted Nelson email archive and the Don Knuth email archive. We have also launched a new multi-institutional online ePADD Discovery website at, featuring the archive of literary critic and historical theorist, Hayden White, from the UC Santa Cruz archives. To accompany the site, we have created documentation about contributing to the Discovery site

What did receiving the NDSA award mean to you?

Beyond the recognition of our colleagues, it raised the profile of the ePADD software which garnered more users and interest. This greater following gave us the impetus for our third grant from the Mellon Foundation and allowed us to create a more stable program that can be used as a production tool for email archives.

What efforts/advances/ideas of the last few years have you been impressed with or admired in the field of data stewardship and/or digital preservation?

There has been a lot of development in the field since we started with the ePADD project. But I have been very impressed with the EaaSI project (emulation), for which Stanford serves as a node host. This project will be a game changer for our stakeholders across the university and beyond, as well as colleagues throughout the library who use this platform to provide access to legacy software and files that rely on unique and outdated software.

How has the ePADD project evolved since you won the Innovation Award?

I included a lot of this in #1 above – but I would add that the raised profile and increased interest and use of ePADD, has brought dedicated partners. The Stanford-Harvard-Manchester partnership began during our third grant and has increased exposure of ePADD+ (as we now refer to it) and with the greater involvement from colleagues at each institution has allowed the larger team to focus on different aspects of running and managing the project. One exciting outcome is the commitment of more software testers and greater input from a wider community.

What do you currently see as some of the biggest challenges in email assessment and preservation?

While I am still hoping for a more holistic way to search across all types of archival content, I think that sustainability is one of the major issues facing open-source software development projects. The cost of bug-fixes and updates with new versions of underlying programs might not always be inordinate, but securing dedicated funding is not simple and is often very time consuming. Even more difficult is getting concrete buy-in for funds needed to pay developers to create significant enhancements. We are excited to see the progress from the It Takes a Village in Practice project that aims to provide guidance to open-source software development projects on sustainability. We are engaged in beta testing for the tools that they are developing, and it will be very interesting to see how they can be of service to the broader community.

The post Catching up with past NDSA Innovation Awards Winners: ePADD appeared first on DLF.

Calls for proposals – Samvera Connect 2021 Online / Samvera

The Program Committee is pleased to announce its Call for Proposals (CfP) of workshops, presentations/panels, lightning talks, and posters for Samvera Connect 2021 Online. The online conference workshops will be held October 14th -15th with plenary presentations October 18th – 22nd.

Connect Online programming is intended to serve the needs of attendees from throughout our Community, from potential adopters to expert Samverans, in many roles including developers, managers, sysops, metadata librarians, and others who are interested in Samvera technologies and Community activities.

Workshops: submission form open through Sunday, August 15th, 2021 

The Workshops form includes the option to request a workshop on a specific topic that you would like to attend. The Program Committee will use these suggestions to solicit workshops from the Community.

Presentations and Panels: submission form open through Tuesday, August 31st, 2021

Lightning Talks: submission form open through Thursday, September 30th, 2021

Virtual Posters: submission form open through Thursday, September 30th, 2021

You may find it helpful to refer to the workshop program, presentation/lightning talk program, and posters from last year’s online conference.

The post Calls for proposals – Samvera Connect 2021 Online appeared first on Samvera.

b2c2b is the new b2b / Casey Bisson

The most recent StackOverflow developer survey shows 77% of developers prefer to use a free trial as a way to research a new service. Forrester Research reported that 93% of b2b buyers prefer self-service buying online. And a Harvard Business Review study found “that [b2b] customers are, on average, 57% of the way through the [purchase] process before they engage with supplier sales reps.” Because of this, b2b sales require internal advocates—called mobilizers—that can build consensus around purchase decisions.

Two metadata directions / Lorcan Dempsey

Two metadata directions

This short piece is based on a presentation I delivered by video to the Eurasian Academic Libraries Conference - 2021, organized by The Nazarbayev University Library and the Association of University Libraries in the Republic of Kazakhstan. Thanks to April Manabat of Nazarbayev University for the invitation and for encouragement as I prepared the presentation and blog entry. I was asked to talk about metadata and to mention OCLC developments. The conference topic was: Contemporary Trends in Information Organization in the Academic Library Environment.

The growing role and value of metadata

Libraries are very used to managing metadata for information resources - for books, images, journal articles and other resources. Metadata practice is rich and varied. We also work with geospatial data, archives, images, and many other specialist resources. Authority work has focused on people, places and things (subjects). Archivists are concerned about evidential integrity, context and provenance. And so on.

In the network environment, general metadata requirements have continued to evolve in various ways:

Information resource diversification. We want to discover, manage or otherwise interact with a progressively broader range of resources, research data, for example, or open educational resources.

Resource type diversification. However, we are also increasingly interested in more resource types than informational alone. The network connects together many entities or types of resource in new ways, and interaction between these entities requires advance knowledge, often provided by metadata, to be efficient. Workflows tie people, applications and devices together to get things done. To be effective, each of these resources needs to be described. Social applications like Strava tie together people, activities, places, and so on. Scholarly engines like Google Scholar, Semantic Scholar, Scopus or Dimensions tie together research outputs, researchers, funders, and institutions. The advance knowledge required for these workflows and environments to work well is provided by metadata and so we are increasingly interested in metadata about a broad range of entities.

Functional diversification. We want to discover, manage, request or buy resources. We also want to provide context about resources, ascertain their validity or integrity over time, determine their provenance. We want to compare resources, collect data about usage, track and measure. We want to make connections between entities, understand relationships, and actually create new knowledge. We do not just want to find individual information resources or people, we want to make sense of networks of people, resources, institutions, and so on and the relations between them.

Source diversification.  I have spoken about four sources of metadata in the past. Versions of these are becoming more important, but so is how they are used together to tackle the growing demands on metadata in digital environments.

  1. Professional. Our primary model of metadata has been a professional one, where librarians, abstract writers, archivists and so on are the primary source. Libraries have streamlined metadata creation and provision for acquired resources. Many libraries, archives and others, devote professional attention and expertise to unique resources - special collections, archives, digitised and born-digital materials, institutional research outputs, faculty profiles, and so on.
  2. Community.  I described the second as crowdsourced, and certainly the collection of contributions in this way has been of importance, in digital projects, community initiatives and in other places. However, one might extend this to a broader community source. The subject of the description or the communities from which the resources originate are an increasingly important source. This is especially the case as we pluralize description as I discuss further below.  An interesting example here is Local Contexts which works with collecting institutions and Indigenous communities and "provides a new set of procedural workflows that emphasize vetting content, collaborative curation, ethical management and sustained outreach practices within institutions."
  3. Programmatically promoted. The programmatic promotion of metadata is becoming increasingly important. We will see more algorithmically generated metadata, as natural language processing, entity recognition, machine learning, image recognition, and other approaches become more common. This is especially the case as we move towards more entity-based approaches where the algorithmic identification of entities and relationships across various resources becomes important. At the same time, we are more aware of the need for responsible operations, where dominant perspectives also influence construction of algorithms and learning sets.
  4. Intentional. A fourth source is intentional data, or usage data, data about how resources are downloaded, cited, linked and so on. This may be used to rate and rank, or to refine other descriptions. Again, appropriate use of this data needs to be responsibly managed.

Perspective diversification. A purported feature of much professional metadata activity has been neutral or objective description. However, we are aware that to be &aposneutral&apos can actually mean to be aligned with dominant perspectives. We know that metadata and subject description have very often been partial, harmful or unknowing about the resources described, or have continued obsolescent or superseded perspectives, or have not described resources or people in ways that a relevant community expects or can easily find. This may be in relation to race, gender, nationality, sexual orientation, or other contexts. This awareness leads directly into the second direction I discuss below, pluralization. It also highlights the increasing reliance on community perspectives. It is important to understand context, cultural protocols, community meanings and expectations, through more reciprocal approaches. And as noted above, use of programmatically promoted or intentional data needs to be responsibly approached, alert to ways in which preferences or bias can be present.

So we want to make metadata work harder, but we also need more metadata and more types of metadata. Metadata helps us to work in network environments.

A more formal definition of metadata might run something like: "schematized assertions about a resource of interest." However, as we think about navigating increasingly digital workflows and environments, I like to think about metadata in this general way:

data which relieves a potential user (whether human or machine) of having to have full advance knowledge of the existence or characteristics of a resource of potential interest in the environment.

Metadata allows applications and users to act more intelligently, and this becomes more important in our increasingly involved digital workflows and environments.

Given this importance, and given the importance of such digital environments to our working, learning and social lives, it also becomes more important to think about how metadata is created, who controls it, how it is used, and how it is stewarded over time. Metadata is about both value and values.

In this short piece, and in the presentation on which it is based, I limit my attention to two important directions. Certainly, this is a part only of the larger picture of evolving metadata creation, use and design in libraries and beyond.

Two metadata directions

I want to talk about two important directions here.

  1. Entification.
  2. Pluralization.

Entification: strings and things

Google popularized the notion of moving from &aposstrings&apos to &aposthings&apos when it introduced the Google knowledge graph. By this we mean that it is difficult to rely on string matching for effective search, management or measurement of resources. Strings are ambiguous. What we are actually interested in are the &aposthings&apos themselves, actual entities which may be referred to in different ways.

Entification involves establishing a singular identity for &aposthings&apos so that they can be operationalized in applications, gathering information about those &aposthings,&apos and relating those &aposthings&apos to other entities of interest.

Research information management and the scholarly ecosystem provide a good example of this. This image shows a variety of entities of interest in a research information management system.

Two metadata directions

These include researchers, research outputs, institutions, grants and other entities. Relationships between these include affiliation (researcher to institution), collaborator (researcher to researcher), authorship (researcher to research output), and so on. We want to know that the professor who got this grant is the same one as the one who teaches this course or published that paper.

These identities could be (and are) established within each system or service. So, this means that a Research Information Management System does not only return strings that match a text search. It can facilitate the prospecting of a body of institutional research outputs, a set of scholars, a collection of labs and departments, and so on, and allow links to related works, scholars and institutions to be made.

Similarly, a scholarly engine like Scopus, or Semantic Scholar, or Dimensions will bring together scholarly entities and offer access to a more or less rich network of results. Of course, in this case, the metadata may be under less close control. Typically, they will also be using metadata sourced in the four ways I described above, as &aposprofessionally created&apos metadata works with metadata contributed by, say, individual researchers, as entities may be established and related programmatically, and as usage data helps organize resources.

Wikidata is an important resource in this context, as a globally addressable identity base for entities of all types.

Of course one wants to work with entities across systems and services. Researchers move between institutions, collaborate with others, cite research outputs, and so on. One may need to determine whether an identity in one context refers to the same entity as an identity in another context. So important entity backbone services have emerged which allow entities to be identified across systems and services by assigning globally unique identifiers. These include Orcid for researchers, DOI for research outputs, and the emerging ROR for institutions.

These initiatives aim to create a singular identity for resources, gather some metadata about them, and make this available for other services to use. So a researcher may have an Orcid, for example, which is associated with a list of research outputs, an affiliation, and so on. This Orcid identity may then be used across services, supporting global identification and contextualization.

Here, for example, is a profile generated programmatically by Dimensions (from Digital Science) for my colleague Lynn Connaway. It aims to generate a description, but then also to recognize and link to other entities (e.g. topics, institutions, or collaborators). Again, the goal is to present a profile, and then to allow us to prospect a body of work and its connections. It pulls data from various places. We are accustomed to this more generally now with Knowledge Cards in Google (underpinned by Google&aposs knowledge graph).

Two metadata directions

Of course, this is not complete, and there has been some interesting discussion about improved use of identifiers from the scholarly entity backbone services. Meadows and Jones talk about the practical advantages to scholarly communication of a &aposPID-optimized world.&apos (PID=persistent identifier.)

In these systems and services, the entities I have been talking about will typically be nodes in an underlying knowledge graph or ontology.  

Today, KGs are used extensively in anything from search engines and chatbots to product recommenders and autonomous systems. In data science, common use cases are around adding identifiers and descriptions to data of various modalities to enable sense-making, integration, and explainable analysis.  [...]

A knowledge graph organises and integrates data according to an ontology, which is called the schema of the knowledge graph, and applies a reasoner to derive new knowledge. Knowledge graphs can be created from scratch, e.g., by domain experts, learned from unstructured or semi-structured data sources, or assembled from existing knowledge graphs, typically aided by various semi-automatic or automated data validation and integration mechanisms. // The Alan Turing Institute

The knowledge graph may be internal to a particular service (Google for example) or may be used within a domain or globally. Again, Wikidata is important because it publishes its underlying knowledge graph and can be used to provide context, matches, and so on for other resources.

The Library of Congress, other national libraries, OCLC ,and others now manage entity backbones for the library community, sometimes rooted in traditional authority files. There is a growing awareness of the usefulness of entity-based approaches and of the importance of identifiers in this context.

In this way, it is expected that applications will be able to work across these services. For example, an Orcid identity may be matched with identifiers from other services, VIAF for example, to provide additional context or detail, or Wikidata to provide demographic or other data not typically found in bibliographic services.

In this way, we can expect to see a decentralized infrastructure which can be tied together to achieve particular goals.

Pluralizing description

Systems of description are inevitably both explicitly and implicitly constructed within particular perspectives. Metadata and subject description have long been criticized for embodying dominant perspectives, and for actively shunning or overlooking the experiences, memories or expectations of parts of the communities they serve. They may also contain superseded, obsolescent or harmful descriptions.

Libraries have spoken about "knowledge organization" but such a phrase has to reckon with two challenges. First, it is acknowledged that there are different  knowledges.

The TK Labels support the inclusion of local protocols for access and use to cultural heritage that is digitally circulating outside community contexts. The TK Labels identify and clarify community-specific rules and responsibilities regarding access and future use of traditional knowledge. This includes sacred and/or ceremonial material, material that has gender restrictions, seasonal conditions of use and/or materials specifically designed for outreach purposes. // Local Contexts, [Traditional Knowledge] labels

Second, knowledge may be contested, where it has been constructed within particular power relations and dominant perspectives.

Others described “the look of horror” on the face of someone who has been told to search using the term “Indians of North America.” Students — Indigenous as well as settler—who work with the collections point out to staff the many incorrect or outdated terms they encounter. // Towards respectful and inclusive description

My Research Library Partnership colleagues carried out a survey on Equity, Diversity and Inclusion in 2017. While it is interesting to note the mention of archival description as among the most changed features, there was clearly a sense that metadata description and terminologies required attention, and there was an intention to address these next.

Two metadata directions

Such work may be retrospective, including remediation of existing description by linking to or substituting more appropriate descriptions. And there is certainly now a strong prospective focus, working on pluralizing description, on decentering dominant perspectives to respectfully and appropriately describe resources.

This work has been pronounced in Australia, New Zealand and Canada, countries which recognize the need to address harmful practices in relation to Indigenous populations. In early 2020, my RLP colleagues interviewed 41 library staff from 21 institutions in Australia, New Zealand, Canada and the US to talk about respectful and inclusive description:

Of those interviewed, no one felt they were doing an adequate job of outreach to communities. Several people weren’t even sure how to go about connecting with stakeholder communities. Some brought up the possibility of working with a campus or community-based Indigenous center. These organizations can be a locus for making connections and holding conversations. A few working in a university setting have found strong allies and partners to advocate for increased resources in faculty members of the Indigenous Studies department (or similar unit). Those with the most developed outreach efforts saw those activities as being anchored in exchanges that originated at the reference desk, such as when tribal members came into the library to learn something about their own history, language or culture using materials that are stewarded by the library. Engaging with these communities to understand needs offers the opportunity to transform interactions with them from one-time transactions to ongoing, meaningful relationships.  Learning how Indigenous community members use and relate to these materials can decenter default approaches to description and inspire more culturally appropriate ways. Some institutions have developed fellowships to foster increased use of materials by community members. // Towards respectful and inclusive description

The murder of George Floyd in the US caused a general personal, institutional and community reckoning with racism. This has extended to addressing bias in library collections and descriptions.

Materials in our collections, which comprise a part of the cultural and historical record, may depict offensive and objectionable perspectives, imagery, and norms. While we have control over description of our collections, we cannot alter the content. We are committed to reassessing and modifying description when possible so that it more accurately and transparently reflects the content of materials that are harmful and triggering in any way and for any reason.

As librarians and archivists at NYU Libraries, we are actively confronting and remediating how we describe our collection materials. We know that language can and does perpetuate harm, and the work we are undertaking centers on anti-oppressive practices. We are also making reparative changes, all to ensure that the descriptive language we use upholds and enacts our values of inclusion, diversity, equity, belonging, and accessibility. // NYU Libraries, Archival Collections Management, Statement on Harmful Language.

Many individual libraries, archives and other organizations are now taking steps to address these issues in their catalogs. The National Library of Scotland has produced an interesting Inclusive Terminology guide and glossary, which includes a list of areas of attention.

Two metadata directions

Of course, the two directions I have mentioned can be connected, as entification and linking strategies may offer ways of pluralizing description in the future:

Other magic included technical solutions, such as linked data solutions, specifically, mapping inappropriate terms to more appropriate ones, or connecting the numerous alternative terms with the single concept they represent. For example, preferred names and terms may vary by community, generation, and context—what is considered incorrect or inappropriate may be a matter of perspective. Systems can also play a role: the discovery layers should have a disclaimer stating that users may find terms that are not currently considered appropriate. Terms that are known to be offensive terms could be blurred out (with an option to reveal them). // Towards respectful and inclusive description

OCLC Initiatives

Now I will turn to discuss one important initiative OCLC has under each of these two directions.

Entification at OCLC: towards a shared entity management infrastructure

For linked data to move into common use, libraries need reliable and persistent identifiers and metadata for the critical entities they rely on. This project [SEMI] begins to build that infrastructure and advances the whole field // Lorcan Dempsey
Two metadata directions

OCLC has long worked with entification in the context of several linked data initiatives. VIAF (the Virtual International Authority File) has been foundational here. This brings together name authority files from national libraries around the world, and establishes a singular identity for persons across them. It adds bibliographic and other data for context. And it matches to some other identifiers. VIAF has become an important source of identity data for persons in the library community.

Project Passage is also an important landmark. In this project we worked with Wikibase to experiment with linked data and entification at scale. My colleague Andrew Pace provides an overview of the lessons learned:

  • The building blocks of Wikibase can be used to create structured data with a precision that exceeds current library standards.
  • The Wikibase platform enables user-driven ontology design but raises concerns about how to manage and maintain ontologies.
  • The Wikibase platform, supplemented with OCLC’s enhancements and stand-alone utilities, enables librarians to see the results of their effort in a discovery interface without leaving the metadata-creation workflow.
  • Robust tools are required for local data management. To populate knowledge graphs with library metadata, tools that facilitate the import and enhancement of data created elsewhere are recommended.
  • The pilot underscored the need for interoperability between data sources, both for ingest and export.
  • The traditional distinction between authority and bibliographic data disappears in a Wikibase description.

These initiatives paved the way for SEMI (Shared Entity Management Infrastructure). Supported by the Andrew W. Mellon Foundation, SEMI is building the infrastructure which will support OCLC&aposs production entity management services. The goal is to have infrastructure which allows libraries to create, manage and use entity data at scale. The initial focus is on providing infrastructure for work and person entities. It is advancing with input from a broad base of partners and a variety of data inputs, and will be released for general use in 2022.

Two metadata directions

Pluralization at OCLC: towards reimagining descriptive workflows

OCLC recognizes its important role in the library metadata environment and has been reviewing its own vocabulary and practices. For example, it has deprecated the term &aposmaster record&apos in favor of &aposWorldCat record.&apos

It was also recognized that there were multiple community and institutional initiatives which were proceeding independently and that there would be value in a convening to discuss shared directions.

Accordingly, again supported by the Andrew W Mellon Foundation, OCLC, in consultation with Shift Collective and an advisory group of community leaders, is developing a program to consider these issues at scale. The following activities are being undertaken over several months:

  • Convene a conversation of community stakeholders about how to address the systemic issues of bias and racial equity within our current collection description infrastructure.
  • Share with member libraries the need to build more inclusive and equitable library collections and to provide description approaches that promote effective representation and discovery of previously neglected or mis-characterized peoples, events, and experiences.
  • Develop a community agenda that will be of great value in clarifying issues for those who do knowledge work in libraries, archives, and museums, identifying priority areas for attention from these institutions, and providing valuable guidance for those national agencies and suppliers.

It is hoped that the community agenda will help mobilize activity across communities of interest, and will also provide useful input into OCLC development directions.

Find out more .. resources to check out

The OCLC Research Library Partners Metadata Managers Focus Group is an important venue for discussion of metadata directions and community needs. This report synthesizes six years (2015-2020) of discussion, and traces how metadata services are evolving:

Transitioning to the Next Generation of Metadata
Two metadata directions

This post brings together a series of international discussions about the report and its ramifications for services, staffs and organization.

Next-generation metadata and the semantic continuum

For updates about OCLC&aposs SEMI initiative, see here:

WorldCat - Shared entity management infrastructure | OCLC
Learn more about how OCLC is developing a shared
Two metadata directions

For more about about reimagining descriptive workflows, see here:

Reimagine Descriptive Workflows
OCLC has been awarded a grant from The Andrew W. Mellon Foundation to convene a diverse group of experts, practitioners, and community members to determine ways to improve descriptive practices, tools, infrastructure and workflows in libraries and archives. The multi-day virtual convening is part of…
Two metadata directions

The Project Passage summary and report is here:

Creating Library Linked Data with Wikibase: Lessons Learned from Project Passage
The OCLC Research linked data Wikibase prototype (“Project Passage”) provided a sandbox in which librarians from 16 US institutions could experiment with creating linked data to describe resources—without requiring knowledge of the technical machinery of linked data. This report provides an overview…
Two metadata directions

Acknowledgements: Thanks to my colleagues John Chapman, Rachel Frick, Erica Melko, Andrew Pace and Merrilee Proffitt for providing material and/or advice as I prepared the presentation and this entry. Again, thanks to April Manabat of Nazarbayev University for the invitation and for encouragement along the way. For more information about the Conference, check out these pages:

Nazarbayev University LibGuides: Eurasian Academic Libraries Conference - 2021: Home
Nazarbayev University LibGuides: Eurasian Academic Libraries Conference - 2021: Home
Two metadata directions

Picture: I took the feature picture at Sydney Airport, Australia (through a window). The Pandemic is affecting how we think about work travel and the design of events, although in as yet unclear ways. One pandemic effect, certainly, has been the ability to think about both audiences and speakers differently. It is unlikely that I would have attended this conference had it been face to face, however, I readily agreed to be an online participant.

Blog Archive / LITA

The LITA Blog will now function as an archive of posts by the Library and Information Technology Association (LITA), formerly a division of the American Library Association (ALA). LITA is now a part of Core, but was previously the leading organization reaching out across types of libraries to provide education and services for a broad membership of systems librarians, library technologists, library administrators, library schools, vendors, and others interested in leading edge technology and applications for librarians and information providers.

Are you interested in becoming involved with Core or the Core News blog? Core News has more information!

Software supply chain security, SBOMs, and Biden's cybersecurity executive order / Casey Bisson

The Biden administration’s May 12 executive order on cybersecurity outlined the most comprehensive government policy yet to protect public and private resources from cyber attack, and laid out a number of requirements for federal information systems going forward. A number of sections of the order require the federal government to modernize security practices, including establishing a review board, developing playbooks, and improved sharing of threat information between agencies (and from private service providers to the agencies they serve).

Intel Did A Boeing / David Rosenthal

Two years ago, Wolf Richter noted that Boeing's failure to invest in a successor airframe was a major cause of the 737 Max debacle:
From 2013 through Q1 2019, Boeing has blown a mind-boggling $43 billion on share buybacks
I added up the opportunity costs:
Suppose instead of buying back stock, Boeing had invested in its future. Even assuming an entirely new replacement for the 737 series was as expensive as the 787 (the first of a new airframe technology), they could have delivered the first 737 replacement ($32B), and be almost 70% through developing another entirely new airframe ($11B/$16B). But executive bonuses and stock options mattered more than the future of the company's cash cow product.
Below the fold I look at how Intel made the same mistake as Boeing, and early signs that they have figured out what went wrong.

William Lazonick and Matt Hopkins chronicle Intel's version of the Boeing disease in How Intel Financialized and Lost Leadership in Semiconductor Fabrication:
In the years 2011-2015, Intel was in the running, along with TSMC and SEC, to be the fabricator of the iPhone, iPad, and iPod chips that Apple designed. While Intel spent $50b. on P&E and $53b. on R&D over those five years, it also lavished shareholders with $36b. in stock buybacks and $22b. in cash dividends, which together absorbed 102% of Intel’s net income (see Table 1). From 2016 through 2020, Intel spent $67b. on P&E and $66b. on R&D, but also distributed almost $27b. as dividends and another $45b. as buybacks. Intel’s ample dividends have provided an income yield to shareholders for, as the name says, holding Intel shares. In contrast, the funds spent on buybacks have rewarded sharesellers, including senior Intel executives with their stock-based pay, for executing well-timed sales of their Intel shares to realize gains from buyback-manipulated stock prices.
What did spending $81B on buybacks cost Intel? It is true that both Intel's leading competitors, SEC (Samsung) and TSMC bought back stock, but there was a big difference in why they did:
The purpose of SEC’s stock buybacks in 2002-2007 and 2014-2018 was to increase the voting power of the founding Lee family, thereby consolidating its strategic control over resource allocation against the threat of corporate raiders.[17] It is clear from SEC’s remarkable history that the Lee family has used its strategic control to allocate profits to investments in world-class productive capabilities. So too at TSMC under the leadership of Morris Chang. At 50% of net income over the past decade, TSMC’s dividend payout ratio was 2.8 times that of SEC and 1.4 times that of Intel. The sole purpose of TSMC’s stock repurchases between 2003 and 2008 was to buy out the ownership stake of Philips.[18]
Two problems:
  • Intel prioritized buybacks over dividends, the competition did the reverse, only using them for strategic goals.
  • Intel distributed essentially its entire net income to shareholders, where TSMC distributed about half and SEC under a third.
Despite Intel having very low retained earnings:
in recent years the company has made substantial allocations to P&E and R&D, even as it has distributed almost all its profits to shareholders.[20] But Intel has been able to tap other cash flows to make, simultaneously, large-scale productive investments and shareholder payouts. For the decade, 2011-2020, these other cash flows included depreciation charges of $87b., long-term debt issues of $45b., and stock sales (mainly to employees in stock-based compensation plans) of $12b.
Table 1 shows that Intel's investments in P&E and R&D for 2011-202 totalled $$236B, while the "other cash flows" they quote totalled $144B and retained earnings totalled $50B. I believe the remaining $42B is explained by Footnote 20:
Note that R&D is accounted for as a current expense, so only increments to R&D spending must be financed out of profits or other sources of funds such as depreciation, debt, or equity issues.
The $81B Intel spent on buybacks in that decade could instead have raised their investment in P&E and R&D by over one-third. Instead, according to Table 3, about $105M or 0.1% of it ended up in the pockets of Intel's first three non-engineer CEOs:
Intel’s buybacks reached $10.6b. in 2005, the year in which Otellini, Intel’s first non-engineer CEO, took over. Buybacks declined to an average of $1.7b. in 2008 and 2009 in reaction to the financial crisis, but then were jacked up to as high as $14.3b. in 2011. The following year, buybacks were $5.1b. as Otellini departed as CEO with $40m. in total compensation, 82% of it stock-based.

With Krzanich as CEO, buybacks peaked at $10.8b. in 2014. He raked in $40 million in total pay (79% stock-based) in 2017 but was ousted in mid-2018 for having a “consensual relationship” with an Intel employee.[21] In early 2018, news outlets alleged that Krzanich engaged in insider trading, based on non-public information of security flaws in Intel’s CPUs, as he sold all his Intel shares except for the minimum 250,000 he was required by contract to hold.[22]

With Krzanich’s exit, the new CEO was Robert Swan, an MBA who had spent his career in finance at a number of companies, including GE, TRW, Northrup Grumman, eBay, and General Atlantic, before joining Intel as CFO in 2016. From 2018 through 2020, Swan averaged just under $13m. in total remuneration, of which 57% was stock-based. In the Swan years, annual dividends were 19% higher and annual buybacks 186% higher than in the Krzanich years.
Boeing switched CEOs from engineers to financiers when McDonnell-Douglas' Harry Stonecipher:
“bought Boeing with Boeing’s money.” Indeed, Boeing didn’t ultimately get much for the $13 billion it spent on McDonnell Douglas, which had almost gone under a few years earlier. But the McDonnell board loved Stonecipher for engineering the McDonnell buyout, and Boeing’s came to love him as well. This was in no small part because Stonecipher cast himself as the savior of Boeing and knew just how to exploit a bad situation to get his way.
Stonecipher’s other big cultural transformation was focused on maligning and marginalizing engineers as a class, and airplanes as a business. “You can make a lot of money going out of business” was something he liked to say.
The end result of focusing the company on the stock price instead of the product was the 737 MAX disaster, the 787, KC-46 Pegasus tanker, and space program debacles, and now the long delay in certification of the 777X.

Similarly, the end result for Intel of focusing on the stock price instead of the product was:
In 2020, both TSMC and SEC transitioned from 7nm to 5nm process, and in 2021 both are making investments to commercialize 3nm. TSMC went from zero 5nm revenues in 2Q20 to 8% in Q320 and 20% in Q420.[7] SEC is intent on closing the technology gap with TSMC by allocating $28b. to capital expenditures in 2021, about the level of its 2020 plant and equipment (P&E) investments.

For its part, TSMC has announced plans to spend $100b. in total on P&E and R&D over the next three years, including $30b. in 2021, up from $17.2b. in 2020. TSMC will construct a $12b. 5nm facility in Arizona and is also considering the state as the site for a $25b. 3nm fab.[8] Most of this new capacity is slated to fabricate Apple’s M-series processors.[9]

Intel still leads the global semiconductor industry in total revenues. But, as an IDM, Intel manufactures almost all its CPUs at 14nm, and its 10nm capacity has been stuck, with limited output, since 2018.[10] Meanwhile, Apple is abandoning Intel processors for its Mac computers, turning instead to TSMC to fabricate Apple’s own designs.[11] Intel itself already contracts with TSMC and UMC to produce 15-20% of its non-CPU chips. Moreover, later this year, TSMC will commence production of intel’s Core i3 processors, inside advanced laptops, at 5nm.[12]
Intel's competitors' announced plans are to invest 5% and 11% more than Intel's average for the last 5 years. And they've been getting a lot more bang for their buck than Intel over those years.

Earlier this year Intel appears to have realized that they'd been suffering from the wrong kind of CEO, and hired back engineer Pat Gelsinger as CEO. Perhaps it was an indication that Intel would no longer "make a lot of money going out of business" when in early May Reuters reported that Intel will ‘focus’ less on buying back company stock — CEO and that:
The new CEO said in March that Intel will spend up to $20 billion to build two new factories in Arizona, greatly expanding its advanced chip manufacturing capacity.
Note that, on TSMC's figures for their Arizona fabs, "up to $20B" isn't enough for two 5nm fabs, or one 3nm fab. So despite fewer buybacks Intel still isn't planning to compete with the foundry leaders. Though one of Gelsinger's early announcements was:
Intel Foundry Services (or IFS) is one prong of Intel’s strategy to realign itself with the current and future semiconductor market. Despite having attempted to become a foundry player in the past, whereby they build chips under contract for their customers, it hasn’t really worked out that well – however IFS is a new reinvigoration of that idea, this time with more emphasis on getting it right and expanding the scope.
Ian Cutress reported on an early stage of IFS development in Intel Licenses SiFive’s Portfolio for Intel Foundry Services on 7nm
Today’s announcement from SiFive comes in two parts; this part is significant as it recognizes that Intel will be enabling SiFive’s IP portfolio on its 7nm manufacturing process for upcoming foundry customers. We are expecting Intel to offer a wide variety of its own IP, such as some of the x86 cores, memory controllers, PCIe controllers, and accelerators, however the depth of its third party IP support has not been fully established at this point. SiFive’s IP is the first (we believe) official confirmation of specific IP that will be supported.
But on the other hand:
Intel’s financial report for Q121 shows that between February 22 and March 27, with Gelsinger at the top, Intel executed $1.5b. in buybacks. Perhaps the new CEO was focused on other things during his first month and a half in office. For Intel as for other major U.S. corporations, the addiction to buybacks is hard to kick.

The 19 publicly listed corporate members of the U.S. Semiconductor Industry Association that signed a letter to President Biden in February,[25] asking the government for financial support for their industry, did buybacks of $540b. (2020 dollars) from 2001 through 2020, with IBM, Intel, Qualcomm, and TI accounting for 84% of these repurchases. In 2016-2020 alone, these 19 companies squandered $148b. (nominal) on buybacks—almost three times the $50b. in financial aid that the Biden administration has offered the SIA.

Our policy recommendation for the Biden administration is simple: As a condition for giving the U.S. semiconductor industry $50 billion in infrastructure assistance, put a ban on SIA members doing stock buybacks as open-market repurchases.
Right, otherwise the $50B is going to end up in buybacks. So much easier than productive investments, which actually require thought, planning and time to come to fruition.

But why is is even legal for companies to manipulate their stock price in this way, allowing their stockholders to avoid tax by converting current income (dividends) into capital gains (buybacks)? The answer is the Securities and Exchange Commission’s Rule 10b-18:
Rule 10B-18 is considered a safe harbor provision. A safe harbor is a legal provision to reduce or eliminate legal or regulatory liability in certain situations as long as certain conditions are met. If the company abides by the four conditions of Rule 10B-18 when it is repurchasing the shares, the SEC will not deem the transactions in violation of anti-fraud provisions of the Securities Exchange Act of 1934.
Clearly, Congress understood in 1934 that companies buying back their stock allowed management to both avoid tax and artificially pump their stock price. But Reagan's SEC thought that avoiding tax and pumping the stock market was a good thing, hence Rule 10b-18. This short-term thinking motivated companies to choose their CEOs for their skill in financial rather than product engineering, with the consequent erosion of US technology competence. In the short term it is easy to “make a lot of money going out of business”. In the long term you're out of business but the CEO is enjoying a well-funded retirement. Repeal of Rule 10b-18 should be a priority for Biden's SEC but I wouldn't hold my breath.

Reflection: My third year at GitLab and becoming a non-manager leader / Cynthia Ng

Wow, 3 years at GitLab. Since I left teaching, because almost all my jobs were contracts, I haven't been anywhere for more than 2 years, so I find it interesting that my longest term is not only post-librarian-positions, but at a startup! Year 3 was full on pandemic year and it was a busy one. Due to the travel restrictions, I took less vacation than previous years, and I'll be trying to make up for that a little by taking this week off.

Work with us on ‘Promoting some keynote events/lectures during Open Data Day 2022’ / Open Knowledge Foundation

Open Data Day 2022 is nine months away – and we are already thinking about how to make it bigger and better than 2021 !

Recently we asked you about your experience with Open Data Day.

We wanted to know what worked, what didn’t and how it could be improved.

You gave us lots of really useful feedback which we published here.

One of things you told us was that we should

    ‘Promote some keynote events/lectures during Open Data Day’.

If you or your organisation would like to discuss working with us on ‘Promoting some keynote events/lectures during Open Data Day’ – please do email us on

We think focusing on one topic area, and curating and promoting some key note events, could be a powerful mechanism to show the benefits of open data and encourage the adoption of open data policies by government, business and civil society.

There are many possible topic areas that we could focus on – from climate relevant data to open law. From open contracting data to oceanographic data.

If the Open Data Day team at Open Knowledge Foundation are going to do this, however, we need to identify partners from around the world with us.

  • We need people or organisations with topic expertise who can help curate events.
  • We need partners with broad community reach to ensure that the work we do together reaches the most people.
  • And we need funding to make it happen
  • If you or your organisation would like to discuss working with us on ‘Promoting some keynote events/lectures during Open Data Day’ – please do email us on

    Calling all legal practitioners – you are invited to an interactive workshop on immigration and automation / Open Knowledge Foundation

    – Are you a lawyer, campaigner or activist working in the UK immigration system?
    – Do you want to learn how automated decision systems are currently used in the immigration system in the UK ?
    – Do you want to learn about legal strategies for challenging the (mis)use of these technologies?

    = = = = = = =

    Join the The Justice Programme team for an online 90 minute interactive workshop on August 19th 2021 between 12.00 – 13.30 BST (London time).

    Tickets for the event cost £110 (inc VAT) and can be purchased online here.

    = = = = = = =

    Tickets are limited to 20 people – to ensure that everyone who attends can maximise their learning experience.

    If you are unwaged and can’t afford a ticket, please email The Justice Programme is offering two places at a 75% discount (£27.50 each).

    All proceeds from this event are reinvested in the work of The Justice Programme, as we work to ensure Public Impact Algorithms do no harm.

    = = = = = = =

    What will I learn ?

    In this Interactive Workshop on Immigration and automation we will:

    – explore how AI and algorithms are presently being used and likely to be used in UK and elsewhere
    review an overview of how algorithms work
    – discuss the potential harms involved at the individual and societal levels
    – summarise legal strategies, resources and best practices
    – participate in a group exercise on a realistic case study

    You will also get access to a guide summarising the key points of the workshop and documenting the answers to your questions.

    This workshop is brought to you by Meg Foulkes, Director of The Justice Programme and Cedric Lombion, our Data & Innovation Lead.

    Read more about The Justice Programme team here.

    = = = = = = =

    About The Justice Programme

    The Justice Programme is a project of the Open Knowledge Foundation, which works to ensure that Public Impact Algorithms do no harm.

    Find out more about The Justice Programme here, and learn more about Public Impact Algorithms here.

    DLTJ Now Uses Webmention and Bridgy to Aggregate Social Media Commentary / Peter Murray

    When I converted this blog from WordPress to a static site generated with Jekyll in 2018, I lost the ability for readers to make comments. At the time, I thought that one day I would set up an installation of Discourse for comments like Boing Boing did in 2013. But I never found the time to do that. Alternatively, I could do what NPR has done— abandon comments on its site in favor of encouraging people to use Twitter and Facebook—but that means blog readers don’t see where the conversation is happening. This article talks about IndieWeb—a blog-to-blog communication method—and the pieces needed to make it work on both a static website and for social-media-to-blog commentary.

    The IndieWeb is a combination of HTML markup and an HTTP protocol for capturing discussions between blogs. To participate in the IndieWeb ecosystem, a blog needs to support the “ h-card” and “ h-entry” microformats. These microformats are ways to add HTML markup to a site to be read and recognized by machines. If you follow the instructions at, the “Level 2” steps will check your site’s webpages for the appropriate markup. The Jekyll theme I use here, minimal-mistakes, didn’t include the microformat markup, so I made a pull request to add it.

    With the markup in place, uses the Webmention protocol to notify others when I link to their content and receive notifications from others. If you’re setting this up for yourself, hopefully someone has already gone through the effort of adding the necessary Webmention communication bits to your blog software. Since DLTJ is a static website, I’m using the Webmention.IO service to send and receive Webmention information on behalf of and a Jekyll plugin called jekyll-webmention_io to integrate Webmention data into my blog’s content. The plugin gets that data from, caches it locally, and builds into each article the list of webmentions and pingbacks (another kind of blog-to-blog communication protocol) received.

    Webmention.IO and jekyll-webmention_io will capture some commentary. To get comments from Twitter, Mastodon, Facebook, and elsewhere, I added the Bridgy service to the mix. From their About page : “Bridgy periodically checks social networks for responses to your posts and links to your web site and sends them back to your site as webmentions.” So all of that commentary gets fed back into the blog post as well.

    I’ve just started using this Webmention/Bridgy setup, so I may have some pieces misconfigured. I’ll be watching over the next several blog posts to make sure everything is working. If you notice something that isn’t working, please reach out to me via one of the mechanisms listed in the sidebar of this site.

    Distant Reader Workshop Hands-On Activities / Eric Lease Morgan

    This is a small set of hands-on activities presented for the Keystone Digital Humanities 2021 annual meeting. The intent of the activities is to familiarize participants with the use and creation of Distant Reader study carrels. This page is also available as PDF file designed for printing.


    The Distant Reader is a tool for reading. Given an almost arbitrary amount of unstructured data (text), the Reader creates a corpus, applies text mining against the corpus, and returns a structured data set amenable to analysis (“reading”) by students, researchers, scholars, and computers.

    The data sets created by the Reader are called “study carrels”. They contain a cache of the original input, plain text versions of the same, many different tab-delimited files enumerating textual features, a relational database file, and a number of narrative reports summarizing the whole. Given this set of information, it is easy to answer all sorts of questions that would have previously been very time consuming to address. Many of these questions are akin to newspaper reporter questions: who, what, when, where, how, and how many.

    Using more sophisticated techniques, the Reader can help you elucidate on a corpus’s aboutness, plot themes over authors and time, create maps, create timelines, or even answer sublime questions such as, “What are some definitions of love, and how did the writings of St. Augustine and Jean-Jacques Rousseau compare to those definitions?”

    The Distant Reader and its library of study carrels are located at:

    Activity #1: Compare & contrast two study carrels

    These tasks introduce you to the nature of study carrels:

    1. From the library, identify two study carrels of interest, and call them Carrel A and Carrel B. Don’t think too hard about your selections.
    2. Read Carrel A, and answer the following three questions: 1) how many items are in the carrel, 2) if you were to describe the content of the carrel in one sentence, then what might that sentence be, and 3) what are some of the carrel’s bigrams that you find interesting and why.
    3. Read Carrel B, and answer the same three questions.
    4. Answer the question, “How are Carrels A and B similar and different?”

    Activity #2: Become familiar with the content of a study carrel

    These tasks stress the structured and consistent nature of study carrels:

    1. Download and uncompress both Carrel A and Carrel B.
    2. Count the number of items (files and directories) at the root of Carrel A. Count the number of items (files and directories) at the root of Carrel B. Answer the question, “What is the difference between the two counts?”. What can you infer from the answer?
    3. Open any of the items in the directory/folder named “cache”, and all of the files there ought to be exact duplicates of the original inputs, even if they are HTML documents. In this way, the Reader implements aspects of preservation. A la LOCKSS, “Lots of copies keep stuff safe.”
    4. From the cache directory, identify an item of interest; pick any document-like file, and don’t think too hard about your selection.
    5. Given the name of the file from the previous step, open the file with the similar name but located in the folder/directory named “txt”, and you ought to see a plain text version of the original file. The Reader uses these plain text files as input for its text mining processes.
    6. Given the name of the file from the previous step, use your favorite spreadsheet program to open the similarly named file but located in the folder/directory named “pos”. All files in the pos directory are tab-delimited files, and they can be opened in your spreadsheet program. I promise. Once opened, you ought to see a list of each and every token (“word”) found in the original document as well as the tokens’ lemma and part-of-speech values. Given this type of information, what sorts of questions do you think you can answer?
    7. Open the file named “MANIFEST.htm” found at the root of the study carrel, and once opened you will see an enumeration and description of all the folders/files in any given carrel. What types of files exist in a carrel, and what sorts of questions can you address if given such files?

    Activity #3: Create study carrels

    Anybody can create study carrels, there are many ways to do so, and here are two:

    1. Go to, and you may need to go through ORCID authentication along the way.
    2. Give your carrel a one-word name.
    3. Enter a URL of your choosing. Your home page, your institutional home page, or the home page of a Wikipedia article are good candidates.
    4. Click the Create button, and the Reader will begin to do its work.
    5. Create an empty folder/directory on your computer.
    6. Identify three or four PDF files on your computer, and copy them to the newly created directory. Compress (zip) the directory.
    7. Go to, and you may need to go through ORCID authentication along the way.
    8. Give your carrel a different one-word name.
    9. Select the .zip file you just created.
    10. Click the Create button, and the Reader will begin to do its work.
    11. Wait patiently, and along the way the Reader will inform you of its progress. Depending on many factors, your carrels will be completed in as little as two minutes or as long as an hour.
    12. Finally, repeat Activities #1 and #2 with your newly created study carrels.

    Extra credit activities

    The following activities outline how to use a number of cross-platform desktop/GUI applications to read study carrels:

    • Print any document found in the cache directory and use the traditional reading process to… read it. Consider using an active reading process by annotating passages with your pen or pencil.
    • Download Wordle from the Wayback Machine, a fine visualization tool. Open any document found in the txt directory, and copy all of its content to the clipboard. Open Wordle, paste in the text, and create a tag cloud.
    • Download AntConc, a cross-platform concordance application. Use AntConc to open one more more files found in the txt directory, and then use AntConc to find snippets of text containing the bigrams identified in Activity #1. To increase precision, configure AntConc to use the stopword list found in any carrel at etc/stopwords.txt.
    • Download OpenRefine, a robust data cleaning and analysis program. Use OpenRefine to open one or more of the files in the folder/directory named “ent”. (These files enumerate named-entities found in your carrel.) Use OpenRefine to first clean the entities, and then use it to count & tabulate things like the people, places, and organizations identified in the carrel. Repeat this process for any of the files found in the directories named “adr”, “pos”, “wrd”, or “urls”.

    Extra extra credit activities

    As sets of structured data, the content of study carrels can be computed against. In other words, programs can be written in Python, R, Java, Bash, etc. which open up study carrel files, manipulate the content in ways of your own design, and output knowledge. For example, you could open up the named entity files, select the entities of type PERSON, look up those people in Wikidata, extract their birthdates and death dates, and finally create a timeline illustrating who was mentioned in a carrel and when they lived. The same thing could be done for entities of type GRE (place), and a map could be output. A fledgling set of Jupyter Notebooks and command-line tools have been created just for these sorts of purposes, and you can find them on GitHub:

    Every study carrel includes an SQLite relational database file (etc/reader.db). The database file includes all the information from all tab-delimited files (named-entities, parts-of-speech, keywords, bibliographics, etc.). Given this database, a person can either query the database from the command-line, write a program to do so, or use GUI tools like DB Browser for SQLite or Datasette. The result of such queries can be elaborate if-then statement such as “Find all keywords from documents dated less than Y” or “Find all documents, and output them in a given citation style.” Take a gander at the SQL file named “etc/queries.sql” to learn how the database is structured. It will give you a head start.


    Given an almost arbitrary set of unstructured data (text), the Distant Reader outputs sets of structured data known as “study carrels”. The content of study carrels can be consumed using the traditional reading process, through the use of any number of desktop/GUI applications, or programmatically. This document outlined each of these techniques.

    Embrace information overload. Use the Distant Reader.