We are pleased to announce Dr. Jamie Lee as the keynote speaker for Digital Preservation 2023: Communities of Time and Place (#DigiPres23). Dr. Lee is an Associate Professor of Digital Culture, Information, and Society at the School of Information, University of Arizona, and is a scholar, activist, filmmaker, archivist, oral historian, partner, co-parent, neighbor, and friend. They founded and direct the Arizona Queer Archives (www.arizonaqueerarchives.com) where they train community members on facilitating oral history interviews and building collections in and with their own families and communities. With storytelling at the heart of their life’s work, Lee also directs the Digital Storytelling & Oral History Lab and co-founded the Critical Archives and Curation Collaborative, the co/lab, through which they collaborate on such storytelling projects as secrets of the agave: a Climate Justice Storytelling Project (www.secretsoftheagave.com), the Climate Alliance Mapping Project, CAMP (www.climatealliancemap.org), and the Stories of Arizona’s Tribal Libraries Oral History Project (with Dr. Sandy Littletree and Knowledge River). Lee’s 2021 research monograph, Producing the Archival Body, engages storytelling to re-consider how archives are defined, understood, deployed, and accessed to produce subjects. Arguing that archives and bodies are mutually constitutive and developing a keen focus on the body and embodiment alongside archival theory, Lee introduces new understandings of archival bodies that interrogate how power circulates in archival contexts in order to build critical understandings of how deeply archives shape the production of knowledges and human subjectivities. For more on Lee’s projects, visit www.thestorytellinglab.io. In their keynote talk,“Kairotic and Kin-centric Archives: Addressing Abundances and Abandonments,” Dr. Lee traverses the persistent memories and memory-making practices of their local queer borderlands communities through frameworks of the kairotic and kin-centric. Sharing stories from two distinct community-based digital archiving projects, Lee attends to loss and to re-collection and explicitly addresses both abundances and abandonments.
More information on Dr. Lee’s keynote talk will be shared when the DigiPres program schedule is released soon!
GeoTE Tanzania celebrated Open Data Day on 10 March 2023, where representatives from different organisations and students participated in the event. The event theme was “Mapping with Artificial Intelligence for Disaster Response”, where we aimed to help create open data to help areas affected by earthquakes in Turkey and Syria. The event was successfully organised and outlined several activities aimed at not only contributing to open data but transforming the knowledge of students, researchers, and stakeholders on why open data is so important for development purposes.
The following are the outlined activities of the event;
How AI is used in open data creation
Participants had time to be trained on different tools that deploy the power of artificial intelligence in detecting ground features and helping mappers contribute to open data. In conjunction with OpenStreetMap, mapwith.ai enables the mapping community to enjoy a faster & more accurate mapping experience, where artificial intelligence helps to predict features on high-resolution satellite imagery and then these features are then populated in the RapiD, thus open data is created by mappers verifying the predicted features. This was a hands-on training where we interacted with several projects created in the Tasking Manager to demonstrate the power of AI in open mapping. This presentation was given out by YouthMappers from Jordan University and the Sokoine University of Agriculture.
How to improve the quality of AI-generated data
With the use of AI for data creation, it is very essential to observe the data quality aspect, and apart from using data Quality Assurance tools to help lead to a better quality of OpenStreetMap data, we discussed the need for every mapper to ensure data being generated has all elements of data qualities when mapping/contributing to open data.
Inclusiveness and collaborations
In this topic, we discussed the need to promote inclusiveness in data-driven community-centred programs in supporting local sustainable developments from the local to a global level. This session aimed at highlighting opportunities for youth, especially in universities to work together with community organisations as a global network of students, scholars, educators, and researchers.
Mapathon
Finally, we had a Mapping Party (Mapathon) for about 2 hours where participants contributed to open data in project number 14329 and Earthquake affected areas mapping tasks which are all hosted in the Humanitarian OpenStreetMap tasking manager. In two hours of the Mapathon participants helped trace ground features, specifically buildings and roads from satellite imagery where they helped map over 15000+ buildings and about 100 km of roads in the projects.
We learned that AI can assist and amplify the efforts of mappers to produce more map data. However, this comes with a condition of making sure that the data created is of high quality and humanitarian efforts from different mappers of different levels should observe data quality aspects before contributing to data.
To give the opportunity to learn Artificial Intelligence and Machine Learning for biodiversity conservation for young researchers of the University of Mahajanga, Tanjona Association organised the Open Data Day event. This event was held on 9 – 10 March 2023 at the University of Mahajanga, under the name “Hay Tech” as same as the last year to ensure its continuity.
The objectives highly focus on the importance of using advanced technologies in research:
To train young scientists in becoming familiar with the web-based application Digital Earth Africa for forest monitoring;
To familiarise young researchers with using new technologies for their future research;
To strengthen the connection between young researchers and mentors;
To teach young scientists about effective and readable data collection so that it can be accessed by another user.
30 graduate students from the science department attended the event. A minimum of knowledge in GIS was required in order to attend the event. The participants were mentored by the elder researchers from the Department of Science who already have multiple experiences in using advanced technologies like AI and ML in their current research.
The event started with a self-introduction from each participant and mentor. After the introduction, the session continued with the introduction to the web-based application Digital Earth Africa. Digital Earth Africa is a free and open-access platform with built-in script in Python code for remote sensing in multiple domains including forest monitoring and change detection. This part was provided by Kanto RAZANAJATOVO, one of the participants in the official launch of Digital Earth Africa in 2020. The importance of AI and ML in advancing and enhancing research was explained to participants to familiarise and encourage them to use open access and advanced tools like Digital Earth Africa.
Fieldwork was conducted after the presentation, in which participants learned how to collect data that can be easily used by other people and how to verify data to compare with those from remote sensing. This part was provided by Jospin BANAH and Herizo ONINJATOVO RADONIRINA. Trainees were invited to ameliorate their method in data collection so that it can be accessible in the future for the next generation as we are constantly promoting the availability of data for the student. A manipulation of QGIS, especially in georeferencing was provided by Dr. Bernard ANDRIAMAHATANTSOA.
From the 2022 Open Data Day event, graduate students were able to constantly connect with mentors in using GIS and other tools related to biodiversity conservation. The continuation of Open Data Day helped graduate students in manipulating tools that were seemingly difficult for them like QGIS, effective and clean databases, and remote sensing.
I am excited to be sharing the outcomes of a major research project in service of archives, archival discovery, and archives researchers. For the last 2.5 years, OCLC has led research work for Building a National Finding Aid Network (NAFAN), an IMLS-supported research and demonstration project rooted in the goal of providing inclusive, comprehensive, and persistent access to finding aids by laying the foundation for a national finding aid network available to all contributors and researchers.
This research will inform next steps for the NAFAN project and also offers a wealth of information on archival user behavior and needs, and the current state of archival description workflows and data. OCLC has published five reports on its findings from the NAFAN research:
Summary of Research—Synthesizes findings from across all research activities on the NAFAN project
Pop-up Survey—Summarizes results from a national survey of online archive users on their search behavior, information needs, and demographic characteristics
User Interviews—Details findings from interviews with archival aggregation end users on their information needs and information-seeking behavior
Focus Group Interviews—Shares outcomes from focus group discussions with archivists to examine their needs for describing collections and contributing description to an archival aggregation
EAD Analysis—Analyzes EAD data as raw material for building a finding aid aggregation by looking for common data structures present and probing for gaps that could impede user discovery
The NAFAN project was a true collaboration. Within OCLC, colleagues from across the research team with expertise in quantitative and qualitative research methods, archives practice, and archival data came together to bring their skills to bear on the mixed methods research, and our communication team put in a huge effort to publish five distinct reports. A thoughtful and engaged cohort of archivists served as an advisory board for the research. The overall project was coordinated by the California Digital Library (CDL), in collaboration with OCLC, the University of Virginia Library, Shift Collective, and Chain Bridge Group, and in partnership with statewide/regional finding aid aggregators and LYRASIS (ArchivesSpace) as a technical consulting partner. And it was supported by IMLS through grant #LG-246349-OLS-20.
We’re very excited to share this research, and hope readers will find it informative and useful. We’ll be out in the world talking about the findings throughout the coming year, starting with presenting at the upcoming Society of American Archivists conference and in virtual pre-conference activities. Research Library Partnership members can attend a series of upcoming webinars on this work. We’ve been blogging about our process throughout the NAFAN research, you can read past posts here. And stay tuned for future posts as we share more about our findings and dig into some of the details we find particularly interesting.
I have been generally skeptical of claims that blockchain technology and cryptocurrencies are major innovations. Back in 2017 Arvind Narayanan and Jeremy Clark published Bitcoin's Academic Pedigree, showing that Satoshi Nakamoto assembled a set of previously published components in a novel way to create Bitcoin. Essentially the only innovation among the components was the Longest Chain Rule.
But, for good or ill, there is at least one genuinely innovative feature of the cryptocurrency ecosystem and in Flash loans, flash attacks, and the future of DeFi Aidan Saggers, Lukas Alemu and Irina Mnohoghitnei of the Bank of England provide an excellent overview of it. They:
analysed the Ethereum blockchain (using Alchemy’s archive node) and gathered every transaction which has utilised the ‘FlashLoan’ smart contract provided by DeFi protocol Aave V1 and V2. The Aave protocol, one of the largest DeFi liquidity providers, popularised flash loans and is often credited with their design. Using this data we were able to gather 60,000 unique transactions from Aave’s flash loan inception through to 2023
Below the fold I discuss their overview and some of the many innovative ways in which flash loans have been used.
The key enabler of flash loans was Ethereum's ability to wrap a Turing-complete program in an atomic transaction. The Turing-complete part allowed the transaction to perform an arbitrary sequence of actions, and the atomic part ensured that either all the actions would be performed, or none of the actions would be performed. Back in 2021 Kaihua Qin, Liyi Zhou, Benjamin Livshits, and Arthur Gervais from Imperial College posted Attacking the defi ecosystem with flash loans for fun and profit, analyzing and optimizing two early flash loan attacks:
We show quantitatively how transaction atomicity increases the arbitrage revenue. We moreover analyze two existing attacks with ROIs beyond 500k%. We formulate finding the attack parameters as an optimization problem over the state of the underlying Ethereum blockchain and the state of the DeFi ecosystem. We show how malicious adversaries can efficiently maximize an attack profit and hence damage the DeFi ecosys- tem further. Specifically, we present how two previously executed attacks can be “boosted” to result in a profit of 829.5k USD and 1.1M USD, respectively, which is a boost of 2.37× and 1.73×, respectively.
DeFi protocols join the ecosystem, which leads to both exploits against protocols themselves as well as multi-step attacks that utilize several protocols such as the two attacks in Section 3. In a certain poignant way, this highlights the fact the DeFi, lacking a central authority that would enforce a strong security posture, is ultimately vulnerable to a multitude of attacks by design. Flash loans are merely a mechanism that accelerates these attacks. It does so by requiring no collateral (except for the minor gas costs), which is impossible in the traditional fiance due to regulations. In a certain way, flash loans democratize the attack, opening this strategy to the masses
Flash loans are unlimited uncollateralised loans, in which a user both receives and returns borrowed funds in the same blockchain transaction. Currently they exist exclusively within the DeFi ecosystem. DeFi aims to be an alternative to traditional financial (TradFi), with centralised intermediaries replaced by so-called decentralised code-based protocols. These protocols, based on distributed ledger technology, eliminate, in theory, the need for trust in counterparties and for financial institutions as we know them.
...
It is important to understand that the lender is exposed to almost no credit risk when participating in a flash loan, hence collateral is not required. Flash loans leverage smart contracts (code which ensures that funds do not change hands until a specific set of rules are met) and the atomicity of blockchains (either all or none of the transaction occurs) to enable a form of lending that has no traditional equivalents.
Flash loans are therefore only available to the borrower for the short duration of the transaction. Within this brief period, the borrower must request the funds, call on other smart contracts to perform near-instantaneous trades with the loaned capital, and return the funds before the transaction ends. If the funds are returned and all the sub-tasks execute smoothly, the transaction is validated.
On a platform such as Aave, this is how flash loans typically work:
The borrower applies for a flash loan on Aave.
The borrower creates a logic of exchanges to try making a profit, such as sales, DEX purchases, trades, etc.
The borrower repays the loan, makes a profit, and pays a 0.09% fee.
If any of the following conditions occur, the transaction is reversed, and the funds are returned to the lender:
The borrower does not repay the capital
The trade does not lead to a profit
"reversed" is not quite the right word. If transaction does not repay the loan, or if the trade does not make a profit, the transaction is not validated and thus does not become part of the history on the blockchain. It is as if it never happened, except that the failed transaction incurred gas fees. Saggers et alexplain the fees:
A non-refundable fee that covers the operational costs of running the smart contracts must be paid up-front, known as the ‘gas fee’ for the transaction – this is true for any Distributed Ledger Technology transaction and not specific to flash loans. Further commission fees are charged only once the transaction executes successfully, making the whole endeavour nearly ‘risk free’ to both the borrower and lender.
Based on their 60,000 flash loan dataset, Saggers et al's Figure 2 displayed a histogram of "the ratio between the gas fee paid by a flash loan transaction and the average gas fee paid on the same day, for all transactions on the Ethereum blockchain" and determined that:
on average, flash loans cost roughly 15 times as much as a standard DeFi transaction.
Of course, if they succeed the flash loan transactions are likely to be much more profitable than other DeFi transactions, outweighing the additional cost.
There are two reasons why flash loans are so expensive; they are typically urgent and complex. Saggers et alexplain:
The more events included in a transaction, the more space it takes on the Ethereum Virtual Machine. Given the uncertain execution of these loans, some users are also willing to pay additional prioritisation fees for their transaction to be included in the most immediate block added.
They compiled a histogram of the number of logs in flash loan transactions and concluded:
cost is proportional to the complexity of a transaction, and on this count, flash loans also stand out from typical transactions. Flash loans typically contain between 35–70 logs (Figure 3) per transaction compared to roughly 5–10 logs for the average Aave transaction.
Because of these high fees, flash loans require sufficiently large profit opportunities. Saggers et allist them:
Flash loans are most commonly used for arbitrage opportunities, for example if traders look to quickly profit from a mismatch in cryptoassets’ pricing across markets. Flash loans can also be used for collateral swaps – a technique where a user closes their loan with borrowed funds to immediately open a new loan with a different asset as collateral – or debt-refinancing through ‘interest rate swaps’ from different protocols.
But the really profitable use of flash loans is for theft. Molly White maintains a timeline of successful flash loan attacks; as I write the entries for the trailing 12 months are:
White is thus recording more than one such successful heist per month. As shown in their Figure 4, Saggers et al's 60,000 transaction database demonstrates that:
To this date over US$6.5 billion dollars’ worth of cryptocurrency has been stolen in attacks directly attributable to flash loans.
Since their data covers only one of the many DeFi protocols, albeit the most liquid, this is surely an underestimate. White's Web3 is Going Just Great timeline currently records over $12B in losses, but not all of these exploited flash loans. Further, it is important to note, as Jemima Kelly and Molly White both point out, all these dollar numbers are "notional value". The actual dollars that can be realized from these heists are typically much less than these headline figures, as Ilya Lichtenstein and Heather Morgan found out.
Table A shows that out of the top five largest amounts borrowed via flash loans, four of these were used to attack protocols.
Date
Amount borrowed (US$ millions)
Protocol attacked
Amount stolen (US$ millions)
27/10/2021
2,100
Cream Finance
130
16/06/2022
609
Inverse Finance
5.8
17/04/2022
500
Beanstalk (loan1)
181 (total)
22/05/2021
396
N/A
N/A
17/04/2022
350
Beanstalk (loan2)
181 (total)
This seems reasonable, in that even using a flash loan, borrowing hundreds of millions of dollars "worth" of cryptocurrency is expensive, and the fees for executing the flash loan attack are expensive, so only the lure of ill-gotten gain if the attack succeeds can motivate the expense.
Lets look at the two most profitable attacks in Table A, Cream Finance ($130M on October 27th, 2021) and Beanstalk ($181M on April 17th, 2022).
Crypto lending service C.R.E.A.M. Finance lost $130 million in a flash loan attack. It was the third hack of the platform this year, following a $37.5 million hack in February and an $18.8 million attack in August.
Decentralized finance protocols (DeFi) Cream Finance and Alpha Finance were victims of one of the largest flash loan attacks ever Saturday morning, resulting in a loss of funds totaling $37.5 million,
...
This is the second attack on a DeFi protocol in the last two weeks. Cronje's Yearn Finance suffered an an exploit in one of its DAI lending pools, according to the decentralized finance protocol’s official Twitter account. That exploit drained $11 million.
Cronje is Andre Cronje. The second heist exploited a reentrancy bug, not a flash loan. Eliza Gkritsi reported on it:
The attack was first reported by PeckShield in a tweet early on Monday. The blockchain security firm pointed to Ethereum records showing at least $6 million were drained at 5:44 UTC.
Cream Finance later confirmed the hack in a tweet, adding that 418,311,571 AMP tokens and 1,308.09 ether had been stolen, bringing the total value of the hack to just over $25 million. PeckShield updated its estimate, saying the hacker siphoned off about $18.8 million.
The root cause of the incident was lending of AMP tokens, Cream Finance Product Manager Eason Wu said on Discord. Other assets on Cream are secure, he said.
AMP token contracts allowed for a reentrancy attack, the same type of exploit used in the infamous DAO hack.
According to blockchain records, $92 million was stolen into one address and $23 million into another, alongside other funds taken. The funds are now being moved around to different wallets.
The funds stolen were mostly in Cream LP tokens and other ERC-20 tokens. Cream LP tokens are tokens you receive when you deposit funds into the Cream pools.
The price of cream (CREAM) has plummeted following the news, down from $152 to $111 in minutes — a 27% drop — according to CoinGecko.
It is now around $25. It is amazing that, with this history of incompetence, anyone would pay anything for it.
Beanstalk
Beanstalk was an algorithmic metastablecoin that, as David Gerard pointed out, was obviously a scam:
Beanstalk was offering interest on locked-in BEAN tokens on the order of 2,000% to 4,000% annual percentage rate. Those numbers are enough to tell you straight away that this is not a sustainable scheme.
It used a slightly different algorithm to the infamous Terra/Luna pair that "transitioned to state B" on May 13th, 2022. Twenty-six days earlier, Beanstalk suffered a flash loan attack. Molly White summarized the attack:
All my magic beans gone. An attacker successfully used a flash loan attack to exploit a flaw in Beanstalk Farms' stablecoin protocol, which allowed them to make off with 24,830 ETH (almost $76 million). The attacker then donated $250,000 to Ukraine before moving the remaining funds to Tornado Cash to tumble.
Estimated damages to the project were higher than the amount the hacker was able to take for themselves — around $182 million. The $BEAN token, once pegged to $1, dropped to nearly 0.
The exploiter used flash loans to borrow enough of the required voting power to push a proposal through. The proposal was pushed through an emergency execution option in Beanstalk. This emergency execution option allowed for a BIP to be executed if a ‘super majority’ voted in favor of it.
The BIP had a ‘hidden’ exit call that would withdraw all the funds once the BIP was executed.
Flash loans complete in a single block, so the $BEAN that was loaned was actually non-existent. But the loan allowed the exploiter to inflate his holdings and get a supermajority of $STALK, to push through the BIP, before the loan closed.
When all was said and done ~$75 million was removed from the liquidity pool (roughly evenly split between $BEAN and $ETH).
This is an example of the problems of governance via tokens that I discussed in Be Careful What You Vote For. In Code Is LOL Ed Zitron riffs on the same idea:
Much of what has been written about this has (as I have) used the term “attacker,” as if an attack was made in anything other than complete accordance with how the project worked. The project was set up to allow the largest amount of votes to pass whatever proposal was made, which, in this case, was “I think that I should have all of the money.” To describe this as an attack, a hack, or “malevolent” is to misunderstand the vacuous idea of “code is law” and Decentralized Autonomous Organizations.
“Code is law” refers to the idea that code can govern without human intervention, and DAOs are the Code Is Law monster - an autonomous organization that operates entirely based on what the highest amount of votes suggest. The problem is obvious if you’ve met more than one person in your life - human beings are fallible and biased, and prone to making mistakes.
...
The irony here is that cryptocurrency grew out of anti-government propaganda - the idea that our current laws allow the rich to take advantage of specific systems to benefit them. Except, if anything, cryptocurrency has created an even more exploitable system, naturally built for the rich to extract from the poor that they’ve tricked into buying into.
As part of the commemoration of Open Data Day, OpenStreetMap (OSM) Togo in partnership with Wikimedia Community Togo and Internet Society Togo’s chapter hosted a “Meet-up on Artificial Intelligence and Open Data” on 4th March 2023. The event aims to bring together data stakeholders at the same table to discuss the real potential of open data and AI to address the daily challenges faced by communities. Around 25 participants attended the event.
Moderated by Ezéchiel Ametovena, the event started with a presentation by Hermann Kassalouwa, Open Data Activist & President of the Wikimedia Community Togo on the state of Open Data in Togo and Open Data Day. The aim of the presentation was to introduce participants to the fundamental concepts of open data and its state in Togo. Particular emphasis was placed on the laws sanctioning open data and their application in Togo. Practical examples of open data platforms set up by the Togolese authorities for the sharing of public data were presented.
The second session was about Artificial Intelligence (AI) and its applications in humanitarian mapping. Moderated by Kokou Amegayibo, President of OSM Togo, the session showed participants what artificial intelligence is, the different forms of AI that exist, and how AI can break the routine while facilitating human life. Ethical issues, such as accountability and transparency of decisions made by machines, privacy, and data security were also raised to show the limits of AI technology.
Zoubinnaba Youma led the third session where he demonstrated Map with AI, a tool developed by Meta to support mapping through the OSM project. The link was made between the earthquake in Haiti and the one that occurred recently in Syria and Turkey He also showcased the use of deep learning implemented by Mapillary to detect street-level signs.
The final session was made by Ata Franck Akouete, where he introduced the basics and classic techniques of editing articles on Wikipedia and demonstrated to process to use ChatGPT and QGIS. The session showed how to get started with Chat GPT and how it can be used to quickly improve article editing on Wikipedia. He also demonstrated the use of ChatGPT through QGIS to show the participants how to explore its potential in exploiting open geodata freely accessible online. Following this series of demonstrations, a discussion was held around Overture Maps, which is a new geographic structured database on AI and existing open data including OSM.
It is important to note that before the event, more than 40% of the registered participants had no notion about the concept of AI. Presentations followed by Q&A sessions helped to better understand the different topics.
Here in the Northern Hemisphere, it is finally feeling like spring with warming temperatures and flowers blooming; for our Partners in the Southern Hemisphere, I hope you are easing into autumn. Both equinoctial periods bring a sense of change but also continuity.
With the OCLC RLP, we are humming along with exciting research and programs underway while continuing our steady support and care for our institutions.
The concept for these reports originated in a 2019 discussion of challenges facing art research libraries between members of the RLP, including an acute lack of space (including off-site storage), a lack of knowledge about the overall contours of peer institutions, and the value of partnerships between dissimilar library types. The reports present models and other findings that help support art libraries in providing access to materials and resources for the continued advancement of art scholarship.
A May webinar focused on findings and next steps based on this report for the OCLC RLP, and we invite you to take advantage of this opportunity to learn more about this research.
The RLP holds space for shared understanding, and action
The RLP uses its “power to convene” to bring together our partner institutions around new and evolving areas of focus in research libraries. We have recently convened partners around critical new areas: bias in research analytics and collections as data. In January, we collectively explored the topic of bias in research analytics. This is an important area to investigate as institutions increasingly rely on big data to power an array of indicators about research activities to support decision-making, competitive analysis, economic development, impact assessment, and individual and institutional reputation management. A recent blog post guest-authored by RLP Partners describes our discussions on this topic and issues “a call to action” for institutions to join the RLP in examining scholarly communications and research impact activities critically.
In March, we convened around collections as data, making collection materials held in libraries, archives, and museums available for computationally driven research and teaching, an evolving practice. Chela Scott Weber represented the realities and aspirations of RLP institutions on this topic at Collections as Data: State of the Field and Future Directions, a Mellon Foundation-supported symposium that will help chart prospective ways forward.
Following up on our work around diversity, equity, and inclusion, we have seen institutions seeking to diversify their collections. Once again, we drew from the RLP to discover how institutions are approaching diversifying collections, what challenges they face in doing so, and what the future looks like. As an outcome, we summarized the findings in a blog post and hosted a well-attended panel discussion at ACRL (watch the recording). We will plan additional programming around this topic in the coming months.
Merrilee Proffitt, Mark McBride (Ithaka S+R) and Brian Lavoie in conversation at the ACRL Conference.
I know you look to the OCLC Research Library Partnership as an avenue for innovating in libraries. As we move into this new season of the RLP, we are dedicated to ensuring a strong and secure future for our organization. We are so grateful for your support.
If you want to learn more about the OCLC RLP and how your institution can take part, please contact me, or anyone on our talented team.
GitHub
Copilot is a technology that is designed to help you write code,
kind of like your partner in pair
programming. But, you know, it’s not an actual person. It’s
“A.I.”–whatever that means.
In principle this sounds like it might actually be a good thing right? I
know several people, who I respect, that use it as part of their daily
work. I’ve heard smart people say AI coding
assistants like Copilot will democratize programming, by making it
possible for more people to write code, and automate the drudgery out of
their lives. I’m not convinced.
Here’s why I don’t use Copilot (or ChatGPT) to write code:
Copilot’s suggestions are based on a corpus of open source code in
Github, but the suggestions do not mention where the code came from, and
what the license is. GitHub is stealing and selling intellectual
property.
Copilot lets you write code faster. I don’t think more code is
a good thing. The more code there is, the more code there is to
maintain. Minimalism in features is usually a good thing too. Less
really is more.
As more and more programmers use Copilot it creates conservativism in
languages and frameworks that prevents people from creating and learning
new ways of doing things. Collectively, we get even more stuck in our
ways and biases. Some of the biases encoded into LLMs are things that we
are actively trying to change.
Developers become dependent on Copilot for intellectual work. Actually,
maybe addicted is a better word here. The same could be (and
was) said about the effect of search engines on software development
work (e.g. Googling error messages). But the difference is that search
results need to be interpreted, and the resulting web pages have
important context that you often need to understand. This is work that
Copilot optimizes away and truncates our knowledge in the process.
Copilot costs money. It doesn’t cost tons of money (for a professional
person in the USA) but it could be significant for some. Who does it
privilege? Also, it could change (see point 4). Remember who owns this
thing.
How much energy
does it take to run Copilot as millions of developers outsource their
intellectual work to its LLM infrastructure? Is this massive
centralization and enclosure really progress in computing? Or is it a step
backwards as we try to reduce our energy use as a species?
What does Copilot see of the code in your editor? Does it use your code
as context for the prompt? What does it store, and remember, and give to
others? Somebody has probably looked into this, but if they have it is
always up for revision. Just out of principle I don’t want my editor
sending my code somewhere else without me intentionally doing it.
Working with others who use Copilot makes my job harder, since they
sometimes don’t really understand the details of why the code is written
a particular way. Over time Copilot code can mix idioms, styles and
approaches, in ways that the developer doesn’t really understand or even
recognize. This makes maintenance harder.
As far as I can tell the only redeeming qualities of Copilot are:
Copilot encourages you to articulate and describe a problem as written
prose before starting to write code. You don’t need Copilot for this.
Maybe keep a work journal or write a design
document? Maybe use your issue tracker? Use text to
communicate with other people.
Copilot is more interactive than a rubber
duck. But, it turns out Actual People are even more
interactive and surprising. Reach out to other professionals and make
some friends. Go to workshops and conferences.
I could be convinced that Copilot has a useful place in the review of code
rather than the first draft of code. It wouldn’t be a replacement for
review by people, but I believe it could potentially help people do the
review. I don’t think this exists yet?
Copilot makes me think critically about machine learning technology, my
profession and its place in the world.
Maybe my thinking on this will change. But I doubt it. I’m on the older
side for a software developer, and (hopefully) will retire some day.
Maybe people like me are on the way out, and writing code with Copilot
and ChatGPT is the future. I really hope not.
But some good news: you can still uninstall it–from your computer, and
from your life.
Cristina Fontánez Rodríguez is the Virginia Thoren and Institute Archivist at Pratt Institute Libraries and Visiting Assistant Professor at the Pratt Institute School of Information. Prior to joining Pratt, Cristina was a National Digital Stewardship Resident for Art Information at the Maryland Institute College of Art’s Decker Library. Cristina’s work is focused on the application of social justice principles to archival practice through participatory and non-hierarchical ways of knowledge-seeking and making. She is a founding member of Archivistas en Espanglish, a collective dedicated to amplifying spaces of memory-building between Latin America and Latinx communities in the US. Cristina also co-runs Barchives, an independent outreach initiative that brings archivists to bars to talk about New York City’s archival collections and local history. She holds a BA in Geography from Universidad de Puerto Rico Recinto de Río Piedras and an MLS with a certificate in Archives and Preservation of Cultural Materials from CUNY Queens College.
The ARLIS/NA Annual Meeting was perhaps the first in-person conference I’ve attended since the pandemic. While I usually engage more with archives-focused organizations, due to the increased number of sessions on archives, the lack of in-person events in the past years, and an opportunity to visit Mexico City, I decided to attend this year’s ARLIS/NA meeting.
Contrary to my past experiences at large conferences, ARLIS/NA didn’t feel overwhelming or impersonal. I reconnected with archivists and librarians I hadn’t seen in years and met others who share my interests. I even met a librarian from my Alma Mater, the University of Puerto Rico, and after chatting for a while realized we had attended the same high school in San Juan. This meeting was fortuitous because, besides having someone with a similar background to me to share this experience with, we also got the opportunity to participate in private tours outside the conference.
I was particularly impressed by photographer Alejandro Cartagena’s talk, the closing plenary with Dr. Barbara E. Mundy, and the panel Voces de México. These three sessions were content-focused instead of project-focused, which was refreshing. While I love a database cleanup project as much as the next person, having the opportunity to hear from artists and art historians about their work was one of the highlights of my conference experience. These three sessions also had something in common, which is that they dealt with imperialism and colonialism and how that has had an effect on their work. For example, Alejandro Cartagena spoke at length about documenting Monterrey’s social and physical landscapes as they’ve transformed due to migration, assimilation, and drug cartels, Dr. Mundy was asked about Mexican artifacts housed in European museums, and Ángel Aurelio González Amozorrutia and Merry MacMasters (Voces of Mexico panel) both spoke about Mexican archival collections housed in the United States.
Whether other speakers intended it this way or not, U.S.-based organizations’ ownership over the cultural heritage of other countries was a theme throughout the conference and the one that resonated with me the most. Latin American collections were heavily featured in the conference’s programming, from workshops to the plenary sessions mentioned above. After each session, I asked myself: How can we avoid replicating extractivist frameworks that prioritize access to U.S.-based students and scholars? Is making collections available digitally enough? (No.) When digitization is the only reparative solution available to us, are we doing enough work to contextualize these collections for non-U.S. researchers, especially for those whose culture and history are being documented? (Also, no.) How can archivists and librarians ethically discuss and present on these types of collections, especially when they were not responsible for acquiring them? And lastly, in focusing our efforts on demonstrating the value of these collections to our directors, boards, and existing communities, aren’t we neglecting the communities who have always known their value?
I confess that I don’t have the answer to all these questions, but at times I felt we needed to intentionally address the socio-political ramifications accompanying these collections’ very existence. However, for me, the presentations provided a starting line for thought-provoking post-session discussions with colleagues.
We look at what conversational search is, how it relates to large language models like ChatGPT and PaLM 2, and how the technology can provide engaging search experiences.
Applications for funding to attend the 2023 DLF Forum as a Forum Fellow or GLAM Cross-Pollinator are open through June 9. Forum Fellowships (including free registration and travel stipend) are open to digital library, archives, museum, and arts & cultural heritage students and professionals, and our GLAM Cross-Pollinator Registration Awards will provide complimentary registration to a member of each of the following organizations: ARLIS/NA, ASIS&T, or ER&L. Learn more about eligibility and complete our brief application on the DLF Forum website.
CLIR and DLF will be closed on Monday, June 19, in observance of Juneteenth.
This month’s DLF group events:
DLF Assessment Interest Group Content Reuse Working Group – Introducing D-CRAFT: A Toolkit for Assessing Reuse of Digital Objects in Your Collections Monday, June 5, 2pm ET/11am PT; register for call-in information
Digital libraries are increasingly challenged with the task of assessing reuse of digital objects in their collections. This can be a complex and time-consuming process, and there is no one-size-fits-all solution.
D-CRAFT is a new toolkit that can help digital library practitioners assess reuse of digital objects in their collections. D-CRAFT includes tools, methods, recommended practices, explanatory videos, and tutorials that can help you streamline the reuse assessment process and make it more efficient.
Register for our webinar today to learn more about D-CRAFT and how it can help you assess reuse of digital objects in your collections. After registering, you will receive a confirmation email with information about joining the meeting.
The webinar will be recorded and sent out to registrants regardless of attendance.
DLF Assessment Interest Group Cost Assessment Working Group – OCR, Handwriting Recognition, and Speech-to-Text Monday, June 12, 3pm ET/12pm PT; contact us at info@diglib.org for call-in information
Please join the DLF AIG Cost Assessment Working Group at a discussion session on OCR, handwriting recognition software, and speech-to-text tools. We will be discussing the various tools used, such as ABBYY FineReader, Adobe Acrobat, PrimeOCR, Rev, Tesseract, Transkribus, tools integrated within platforms such as ContentDM, and newer tools such as Amazon Textract and Whisper. We invite all who are interested in the topic and those who have worked with these tools to share their experience and knowledge pertaining to cost analysis, workflow integration, overall functionality and accuracy. This session will not be recorded but a summary of the discussion will be released. Please fill out this voluntary survey so we can gather information to help guide the discussion: https://forms.gle/nZhR53aJQH3J2WxH8.
DLF Arts and Cultural Heritage Working Group meeting Thursday, June 15, 2pm ET/11am PT; contact us at info@diglib.org for call-in information
Join DLF’s Arts and Cultural Heritage Working Group (formerly known as the Museums Cohort) for a discussion on artist/technologist residencies and other artist collaborations to increase engagement with digital collection holdings led by Jaime Mears, senior innovation specialist at LC Labs at the Library of Congress.
DLF Climate Justice Working Group – Book Club Monthly on last Tuesdays through September, 12pm ET/9am PT; register once to attend any meeting
Concerned about climate change but stuck in despair or uncertain what to do? The DLF Climate Justice Working Group invites all to join us to discuss Not Too Late: Changing the Climate Story from Despair to Possibility, edited by Thelma Young Lutunatabua and Rebecca Solnit. We will meet on Zoom the last Tuesday of every month between May 30 and September 26 at 12pm ET/9am PT.
This month’s open DLF group meetings:
For the most up-to-date schedule of DLF group meetings and events (plus NDSA meetings, conferences, and more), bookmark the DLF Community Calendar. Can’t find meeting call-in information? Email us at info@diglib.org.
DLF groups are open to ALL, regardless of whether or not you’re affiliated with a DLF member institution. Learn more about our working groups and how to get involved on the DLF website. Interested in starting a new working group or reviving an older one? Need to schedule an upcoming working group call? Check out the DLF Organizer’s Toolkit to learn more about how Team DLF supports our working groups, and send us a message at info@diglib.org to let us know how we can help.
tl;dnr - Through the use of OpenRefine, one can create more useful HathiTrust collection files.
Introduction
I often take advantage of the HathiTrust and its very large collection of public domain documents, but when I search the collection for just about anything, I am often faced with numersous duplicate items. Because the volume of search results is so large, filtering duplicates is often tedious, but I have learned how to take advantage of OpenRefine's clustering functions to quickly and easily remove duplicates. This blog posting describes how.
The Problem
For simplicity's sake, let's use a HathiTrust featured collection as an example, specificaly, the Adventure Novels: G.A. Henty. [1] At first blush, the collection includes 47 items, but after downloading the collection file, importing it into any spreadsheet application, and sorting/grouping it by title one can see there are dupicate items, for example but not limited to:
A Knight of the White Cross (two listings)
Bonnie Prince Charlie (three listings)
In the Reign of Terror (seven listings)
While manually looping through 47 items and removing duplicates is not onerous, the problem becomes acute when the student, researcher, or scholar tries to create a complete and authoritative list of all Henty's titles; an author search for Henty and filtered by langauge, place of publication, and even specific library returns many copies of the same things. The tedious process of manually removing duplicates from any sizable collection will significantly impede anybody from doing research on whole collections, and it will cause any computer-based analysis to be whoefully inaccrate. This, in turn, will encourage some people to disregard computer-based analysis. From my point of view, such is undesireable, and this is where OpenRefine comes to the rescue.
OpenRefine, the solution
OpenRefine bills itself as "a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data", and from my point of view, it most certainly lives up to its description. [2] Some of my research would be a whole lot more difficult if it weren't for OpenRefine.
With the idea of removing duplicates and creating a more useful HathiTrust collection file, the first step is to create an OpenRefine project and choose the given collection file as input. [3] Since OpenRefine eats delimited files (like comma-separated value files and tab-separated value -- TSV -- files) for lunch, OpenRefine will recognize the collection file as a TSV file and present you with additional parsing options. In this case, you can accept the defaults and finish initializing the project by clicking the "Create project" button
create a project
The next step is to apply text faceting against the title column and sorting the result by count. You will see that a number of items are listed numerous times, and upon closer inspection, you will see some titles with very similar manifestations (differences in cataloging practice). These are the sorts of things we want to both normalize and deduplicate. Click the "Cluster" button.
viewing text facets
After clicking the Cluster button you will be presented with a dialog box displaying a large number of clustering options/algorithms. Apply each and every option to the collection, and the titles we become normalized, and when you are finished, the number of items in the collection will not have changed, but the number of unique titles will have decreased; there are now many repeated titles. Click the "Remove all" button to exit the faceting process.
clustering
The next step is pernamently sorting the collection by title, "blanking down", and removing the blanked items. This is the actual process for removing the duplicates. Here's how:
Sort the titles alphabetically, and then make the sort permanent by selecting "Reorder rows permanently" from the Sort menu.
Choose "blank down" from the Edit menu, and this retains the first title of many duplicates but makes any subsequent titles empty.
Text facet on the title columm and select the last value from the facets, and it has the label "(blank)".
Finally, select "Delete matching rows".
sort titles
make sort permanent
blank down
delete matching rows
If you have been using the Henty collection, then your collection has been reduced to 29 items, and none of the titles are duplicated. Use the "Export" button to save your good work to a file and use the file for further analysis. For example, upload the new file to the HathiTrust Research Center and do processing against it. [4]
export final result
Extra credit
For extra credit, you might want to apply other cleaning/normalizing processes against the collection file, but removing duplicates is probably the most important. Some of these other processes include making sure the access column includes a value of "1". Otherwise, you may not be able to download the full text of the associated item. You might also want to take a look at the rights_date_used column and make sure there are no dates similar to "9999". You might also want to remove leading articles from the titles so sorted titles... sort correctly.
Summary
The 'Trust is a leading provider of digitized books distributed in the public domain. Using the 'Trust the student, researcher, or scholar can study entire genres or all the complete works written by a given author, but removing duplicates is critical to such analysis. This posting outlined one way to do this with the help of OpenRefine's clustering functions. The OpenRefine website outlines how to use clustering in greater detail. See the website as well. [5]
On 21 April 21, 2023, We Are Capable (WAC) Namibia hosted Open Data Day at the Namibia Business Innovation Centre boardroom of the Namibia University of Science and Technology (NUST). The event was held with the theme “Challenges and opportunities in promoting ethical and cultural AI through open data and data sharing”. The event aimed to educate university students and citizens about the benefits and value that open data can bring to society. The event had about 25 students in attendance, most of them NUST students and others newly graduated students.
The event began with a welcome speech by Mr. Benjamin Akinmoyeje, an Informatics Ph.D. student at NUST, and was followed by a presentation by Ms. Ruusa Ipinge, a lecturer at NUST, on the General Data Protection Regulation (GDPR) and open data. She explained GDPR and how it relates to open data. She also touched on the data protection law in Namibia, which is being drafted, and how the government is trying to create awareness to educate people about the policy that is soon to be implemented.
The main topic of the event was “Challenges and opportunities in promoting ethical and cultural AI through open data and data sharing,” which was presented by Benjamin Akinmojeje. He emphasised how AI can raise bias and not be culturally developed, which was an insightful presentation that sparked much interest among the audience.
Following the presentations, participants were grouped into small groups to discuss various topics related to the theme of the event. They discussed identifying biases in AI algorithms and how to address them, ensuring inclusivity in AI development and deployment, balancing data sharing with privacy concerns, and the role of open data in promoting ethical and culturally responsible AI.
The participants demonstrated an understanding of AI and open data, and they even highlighted some points on why Namibia needs to have a data protection law and an open data catalogue. They emphasised that data should be available, accessible, and shared with everyone, which will make it easier for students to use data for their research purposes.
In conclusion, Benjamin delivered a speech encouraging students to keep showing up at future events, as these events provide an opportunity to increase knowledge and interest in the field of data. Overall, the event was an informative and engaging event for students and the public who attended. It is essential that this event continues to be organised every year as it provides a vital source of knowledge about the field of open data.
On June 9, 2023, at 2:00 p.m. Eastern Time, OCLC will present the second of its 2023 virtual Cataloging Community Meetings, open freely to all. Members of the Cataloging Community and OCLC staff will offer brief presentations on current topics of interest, with a special focus on initiatives to increase diversity, equity, and inclusion in library metadata. Among the topics and speakers will be:
Rapid Harm Reduction Through Locally-Defined Subjects in WorldCat Discovery, Grace McGann, OCLC
There will be ample opportunity for interaction with Cataloging Community speakers and OCLC staff. If you can’t attend live, register now to get the full recording after the event.
“Prepared for Pride Month”
A central target of the current wave of censorship has been materials having LGBTQ+ themes and characters, so as June arrives, “Prepared for Pride Month: A Conversation” could hardly be more timely. Free to members of ALA and the Freedom to Read Foundation (FTRF), the webinar will take place on June 6, 2023, 5:30 p.m. Eastern Time. Best practices and practical strategies will be discussed by panelists Pat Tully, Director of Alaska’s Ketchikan Public Library (OCLC Symbol: Q2Z); Director of the ALA Office for Intellectual Freedom (OIF) and Executive Director of FTRF, Deborah Caldwell-Stone; and Assistant Director of Communications and Outreach for OIF, Betsy Gomez.
How parents can fight book bans
On “Alabama’s largest news site,” AL.com, investigative reporter Anna Claire Vollers writes about the organized assaults on the freedom to read and recommends means of resistance in “Here’s how parents can fight book bans in their kids’ school libraries.” Vollers suggests folks “speak up and show up” with their children, join local and national groups, and get the word out. Making the point that those advocating book bans are loud but in the minority, Vollers notes that when the majority stands up against restrictions, they often prevail.
Supporting libraries
Former public school librarian Deirdre Sugiuchi and ALA president Lessa Kanani’opua Pelayo-Lozada talk about standing up to book bans, supporting library workers and users, and the future of libraries in “American Libraries Are Taking a Stand Against Book Bans.” Pelayo-Lozada cites several ALA initiatives and constituent units that have tried to counter the attacks on libraries, to ensure that libraries are more welcoming, and to expand the diversity of the profession. The interview appears in Electric Lit.
Technical Services Librarian Sarah R. Jones, Digital Special Collections Librarian Emily Lapworth, and Special Collections Technical Services Librarian Tammi Kim, all at the University of Nevada, Las Vegas (OCLC Symbol: UNL) are in the process of “adopting more inclusive practices and embedding them programmatically at their institution.” In “Assessing Diversity in Special Collections and Archives” (College and Research Libraries, Volume 84, Number 3, May 2023, Pages 335-356), they analyze and assess the policies and strategies used at UNLV in terms of how successful they have been.
“The Importance of Black Mentorship”
“Surviving in academia involves finding your supporters, allies, accomplices, agitators, and disruptors. It is identifying those with whom you risk being your true, authentic self to share your worries, anxieties, fears, and stresses with someone there to help in the necessary ways. They respect you.” So writes Twanna Hodge, a Ph.D. student in the College of Information Studies at the University of Maryland, College Park (OCLC Symbol: MDX), in “The Importance of Black Mentorship,” available on WOC+Lib, “a digital platform for women of color (WOC) within librarianship.”
The ALA Library History Round Table presents its annual Research Forum on June 13, 2023, at 3:00 p.m. Eastern Time, “Unpacking Access.” Speakers for the free webinar will include Amanda Rizki of the University of Virginia Library (OCLC Symbol: VA@) on “Carceral Fees: A History of Racism at the Circulation Desk” and Ethan Lindsay of Wichita State University Library (OCLC Symbol: KSW) on “Extending Library Access to Readers Across the Plains: The Early Traveling Libraries Program in Kansas.”
EDI in library workplaces
On June 13 and 14, 2023, between 10 a.m. and 6 p.m. Eastern each day, take part in the free Core: Leadership, Infrastructure, Futurese-Forum “Incorporating Equity, Diversity and Inclusion in Your Workplace: A Conversation.” Core e-Forums are two-day, moderated, electronic discussions that work similarly to email discussion lists. Moderating will be Ray Pun of the Alder Graduate School of Education (OCLC Symbol: CAAGS); Loida Garcia-Febo, International Library Consultant and Past President of the American Library Association; and Robin L. Kear of the University of Pittsburgh (OCLC Symbol: PIT). Questions will center on promoting EDI values in libraries, what has and has not worked in practice, dealing with burnout, and fostering collaboration.
Challenging the challengers in Florida and Arkansas
One big idea in cryptocurrencies is attempting to achieve decentralization through "governance tokens" whose HODLers can control a Decentralized Autonomous Organization (DAO) by voting upon proposed actions. Of course, this makes it blindingly obvious that the "governance tokens" are securities and thus regulated by the SEC. But even apart from that problem recent events, culminating in "little local difficulties" for Tornado Cash, demonstrate that there are several others.
Below the fold I look at these problems. The DAO was the first major "smart contract". Even after it had been exploited on 17th June, 2016 for $50M of notional value, the front page of its web site announced:
The DAO’s Mission: To blaze a new path in business organization for the betterment of its members, existing simultaneously nowhere and everywhere and operating solely with the steadfast iron will of unstoppable code.
The DAO was intended to operate as "a hub that disperses funds (currently in Ether, the Ethereum value token) to projects". Investors received voting rights by means of a digital share token; they vote on proposals that are submitted by "contractors" and a group of volunteers called "curators" check the identity of people submitting proposals and make sure the projects are legal before "whitelisting" them.
As it turned out, although the code of the DAO was immutable, the Ethereum platform on which it ran wasn't. The Ethereum community decided to hard-fork so as to return the ETH in the DAO to its original owners. The exploited chain continued as Ethereum Classic, currently "worth" $18.03 versus the forked chain's ETH $1811.
The upshot was that people realized that deploying immutable software meant that any bugs could not be fixed, and that in the real world where the bounty for a bug might be in the millions of dollars this wasn't a risk worth running. So the concept of voting to govern a DAO's actions was extended to include voting to mutate the code, so they became IINO (Immutable In Name Only).
What are the problems with the idea of voting to control, and in particular to update, a DAO? Here is my list:
If a "smart contract" needs to be upgraded to patch a bug or vulnerability, or to recover stolen funds, the multisig members need to (a) be told about it, and (b) be given time to vote, during which time anyone who knows about the reason can exploit it, so (c) keep it secret. Benjamin Franklin wrote “Three may keep a secret, if two of them are dead.” This was illustrated by the $162M Compound fiasco:
"There are a few proposals to fix the bug, but Compound’s governance model is such that any changes to the protocol require a multiday voting window, and Gupta said it takes another week for the successful proposal to be executed."
Compound built a system where, if an exploit was ever discovered, the bad guys would have ~10 days to work with before it could be fixed. This issue is all the more important in an era of flash loan attacks when exploits can be instantaneous.
Cryptocurrencies are supposed to be decentralized and trustless.
Their implementations will, like all software, have vulnerabilities.
There will be a delay between discovery of a vulnerability and the deployment of a fix to the majority of the network nodes.
If, during this delay, a bad actor finds out about the vulnerability, it will be exploited.
Thus if the vulnerability is not to be exploited its knowledge must be restricted to trusted developers who are able to ensure upgrades without revealing their true purpose (i.e. the vulnerability). This violates the goals of trustlessness and decentralization.
This problem is particularly severe in the case of upgradeable "smart contracts" with governance tokens. In order to patch a vulnerability, the holders of governance tokens must vote. This process:
Requires public disclosure of the reason for the patch.
Cannot be instantaneous.
If cryptocurrencies are not decentralized and trustless, what is their point? Users have simply switched from trusting visible, regulated, accountable institutions backed by the legal system, to invisible, unregulated, unaccountable parties effectively at war with the legal system. Why is this an improvement?
scornful of the dominance of coin voting, a voting process for DAOs that Buterin feels is just a new version of plutocracy, one in which wealthy venture capitalists can make self-interested decisions with little resistance. “It’s become a de facto standard, which is a dystopia I’ve been seeing unfolding over the last few years,” he says.
This ignores an even bigger problem, because it isn't just the VCs or the whales that hold huge stashes of these tokens, it is the exchanges holding them on account for their customers. Strangely, five years earlier Buterin had described a problem:
In a proof of stake blockchain, 70% of the coins at stake are held at one exchange.
Note that as I write four staking services control Ethereum, Lido, Coinbase, Binance, & Kraken. Justin Sun famously conspired with exchanges to vote their customers' coins to enable him to take over the Steem blockchain.
This problem is exacerbated by the availability of flash loans, allowing an attacker cheaply and instantaneously to acquire temporary voting power.
An October 2020 example of a flash loan attack on DAO governance was what BProtocol did to MakerDAO:
BProtocol used 50,000 ETH to borrow wrapped ETH from decentralized exchange dYdX. It put the wrapped ETH on Aave protocol to borrow $7 million in MKR governance tokens, which allow holders to vote on proposals affecting Maker’s operations. It locked those tokens to vote for its proposal, then unlocked them to return the funds to AAVE and dYdX.
The decentralized, credit-based finance system Beanstalk disclosed on Sunday that it suffered a security breach that resulted in financial losses of $182 million, the attacker stealing $80 million in crypto assets.
As a result of this attack, trust in Beanstalk's market has been compromised, and the value of its decentralized credit-based BEAN stablecoin has collapsed from a little over $1 on Sunday to $0.11 right now.
The decentralized finance (DeFi) platform detailed on its Discord channel that the attacker took a flash loan on Aeve, a liquidity protocol, and used their voting power from holding a large amount of the Stalk native governance token to pass a malicious proposal.
In the wake of the attack, chat logs and video evidence show that the founders were warned about the risk of exactly this kind of attack, but they dismissed community members’ concerns.
...
Though the attack shocked Beanstalk users — some of whom claimed to have lost six-figure sums of money — the threat of a governance attack was raised in Beanstalk’s Discord server months previously and in at least one public AMA session held by Publius, the development team behind the project.
Can the voters understand the proposal?
Whoever the voters may be, just as with a ballot proposition in California, they will be presented with a written description of the proposal. But this isn't what they are voting on. In California, it is a set of changes to the law. In the case of governance tokens, it is a set of changes to the code. It is notoriously difficult to read law or code and determine exactly how it will function in every case.
A proposal ostensibly to penalize cheating network participants in the Tornado Cash crypto tumbler project successfully passed by DAO vote. However, the proposer had added an extra function, which they subsequently used to obtain 1.2 million votes. Now that they have more than the ~700,000 legitimate Tornado Cash votes, they have full control of the project.
The attacker has already drained locked votes and sold some of the $TORN tokens, which are governance tokens that both entitle the holder to a vote but also were being traded for $5–$7 around the time of the attack. The attacker has since tumbled 360 ETH (~$655,300) through Tornado Cash to obscure its final destination. Meanwhile, $TORN plummeted in value more than 30% as the attacker dumped the tokens.
The attacker now has full control over the DAO, which according to crypto security researcher Sam Sun grants them the ability to withdraw all of the locked votes (as they did), drain all of the tokens in the governance contract, and "brick" (make permanently non-functional) the router.
The full details of how this was done are in this thread by samczsun:
Be careful what you vote for! While we all know that proposal descriptions can lie, proposal logic can lie too! If you're depending on the verified source code to stay the same, make sure the contract doesn't have the ability to selfdestruct
The Tornado Cash token (TORN) is up 10% after a proposal submitted by a wallet address linked to a recent attack on the decentralized autonomous organization’s (DAO) governance state looks to reverse the malicious changes.
“The attacker posted a new proposal to restore the state of governance," user Tornadosaurus-Hex wrote in the Tornado Cash community forum, adding that there is a "good chance" that the attacker would execute it.
Tornadosaurus-Hex said that the attacker is reverting the TORN tokens they gave themself – which gave them a controlling share of the governance votes – back to zero.
Maybe this time the proposal's code will actually do what he says it does.
Will the vote be effective?
If the vote is to mutate some "smart contract" controlled by a multisig of all the tokens, it can become effective once a quorum of votes has been cast. But in many cases although the vote is presented as binding, it is in effect advisory. Here are Molly White's reports on a couple of recent examples:
In June and October 2022, the Aragon DAO — that is, all holders of the $ANT token or (later) their delegates — voted on several proposals supporting a move to place the Aragon treasury under DAO control. The treasury is a pool of crypto assets currently priced at around $174 million. However, the tokens continued to remain under control of the Aragon Association.
On May 9, 2023, the Aragon Association announced that they would not be following through with the treasury change, and instead would be "repurposing the Aragon DAO into a grants program". They attributed the decision to "coordinated social engineering and 51% attack" on the DAO that began shortly after a small portion of the treasury assets were transferred.
Arbitrum submitted a proposal for DAO members to vote on various governance processes, as well as the distribution of 750 million ARB tokens to an "Administrative Budget Wallet" — tokens that were priced at around $1 billion.
The vote, which still has a day left before completion, is currently standing at 75% against and 25% in support. However, it was discovered that Arbitrum had already begun spending those 750 million tokens, including via the movement of a substantial amount of tokens, and "conversion of some funds into stablecoins for operational purposes".
As you can see, both of these are cases where the system is DINO (Decentralized In Name Only).
We love stories! No matter what we are working on – whether it’s training in person, online learning, or designing a website – we like to incorporate interactive and narrative elements to make it more enjoyable and engaging for our users. Interactive storytelling also has significant pedagogical value. For example, we often use branching scenarios [...]
What does “open” mean today? What should it mean? What has changed since 2015, when the Open Definition was last updated?
We at Open Knowledge are preparing another round of consultations on updating the “Open Definition”. We will have a face-to-face session in Spanish during RightsCon to ensure the voices of the Latin American communities gathered in Costa Rica are heard and incorporated to the review process. It will be a practical session to write collectively.
The overall objective is to problematise each sentence of the current definition, version 2.1, and reach a consensus for the new generations. What does “open” mean in the context of data extractivism, digital colonialism, climate change, and economic, racial, and gender inequalities?
The Open Definition was a collaborative process led by the Open Knowledge Foundation over a decade ago that created a consensus among experts in defining “open” in relation to data and content. As our CEO Renata Ávila articulates in more detail, the macro intention of this project is to create a bridge between old definitions and new discourses to keep the open ecosystem alive and current.
One of the hidden gems of the Library of Congress is the Congressional Research Service (CRS).
With a staff of about 600 researchers, analysts, and writers, the CRS provides “policy and legal analysis to committees and Members of both the House and Senate, regardless of party affiliation.”
It is kind of like a “think tank” for the members of Congress.
And an extensive selection of their reports are available from the CRS homepage and—as government publications—are not subject to copyright; any CRS Report may be reproduced and distributed without permission.
And they publish a lot of reports.
(Read more on their CRS frequently-asked-questions page.)
I remember learning about the CRS in library school, but what got me interested in them again was a post on Mastodon about an Introduction to Cryptocurrency report that they produced.
At just 2 pages long, it was a concise yet thorough review of the topic, ranging from how they work to questions of regulation.
Useful stuff!
And that wasn’t the only useful report I (re-)discovered on the site.
An Automated Syndication Feed
The problem is that no automated RSS/Atom feeds of CRS reports exists.
Use your favorite search engine to look for “Congressional Research Service RSS or Atom”; you’ll find a few attempts to gather selected reports or comprehensive archives that stopped functioning years ago.
And that is a real shame because these reports are good, taxpayer-funded work that should be more widely known.
So I created a syndication feed in Atom:
You can subscribe to that in your feed reader to get updates.
I’m also working on a Mastodon bot account that you can follow and automated saving of report PDFs in the Internet Archive Wayback Machine.
Some Important Caveats
The CRS website is very resistant to scraping, so I’m having to run this on my home machine (read more below).
I’m also querying it for new reports just twice a day (8am and 8pm Eastern U.S. time) to avoid being conspicuous and tripping the bot detectors.
The feed is a static XML document updated at those times; no matter how many people subscribe, the CRS won’t see increased traffic on their search site.
So while I hope to keep it updated, you’ll understand if it misses a batch run here or there.
Also, hopefully, looking at the website’s list of reports only twice a day won’t raise flags with them and get my home IP address banned from the service.
If the feed stops being updated over an extended time, that is probably why.
There is no tracking embedded in the Atom syndication feed or the links to the CRS reports.
I have no way of knowing the number of people subscribing to the feed, nor do I see which reports you click on to read.
(I suppose I could set up stats on the AWS CloudFront distribution hosting the feed XML file, but really…what’s the point?)
How It’s Built
If you are not interested in the technology behind how the feed was built, you can stop reading now.
If you want to hear more about techniques for overcoming hostile (or poorly implemented) websites, read on.
You can also see the source code on GitHub.
Obstacle #1: Browser Detection
The CRS website is a dynamic JavaScript that goes back-and-forth with the server to build the contents of web pages.
The website itself sends nicely formatted JSON documents to the JavaScript running in the browser based on your search parameters.
That should make this easy, right?
Just bypass the JavaScript front end and parse the JSON output directly.
In fact, you can do this yourself.
Go to https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true& in your browser and see the 15 most recent reports.
Try to reach that URL with a program, though, and you’ll get back an HTTP 403 error.
(In my case, I was using the Python Requests library.)
And I tried everything I could think about.
I even tried getting the curl command line with the headers that the browser was using from the Firefox web developer tools:
curl -v'https://crsreports.congress.gov/search/results?term=&r=2203112&orderBy=Date&isFullText=true&'\-H'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/113.0'\-H'Accept: application/json, text/plain, */*'\-H'Accept-Language: en-US,en;q=0.5'\-H'Accept-Encoding: gzip, deflate, br'\-H'Connection: keep-alive'\-H'Referer: https://crsreports.congress.gov/search/'\-H'Sec-Fetch-Dest: empty'\-H'Sec-Fetch-Mode: cors'\-H'Sec-Fetch-Site: same-origin'\-H'Pragma: no-cache'\-H'Cache-Control: no-cache'\-H'TE: trailers'
…and still got denied.
So I gave up and used Selenium to run a headless browser to get the JSON content.
And that worked.
Obstacle #2: Cloudflare bot detection
So with the headless browser, I got this working on my local machine.
That isn’t really convenient, though…even though my computer is on most working hours, something like this should be run on a server in the cloud.
Something like AWS Lambda is ideal.
So I took a detour to learn about Headless Chrome AWS Lambda Layer (for Python).
This is a technique to run Chrome on a server, just like I was doing on my local machine.
So I got the code working on AWS Lambda.
It was a nice bit of work…I was pleased to learn about a new AWS skill (Layers for Lambda).
But I hit another wall…this time at Cloudflare, a content distribution network that sits in front of the CRS website with protections to stop bots like mine from doing what I was trying to do.
Instead of the JSON response, I got Cloudflare’s HTML page asking me to solve a captcha to prove my bot’s humanness.
And look…I love y’all, but I won’t be answering captcha challenges twice a day to get the report syndication feed published.
So after all of that, I decided to just run the code locally.
If you know of something I missed that could bypass obstacles 1 and 2 (and won’t get the FBI knocking at my door), please let me know.
Starting in January 2023, we are meeting with more than 100 people to discuss the future of open knowledge, shaped by a diverse set of visions from artists, activists, scholars, archivists, thinkers, policymakers, data scientists, educators, and community leaders from everyone.
The Open Knowledge Foundation team wants to identify and discuss issues sensitive to our movement and use this effort to constantly shape our actions and business strategies to best deliver what the community expects of us and our network, a pioneering organisation that has been defining the standards of the open movement for two decades.
Another goal is to include the perspectives of people of diverse backgrounds, especially those from marginalized communities, dissident identities, and whose geographic location is outside of the world’s major financial powers.
How openness can accelerate and strengthen the struggles against the complex challenges of our time? This is the key question behind conversations like the one you can read below.
*
This week we had the chance to talk with one of the most active voices in the field of open science, whose influence has reached generations of researchers in many geographies.Dr. Peter Murray-Rust is a chemist, a professor in molecular informatics at the University of Cambridge, and a historic open access and open data activist in academia.
In his career, he’s been particularly interested in promoting open knowledge through research groups and communities, such as the Blue Obelisk community for Open Source chemical software, and through the Semantic Web. He was among the proposers of the World Wide Molecular Matrix back in 2002 and has been the creator along with Henry Rzepa of the Chemical Markup Language.
Peter is also a member of the Open Definition Advisory Council – which is one of the reasons why we are publishing this conversation today. The OKFN team is preparing the second round of consultations on the Open Definition review, with the aim of updating it to current challenges and finding a broader and more diverse consensus around it. The session will take place in person at RightsCon Costa Rica in June 7. This conversation helps to contextualise the work done previously and seeks to put our brains together to think about some of the topics that have already emerged in the previous session at MozFest and in the discussion forum.
With each conversation, more people join the project. On this occasion, we had the pleasure of having the participation of Adrià Mercader, OKFN’s Technical Lead based in Tarragona, Catalonia; Nikesh Balami, OKFN’s International Open Data Lead based in Kathmandu, Nepal; and Lucas Pretti, OKFN’s Communications Lead based between São Paulo, Brazil and Madrid, Spain.
We hope that Peter’s insights will serve as an inspirational source for the important discussions ahead.
*
Lucas Pretti: In an email message you sent me while we were arranging this conversation, you said the following, “We are literally fighting for the soul of Open and we are not winning”. Could you elaborate on this feeling of loss in more detail? Why are we losing?
Peter Murray-Rust: The idea of public open started with software, with Richard Stallman’s freedoms and similar things in the 80s. I would say that until the cloud and the mega-corporations have come along, open software has been a success. A little over a decade ago, there was a big momentum but since 2013-14s it died and the corporates came to realise that there were vast amounts of wealth to be made by enclosing open. We’ve seen it in all areas. There were great inroads into the public domain and massive investments in lawyers to sue people who don’t support corporates. My own area is scholarly publications, which is one of the worst at the complete travesty of everything.
The whole thing at best is incredibly messy today. You cannot rely on something a hundred years old being in the public domain, as it could well be owned by a corporation. In any area where you land, you don’t know how you’re going to go out, as there’s no roadmap. For example, who owns the map of Britain? Who owns the postcodes of Britain? Who owns timetables? Who owns this sort of thing? There’s actually no expectation that they will be open. It’s all scattered all over.
At least I find that most governments are largely on the side of open. I saw recently that the US government is now funding open knowledge networks. Nothing to do with OKFN, but it’s the technology to build knowledge graphs from all public knowledge. This is really good if they’re able to make it happen.
But in general terms, we’re no nearer knowing whether we’re going to win these battles. A lot of these things are worthy endeavours, but they’re not backed by the force of law. Until we have the force of law, and corporations are fined for this sort of thing, we’re never going to get compliance.
Adrià Mercader: I agree, but want to offer my perspective as a counter-argument. I joined Open Knowledge in 2011 at the beginning when the curve was at worst. If we judge the current situation by the standards of the hopes we had back then, it’s difficult not to be a bit disappointed.
But maybe we need to rethink and reframe what the long-term goal is. I don’t think that the landscape of open is worse now than it was in 2010 by any measure, especially in certain areas or topics like government. There’s innovation and individuals and groups within different governments in the world pushing for more transparency and better data publication. Now people know what open data is, and I think that there’s more data literacy in general across different sectors. People are more aware of why data is important, what are data formats, etc. There’s been a progression.
So I wanted to ask you specifically in the context of the academia. Do people starting up in academia today coming from higher education have an increased awareness of topics like open access, licensing, or even data literacy skills? Do you think that this has improved over the years?
Peter Murray-Rust: I don’t teach or research in the UK anymore. My current experience is with Indian undergraduates. Unfortunately, there’s not a high entry point with these concepts there, but in general terms, I would say that the generation who are graduating are much more aware of these ideas, much more likely to publish open-source software and understand the value of open data. Some disciplines have managed it well, like bioscience, a lot of geosciences, and astronomy. Engineering and chemistry are still very fossilised in that regard tough.
Universities are conservative organisations. Just because they do cutting-edge research doesn’t mean that they are innovative as organisations. Universities are often symbiotic with major corporations. If you take journals and books, these are now run by huge corporates, same for teaching. A lot of learning societies are also incredibly conservative and very possessive.
With the advances in knowledge management, I expect that this is going to come to more and more commoditised education, and undergraduates will be inducted into this process. I’m a bit of a pessimist here. As soon as corporates are included in the system, any values are lost because corporates now do not have human values. Their values are determined by shareholder value, which is computable by machines. It really doesn’t matter whether those machines are AI or collections of human bureaucrats. Everything in academia is getting more and more mechanised.
Adrià Mercader: You touched on probably everyone’s favourite subject right now. Artificial intelligence has a lot of potentialities but also presents many dangers. I’m keen to hear your thoughts on AI. Could you reflect on the impacts on academia but also think about challenges like climate change?
Peter Murray-Rust: GPT and its peers will soon be indistinguishable from a very competent human in many different areas. I haven’t used it a lot, but I can ensure that it’s really an automated collection of semi-public human knowledge based on language models and on sticking together sentences. I don’t know whether it’s going to become sentient, intelligent or whatever, but I am clear that it’s going to be controlled by corporates. And those corporates will surveil and control their users. Every time somebody uses it, the AI will know more about that person and will develop its intelligence based on that interaction. I’m sure Google has been doing this for years, and that’s the main use of Bing for Microsoft.
That reminds me of Microsoft Academic Graph, which was a rival to Google Scholar that aims to index the world scholar literature. About a year ago they open-sourced it and Microsoft Academic Graph is now managed by OpenAlex. This is one of the most innovative and challenging open communities I know, run by Heather Piwowar and Jason Priem out of Vancouver. They spent years fighting the corporates to create an index of scholarly publishing using a semantic map of academic publications. It’s all open source (I use open source just as a means conformant to the Open Definition) and anyone can use it. For all I know, Jason is putting it into AI now…
What I don’t know about AI is how much the startup costs are for people who don’t have billion-dollar revenues. How much do you need to compute banks under the Arctic ice? Can you get into it at a medium level? I know Mozilla is thinking about this with their Mozilla.ai project, but there’s still no one standing up to be the public AI of 2023. I have suggested that this is something that CERN might do, as they have played a big role in managing the sort of peripherals of knowledge. The Open Hardware licence was developed by CERN, for example. And now more recently I think they’re looking at the possibility of open in AI.
If this continues to be run by corporations, for which the only goal is growing their market share, I expect to see many lawsuits involving AI and people who are misusing it. Copyright is going to be a major problem in AI.
Lucas Pretti: Here is a great opportunity to discuss the Open Definition. As you know, in 2023 we are reigniting the discussions in order to review the definition and find a broader and more diverse consensus around it. One of the emerging tendencies is what people are calling “responsible licences”, being RAIL the main one today for AI – you probably remember that it was mentioned by someone in our MozFest session in March.
The current Open Definition includes the expression “for any purpose” as one of the variables that define open content, with which we all theoretically agree. But the enclosures are coming so strong that it might be useful to review this generalisation and add some sort of protection depending on the usage. What do you think? How can the Open Definition be useful to address that contradiction?
Peter Murray-Rust: Yes, it’s a very, very important topic. First of all, the Open Definition doesn’t stand alone. There are many laws that have more power. So, for example, if you use the Open Definition to publish personal information, you can’t say you’re allowed to do that as you answer to the country whose privacy laws you’ve broken.
Coming down to “for any purpose”, there’s the need to realise that what the definition is doing is helping organisations create legally actionable documents. The philosophy of the definition is to help decide whether a proposed licence is conformant with open or not. But the final licence is always negotiated in some jurisdictions.
Now, one of the things that I think you have to do is to get, not just me, but some of the original members of the Open Definition Advisory Council into the review to make sure the licence is really actionable in a legal aspect. I know cases in the spirit of “do no evil” that are actually positively harmful because they’re not actionable and lead to messy court cases. So you have to talk to those people who understand more than I do exactly about how the legal aspects are, or you can run into the same problem. That would be one thing.
The other thing is the environment of the Open Definition licence right now. In the 2000s it worked mainly because there was mostly static information circulating. In other words, you had a map, a corpus of documents, or a piece of music, and you could apply the Open Definition to that particular object. Now we’re moving into the realm where the coherence of an object is fragmented. Today many websites are made of an assembly of components from other databases, and we need to discuss what is protectable or relevant to state as an open element. Another real problem, which was slightly different, is that you can end up with open material in a closed environment.
For example, I put my slides up on SlideShare and then SlideShare came along and implemented a new policy saying that everybody has to sign up to share SlideShare. The corporates will always enclose the mechanism of accessing open objects, and that problem hasn’t been solved. The only way it could be solved is to have a trusted organisation actually possessing all of the open content available, like the Internet Archive or something of that sort. By the way, the Internet Archive is being sued by whoever. Goodness!
Lucas Pretti: Yes, there’s a huge campaign today, #EmpoweringLibraries, defending the Internet Archive and the right of libraries and librarians to own and lend digital documents. That’s what you’re referring to, right?
Peter Murray-Rust: Yes. I think that academic libraries particularly have completely dropped the ball over the last 20 years… They should have been protecting the material, but what they have been doing is paying libraries subscriptions and paywalls. They’re building their own piddling repositories which aren’t properly used and aren’t federated. They should have stood for a national repository of academia for every country.
Well, some countries have done that, like Brazil and the Netherlands. But in the UK it is a total mess. Today, librarians are ultra-scared of transgressing any rule. If a publisher makes a rule, they adhere to it and don’t challenge it. Same for copyfraud – it’s a disaster! Anybody can copyright anything. The only recourse you have is to hire a lawyer, sue them, and you might get back the monetary value of your contribution to the document. But it would be symbolic. Nobody ever has punitive damages in this sort of area.
Nikesh Balami: It’s great to hear all these things directly from you. It’s all very in line with the narratives and discussions we have in forums and events in the open movement. We are always discussing the gaps, safety, priorities, and people defining openness in their own ways. And the losses.
But how can we turn this game around? How can we engage more younger generations and make sure the narrative fits into the new technologies that have been shifting?
Peter Murray-Rust: One of the things I haven’t mentioned and I feel very strongly is that many of the current systems and scholarly publishing are utterly inequitable and neocolonialist, and they’re getting worse. The only way to challenge that would be to get enough people in the North to take that on board as a crusade – and there aren’t at the moment. The US is trying to do that with its latest policies and releases. The UK government is a disaster, so it’s not going to do anything. The EU is heavily lobbied by corporates, so it’s going to come up with something in the middle.
I actually think that the big potential growing point is Latin America as a focus for the Global South. There’s a tradition in Latin America that scholarly publication is a public good. That has been strong with SciELO and related things, and people like Arianna Becerril-Garcia are taking that to the next level. Arianna is a computer scientist, and professor of computer science at UNAM, in Mexico, and she has been spearheading open in Latin America through projects like Redalyc, an archive platform, and AmeliCA, a publishing platform and framework for academic publications. They are now expanding it to Namibia, looking up to do the same sort of thing having Namibia as an African hub.
We need to link up this kind of people in Latin America, Southern Sub-Saharan Africa, and the Indo-Pacific States, India, Indonesia, etc. We’ve gotta have a unified critical mass, which is seen to be doing things better. That’s the only way we’re going to win. If we do that, we might have enough to build a critical mass that would challenge the North. What I’m saying is that we’re not going to win on principles or on price, because the North has got billions of dollars to give to publishers. The article processing charges are totally iniquitous.
By the way, do you know what it costs in dollars to publish an open-access paper in the world’s most recognised scientific journal (I can’t say the name here, but it’s easy to deduce)? Have a guess. I want to hear your guess.
Adrià Mercader: A thousand dollars.
Lucas Pretti: I would say $500, something like this.
Nikesh Balami: I was thinking around $300, something below 500.
Peter Murray-Rust: The answer is $12,000. It’s the price for glory. It’s not advertising. It’s saying to the author, “If you publish in my magazine and you’ve got $12,000, then your career will advance much fo faster than your rivals”.
Lucas Pretti: And you can get a Nobel maybe in 15 years…
Adrià Mercader: It’s actually more perverse than that. The whole academic system is built on what journals you publish in. My partner is a scientist, and the funding she receives will depend on the impact factor of whatever journal she’s publishing. It’s like a racket, basically.
Peter Murray-Rust: Exactly. It’s corrupt. It’s totally corrupt. Impact factors are generally made by algorithms, but in particularly important journals impact factors are negotiable. If the Open Knowledge Foundation is about injustice, that typifies one of the major global injustices in the world: access to knowledge. I would say that part of OKFN’s role should be to discover and formalise injustices, particularly with respect to the Global South to put them in front of people who can make a difference.
✔️ Leaders need to take a stand and be clear on their #values. When meeting resistance, knowing your core values will help your response. Do you have the backs of your employees and clients?
✔️ When making decisions around DEIB, determine who caused the harm and who’s being centered? Do those answers align?
✔️ Develop regular #mentalhealth practices. How are you continuously filling your cup, not just during #MentalHealthMonth but all year.
✔️ When coaching WOC, nothing is broken so there’s nothing to fix. Just shine a light on them, be with and walk with them.
✔️ When we make decisions on brand for our values, it makes us feel happy and fulfilled and determines what we should do during our work and our personal lives.
✔️ We must interrogate the avoidance of discomfort during #DEI conversations.
On March 4, 2023, USTP YouthMappers organised Open Data Day 2023 with the theme “Empowering AI and Mapping with Open Data” at the University of Science and Technology of Southern Philippines, Cagayan de Oro Campus. With 40 participants, the training workshop was a tremendous success, and it was encouraging to see that more than half of the attendees were female.
The main highlight of the event was a training workshop on RapID, an AI-powered OpenStreetMap Editor developed by Meta. Mikko Tamura, the regional Community Manager of the Open Mapping Hub Asia Pacific, led the training. Mikko led the training virtually while the organizing team facilitated the training on the ground. The goal of the training workshop was to give participants an understanding of the possibilities of open data in the context of AI. The Rapid Editor tool’s ability to enable users to import and analyse data from various sources makes it especially helpful for working with open data.
The participants were enthusiastic and engaged throughout the training workshop, as they were able to learn about the benefits of open data in the field of AI and mapping. The event successfully achieved its goal of promoting the use of open data in the community, as participants were able to gain insights into the importance of open data and how it can contribute to the advancement of AI and mapping. The fact that there were more women than males among the participants is evidence of the growing interest among women in open data and AI. Because they are underrepresented in the tech sector, women can learn about cutting-edge innovations at occasions like Open Data Day.
Overall, the event was a huge success, with participants leaving the event more knowledgeable about the use of open data. The training workshop was relevant and they can use it for their studies and research. The organisers hope that this event will inspire others to conduct similar initiatives that promote the use of open data in their respective fields.
The Latin American Center for Internet Research (CLISI) held a discussion on Artificial Intelligence, Virtual Reality, and Open Data in the city of Valencia, in the spaces of the Librería La Alegría to celebrate Open Data Day 2023. The attendees interested in the subject, as well as a group of journalists from the Carabobo region, joined the event. The researchers presented a research project to be carried out this year on the subject during the discussion.
The discussion began with the presentation of its director, lawyer, and consultant, Mr. José Mendoza of MMD & Associates, where he shared his opinion; the Venezuelan university and Latin America, in general, are lagging behind with respect to artificial intelligence. He believes it should be modernised and says that “currently there are no topics on virtual reality in the classrooms“.
The next speaker was Dr. Crisálida Villegas, Director of the Transcomplexity Researchers Network, where she defined Transcomplexity as the possibility of seeing reality from multiple perspectives. At the same time, she pointed out that “Virtuality is here to stay, a model of Transcomplexity is a model of complementarity“. For Dr. Villegas, artificial intelligence will never win over man. In her speech, she assures that “we must speak of hybrid models, virtual spaces on a par with face-to-face spaces“.
At the end of the discussion, Dr. Waleska Perdomo from ONG Sinapsis started her presentation with the phrase “Here there are dragons“, based on medieval times to locate unexplored places on maps.
In short, for this and more, the specialist is convinced that reality is energy, “it is things, it is what I feel is the material world, therefore reality is multiple, diverse, convulsive, complicated, chaotic. This human singularity composes the artificial reality that is nothing more than the extension of the mind”.
Centro Latinoamericano de Investigaciones Sobre Internet received a small grant to host Open Data Day(s) (ODD) events and activities.
Wonderland Sculpture. Calgary. Bernard Spragg. NZ from Christchurch, New Zealand, PDM-owner, via Wikimedia Commons
The OCLC RLP Metadata Managers Focus Group met in March 2023 to explore new developments in the shift from “authority control” towards “identity management.” Our discussion was facilitated by Charlene Chou of New York University, Joy Panigabutra-Roberts of the University of Tennessee, and John Riemer of UCLA.
This shift is a familiar topic for Metadata Managers and was a prominent feature of Karen Smith-Yoshimura’s Transitioning to the Next Generation of Metadata report and in multiple posts here on Hanging Together (read more about identity management on Hanging Together).
What lessons they’ve learned from using other identity management platforms in their workflows
How are they incorporating non-LC/NAF sources into their local cataloging workflows
What the opportunities and threats are for expanding PCC cataloging beyond the LC/NAF
We received responses from 11 members of the group and had further discussions during two virtual sessions.
Summary
One of my favorite quotes from science fiction author William Gibson — “The future is here; it’s just not evenly distributed yet”—aptly describes much of our conversation about identity management outside of LC/NAF. Organizations with staff resources and platforms ready to integrate with non-LC/NAF identifiers are doing so when those entities are fit for purpose. This is especially true when looking beyond the catalog and into research information management and/or digital asset management of unique collections. Librarians who have engaged with ORCID, SNAC (Social Networks and Archival Context), Wikidata, and other identity management platforms find that the benefits can outweigh their anxieties about working outside of NACO files. These alternatives do not diminish the time and intellectual work that goes into managing identities. However, these alternatives’ clear governance policies and technical affordances can make the work more efficient.
For the many libraries not in the vanguard, broader adoption of identity management beyond the LC/NAF requires workflow safeguards and technological solutions. Our discussions indicate that librarians succeed most when URIs or other persistent identifiers allow them to make connections across multiple platforms. Changes in the MARC 21 formats provide a mechanism for recording this data, but these will only be valuable if corresponding systems and workflows can use incorporated identifiers. Our conversations also demonstrated the value of PCC pilot projects in helping us focus on solutions that work.
Key takeaways
We’re already doing this in various environments. Respondents noted that in some environments, this is not a new development. With the growth of the adoption of persistent identifier systems (PIDs), such as ORCID and ROR, libraries have been incorporating other identity management sources when describing resources in institutional repositories and ETD workflows. These PIDs continue to fulfill some of linked data’s promise by being the “glue” between library catalogs, repositories, and research information management (RIM)/current research information systems (CRIS).
Participants discussed the benefits of alternative identity management sources for special and distinctive collections where the managed identities fall outside the current requirements for establishing NACO headings. While some organizations are participating in Social Networks and Archival Context (SNAC), many respondents noted their increased use of Wikidata, especially in digital repositories.
Workflows are becoming clearer. Thanks in large part to pilot projects, such as the PCC Wikidata Pilot, the PCC ISNI Pilot, and the PCC URIs in MARC Pilot, libraries are finding ways to incorporate other identity management platforms into their workflows. For example, NYU noted they routinely include 024 (other standard identifiers) fields in authority records with ISNI, VIAF, and Wikidata identifiers and also others like ORCID and Union List of Artist Names (ULAN) when appropriate for the person described.
The University of Washington has continued a project begun under the PCC Wikidata Pilot that resulted in Wikidata application profiles for Faculty and Staff and Graduate Students. Once a Wikidata entity is created, these URIs are directly incorporated into MARC records for electronic thesis and dissertations (ETD).
Going beyond LC authority files can allow catalogers to remediate problematic gaps. But it also comes with risks.
Several participants noted that using vocabularies outside the LC/NAF and LCSH allowed them to use culturally relevant names for Indigenous communities. At the University of Chicago, archivists who previously only used LC Subject Headings for cultural group names are exploring identity management options that allow them to circumvent problematic terms. “By utilizing controlled vocabularies created and maintained by subject experts who work with Indigenous communities, the archivists hope to make records more discoverable for the people whose cultures and ancestors are described…. Archivists select a vocabulary to use based on which has the most subject expert input for Indigenous communities in the particular geographic region… For example, [the] finding aid to the Gerhardt Laves Papers. Here, an archives processing student assistant corrected Laves’ spelling of several Australian Aboriginal cultures and languages based on AustLang. For instance, we used AustLang to identify the correct spelling for the Karajarri community (spelling in Laves’ materials Karadjeri) and the Yawuru community (spelled in Laves’ materials Yuwari).”
Similar efforts are underway at the University of Sydney. “We apply Aboriginal and Torres Strait Islander people headings from the AustLang database in selected theses records in our Institutional Repository….We [also] use BlackWords in AustLit and the Aboriginal and Torres Strait Islander Biographical Index (ABI) to ascertain the heritage of the Aboriginal and/or Torres Strait Islander authors and then add notes in the metadata to highlight their heritage.” View example record. At the same time, these examples highlight the promises and perils of moving outside of known rules of LC/NAF.
While these examples use a trusted, authoritative source, another participant noted: “There are also ethical concerns—how much faith can we put in each service to respect the privacy and dignity of the people being described? Can we control/remediate any harm being done in the same way that we can in the NAF?” Another participant emphasizes that “…well-meaning majority-white institutions might unintentionally expose individuals from historically marginalized groups to harm or harassment, and majority identity library workers sometimes lack enough cultural awareness to accurately describe or label individuals from underrepresented groups.”
Trust, efficiency, lower costs, technical affordances, and good governance policies and procedures are what we’re looking for in new identity management environments. We asked respondents, “what criteria do you consider important in selecting an identity management source.”
This mindmap visually summarizes the criteria mentioned by participants in their responses and our discussions.
Trust. Above all, librarians feel that any identity management services they will use need to be trusted. This is not only because the information a service provides is current and correct, but it is offered through stable endpoints that can be relied on in production environments. Especially for inclusion in linked-data records, having persistent URIs is essential.
Currently, we place a great deal of trust in the Library of Congress and NACO-trained catalogers because of the strong governance model they provide. This community provides mutual support, documentation, and training to ensure the quality of its authority file. While it is unlikely other identity management services will replicate this quality at the same level, having a clear governance structure is an important consideration. For example:
Do its values reflect those of the library community?
Is there a clear support channel for technical help and/or to report errors/problems?
Is it inclusive of different communities/individuals being identified?
Does it have mechanisms in place to prevent or mitigate the harms from vandalism?
Efficiency. Librarians’ interest in alternative identity management often comes back to how they can make their work more efficient. Identity management sources that are not comprehensive in the area of description are not up to date or are inconsistent in their coverage, making poor targets for inclusion. Instead, our participants sought out sources that were fit for purpose (e.g. ORCID for scholars, Discogs for music, IMDB for movies, or specialized resources like AustLang, etc.).
An identity management source should also allow workflows to flow smoothly. Several contributors noted that the advantages of Wikidata were the low barriers to entry, timeliness of URI generation, and ease of contributions/updates. Participants recognize these features sit in tension with the desire for trusted and efficient targets because it results in duplicates that need to be disambiguated or merged. Valued services provide additional structured properties that aid in disambiguation beyond just looking for closely matching string labels.
Respondents stressed that non-LC/NAF authorities do not necessarily reduce the time and effort inherent in identity management. Instead, their focus was on the capabilities of these platforms, which allowed them to focus on the intellectual aspects of identity management and less on data management. There is also concern that non-LC/NAF sources could exacerbate problems with duplicates.
Technical affordances. Many of our participants also saw value in the modern technical infrastructures provided by the alternatives to LC/NAF, especially when using linked data.
Participants indicated that a criteria for adopting a new identity management platform includes:
conformance to basic Linked Data principles
ability to be serialized in multiple ways, such as JSON-LD, CSV, etc.
support multiple languages and scripts and use language tags to localize preferred labels
Many linked-data-ready services also increase their value to libraries by offering an API, SPARQL endpoint, or OpenRefine reconciliation service. This allows metadata creators to work in batches and makes these services more efficient at larger scales. This is especially true when combined with tools and techniques that help disambiguate or cluster similar name strings and/or entity properties. Libraries still operating in older systems may find it more difficult to integrate these services. However, modern library service platforms (LSPs), repositories, and DAMs are increasingly adapting to include these sources.
The use of alternatives is not limited to identity management services that are based on library-specific data models. However, those services that have a close alignment to library models are more readily integrated into existing workflows. Identity management services with a clear, consistent, and (relatively) stable data model are preferred.
The road ahead
Metadata Managers have many aspirations that are tempered by concerns about what going beyond the LC/NAF means for identity management.
Using other platforms can increase the inclusion of entities that don’t neatly fit into current bibliographic practices. This can allow us to address current harms and further align our practices with the needs of specific communities. Loosening the control we have on authority record workflows means we can also invite more people into the process in ways that augment library expertise and result in the creation of more identifiers/greater coverage of the entities in bibliographic data. At the same time, we recognize there are dangers in possibly diluting our resources with lower-quality entities that create more work for librarians.
Libraries hope that new identity management platforms can be implemented in ways that lower costs. This may come through increased efficiencies or by being able to rely on less-demanding training requirements needed for staff to perform identity management tasks. Many of the emerging alternatives are also free to use without a direct cost to the library that uses them. Instead, the costs are borne by other agents within the metadata ecosystem—whether that’s through memberships in collaboratives like CrossRef, ORCID, SNAC, or through philanthropic funding that supports the Wikimedia Foundation/Wikidata. Librarians are also wary that we could invest time, labor, and intellectual efforts that contribute to identity management environments outside of library control. If there is value in this work, it could be captured in the future if the terms and conditions of use change to more closed models.
What do these conversations mean for OCLC? In my next post, I’ll interview Jeff Mixter to learn more about how he’s taken his work on OCLC Research pilot projects into developing production services for WorldCat Entities and linked data.
To commemorate Open Data Day 2023, Amara Hub convened a co-creation session to discuss the use of Artificial Intelligence (AI) in conservation efforts in Northern Uganda. The session was attended by a diverse spectrum of participants from Uganda Wildlife Authority, Africa Wildlife Federation, Uganda Conservation Foundation, Unwanted Witness along with conservation tech enthusiasts and practitioners, who shared their insights and ideas on how to leverage AI to protect endangered species and mitigate the impact of climate change on biodiversity. This hybrid breakfast event with 20 participants was held on 9th March 2023.
The broad next steps of the co-creation session are to identify and prioritise key challenges where there is an opportunity for AI to offer solutions, build partnerships, and develop a successful pilot project that benefits both wildlife and local communities in Northern Uganda.
Through the use of hypotheses from two scenarios, the co-creation session participants assessed what artificial intelligence can practically look like for conservation, and how it can be applied in the different contexts of protected and unprotected areas.
The first scenario was on the use of machine learning algorithms to track the movement of animals and predict their behaviour. By analysing large amounts of data from GPS collars and other tracking devices, AI systems can learn to predict where animals are likely to go next, making it easier for conservationists to anticipate and prevent human-wildlife conflict. Participants explored a research piece from Elizabeth Bondi et al., SPOT poachers in action: Augmenting conservation drones with automatic detection in near real-time as a basis for discussion on this scenario.
For the second scenario, participants in the co-creation session discussed climate change and its significant impact on biodiversity, with species facing extinction due to changes in temperature, precipitation, and other environmental factors. There was general consensus that monitoring the effects of climate change on biodiversity is a challenging and time-consuming task, making it difficult to implement effective mitigation strategies. However, the use of audio and visual data might be leveraged to develop more accurate and efficient monitoring systems. There was a review of data from 2014, by F. Wanyama, P. Elkan, F. Grossmann, et al, which highlights the findings of an aerial survey of large mammals in the Kidepo Valley National Park (KVNP) and the adjacent Karenga Community Wildlife Area (KCWA) in June 2014. Data from aerial surveys were able to determine that the population of elephants had fluctuated over the years in response to human-environmental interactions.
The application of AI in wildlife conservation has the potential to revolutionise the way we protect and conserve our natural resources. With the rapid advancements in technology, there is a real opportunity to leverage AI for the benefit of wildlife conservation in Northern Uganda.
Some specific actions developed towards this are:
Document the outcomes of the co-creation session: This detailed blog post is a crucial first step to reporting and acts as a repository for the ideas, insights, and suggestions generated during the co-creation session. This documentation can be used to inform the next stages of the project and can serve as a reference for the team and stakeholders. The presentation deck shared by Amara Hub is also available here.
Concept note development: After the co-creation session, the logical next step is to identify the key challenges that need to be addressed to implement AI for wildlife conservation in Northern Uganda. A follow-up concept note will be drafted by Amara Hub and shared with stakeholders. This will include the challenges related to data collection, analysis, and efficient decision-making for conservation efforts in Northern Uganda, proposed community engagement models, and resources required to pilot identified solutions.
Partnership building: Forming partnerships shall help to ensure the success of the initiative. These partnerships will include community organisations, government agencies, as well as local/international conservation organisations. To aid communication, collaboration, and knowledge-sharing a dedicated Slack channel has been created. Any like-minded individuals/organisations are open to join it here.
The age of generative AI and the large language model
It looks like generative AI will be broadly deployed across industries, in education, science, retail and recreation. Everywhere.
Current discussion ranges from enthusiastic technology boosterism about complete transformation to apocalyptic scenarios about AI takeover, autonomous military agents, and worse.
Comparisons with earlier technologies seem apt, but also vary in suggested importance. (Geoffrey Hinton wryly observed in an interview that it may be as big as the invention of the wheel.)
Like those earlier technologies - the web or mobile, for example - progress is not linear or predictable. It will be enacted in practices which will evolve and influence further development, and will take unimagined directions as a result. It has the potential to engage deeply with human behaviours to create compounding effects in a progressively networked environment. Think of how mobile technologies and interpersonal communications reshaped each other, and how the app store or the iPod/iPhone evolved in response to use.
The click-bait is strong, and it can be difficult to separate the spectacular but short-lived demonstration from the real trend.
It is likely to become a routine part of office, search, social and other applications. And while some of this will appear magic, much of it will be very mundane.
The outcomes will be productive, and also problematic. There is the dilution of social trust and confidence as synthesized communication or creation is indiscernible from human forms; there is the social impact on employment, work quality and exploited labor; there is potential for further concentration of economic or cultural power; there is propagation of harmful and historically dominant perspectives.
This is a descriptive overview of some current developments in generative AI, looking broadly. The treatment is necessarily limited, mostly by my own comprehension. My purpose is to provide some background context for those working in libraries as a prelude to discussing library issues and directions in subsequent posts. While I mention libraries and related areas occasionally, I have deliberately kept this general and not picked up on library implications. The immediate prompt (sic) was that I was writing a note on libraries and AI, in part based on a recentpresentation on discovery, and I realised I needed to know more about some of the background myself. There is of course now a large technical and research literature, as well as an avalanche of popular exposition and commentary in text, podcasts, and on YouTube. The click-bait is strong, and it can be difficult to separate the spectacular but short-lived demonstration from the real trend. Accordingly, I include quite a few links to further sources of information and I also quote liberally. Much of what I discuss or link to will be superseded quickly. Do get in touch if you feel I make an error of judgement or emphasis.
This is a long post! The sections (above) can be read independently and there is more detail in the table of contents. There are quite a few examples in Section one if you want to skim or jump for example.
Google has a Machine learning glossary which includes succinct definitions with some expansion for developers. The New York Times recently provided a more high level and shorter list of terms in generative AI
Brief orientation
Large language model
ChatGPT rests on a large language model. A large language model (LLM) is a neural network which is typically trained on large amounts of Internet and other data. It generates responses to inputs (&aposprompts&apos), based on inferences over the statistical patterns it has learned through training.
This ability is used in applications such as personalization, entity extraction, classification, machine translation, text summarization, and sentiment analysis. It can also generate new outputs, such as poetry, code and, alas, content marketing. Being able to iteratively process and produce text also allows it to follow instructions or rules, to pass instructions between processes, to interact with external tools and knowledge bases, to generate prompts for agents, and so on.
This ability means that we can use LLMs to manage multiple interactions through text interfaces. This potentially gives them compounding powers, as multiple tools, data sources or knowledge bases can work together iteratively ... and is also a cause of concern about control and unanticipated behaviors where they go beyond rules or act autonomously.
LLMs accumulate knowledge about the world. But their responses are based on inferences about language patterns rather than about what is &aposknown&apos to be true, or what is arithmetically correct. Of course, given the scale of the data they have been trained on, and the nature of the tuning they receive, they may appear to know a lot, but then will occasionally fabricate plausible but fictitious responses, may reflect historically dominant perspectives about religion, gender or race, or will not recognise relevant experiences or information that has been suppressed or marginalized in the record.
I liked this succinct account by Daniel Hook of Digital Science:
What can seem like magic is actually an application of statistics – at their hearts Large Language Models (LLMs) have two central qualities: i) the ability to take a question and work out what patterns need to be matched to answer the question from a vast sea of data; ii) the ability to take a vast sea of data and “reverse” the pattern-matching process to become a pattern creation process. Both of these qualities are statistical in nature, which means that there is a certain chance the engine will not understand your question in the right way, and there is another separate probability that the response it returns is fictitious (an effect commonly referred to as “hallucination”). // Daniel Hook
And here is a more succinct statement by Stephen Wolfram from his extended discussion of ChatGPT: &apos... it’s just saying things that “sound right” based on what things “sounded like” in its training material.&apos
I used &apostext&apos above rather than &aposlanguage.&apos Language is integral to how we think of ourselves as human. When we talk to somebody, our memories, our experiences of being in the world, our reciprocal expectations are in play. Removing the human from language has major cultural and personal ramifications we have not yet experienced. I was struck by this sardonically dystopian discussion of AI a while ago in a post on the website of the writing app iA. It concludes with brief poignant comments about language as a human bridge and about what is lost when one side of the bridge is disconnected.
ChatGPT galvanized public attention to technologies which had been known within labs and companies for several years. GPT refers to a family of language models developed by OpenAI. Other language models have similar architectures. GPT-4, the model underlying ChatGPT is an example of what has come to be called a foundation model.
In recent years, a new successful paradigm for building AI systems has emerged: Train one model on a huge amount of data and adapt it to many applications. We call such a model a foundation model. // Center for Research on Foundation Models, Stanford
It is useful to break down the letters in GPT for context.
The &aposg&apos is generative as in &aposgenerative AI.&apos It underlines that the model generates answers, or new code, images or text. It does this by inference over the patterns it has built in training given a particular input or &aposprompt.&apos
The &aposp&apos stands for &apospre-trained&apos, indicating a first training phase in which the model processes large amounts of data without any explicit tuning or adjustment. It learns about patterns and structure in the data, which it represents as weights. In subsequent phases, the model may be tuned or specialised in various ways.
The &apost&apos in GPT stands for &apostransformer.&apos Developed by Google in 2017, the transformer model is the neural network architecture which forms the basis of most current large language models (Attention is all you need, the paper introducing the model). The transformer defines the learning model, and generates the parameters which characterise it. It is common to describe an LLM in terms of the number of parameters it has: the more parameters, the more complex and capable it is. Parameters represent variable components that can be adjusted as the model learns, during pre-training or tuning. Weights are one parameter; another important one is embeddings. Embeddings are a numerical representation of tokens (words, phrases, ...) which can be used to estimate mutual semantic proximity (&aposriver&apos and &aposbank&apos are close in one sense of &aposbank&apos, and more distant in another).
Three further topics might be noted here.
First, emergent abilities. It has been claimed that as models have scaled, unforeseen capabilities have emerged [pdf]. These include the ability to be interactively coached by prompts (&aposin context learning&apos), to do some arithmetic, and others. However, more recently these claims have been challenged [pdf], suggesting that emergent abilities are creations of the analysis rather than actual model attributes. This question is quite important, because emergent abilities have become a central part of the generative AI narrative, whether one is looking forward to increasing capabilities or warning against unpredictable outcomes. For example, the now famous letter calling for a pause in AI development warns against &aposthe dangerous race to ever-larger unpredictable black-box models with emergent capabilities.&apos BIG-bench is a cross-industry initiative looking at characterizing model behaviors as they scale and &aposimprove&apos [pdf]. I was amused to see this observation: &aposAlso surprising was the ability of language models to identify a movie from a string of emojis representing the plot.&apos
Second, multimodal models. Generative models are well established, if developing rapidly. Historically models may have worked on text, or image or some other mode. There is growing interest in having models which can work across modes. GPT-4 and the latest Google models have some multi-modal capability - they can work with images and text, as can, say, Midjourney or DALL-E. This is important for practical applications reasons (doing image captioning or text to image, for example). It also diversifies the inputs into the model. In an interesting development, Meta has released an experimental multi-modal open source model, ImageBind, which can work across six types of data in a single embedding space: visual (in the form of both image and video); thermal (infrared images); text; audio; depth information; and movement readings generated by an inertial measuring unit, or IMU.
And third, embeddings and vector databases. Embeddings are important in the context of potential discovery applications, supporting services such as personalization, clustering, and so on. OpenAI, Hugging Face, Cohere and others offer embedding APIs to generate embeddings for external resources which can then be stored and used to generate some of those services. This has given some lift to vector databases, which seem likely to become progressively more widely used to manage embeddings. There are commercial and open source options (Weaviate, Pinecone, Chroma, etc.). Cohere recently made embeddings for multiple language versions of Wikipedia available on Hugging Face. These can be used to support search and other applications, and similarity measures work across languages.
LLMs do not have experiential knowledge of the world or &aposcommon sense.&apos This touches on major research questions in several disciplines. Yejin Choi summarizes in her title: Why AI Is Incredibly Smart — and Shockingly Stupid.
Three strands in current evolution and debate
I focus on three areas here: major attention and investment across all domains, rapid continuing evolution of technologies and applications, and social concerns about harmful or undesirable features.
1. Major attention and investment
Major attention and investment is flowing into this space, across existing and new organizations. Just to give a sense of the range of applications, here are some examples which illustrate broader trends.
Coaching and learning assistants. Khan Academy has developedKhanmigo, a ChatGPT based interactive tutor which can be embedded in the learning experience. They are careful to highlight the &aposguardrails&apos they have built into the application, and are very optimistic about its positive impact on learning. It is anticipated that this kind of tutoring co-pilot may be deployed in many contexts. For example, Chegg has announced that it is working with GPT-4 to deliver CheggMate, "an AI conversational learning companion." This may have been a defensive move; it was also recently reported that Chegg&aposs shares had dropped 40% given concerns about the impact of ChatGPT on its business. Pearson has issued a cease and desist notice to an (unnamed) AI company who has been using its content, and has said it will train its own models.
Consumer navigation/assistance. Expedia has a two-way integration with ChatGPT - there is an Expedia plugin to allow ChatGPT to access Expedia details, and Expedia uses ChatGPT to provide enhanced guidance on its own site. Other consumer sites have worked with ChatGPT to provide plugins. Form filling, intelligent assistants, enquiry services, information pages, and so on, will likely see progressive AI upgrades. Wendy&aposs has just announced a collaboration with Google to develop a chatbot to take orders at the drive-thru.
Competitive workflows/intelligence. Organizations are reviewing operations and workflows, looking for efficiencies, new capacities, and competitive advantage. A couple of announcements from PWC provide an example. It reports it is investing over $1B in scaling its ability to support and advise clients in developing AI-supported approaches. It also recently announced an agreement with Harvey, a startup specialising in AI services to legal firms, to provide PWC staff with "human led and technology enabled legal solutions in a range of areas, including contract analysis, regulatory compliance, claims management, due diligence and broader legal advisory and legal consulting services." It will build proprietary language models with Harvey to support its business.
Content generation/marketing. Linkedin now uses ChatGPT to "include generative AI-powered collaborative articles, job descriptions and personalized writing suggestions for LinkedIn profiles." The collaborative articles I have seen have been somewhat bland, which underlines a concern many have that we will be flooded by robotic text - in adverts, press releases, content marketing, articles - as it becomes trivially easy to generate text. Canadian company Cohere offers a range of content marketing, search, recommendation, summarization services based on its language models built to support business applications.
Code. The GitHub Copilot, which writes code based on prompts, is acclerating code development. You can also ask ChatGPT itself to generate code for you, to find mistakes or suggest improvements. Amazon has recently released CodeWhisperer, a competitor, and made it free to use for individual developers. There has also been concern here about unauthorised or illegal reuse of existing code in training sets. Partly in response, BigCode is an initiative by Hugging Face and others to develop models based on code which is permissively licensed and to remove identifying features. They have released some models, and describe the approach here.
Productivity. Microsoft is promising integration across its full range of products to improve productivity (please don&apost mention Clippy). Other apps will increasingly include (and certainly advertise) AI features - Notion, Grammarly, and so on.
Publishing. The publishing workflow will be significantly modifed by AI support. A group recently published an interesting taxonomy of areas where AI would have an impact on scholarly publishing: Extract, Validate, Generate, Analyse, Reformat, Discover, Translate. A major question is the role of AI in the generation of submissions to publishers and the issues it poses in terms of creation and authenticity. It seems likely that we will see synthetic creations across the cultural genres - art, music, literature - which pose major cultural and legal questions.
Image generation. Adobe is trialling Firefly, an &aposAI art generator, as competition to Midjourney and others.&apos Adobe is training the application on its own reservoir of images, to which it has rights.
Google. For once, Google is playing catchup and has released Bard, its chat application. It is interesting to note how carefully, tentatively almost, it describes its potential and potential pitfalls. And it has been cautious elsewhere. It is not releasing a public demo of Imagen, its text to image service given concerns about harmful or inappropriate materials (see further below). However, Google&aposs work in this area has been the subject of ongoing public debate. Several years ago, Timnit Gebru, co-lead of its internal Ethical Artificial Intelligence Team left Google after disagreement over a paper critiquing AI and language models (since published: On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?). More recently, Geoffrey Hinton, neural network pioneer, resigned from his position at Google so that he could more readily express his concerns about general AI directions, especially given increased commercial competition. And, from a different perspective, an internal memo was recently leaked, documenting what it claimed to be technical and business competitive missteps by Google. Google has since announced a broad range of AI augmentations across its product suite, including search.
Developers, apps. As important as these more visible intitiatives is the explosion of general interest as individual developers and organizations experiment with language models, tools and applications. Check out the number of openly available language models on Hugging Face or the number of extensions/plugins appearing for applications (see the Visual Studio Code Marketplace for example or Wordpress plugins).
It is likely to become a routine part of office, search, social and other applications. And while some of this will appear magic, much of it will be very mundane.
A discussion of applications in research and education, which of course are of critical importance to libraries, is too big a task to attempt here and is outside my scope. There is already a large literature and a diverse body of commentary and opinion. They are especially interesting given their knowledge intensive nature and the critical role of social trust and confidence. The Jisc Primer (see below) introduces generative AI before discussing areas relevant to education. It notes the early concern about assessment and academic integrity, and then covers some examples of use in teaching and learning, of use as a time-saving tool, and of use by students. AI is already broadly used in research disciplines, in many different ways and at different points in the life cycle, and this will continue to grow. This is especially the case in medical, engineering and STEM disciplines, but will be apply across all disciplines. See for example this discussion of the potential use of AI in computational social sciences [pdf]. A major issue is the transparency of the training and tuning models. Results need to be understandable and reproducible. A telling example of how books were represented in training data was provided in a paper with the arresting title: Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 [pdf]. They found many books in copyright, and also found that some categories were disproportionately over-represented: science/fiction and fantasy and popular out of copyright works, for example. For cultural analytics research, they argue that this supports a case for open models with known training data.
The UK national agency for education support, Jisc, has a primer on Generative AI which has a special focus on applications in education. It plans to keep it up to date.
2. Rapid continuing evolution
The inner workings of these large models are not well understood and they have shown unpredictable results. OpenAI chooses a strange and telling analogy here as they describe the unsupervised LLM training - building the model is &aposmore similar to training a dog than to ordinary programming.&apos
Unlike ordinary software, our models are massive neural networks. Their behaviors are learned from a broad range of data, not programmed explicitly. Though not a perfect analogy, the process is more similar to training a dog than to ordinary programming. An initial “pre-training” phase comes first, in which the model learns to predict the next word in a sentence, informed by its exposure to lots of Internet text (and to a vast array of perspectives). This is followed by a second phase in which we “fine-tune” our models to narrow down system behavior. // OpenAI
Here are some areas of attention and interest:
A diversity of models
OpenAI, Google and Microsoft and others are competing strongly to provide foundation models, given the potential rewards. Amazon has built out its offerings a little more slowly, and is partnering with various LLM developers. Startups such as AI21 Labs and Anthropicare also active. These will be important elements of what is a diversifying environment, and LLMs will be components in a range of platform services offered by these commercial players.
OpenAI chooses a bizarre and telling analogy here - building the model is &aposmore similar to training a dog than to ordinary programming.&apos
There is also a very active interest in smaller models, which may be run in local environments, or are outside of the control of the larger commercial players. Many of these are open source. There has been a lot of work based on LLaMA, a set of models provided by Meta (with names like Alpaca, Vicuna, and .... Koala). Current licensing restricts these to research uses, so there has also been growing activity around commercially usable models. Databricks recently releasedDolly 2.0, an open source model which is available for commercial use. They also released the training data for use by others.
Mosaic ML, a company supporting organizations in training their models, has also developed and released several open source models which can be used commercially. It has also released what it calls an LLM Foundry, a library for training and fine-tuning models. There is a base model and some tuned models. Especially interesting is MPT-7B-StoryWriter-65k+, a model optimised for reading and writing stories.
Many others are also active here and many models are appearing. This was highlighted in the infamous leaked Google Memo. It underlined the challenge of smaller open source LLMs to Google, stressing that it has no competitive moat, and that, indeed, the scale of its activity may be slowing it down.
At the same time, many specialist models are also appearing. These may be commercial and proprietary, as in the PWC/Harvey and Bloomberg examples I mention elsewhere. Or they may be open or research-oriented, such as those provided by the National Library of Sweden, or proposed by Core (an open access aggregator) and the Allen Institute for AI. Intel and the Argonne National Laboratory have announced a series of LLMs to support the scientific research community, although not much detail is provided (at the time of writing). More general models may be also specialised through use of domain specific instruction sets at tuning stage. Google has announced specialised models for cybersecurity and medical applications, for example, based on its latest PaLM 2 model. Although it has not published many technical details, LexisNexis has released a product, Lexis+ AI, which leverages several language models, including, it seems, some trained on Lexis materials.
So, in this emergent phase there is much activity around a diverse set of models, tools and additional data. A race to capture commercial advantage sits alongside open, research and community initiatives. Hugging Face has emerged as an important aggregator for open models, as well as for data sets and other components. Models may be adapted or tuned for other purposes. This rapid innovation, development, and interaction between models and developers sits alongside growing debate about about transparency (of training data among other topics) and responsible uses.
Refinement or specialization: Fine-tuning, moderation, and alignment
Following the &aposunsupervised&apos generation of the large language model, it is &apostuned&apos in later phases. Tuning alters parameters of the model to optimise the performance of LLMs for some purpose, to mitigate harmful behaviors, to specialise to particular application areas, and so on. Methods include using a specialist data set targeted to a particular domain, provision of instruction sets with question/response examples, human feedback, and others. Retraining large models is expensive, and finding ways to improve their performance or to improve the performance of smaller models is important. Similarly, it may be beneficial to tune an existing model to a particular application domain. Developing economic and efficient tuning approaches is also an intense R&Dfocus, especially as more models targeted at lower power machines appear.
OpenAI highlights the use of human feedback to optimize the model.
The data is a web-scale corpus of data including correct and incorrect solutions to math problems, weak and strong reasoning, self-contradictory and consistent statements, and representing a great variety of ideologies and ideas.
So when prompted with a question, the base model can respond in a wide variety of ways that might be far from a user’s intent. To align it with the user’s intent within guardrails, we fine-tune the model’s behavior using reinforcement learning with human feedback (RLHF). // Open.AI
OpenAI also published safety standards and has a Moderation language model which it has also externalised to API users.
The Moderation models are designed to check whether content complies with OpenAI&aposs usage policies. The models provide classification capabilities that look for content in the following categories: hate, hate/threatening, self-harm, sexual, sexual/minors, violence, and violence/graphic. You can find out more in our moderation guide. // Open.AI
OpenAI externalised the RLHF work to contractors, involving labor practices that are attracting more attention as discussed further below. Databricks took a &aposgamification&apos approach among its employees to generate a dataset for tuning. Open Assistant works with its users in a crowdsourcing way to develop data for tuning [pdf]. GPT4All, a chat model from Nomic AI, asks its users can it capture interactions for further training.
Tuning data is also becoming a sharable resource, and subject to questions about transparency and process. For example, Stable AI reports how it has tuned an open source model, StableVicuna, using RLHF data from several sources, including Open Assistant.
Reinforcement Learning by Human Feedback is just parenting for a supernaturally precocious child.
Geoffrey Hinton - major figure in neural network development.
The question of &aposalignment&apos, mentioned in the OpenAI quote, is key, although use of the term is elastic depending on one&aposs view of threats. Tuning aims to align the outputs of the model with human expectations or values. This is a very practical issue in terms of effective deployment and use of LLMs in production applications. Of course it also raises important policy and ethical issues. Which values, one might ask? Does one seek to remove potentially harmful data from the training materials, or try to tune the language models to recognise it and respond appropriately? How does one train models to understand that there are different points of view or values, and respond appropriately? And in an age of culture wars, fake news, ideological divergences, and very real wars, &aposalignment&apos takes on sinister overtones.
This work sits alongside the very real concerns that we do not know enough about how the models work to anticipate or prevent potential harmful effects as they get more capable. Research organization Eleuther.ai initially focused on providing open LLMs but has pivoted to researching AI interpretability and alignment as more models are available. There are several organizations devoted to alignment research (Redwood Research, Conjecture) and LLM provider Anthropic has put a special emphasis on safety and alignment. And several organizations more broadly research and advocate in favor of positive directions. Dair is a research institute founded by Timnit Gebru "rooted in the belief that AI is not inevitable, its harms are preventable, and when its production and deployment include diverse perspectives and deliberate processes it can be beneficial."
The return of content
Alex Zhavoronkov had an interesting piece in Forbes where he argues that content generators/owners are the unexpected winners as LLMs become more widely used. The value of dense, verified information resources increases as they provide training and validation resources for LLMs in contrast to the vast uncurated heterogeneous training data. Maybe naturally, he points to Forbes as a deep reservoir of business knowledge. He also highlights the potential of scientific publishers, Nature and others, whose content is currently mostly paywalled and isolated from training sets. He highlights not only their unique content, but also the domain expertise they can potentially marshall through their staffs and contributors.
Nature of course is part of the Holtzbrinck group which is also home to Digital Science. Zhavoronkov mentions Elsevier and Digital Science and notes recent investments which well position them now.
A good example from a different domain is provided by Bloomberg, which has released details of what it claims is the largest domain-specific model, BloombergGPT. This is created from a combination of Bloomberg&aposs deep historical reservoir of financial data as well as from more general publicly available resources. It claims that this outperforms other models considerably on financial tasks, while performing as well or better on more general tasks.
“The quality of machine learning and NLP models comes down to the data you put into them,” explained Gideon Mann, Head of Bloomberg’s ML Product and Research team. “Thanks to the collection of financial documents Bloomberg has curated over four decades, we were able to carefully create a large and clean, domain-specific dataset to train a LLM that is best suited for financial use cases. We’re excited to use BloombergGPT to improve existing NLP workflows, while also imagining new ways to put this model to work to delight our customers.” // Bloomberg
The model will be integrated into Bloomberg services, and Forbesspeculates that it will not be publicly released given the competitive edge it gives Bloomberg. Forbes also speculates about potential applications: creating an initial draft of a SEC filing, summarizing content, providing organization diagrams for companies, and noting linkages between people and companies; automatically generated market reports and summaries for clients; custom financial reports.
Bloomberg claims that it achieved this with a relatively small team. This prompts the question about what other organizations might do. I have mentioned some in the legal field elsewhere. Medical and pharmaceutical industries are obvious areas where there is already a lot of work. Would IEEE or ACS/CAS develop domain specific models in engineering and chemistry, respectively?
Libraries, cultural and research organizations have potentially much to contribute here. It will be interesting to see what large publishers do, particularly what I call the scholarly communication service providers mentioned above (Elsevier, Holtzbrinck/Digital Science, Clarivate). These have a combination of deep content, workflow systems, and analytics services across the research workflow. They have already built out research graphs of researchers, institutions, and research outputs. The National Library of Sweden has been a pioneer also, building models on Swedish language materials, and cooperating with other national libraries. I will look at some initiatives in this space in a future post.
Specialist, domain or national data sets are are also interesting in the context of resistance to the &aposblack box&apos nature of the current foundation models, unease about some of the content of uncurated web scrapes, and concern about the WEIRD (Western, Educated, Industrial, Rich, Democratic) attributes of current models. We will likely see more LLMs specialised by subject, by country or language group, or in other ways, alongside and overlapping with the push to open models.
Platform competition
There is a major commercial focus on providing services which in different ways allow development, deployment and orchestration of language models. It is likely that on-demand, cloud-based LLM platforms which offer access to foundation models, as well as to a range of development, customization, deployment and other tools will be important. The goal will be to streamline operations around the creation and deployment of models in the same way as has happened with other pieces of critical infrastructure.
OpenAI has pushed strongly to become a platform which can be widely leveraged. It has developed API and plugin frameworks, and has adjusted its API pricing to be more attractive. It has a clear goal to make ChatGPT and related offerings a central platform provider in this emerging industry.
Microsoft is a major investor in OpenAI and has moved quickly to integrate AI into products, to release Bing Chat, and to look at cloud-based infrastructure services. Nvidia is also a big player here. Predictably, Amazon has been active. It recently launched Bedrock ("privately customize FMs with your own data, and easily integrate and deploy them into your applications") and Titan (several foundation models), to work alongside Sagemaker, a set of services for building and managing LLM infrastructure and workflows. Fixie is a new platform company which aims to provide enterprise customers with LLM-powered workflows integrating agents, tools and data.
Again, the commercial stakes are very high here, so these and other companies are moving rapidly.
The position of Hugging Face is very interesting. It has emerged as a central player in terms of providing a platform for open models, transformers and other components. At the same time it has innovated around the use of models, itself and with partners, and has supported important work on awareness, policy and governance.
It will also be interesting to see how strong the non-commercial presence is. There are many research-oriented specialist models. Will a community of research or public interest form around particular providers or infrastructure?
Today we&aposre thrilled to announce our new undertaking to collaboratively build the best open language model in the world: AI2 OLMo.
A modest aspiration: the best open language model in the world
Guiding the model: prompt engineering and in-context learning
&aposPrompt engineering&apos has also quickly entered the general vocabulary. It has been found that results can be interactively improved by &aposguiding&apos the LLM in various ways, by providing examples of expected answers for example. Much of this will also be automated as agents interact with models, multiple prompts are programmatically passed to the LLM, prompts are embedded in templates not visible to the use, and so on. A prompt might include particular instructions (asking the LLM to adopt a particular role, for example, or requesting a particular course of action if it does not know the answer), context (demonstration examples of answers, for example), directions for the format of the response, as well as an actual question.
Related to this, a major unanticipated &aposemergent ability&apos is the improvement in responses that can be achieved by &aposin context learning&apos : &aposIn context learning is a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration.&apos It is &aposin context&apos because it is not based on changing the underlying model, but on improving results in a particular interaction. The language model can be influenced by examples in inferring outputs. So, if you are asking it to develop a slogan for your organization, for example, you could include examples of good slogans from other organizations in the prompt.
A body of good practice is developing here, and, many guidelines and tools are emerging. Prompt engineering has been identified as a new skill, with learned expertise. It has also been recognized as analogous to coding - one instructs the LLM how to behave with sequences of prompts. (Although, again, bear in mind the remark above about training a dog!)
An important factor here is the size of the context window. The context window is the number of tokens (words or phrases) that can be input and considered at the same time by the LLM. One advance of GPT-4 was to enlarge the context window, meaning it could handle longer more complex prompts. However, recently we have seen major advances here. I note MPT-7B-StoryWriter-65k+ elsewhere, which is optimised for working with stories and can accept 65K tokens, large enough, for example, to process a novel without having to break it up into multiple prompts. The example they give is of inputting the whole of The Great Gatsby and having it write an epilogue. Anthropic has announced that their Claude service now operates with a context window of 100k tokens. One potential benefit of such large context windows is that it allows single documents to be input for analysis - they give examples of asking questions about a piece of legislation, a research paper, an annual report or financial statement, or maybe an entire codebase.
An overview of tools, guidelines and other resources produced by dair.ai
Building tools and workflows: agents
Discussion of agents is possibly the most wildly divergent in this space, with topics ranging from controlled business workflow solutions to apocalyptic science fiction scenarios in which autonomous agents do harm.
LLMs have structural limits as assistants. They are based on their training data, which is a snapshot in time. They are unconnected to the world, so they do not search for answers interactively or use external tools. They have functional limits, for example they do not do complex mathematical reasoning. And they typically complete only one task at a time. We are familiar, for example, with ChatGPT telling us that it is not connected to the internet and cannot answer questions about current events.
This has led to growing interest in connecting LLMs to external knowledge resources and tools. However, more than this, there is interest in using the reasoning powers of LLMs to manage more complex, multi-faceted tasks. This has led to the emergence of agents and agent frameworks.
In this context, the LLM is used to support a rules-based framework in which
programming tools, specialist LLMs, search engines, knowledge bases, and so on, are orchestrated to achieve tasks;
the natural language abilities of the LLM are leveraged to allow it to plan -- to break down tasks, and to sequence and connect tools to achieve them;
the LLM may proceed autonomously, using intermediate results to generate new prompts or initiate sub-activities until it completes.
Auto-GPT has galvanized attention here. Several other frameworks have also appeared and the term Auto-GPT is sometimes used generically to refer to LLM-based agent frameworks.
This has led to growing interest in connecting LLMs to external knowledge resources and tools. However, more than this, there is interest in using the reasoning powers of LLMs to manage more complex, multi-faceted tasks. This has led to the emergence of agents and agent frameworks.
Langchain is a general framework and library of components for developing applications, including agents, which use language models. They argue that the most &apospowerful and differentiated&apos LLM-based applications will be data-aware (connecting to external data) and will be agentic (allow an LLM to interact with its environment). Langchain facilitates the plug and play composition of components to build LLM-based applications, abstracting interaction with a range of needed components into Langchain interfaces. Components they have built include language and embedding models, prompt templates, indexes (working with embeddings and vector databases), identified agents (to control interactions between LLMs and tools), and some others.
Frameworks like this are important because they help facilitate the construction of more complex applications. These may be reasonably straightforward (a conversational interface to documentation or PDF documents, comparison shopping app, integration with calculators or knowledge resources) or may involve more complex, iterative interactions.
Hugging Face has also introduced an agents framework, Transformers Agent, which seems similar in concept to LangChain and allows developers to work with an LLM to orchestrate tools on Hugging Face. This is also the space where Fixie hopes to make an impact, using the capabilities of LLMs to allow businesses build workflows and processes. A marketplace of agents will support this.
This Handbook with parallel videos and code examples is a readable overview of LLM topics
At this stage, publicly visible agents have been relatively simple. The ability to use the language model to orchestrate a conversational interface, tools and knowledge resources is of great interest. We will certainly see more interactions of this kind in controlled business applications, on consumer sites, and elsewhere. Especially if they do in fact reduce &aposstitching costs.&apos In a library context, one could see it used in acquisitions, discovery, inter library lending - wherever current operations depend on articulating several processes or tools.
Transparency and evaluation
As models are more widely used there is growing interest in transparency, evaluation and documentation, even if there aren&apost standard agreed practices or nomenclature. From the variety of work here, I mention a couple of initiatives at Hugging Face, given its centrality.
In a paper describing evaluation frameworks at Hugging Face, the authors provide some background on identified issues (reproducibility, centralization, and coverage) and review other work [pdf].
Evaluation is a crucial cornerstone of machine learning – not only can it help us guage whether and how much progress we are making as a field, it can also help determine which model is most suitable for deployment in a given use case. However, while the progress made in terms of hardware and algorithms might look incredible to a ML practitioner from several decades ago, the way we evaluate models has changed very little. In fact, there is an emerging consensus that in order to meaningfully track progress in our field, we need to address serious issues in the way in which we evaluate ML systems [....].
They introduce Evaluate (as set of tools to facilitate evaluation of models and datasets) and Evaluation on the hub (a platform that supports largescale automatic evaluation).
Hugging Face also provide Leaderboards, which show evaluation results. The Open LLM Leaderboard lists openly available LLMs against a particular set of benchmarks.
In related work, Hugging Face have also developed a framework for documentation, and have implemented Model Cards, a standardized approach to description with supporting tools and guidelines. They introduce them as follows:
Model cards are an important documentation framework for understanding, sharing, and improving machine learning models. When done well, a model card can serve as a boundary object, a single artefact that is accessible to people with different backgrounds and goals in understanding models - including developers, students, policymakers, ethicists, and those impacted by machine learning models.
They also provide a review of related and previous work.
Since Model Cards were proposed by Mitchell et al. (2018), inspired by the major documentation framework efforts of Data Statements for Natural Language Processing (Bender & Friedman, 2018) and Datasheets for Datasets (Gebru et al., 2018), the landscape of machine learning documentation has expanded and evolved. A plethora of documentation tools and templates for data, models, and ML systems have been proposed and developed - reflecting the incredible work of hundreds of researchers, impacted community members, advocates, and other stakeholders. //
Some of those working on the topics above were authors on the Stochastic Parrot paper. In that paper, the authors talk about &aposdocumentation debt&apos, where data sets are undocumented and become too large to retrospectively document. Interestingly, they reference work which suggests that archival principles may provide useful lessons in data collection and documentation.
There is clear overlap with the interests of libraries, archives, data repositories and other curatorial groups in this area. In that context, I was also interested to see a group of researchers associated with Argonne National Laboratory show how the FAIR principles can be applied to language models.
Given the black box nature of the models there is also concerns about how they are used to guide decisions in important life areas open to bias - in law enforcement, education, social services, and so on. There may be very little understanding of how important decisions are being made. This is a regulatory driver as shown in this Brookings discussion of developments in California and elsewhere.
Several technical factors have come together in the current generation of LLMs. The availability of massive amounts of data through webscale gathering and the transformer architecture from Google are central. A third factor is the increased performance of specialist hardware, to deal with the massive memory and compute requirements of LLM processing. Nvidia has been an especially strong player here, and its GPUs are widely used by AI companies and inrastructure providers. However, again, given the belief in massively increased demand, others have also been developing their own custom chips to reduce reliance on Nvidia. The competitive race is on at the hardware level also. See announcements from Microsoft and AMD, for example, or Meta (which is also working on data center design &aposthat will be “AI-optimized” and “faster and more cost-effective to build”&apos).
There is some speculation about the impact of quantum computing further out, which could deliver major performance improvements.
At the same time, many of the open source newer models are designed to run with much smaller memory and compute requirements, aiming to put them within reach of a broad range of players.
Research interconnections
There are strong interconnections between research universities, startups and the large commercial players in terms of shared research, development and of course the movement of people. Many of the papers describing advances have shared academic and industrial authorship.
Several years ago, this is how The Verge reported when neural network pioneers Hinton (recently left Google), Bengio and LeCun won the Turing Prize. And they have students and former collaborators throughout the industry.
All three have since taken up prominent places in the AI research ecosystem, straddling academia and industry. Hinton splits his time between Google and the University of Toronto; Bengio is a professor at the University of Montreal and started an AI company called Element AI; while LeCun is Facebook’s chief AI scientist and a professor at NYU. // The Verge
The Stanford AI Index for 2023 notes that industry produced 32 significant machine learning models in 2022, compared to 3 from academia. It cited the increased data and compute resources needed as a factor here. However, one wonders whether there might be some rebalancing in their next report given university involvement in the rise of open source models discussed above.
The leaked Google memo is interesting in this context:
But holding on to a competitive advantage in technology becomes even harder now that cutting edge research in LLMs is affordable. Research institutions all over the world are building on each other’s work, exploring the solution space in a breadth-first way that far outstrips our own capacity. We can try to hold tightly to our secrets while outside innovation dilutes their value, or we can try to learn from each other. // Google "We Have No Moat, And Neither Does OpenAI"
arXiv has become very visible given the common practice of disseminating technical accounts there. While this may give them an academic patina, not all then go through submission/refereeing/publication processes.
This role is recognised on the Hugging Face Hub. If a dataset card includes a link to a paper on arXiv, the hub will convert the arXiv ID into an actionable link to the paper and it can also find other models on the Hub that cite the same paper.
Hugging Face also has a paper notification service.
3. Social concerns
Calling for government regulation and referencing much reported recent arguments for a pause in AI development, a Guardian editorial summarises:
More importantly, focusing on apocalyptic scenarios – AI refusing to shut down when instructed, or even posing humans an existential threat – overlooks the pressing ethical challenges that are already evident, as critics of the letter have pointed out. Fake articles circulating on the web or citations of non-existent articles are the tip of the misinformation iceberg. AI’s incorrect claims may end up in court. Faulty, harmful, invisible and unaccountable decision-making is likely to entrench discrimination and inequality. Creative workers may lose their living thanks to technology that has scraped their past work without acknowledgment or repayment. // The Guardian view on regulating AI: it won’t wait, so governments can’t
Wrong or harmful results
We know that LLMs &aposhallucinate&apos plausible-sounding outputs which are factually incorrect or fictitious. In addition, they can reflect biased or harmful views learned from the training data. They may normalize dominant patterns in their training data (doctors are men and nurses are women, for example).
In a discussion about their Imagen service, Google is very positive about the technical achievements and overall performance of the system ("unprecedented photorealism"). However, they are not releasing it because of concerns about social harm. I provide an extended quote here for two reasons. First, it is direct account of the issues with training data. And second, this is actually Google talking about these issues.
Second, the data requirements of text-to-image models have led researchers to rely heavily on large, mostly uncurated, web-scraped datasets. While this approach has enabled rapid algorithmic advances in recent years, datasets of this nature often reflect social stereotypes, oppressive viewpoints, and derogatory, or otherwise harmful, associations to marginalized identity groups. While a subset of our training data was filtered to removed noise and undesirable content, such as pornographic imagery and toxic language, we also utilized LAION-400M dataset which is known to contain a wide range of inappropriate content including pornographic imagery, racist slurs, and harmful social stereotypes. Imagen relies on text encoders trained on uncurated web-scale data, and thus inherits the social biases and limitations of large language models. As such, there is a risk that Imagen has encoded harmful stereotypes and representations, which guides our decision to not release Imagen for public use without further safeguards in place.
Preliminary assessment also suggests Imagen encodes several social biases and stereotypes, including an overall bias towards generating images of people with lighter skin tones and a tendency for images portraying different professions to align with Western gender stereotypes. Finally, even when we focus generations away from people, our preliminary analysis indicates Imagen encodes a range of social and cultural biases when generating images of activities, events, and objects. // Google
The discussion above is about images. Of course, the same issues arise with text. This was one of the principal strands in the Stochastic Parrot paper mentioned above, which was the subject of contention with Google:
In summary, LMs trained on large, uncurated, static datasets from the Web encode hegemonic views that are harmful to marginalized populations. We thus emphasize the need to invest significant resources into curating and documenting LM training data.
The authors note for example the majority male participation in Reddit or contribution to Wikipedia.
This point is also made in a brief review of bias on the Jisc National Centre for AI site, which notes the young, male and American character of Reddit. They look at research studies on GPT-3 outputs which variously show gender stereotypes, increased toxic text when a disability is mentioned, and anti-Muslim bias. They discussion bias in training data, and also note that it may be introduced at later stages - with RLHF, for example, and in training the Moderation API in OpenAI.
One disturbing finding of the BIG-bench benchmarking activity noted above was that "model performance on social bias metrics often grows worse with increasing scale." They note that this can be mitigated by prompting, and also suggest that the approach of the LaMDA model "to improve model safety and reduce harmful biases will become a crucial component in future language models."
Using Hugging Face tools to explore bias in language models
Authenticity, creation and intellectual property
LLMs raise fundamental questions about expertise and trust, the nature of creation, authenticity, rights, and the reuse of training and other input materials. These all entail major social, ethical and legal issues. They also raise concerns about manipulation, deliberately deceptive fake or false outputs, and bad actors. These are issues about social confidence and trust, and the broad erosion of this trust and confidence worries many.
There will be major policy and practice discussions in coming years. Think of the complexity of issues in education, the law, or medicine, for example, seeking to balance benefit and caution.
Full discussion is beyond my scope and competence, but here are some examples which illustrate some of these broad issues:
Scientific journals have had to clarify policies about submissions. For example, Springer Nature added two policies to their guidelines. No LLM will be accepted as an author on the basis that they cannot be held accountable for the work they produce. And second, any use of LLM approaches needs to be documented in submitted papers. They argue that such steps are essential to the transparency scientific work demands.
What is the intellectual and cultural record? For the first time, we will see the wholesale creation of images, videos and text by machines. This raises questions about what should be collected and maintained.
Reddit and Stackoverflow have stated that they want compensation for being used as training data for Large Language Models. The potential value of such repositories of questions and answers is clear; what is now less clear is how much, if at all, LLM creators will be willing to pay for important sources. (It is interesting to note Arxiv and PLOS on some lists of training data inputs.) Of course, given the broad harvest of web materials in training data there is much that owners/publishers may argue is inappropriate use. Others may also join Reddit in this way, and some may take legal steps to seek to prevent use of their data.
There is growing interest in apps and tools which can tell what creations are synthesised by AI. Indeed, OpenAI itself has produced one: "This classifier is available as a free tool to spark discussions on AI literacy." Although they note that it is not always accurate or may easily be tricked. Given concerns in education, it is not surprising seeing plagiarism detection company Turnitin provide support here as well.
LLMs are potentially powerful scholarly tools in cultural analytics and related areas, prospecting recorded knowledge for meaningful patterns. However, such work will be influenced -- in unknown ways -- by the composition of training or instruction data. As noted above, a recent study showed that GPT-4&aposs knowledge of books naturally reflects what is popular on the web, which turns out to include popular books out of copyright, science fiction/fantasy, and some other categories.
The LLM model has at its core questions about what it is reasonable to &aposgenerate&apos from other materials. This raises legal and ethical questions about credit, intellectual property, fair use, and about the creative process itself. A recent case involving AI-generated material purporting to be Drake was widely noted. What is the status of AI-generated music in response to the prompt: &aposplay me music in the style of Big Thief&apos? (Looking back to Napster and other industry trends, Rick Beato, music producer and YouTuber, argues that that labels and distributors will find ways to work with AI-generated music for economic reasons.) A group of creators have published an open letter expressing concern about the generation of art work: "Generative AI art is vampirical, feasting on past generations of artwork even as it sucks the lifeblood from living artists." The Getty case against Stability AI is symptomatic:
Getty Images claims Stability AI ‘unlawfully’ scraped millions of images from its site. It’s a significant escalation in the developing legal battles between generative AI firms and content creators. // The Verge
Questions about copyright are connected with many of these issues. Pam Samuelson summarises issues from a US perspective in the presentation linked to below. From the description:
The urgent questions today focus on whether ingesting in-copyright works as training data is copyright infringement and whether the outputs of AI programs are infringing derivative works of the ingested images. Four recent lawsuits, one involving GitHub’s Copilot and three involving Stable Diffusion, will address these issues.
She suggests that it will take several years for some of these questions to be resolved through the courts. In her opinion, ingesting in-copyright works for the purposes of training will be seen as fair use. She also covers recent cases in which the US Copyright Office ruled that synthesized art works were not covered by copyright, as copyright protection requires some human authorship. She described the recent case of Zarya of the Dawn, a graphic work where the text was humanly authored but the images were created using Midjourney. The Art Newspaperreports: "The Copyright Office granted copyright to the book as a whole but not to the individual images in the book, claiming that these images were not sufficiently produced by the artist."
Pamela Samuelson, Richard M. Sherman Distinguished Professor of Law, UC Berkeley, discusses several of the recent suits around content and generative AI.
The US Copyright Office has launched a micro-site to record developments as it explores issues around AI and copyright. And this is in one country. Copyright and broader regulatory contexts will vary by jurisdiction which will add complexity.
Relaxed ethical and safety frameworks
There was one very topical concern in the general warnings that Geoff Hinton sounded: that the release of ChatGPT, and Microsoft&aposs involvement, had resulted in increased urgency within Google which might cause a less responsible approach. These echoes a general concern that the desire to gain first mover advantage – or not to be left behind – is causing Microsoft, Google and others to dangerously relax guidelines in the race to release products. There have been some high profile arguments within each company, and The New York Times recently documented internal and external concerns.
The surprising success of ChatGPT has led to a willingness at Microsoft and Google to take greater risks with their ethical guidelines set up over the years to ensure their technology does not cause societal problems, according to 15 current and former employees and internal documents from the companies. // NYT
The article notes how Google "pushed out" Timnit Gebru and Margaret Mitchell over the Stochastic Parrots paper, and discusses the subsequent treatment of ethics and oversight activities at both Google and Microsoft as they rushed to release products.
We can see a dynamic familiar from the web emerging. There is a natural concentration around foundation models and the large infrastructure required to build and manage them. Think of how Amazon, Google and Facebook emerged as dominant players in the web. Current leaders in LLM infrastructure want to repeat that dominance. At the same time, there is an explosion of work around tools, apps, plugins, and smaller LLMs, aiming to diffuse capacity throughout the environment.
Despite this broad activity, there is a concern that LLM infrastructure will be concentrated in a few hands, which gives large players economic advantage, and little incentive to explain the internal workings of models, the data used, and so on. There are also concerns about the way in which this concentration may give a small number of players influence over the way in which we see and understand ideas, issues and identities, given the potential role of generative AI in communications, the media, and content creation.
Environmental impact of large scale computing
LLMs consume large amounts of compute power, especially during training. They contribute to the general concern about the environmental impact of large scale computing, which was highlighted during peak Blockchain discussions.
Big models emit big carbon emissions numbers – through large numbers of parameters in the models, power usage effectiveness of data centers, and even grid efficiency. The heaviest carbon emitter by far was GPT-3, but even the relatively more efficient BLOOM took 433 MWh of power to train, which would be enough to power the average American home for 41 years. // HAI, 2023 State of AI in 14 Charts
Sasha Luccioni, a researcher with the interesting title of &aposclimate lead&apos at Hugging Face, discusses these environmental impacts in a general account of LLM social issues.
There are also concerns about the way in which this concentration may give a small number of players influence over the way in which we see and understand ideas, issues and identities, given the potential role of generative AI in communications, the media, and content creation.
Hidden labor and exploitation
Luccioni also notes the potentially harmful effects of participation in the RLHF process discussed above, as workers have to read and flag large volumes of harmful materials deemed not suitable for reuse. The Jisc post below discusses this hidden labor in a little more detail, as large companies, and their contractors, hire people to do data labelling, transcription, object identification in images, and other tasks. They note that this work is poorly recompensed and that the workers have few rights. They point to this description of work being carried out in refugee camps in Kenya and elsewhere, in poor conditions, and in heavily surveilled settings.
Timnit Gebru and colleagues from the Dair Institute provide some historical context for the emergence of hidden labor in the development of AI, discuss more examples of this hidden labor, and strongly argue "that supporting transnational worker organizing efforts should be a priority in discussions pertaining to AI ethics."
Regulating
There has been a rush of regulatory interest, in part given the explosion of media interest and speculation. Italy temporarily banned ChatGPT pending clarification about compliance with EU regulations. The ban is now lifted but other countries are also investigating. The White House has announced various steps towards regulation, for example, and there is a UK regulatory investigation.
Italy was the first country to make a move. On March 31st, it highlighted four ways it believed OpenAI was breaking GDPR: allowing ChatGPT to provide inaccurate or misleading information, failing to notify users of its data collection practices, failing to meet any of the six possible legal justifications for processing personal data, and failing to adequately prevent children under 13 years old using the service. It ordered OpenAI to immediately stop using personal information collected from Italian citizens in its training data for ChatGPT. // The Verge
Lina Kahn, the chair of the US Federal Trade Commission, has compared the emergence of AI with that of Web 2.0. She notes some of the bad outcomes there, including concentration of power and invasive tracking mechanisms. And she warns about potential issues with AI: firms engaging in unfair competition or collusion, &aposturbocharged&apos fraud, automated discrimination.
The trajectory of the Web 2.0 era was not inevitable — it was instead shaped by a broad range of policy choices. And we now face another moment of choice. As the use of A.I. becomes more widespread, public officials have a responsibility to ensure this hard-learned history doesn’t repeat itself. // New York Times
Chinese authorities are proposing strong regulation, which goes beyond current capabilities, according to one analysis:
It mandates that models must be “accurate and true,” adhere to a particular worldview, and avoid discriminating by race, faith, and gender. The document also introduces specific constraints about the way these models are built. Addressing these requirements involves tackling open problems in AI like hallucination, alignment, and bias, for which robust solutions do not currently exist. // Freedom to Tinker, Princeton&aposs Center for Information Technology Policy
This chilling comment by the authors underlines the issues around &aposalignment&apos:
One could imagine a future where different countries implement generative models trained on customized corpora that encode drastically different worldviews and value systems.
Tim O&aposReilly argues against premature regulation
While there is general agreement that regulation is desirable, much depends on what it looks like. Regulation is more of a question at this stage than an answer; it has to be designed. One would not want to design regulations in a way that favors the large incumbents, for example, by making it difficult for smaller players to comply or get started.
This point is made by Yannic Kilcher (of Open Assistant) in a video commentary on US Senate hearings on AI oversight. He goes on to argue that openness is the best approach here, so that people can see how the model is working and what data has been used to train it. Stability AI also argue for openness in their submission to the hearings: "Open models and open datasets will help to improve safety through transparency, foster competition, and ensure the United States retains strategic leadership in critical AI capabilities."
Finally, I was interested to read about AI-free spaces or sanctuaries.
To implement AI-free sanctuaries, regulations allowing us to preserve our cognitive and mental harm should be enforced. A starting point would consist in enforcing a new generation of rights – “neurorights” – that would protect our cognitive liberty amid the rapid progress of neurotechnologies. Roberto Andorno and Marcello Ienca hold that the right to mental integrity – already protected by the European Court of Human Rights – should go beyond the cases of mental illness and address unauthorised intrusions, including by AI systems. // Antonio Pele, It’s time for us to talk about creating AI-free spaces, The Conversation
The impact on employment
One of the factors in the screen writers&apos strike in Hollywood is agreement over uses of AI, which could potentially be used to accelerate script development or in other areas of work. However, according to The Information: &aposWhile Hollywood Writers Fret About AI, Visual Effects Workers Welcome It.&apos And at the same time, studios are looking at AI across the range of what they do - looking at data about what to make, who to cast, how to distribute; more seamlessly aging actors, dubbing or translating; and so on.
I thought the example interesting as it shows impact in a highly creative field. More generally the impact on work will again be both productive and problematic. The impact on an area like libraries, for example, will be multi-faceted and variable. This is especially so given the deeply relational nature of the work, closely engaged with research and education, publishing, the creative industries and a variety of technology providers.
While it is interesting to consider reports like this one from Goldman Sachs or this research from Ed Felten and colleagues which look at the exposure of particular occupations to AI, it is not entirely clear what to make of them. One finding from the latter report is "that highly-educated, highly-paid, white-collar occupations may be most exposed to generative AI."
Certainly, predictions like the one in the Goldman Sachs report that up to a quarter of jobs might be displaced by AI are driving regulator interest. Which is fueled also by headline-grabbing statements from high profile CEOs (of British Telecom and IBM, for example) forecasting reduced staff as AI moves into backoffice and other areas. There will be increased advocacy and action from industry groups, unions, and others. It
There are very real human consequences in terms of uncertainty about futures, changing job requirements, the need to learn new skills or to cope with additional demands, or to face job loss. This follows on several stressful years. The cumulative effect of these pressures can be draining, and empathy can be difficult. However, empathy, education and appropriate transparency will be critical in the workplace. This is certainly important in libraries given the exposure to AI in different ways across the range of services.
However, empathy, education and appropriate transparency will be critical in the workplace.
Major unintended event
LLMs will manage more processes; judgements are based on chatbot responses; code written by LLMs may be running in critical areas; agents may initiate unwelcome processes. Given gaps in knowledge and susceptibility to error, concerns have been expressed about over-reliance on LLM outputs and activities.
One response is at the regulatory level, but individual organizations will also have to manage risk in new ways, and put in place procedures and processes to mitigate potential missteps. This has led to yet another word becoming more popular: &aposguardrails.&apos Guardrails are programmable rules or constraints which guide the behavior of an LLM-based application to reduce undesired outcomes. However, it is not possible to anticipate all the ways in which such precautions would be needed.
Beyond this, there are broader concerns about the malicious use of AI where serious harm may be caused. In crime or war settings, for example. So-called &aposdual use&apos applications are a concern, where an LLM may be used for both productive and malicious purposes. See for example this discussion of an agent-based architecture "capable of autonomously designing, planning, and executing complex scientific experiments" [pdf]. While the authors claim some impressive results, they caution against harmful use of such approaches, and examine the potential use of molecular machine learning models to produce illicit drugs or chemical weapons. The authors call for the AI companies to work with the scientific community to address these dual use concerns.
Concerns are also voiced by those working within generative AI. As mentioned above, Geoffrey Hinton, a pioneer of neural networks and generative AI, stepped down from his role at Google so that he might comment more freely about what he saw as threats caused by increased commercial competition and potential bad actors.
Hinton&aposs long-term worry is that future AI systems could threaten humanity as they learn unexpected behavior from vast amounts of data. "The idea that this stuff could actually get smarter than people—a few people believed that," he told the Times. "But most people thought it was way off. And I thought it was way off. I thought it was 30 to 50 years or even longer away. Obviously, I no longer think that." // Ars Technica
This received a lot of public attention, and I mentioned some of these concerns in relationship to &aposalignment&apos above. Anthropic is the creator of the Claude model and is seen as a major competitor to Google and OpenAI. This is from the Anthropic website:
So far, no one knows how to train very powerful AI systems to be robustly helpful, honest, and harmless. Furthermore, rapid AI progress will be disruptive to society and may trigger competitive races that could lead corporations or nations to deploy untrustworthy AI systems. The results of this could be catastrophic, either because AI systems strategically pursue dangerous goals, or because these systems make more innocent mistakes in high-stakes situations. // Anthropic
Conclusion
As I write, announcements come out by the minute, of new models, new applications, new concerns.
Four capacities seem central (as I think about future library applications):
Language models. There will be an interconnected mix of models, providers and business models. Large foundation models from the big providers will live alongside locally deployable open source models from commercial and research organizations alongside models specialized for particular domains or applications. The transparency, curation and ownership of models (and tuning data) are now live issues in the context of research use, alignment with expectations, overall bias, and responsible civic behavior.
Conversational interfaces. We will get used to conversational interfaces (or maybe &aposChUI&apos - the Chat User Interface). This may result in a partial shift from &aposlooking for things&apos to &aposasking questions about what one wants to know&apos in some cases. In turn this means that organizations may begin to document what they do more fully on their websites, given the textual abilities of discovery, assistant, and other applications.
Agents and workflows. The text interfacing capacities of LLMs will be used more widely to create workflows, interaction with tools and knowledge bases, and so on. LLMs will support orchestration frameworks for tools, data and agents to build workflows and applications that get jobs done.
Embeddings and Vector databases. Vector databases are important for managing embeddings which will drive new discovery and related services. Embedding APIs are available from several organizations.
Several areas stand out where there is a strong intersection between social policy, education and advocacy:
Social confidence and trust: alignment. Major work needs to be done on alignment at all levels, to mitigate hallucination, to guard against harmful behaviors, to reduce bias, and also on difficult issues of understanding and directing LLM behaviors. However, there will be ongoing concern that powerful capacities are available to criminal or political bad actors, and that social trust and confidence may be eroded by the inability to distinguish between human and synthesized communications.
Transparency. We don&apost know what training data, tuning data and other sources are used by the foundation models in general use. This makes them unsuitable for some types of research use; it also raises more general questions about behaviors and response. We also know that some services capture user prompts for tuning. Tolerances may vary here, as the history of search and social apps have shown, but it can be important to know what one is using. Models with more purposefully curated training and tuning data may become more common. Documentation, evaluation and interpretability are important elements of productive use, and there are interesting intersections here with archival and library practice and skills.
Regulation. Self regulation and appropriate government activities are both important. The example of social media is not a good one, either from the point of view of vendor positioning or the ability to regulate. But such is the public debate that regulation seems inevitable. Hopefully, it will take account of lessons learned from the Web 2.0 era, but also proceed carefully as issues become understood.
Hidden labour and exploitation. We should be aware of the hidden labour involved in building services, and do what we can as users and as buyers to raise awareness, and to make choices.
Generative AI is being deployed and adopted at scale. It will be routine and surprising, productive and problematical, unobtrusive and spectacular, welcome and unwelcome.
I was very struck by the OpenAI comment that training the large language model is more like training a dog than writing software. An obvious rejoinder is that we hope it does not bite the hand that feeds it. However, it seems better to resist the comparison and hope that the large language model evolves into a responsibly managed machine which is used in productive ways.
Feature picture: The feature picture is a piece by A.K.Burns, which I took at an exhibition of her work at the Wexner Center for the Arts, The Ohio State University.
Acknowledgement: I am grateful to Karim Boughida, Sandra Collins and Ian Mulvany for commenting on a draft.
Update: 5/29/23 I added paragraphs to the content and regulation sections.
This is a general background post. I do plan to look at some more specific issues around research content, libraries, and related in future posts.
Lawyers use search algorithms on a daily or even hourly basis, but the way they work often remains mysterious. Users receive pages and pages of results from searches that ostensibly are based on some relevancy standard, seemingly guaranteeing that the most important results are all found. But that may not always be the case. This post explores the mystery of search algorithms from a legal research perspective. It examines what is wrong with algorithms being mysterious, explores our current knowledge of how they work, and makes recommendations for the future.
The Problem—Results Vary Across Algorithms
Upon the user entering a search term, each popular legal search database returns plenty of results. However, even with identical search terms, the result ranking may not be uniform or even close. Although researchers may rely on a particular database for their legal research and believe that the list of results is comprehensive, they may find otherwise upon closer inspection.
Susan Nevelow Mart, Professor Emeritus at Colorado Law, has spent significant time studying the phenomenon of search result divergence. She initially noticed it when she searched the same terms in different databases and received wide variations in results. To test this phenomenon, she ran experiments on a larger scale with fifty different searches and reviewed the top ten results. She focused on the top ten results because she understood that to be the focal point for internet users. Her experiment included six popular legal search databases: Casetext, Fastcase, Google Scholar, Lexis Advance, Ravel, and Westlaw. She assumed that the algorithm engineers behind each database had the same goal: “to translate the words describing legal concepts into relevant documents.”
Her findings show that every algorithm is different, as are their search results. She found an average of 40% of the top ten cases to be unique to each database. She also found that around 25% of the cases only appeared in two of the databases. This means that of the top ten cases from each database, four cases did not appear in any other database. Furthermore, if Lexis and Westlaw are compared alone, the results showed a striking 72% of unique cases. Although law students and attorneys likely consider more than the first ten results, practically speaking, that means users of one database might be looking through a dramatically different list of cases compared to users of a different database. This might lead to them citing and quoting different cases that the others may not have even seen.
From The Algorithm as a Human Artifact. Results from different databases show a high percentage of cases being unique to that database. The highest percentage of unique cases was in Westlaw, with 43% of cases unique from other databases.
Researchers from the University of Cincinnati employed an improved topic modeling analysis on cases from the Harvard Caselaw Access Project corpus to get at the same problem. Topic modeling is an algorithm that maps the statistical relationships among words. In searching for cases about “antitrust” and “market power,” researchers compared the visualizations based on topic modeling with results from Westlaw or Lexis. Curiously, the Westlaw or Lexis results showed classic cases in the field, while the visualizations showed cases that weren’t considered classics but were nonetheless influential in practitioner circles. There was a lack of overlap between them. This finding reveals that not only do algorithms produce differing results, but they may also miss essential cases that are often used in the real world.
Our Knowledge—Significantly Lacking
Once we recognize the problem of search results differing based on the search database, new questions arise. How much do we know and understand about these search algorithms? Do we comprehend why algorithms might produce drastically different results? Can we use this difference to our advantage in legal research? Might it cause harm to legal researchers unwittingly directed to different cases depending on the search algorithm they used?
The short answer to how much we know about legal search algorithms is simple: not much. Advanced legal technologies have been described as “an enigma” for most practitioners because they lack the understanding of how these technologies work.
One reason may be the legal researcher user experience. Researchers typically go through three steps in a search. First, researchers generate keywords or a question. Second, researchers type them into a search box. Last, results are shown immediately after clicking the search button. Databases hide the processes and calculations behind the scenes of how these results were selected and ordered. As Professor Mart mentioned, “[f]or the most part, these algorithms are black boxes—you can see the input and the output. What happens in the middle is unknown, and users have no idea how the results are generated.”
Another reason is that search algorithms are complex. Most legal researchers are not algorithm engineers, so they find it unintuitive to understand how search algorithms function. When tackling algorithmic literacy, researchers Dominique Garingan and Alison Jane Pickard considered using existing information, digital, and computer literacy frameworks to find the best structure for understanding algorithms. After considering multiple frameworks and failing to find one that encompasses algorithmic literacy, the authors suggested that algorithmic literacy in the legal field may be considered an extension of all three frameworks. The pure difficulty of deciding which framework to employ showcases how hard it is for outsiders to learn and understand algorithms.
That is not to say that we have no knowledge of the inner workings of popular legal search databases. We know some of the basic factors that search algorithms focus on. In searching case law documents, Westlaw relies on citations, key numbers, and treatment history, among other factors. Westlaw uses machine learning algorithms trained on legal content that include a diverse set of elements in its ranking. Its algorithm runs through more than sixty queries that “determine alternate terms that may apply to an issue, the legal documents most frequently cited for that issue, and authoritative analytical resources that discuss the issue.” Lexis Advance, in ranking its cases, considers a combination of term frequency and proximity in documents, case name recognition, “landmarkness” of the case, and content type-specific relevance weighting factors. Fastcase’s search uses sixteen different factors to rank search results, including keyword frequency in documents, proximity, citation counts, recency, the aggregate history of the search system, and others.
Though this information is a helpful start for users to understand search ranking, it doesn’t give researchers a particularly detailed description. Legal research databases themselves have provided little help in ensuring users have a basic understanding of search algorithms. Legal database providers tend to view their algorithms as trade secrets and only offer hints on the inner workings of their algorithms in promotional materials. Despite knowing the factors each algorithm may consider in its searches, it is unclear if there are other factors that the algorithms also consider, and what the weight is of each element in the search results.
Recommendations
The first and most important recommendation for legal researchers and practitioners is not to limit themselves to one search database. While it may be the easiest and most cost-efficient way to search, using only one database may cause the researcher to miss critical cases or fail to explore cases others will use. A 2018 survey of librarians, researchers, and professionals at high positions found that the majority of respondents relied on more than one information database when conducting searches. Most used Westlaw, Lexis, and Bloomberg Law, with a small minority also using FastCase. Using multiple databases ensures researchers do not miss important cases that practitioners in that particular field may be well-versed in.
Another recommendation is for law librarians, law schools, and even law firms to engage in further teaching about how to conduct research. Law librarians should continue to act as instructors, experts, knowledge curators, and technology consultants to clarify how search algorithms work, and how legal researchers can offset their known shortcomings.
Beyond the legal field, experts have called for greater algorithmic literacy and transparency more generally in the age of algorithms. Susan Etlinger, an industry analyst at Altimeter Group, stated that we should question how our data are presented and decisions are made just as we may question how our food or clothing are made. What assumptions are built into the algorithms? Were the algorithms sufficiently trained? Were the factors considered appropriate? These are all questions researchers should consider to better understand the algorithms they rely on. Answering these questions is especially important when each algorithm considers different factors and shows different results, even when practitioners in the particular field of study have, for example, a standardized list of cases they believe are the most important.
Fundamentally, legal research professors’ and librarians’ curricula should include information about the role of algorithms in legal research and warnings of the differing results that may come from different databases. They should emphasize that each search database uses a different algorithm so that researchers become aware of discrepancies between them. Algorithms may create the impression that their results are always the most relevant and that researchers need look no further. We know that is not always the case.
William Yao is a Library Innovation Lab research assistant and a student at Harvard Law School.
Sources
Felix B. Chang & Erin McCabe & James Lee, Modeling the Caselaw Access Project: Lessons For Market Power And The Antitrust-Regulation Balance, 22 Nev. L. J. 685 (2022). https://scholars.law.unlv.edu/cgi/viewcontent.cgi?article=1883&context=nlj
Dominique Garingan and Alison Jane Pickard. Artificial Intelligence in Legal Practice: Exploring Theoretical Frameworks for Algorithmic Literacy in the Legal Information Profession. Legal Information Management, 21(2), 97–117 (2021). https://doi.org/10.1017/S1472669621000190
Annalee Hickman, How to Teach Algorithms to Legal Research Students, 28 Persp. 73 (2020). https://legal.thomsonreuters.com/content/dam/ewp-m/documents/legal/en/pdf/other/perspectives/2020/fall/2020-fall-article-6.pdf
Susan Nevelow Mart, Every Algorithm Has a POV, AALL Spectrum, Sept.-Oct. 2017, at 40, available at http://scholar.law.colorado.edu/articles/723/.
Susan Nevelow Mart, Joe Breda, Ed Walters, Tito Sierra & Khalid Al-Kofahi, Inside the Black Box of Search Algorithms, AALL Spectrum, Nov.-Dec. 2019, at 10, available at https://scholar.law.colorado.edu/articles/1238/.
Susan Nevelow Mart, Results May Vary, A.B.A. J., Mar. 2018, at 48, available at http://scholar.law.colorado.edu/articles/964/.
Susan Nevelow Mart, The Algorithm as a Human Artifact: Implications for Legal [Re]Search, 109 Law Libr. J. 387 (2017), available at https://scholar.law.colorado.edu/faculty-articles/755.
Lee Rainie & Janna Anderson, Code-Dependent: Pros and Cons of the Algorithm Age, https://www.pewresearch.org/internet/2017/02/08/code-dependent-pros-and-cons-of-the-algorithm-age/.
Robyn Rebollo, Search Algorithms In Legal Search, https://lac-group.com/blog/search-algorithms-legal-research/.
Michael Lewis is writing an eagerly awaited book on Sam Bankman-Fried and the collapse of FTX, and on his podcast series he has been interviewing sources he used in his research. The latest one is an entertaining interview with Molly White. The best part is their discussion of how the story justifying cryptocurrencies keeps changing.
What is needed is an electronic payment system based on cryptographic proof instead of trust,
allowing any two willing parties to transact directly with each other without the need for a trusted
third party.
routine escrow mechanisms could easily be implemented to protect buyers
seeingly without noticing that "routine escrow mechanisms" had to be trusted third parties.
Even if we ignore the fact that virtual mattress-to-mattress transactions are unsafe, a fundamental problem remains. The thing about money is that more is always better and less is always worse. In Nakamoto's world there are only two ways to increase the size of the stash under your mattress; you either mine new Bitcoin, which is a hassle, or you buy low and sell high, which is even more of a hassle.
What Nakamoto's world needed is a way for the many people who don't want the hassles to lend to the few people who do want the hassle, and receive interest for doing so. It needed banks, which had to be trusted to pay the interest and pay back the loan.
One possibility is that fractional reserve banking is deeply rooted in human nature. People have money, they would like to keep it somewhere safe, they would like it to grow, they would like to be able to get it back at any time. Other people need money, they are willing to pay to borrow it, but they want it for a long time — they want to be able to use it to buy a house or build a business; they don’t want their lender to be able to demand the money back at any time. The savers and the borrowers just want different things. Why would anyone want to lock their money up for a long time? Why would anyone want to borrow money for an uncertain time?
Full-reserve banking (also known as 100% reserve banking, or sovereign money system) is a system of banking where banks do not lend demand deposits and instead, only lend from time deposits. It differs from fractional-reserve banking, in which banks may lend funds on deposit, while fully reserved banks would be required to keep the full amount of each customer's demand deposits in cash, available for immediate withdrawal.
In practice, the "cash" would be in an account at the central bank, but central banks hate the idea, as Alex Harris reports:
The central bank has raised several concerns about narrow banks. The main one is that in times of stress they’d be too attractive as a haven. Money could pour out of Treasury bills, high-quality bonds or even accounts at conventional banks, amplifying risks to the broader financial system. Narrow banks could also make it harder for the central bank to manage short-term interest rates. And because conventional banks could end up holding few deposits, they might do less lending, making loans more expensive and credit harder to get.
The alternative is fractional-reserve banking, in which banks keep enough cash on hand to cover their expected worst-case withdrawals during the time they need to sell their longer-term, higher-interest assets. Levine explains the idea:
If you are an enterprising middleman, you can try to convince one side or the other to do something that doesn’t quite match their desires — “invest your money long-term, but if you need it back we can probably find a way to get it to you,” or “borrow short-term, but you can probably keep rolling your debt for a long time” — but it is easier and more appealing to just promise everyone exactly what they want. People who want to park their money give it to you and you promise to give it back whenever they want; people who want to borrow money borrow it from you and you tell them they can keep it for a long time; probably this all works out,
Except that sometimes, as with Silicon Valley Bank, it doesn't because if the "people" who lent you the money are (a) very rich and (b) very online you will have underestimated the rate of withdrawals once confidence in your promises erodes. Bank runs these days are extremely fast. Banks are regulated to force them to keep a lot more cash than they would like in order to reduce this risk.
People in crypto did not trust the banks, in part for the good reason that the banks were doing something (maturity transformation) that is both risky and in some deep sense deceptive. But people in crypto did want the benefits of maturity transformation: People with crypto wanted to park it somewhere safe, earn interest and have access to it whenever they wanted; other people wanted to borrow crypto without the risk of having to give it back early. Crypto shadow banks — Celsius, Voyager, BlockFi, Genesis, Gemini Earn, FTX — sprung up to offer that service, to borrow short and lend long. Free of most regulation, they could offer the service efficiently, market it aggressively, and lose tons of their customers’ money.
First, despite mass marketing campaigns to the contrary, crypto lending platforms recreated banking all over again. Crypto lending platforms were vulnerable to runs because, like all banks, they borrowed short and lent long. This is the essence of banking, so we label these lending platforms “crypto banks.” Second, crypto space was largely circular. Once crypto banks obtained deposits and investments, these firms borrowed, lent, and traded mostly with themselves.
The next generation of crypto firms are linking up with the financial sector, which means their failures will spill over into the real economy. To contain the inevitable growth of systemic risk, regulators should use banking laws to address a banking problem.
in the US, efforts at crypto regulation are largely about either securities law or commodities regulation, but the actual problem of 2022 was banking. Securities regulation is about giving investors full information so they can make informed decisions about what companies to invest in. Bank regulation is the opposite; it assumes that bank depositors should not have to care much about what their banks are up to; bankers and regulators and supervisors worry about a bank’s asset quality and liquidity and capital ratios so that depositors can be information-insensitive.
The essential part of making depositors "information-insensitive" is deposit insurance, which is feasible only if banks are strictly regulated. Applying banking regulation to crypto banks would definitely impose "regulatory clarity", but in a way that would render crypto banks unprofitable. They would be unable to rip off their customers or speculate in coins proceeding moonwards:
Whereas traditional banking borrows short to lend long to people who want to build houses or start businesses, crypto shadow banking borrows crypto short to lend crypto long to people who want to speculate on cryptocurrencies. In some sense you’d expect this to create less of a maturity mismatch — how long does anyone really need to borrow for to day-trade cryptocurrencies? — but on the other hand the collateral is much riskier. If you borrow to buy a house and default, the bank gets the house. If you borrow to buy some magic beans and default, the bank gets some beans.
And in A Retrospective on the Crypto Runs of 2022, Radhika Patel and Jonathan Rose of the Chicago Fed point out that runs on crypto banks happen even faster than Silicon Valley Bank's:
customers withdrew a quarter of their investments from the platform FTX in just one day
...
FTX itself reported outflows of 37% of customer funds, almost all of which were withdrawn in just two days
The owners of large-sized accounts, with over $500,000 in investments, were the fastest to withdraw and withdrew proportionately more of their funding. In fact, during this run, 35% of all withdrawals at Celsius were by owners of accounts with more than $1 million in investments, according to our estimates.
All this focuses on trading platforms like FTX or Celsius. But there is another type of crypto bank, stablecoins. I wrote about the risk of runs against algorithmic stablecoins revealed by the Terra/Luna collapse in Metastablecoins, pointing out that the arbitrageurs who were supposed to keep UST trading at its peg had limited firepower that was overwhelmed in a run. The next week, after USDT had traded nearly 5% under its peg, I wrote More Metastablecoins, based on research from Barclays quoted by Bryce Elder in Barclays to tether: the test is yet to come. Elder described USDT's defenses against a run:
Tether’s closed-shop redemption mechanism means it cannot be viewed like a money-market fund. Processing delays can happen without explanation, there’s a 0.1 per cent conversion fee, and the facility is only available to verified customers cashing out at least $100,000.
And Barclays described how, even if USDT were backed 1-for-1 by dollars in FDIC-insured bank accounts (which it isn't) these defenses wouldn't actually prevent a run:
The only way to get immediate access to fiat is to sell the token on an exchange, regardless of the size of holding . . . [W]hile redemption is ‘guaranteed’ at par, the secondary market price of tether can trade lower, depending on the willingness of holders to accept a haircut in return for access to immediate liquidity. As last week’s price action suggests, some investors were willing to accept a nearly 5 per cent discount to liquidate their USDT holdings immediately.
...
We think that willingness to absorb losses, even though USDT is fully collateralized and has an overnight liquidity buffer that exceeds most prime funds, suggests the token might be prone pre-emptive runs. Holders with immediate liquidity demands have an incentive (or first-mover advantage) to rush to sell in the secondary market before the supply of tokens from other liquidity-seekers picks up. The fear that USDT might not be able to maintain the peg may drive runs regardless of its actual capacity to support redemptions based on the liquidity of its collateral.
Tether Holdings Ltd., the operator of the largest stablecoin, will invest as much as 15% of profits on a regular basis in Bitcoin as part of a strategy to diversify its reserves.
...
Tether held around $1.5 billion of Bitcoin as part of the reserves backing its tokens at the end of March, according to a third-party attestation of its holdings.
...
The company said on Wednesday that it does not expect the value of its current and future Bitcoin holdings to exceed its shareholder capital cushion, referring to excess capital held by Tether to protect against heavy losses.
That figure now stands at more than $2.5 billion, Tether’s Chief Technology Officer Paolo Ardoino, said on Twitter.
USD Coin, the second biggest stablecoin by market cap, received a government rescue in March—proving it really can compete with banks.
...
Pursuantly, over the course of three days in March, the backing assets of the “fully reserved” USDC became a portfolio enviable of a distressed credit investor. And, by extension, so did USDC itself. USDC started to wobble under the weight of the above disclosure (transparency!), and fell more sharply when Circle disclosed it in fact had $3.3 billion stuck at SVB despite attempts to withdraw
...
USDC traded at less than 90 cents on the dollar that weekend — until the government announced it would stand behind the uninsured deposits of the failed banks:
...
The “we don’t lend reserves” refrain was always nonsense, and now USDC has faced a 48-hour drill making that abundantly clear. To be truly “fully reserved” is to have all the reserves at the central bank.
Saying anything less is “fully reserved” is egregiously misleading. Uninsured dollars in banks—which USDC likely needs at least some of (and, in any case, had a lot of) as the on- and off-ramps to the blockchain—are loans to those banks. Circle is issuing demand liabilities and making risky loans; it’s a bank.
If your concern is that crypto shadow banks are becoming more interconnected with the real economy, and that therefore future runs on those shadow banks might be more destructive, there are two ways to go:
Protect crypto shadow banks from runs, with deposit insurance and regulation; or
Protect the real economy from crypto shadow banks, by making it really hard for the traditional financial system to connect with crypto firms.
US regulators seem to be choosing Option 2, which … seems … right … to me?
On 10th March 2023, Youth Open Data organised a workshop to mark Open Data Day. The workshop brought together 40 young leaders and representatives of civil society organisations in Ouagadougou, Burkina Faso. The general objective of the workshop was to exchange understanding, commitment, and action at the national level in terms of Artificial intelligence (AI) based on open data for the monitoring of humanitarian funds.
The specific objectives of the workshop were:
Strengthen participants’ knowledge of AI
Strengthen participants’ knowledge of Open Data
Discover Open Date and AI platforms on humanitarian aid tracking
Discuss national strategies, for the popularisation of data for the benefit of communities
Reflect on a strategy for setting up a collaboration framework with the CSOs present, to develop solutions based on Open Data and AI
The executive director of Youth Open Data started the workshop by welcoming the participants and stressing the importance of the session. He invited each participant to participate massively in a fruitful exchange and expressed sincere thanks to the partners and representatives of institutions for their presence.
Mr. BAZONGO AMON, an expert in AI started the session with his presentation on Artificial Intelligence (AI). He illustrates with examples of tasks by saying that AI tasks are sometimes very simple for humans, such as recognising and locating objects in an image, planning a robot’s movements to catch an object, or driving a car. The most complicated tasks require a lot of knowledge and common sense, for example, to translate a text or conduct a dialogue. He stressed the importance of AI to improve the performance and productivity of the company by automating processes or tasks that previously required human resources.
In his presentation focusing on the use of AI in the humanitarian field, he mentioned three aspects that fit with humanitarian movements namely:
The preparation for humanitarian action
The response to humanitarian action
The recovery to the humanitarian action
Some important questions were also highlighted during the discussion on AI, Open Data, and Humanitarian fund monitoring;
In which dimension are the humanitarian funds: in the preparation and response?
What are the issues related to humanitarian funds: transparency, reliability, use, management of the granted fund, effectiveness and efficiency, identification of the real need, tracking of funds, accountability, and reporting?
What data is used to verify transparency in humanitarian action: RELAC data, and data from the African Aid Transparency Initiative?
After the session on Artificial Intelligence, Mr. Malick LINGANI an expert in Open Data started his presentation on Open Data. He spoke with participants on what is open data, its importance, and how to promote and popularise its data. He pointed to a large number of areas where open government data creates value, and there are probably more. Some of these areas are:
Transparency and control of democracy
Participation
Self-empowerment
New and improved private products and services
Innovation
Improved efficiency of public services
Measuring the impact of public policies
This untapped potential can be revealed by turning government data into open data. However, in order to reveal it, it must be truly open, meaning that there are no restrictions (legal, financial, or technological) on its reuse by the public. To conclude his presentation, he presented platforms for the promotion of data, such as OPEN GOV, AIDA, and the World Bank website.
The workshop went well and overall objectives were achieved. Following the exchanges, the participants expressed their satisfaction to have attended the workshop, but also made the following recommendations:
Organise youth meetings at the national level to talk more about AI and Open Data.
Set up a grouping of CSOs to work on advocacy actions linking access to public data, but also pool our efforts to set up a single digital platform to centralise data produced by CSOs in order to further promote Open Data.
To celebrate Open Data Day 2023, the Nepal Institute of Research and Communications (NIRC) implemented the ‘Enhancing Nepali Youth’s Awareness on Climate Change through AI Technology (ENACT)’ project to train Nepalese Youth’s on climate change.
The project was implemented to:
Sensitise the local youth/early career researchers about emerging climate change issues and the potential effects of climate-induced disasters in their communities.
Capacitate the local youth/early career researchers on using open data and Artificial Intelligence (AI) to address climate change issues.
Advocate the local government representatives for promoting and allocating resources to initiate the use of open data and AI technological solutions to address climate change issues.
The project was implemented in 3 rural municipalities of Saptari district, Madhesh province in Nepal namely – Mahadeva, Tirhut, and Chinnamasta. A total of 32 youth/early career researchers participated in the workshop series of this project. Likewise, the local representatives of the rural municipalities, especially the leadership (Chairpersons and Chief Administrative Officers) also actively participated during the panel discussion and shared their insights and present status and priorities on data/information and climate change.
Activities Organised
Issue Identification Workshop
A half-day workshop was organised on March 8, 2023, to brainstorm and identify the local climate change issues that require attention.
Capacity Building Workshop
A full-day capacity building workshop was organised on March 12, 2023, in Rajbiraj City. The workshop included a series of activities such as sessions on climate change, open data, AI & its use to address climate change, and group brainstorming on possible solutions.
During the session on climate changes, the participants were introduced to the concept of climate change, the differences between weather and climate, how climate change has resulted in climate-induced disasters, climate change scenarios in Nepal, vulnerabilities of communities living in hazard and risk-prone areas, the local adaptation and coping mechanisms.
Policy Advocacy Workshop
Following the full-day capacity building workshop on March 9, a half-day policy advocacy workshop was organised on March 10, 2023, among the local representatives of Mahadeva, Chinnamasta, and Tirhut Rural Municipalities to gather their perspectives on open data and AI to address climate change related issues as well as advocate the local government leadership for ownership and prioritisation on evidence-induced development.
A panel discussion and presentation were made by the group participants, summarising the group work and lessons drawn during the capacity building workshop held on March 9. Synopsis of the perspectives from the panel discussion:
Poor quality of education system in rural municipalities affecting learning and development among youth.
Lack of technical expertise and human resources in the rural municipalities for regular data/information management and updating.
No IT Officer recruited till date in the rural municipality due to political instability.
Data collection done on ad hoc basis and not followed scientific procedures.
During the closing ceremony, certificates of participation were also provided to the participants and souvenirs were provided to the local government representatives.
Nepal Institute of Research and Communications (NIRC) received a small grant to host Open Data Day(s) (ODD) events and activities.
Prompt:
“Write a single sentence summarizing why cryptographically-signed provenance information matters in the context of AI-generated content.”
ChatGPT:
“Cryptographically-signed provenance information matters in the context of AI-generated content to establish trust, transparency, and authenticity by providing a verifiable record of the content’s origin, authorship, and creation process.”
That’s a great point, but did ChatGPT really say that, or did I make that up to trick you into reading this article? I could show you a screenshot of that exchange to try to convince you …
… but what good would it do, when it’s so easy to produce convincing fakes?
(This screenshot has been edited)
In this article, I’ll explain how and why AI vendors such as OpenAI should adopt techniques that make it possible to build strong associations between AI-generated content and its provenance information, with a particular focus on text — or “GPT-born content” — which presents unique challenges.
While recent advances in the standardization of cryptographically-signed provenance techniques help make this possible, the accelerated pace at which generative AI-based products are being put in front of consumers makes it necessary, as one of the many steps the industry could take to help prevent and reduce harm.
(This screenshot has been edited)
Papers, please
Maybe it is because I am an immigrant in the United States, and therefore am used to having to repeatedly prove who I am, where I come from, and why I am here, but I find similarities in the methods government agencies employ to figure out whether I belong here, and the problem at hand.
Every time I travel back to the United States, I need to have my passport with me, which shows where I come from, but also my green card, which shows why I am here. If I were to lose these documents, my biometrics, collected earlier on, could be used to identify me and make an association between myself and my immigration status and history.
Generative AI is crossing a yet-to-be-defined border between statistics and human creativity, and while this is something we should welcome, we can also ask that it identify itself when it does so. Cryptographically-signed provenance information, embedded in a file or archived on the server side, could help achieve that goal. After all, why would we make humans jump through so many - and sometimes unjustified - hoops, but simply trust AI output?
Enter C2PA
The “Coalition for Content Provenance and Authenticity”, or C2PA, is the result of an alliance between the software industry, newsrooms, and non-profit organizations to design and implement technical standards to combat disinformation online. C2PA is also the name of the specification the coalition put together, which allows for embedding cryptographically-signed provenance information into media files. The Content Authenticity Initiative — the Adobe-led arm of C2PA — has developed and released open-source tools to allow the public to develop applications making use of this emerging standard.
That concept is powerful in that it allows an image, a video or audio file to tell us reliably where it came from, when it was created and how. All that information is signed using X509 certificates — a standard type of certificate that is used to secure the web, or sign emails —, ensuring that provenance information has not been altered since it was signed, but also telling us “who” signed it.
That signed provenance information — a manifest containing one or multiple claims listing assertions — is embedded in the file itself and doesn’t hinder its readability: it is, in essence, verifiable metadata, which tools implementing the C2PA specification can read, interpret and validate.
This is the case for CAI’s “Verify” tool, which helps visualize C2PA data embedded into an image file, or even re-associate provenance information with a file from which that data was stripped, by comparing it against their database.
Screenshot of CAI’s “Verify” tool, showing provenance information embedded in fake-news.jpg.
A first application of that concept to generative AI came with Adobe and Stability AI’s joint announcement that they were going to generate and sign “Content Credentials” using C2PA in Adobe Firefly and Stable Diffusion, with the idea that these manifests would be both embedded in the resulting images and preserved on their servers for records keeping and later re–association.
But how would that work for AI-generated text?
C2patool, the leading open-source solution for working with C2PA, supports various image, video and audio formats. It also allows for signing in a “sidecar” (an external file), but doesn’t yet come with a built-in solution for text-based content.
Finding a suitable file format would be the first and probably main hurdle to overcome in order for large language models (LLMs) to label their output. PDF may be a good fit, and provisions were recently added to the C2PA specification to delineate how that integration could work. As of this writing, it appears that existing tools in the C2PA ecosystem do not directly support this integration. XML might be a good lead as well, given that c2patool already supports SVG images, which are XML-based.
That hurdle aside, the implementation would — in principle — be similar to what Adobe Firefly and the like seem to have chosen: creating and signing provenance information at the time of generating the output, serving that information alongside the generated content, and keeping a copy of it on the server side.
The end-user would be presented with options to download a copy of this output, which would come with embedded provenance information.
An example of what a “Download” button would look like on the ChatGPT user interface.
The provenance information contained in the resulting file could answer questions about the authenticity of that content — since it would be signed by OpenAI — but also about the context surrounding its creation: What model was used to generate it? When? What prompt was the model given? Was this what the LLM returned, or did the original response trigger a safeguard mechanism? Did the chatbot interact with external plugins to generate this response? Is this an isolated exchange, or part of a longer discussion? All potentially crucial elements that require careful consideration and policy decisions, as they may preserve and reveal too much about what users submitted.
Applications integrating APIs such as the one provided by OpenAI would directly benefit from access to this verifiable contextual information. However, they would also have an opportunity to add to it and inform consumers about the transformations they’ve operated: a key component of the C2PA standard is that it supports successive claims, allowing for building provenance trees.
Sharing is caring
We’ve seen that, beyond sharing signatures with users directly, generative AI vendors could keep internal records of provenance information for the outputs they produce. A restricted version of this metadata could be sent to a “hashes common”, to which vendors would participate by sending the fingerprint and creation date of the contents they generated. This shared database would allow the public to check whether a given piece of content has been generated by a participating AI vendor, but also potentially help AI practitioners exclude AI-generated content from training datasets.
This would not be exclusive to one particular type of content (text or images), but would be limited by the extent to which fuzzy matching techniques can make reliable associations between slightly altered content and original hashes. The “hashes common” is a larger subject that deserves its own case study to explore issues like scalability, privacy, metadata, and hashing algorithms.
The last word
The technology may be here, or close to being here, and there seems to be momentum in the industry to adopt some of the techniques I briefly described in this post. This may partly be due to growing concerns around plagiarism and copyright infringement, or because the generative AI boom, coinciding with the 2024 presidential election cycle, gives rise to fears that these technologies may augment the generation of false information in that context.
In any case, I think we all know what I shouldn’t do: ask ChatGPT what it “thinks” in order to close my argument.
Prompt:
“Write a single sentence explaining why it may be dangerous to continue to deploy generative AI technologies without offering reliable ways to trace back the contents they generate to their origin.”
ChatGPT:
“Without reliable methods to trace back the generated contents to their origin, the deployment of generative AI technologies poses a significant risk as it allows for potential malicious use, misinformation, and unaccountability, which can have far-reaching consequences for individuals, organizations, and society at large.”
… Did it really say that, or did I make it up?
2023-05-24 update: Edited section about PDF to include existing C2PA specification provisions on PDF support.
This is the second in a series of blog posts exploring the vast terrain of ‘Open’ as it relates to libraries.
The first post in this series provided a broad overview of Open contexts in which libraries operate, signposting some key areas of library interest. Metaphorically, the Open ecosystem was compared to a constellation – a pattern that’s recognizable when you know what to look for, may look somewhat different depending on your local perspective, and emerges slowly from clouds of gas and dust. Out of an opaque and seemingly chaotic space, something luminous takes shape.
Photo credit: Steven Miller “Hikers on the knoll. Jenner Preserve, Sonoma Land Trust.” via Flickr (CC BY-NC-ND 2.0)
This second installment is closer to earth. It describes the role OCLC plays in the Open ecosystem, focusing on Open Access, a context in which libraries have taken on an especially prominent role. Other Open contexts – Open and reproducible science/scholarship, Open Education, Open source software, Open scholarly infrastructures – are no less important, but a pattern of shared library service expectations has yet to emerge. This is an important consideration for an organization like OCLC, which maintains data and technology platforms for tens of thousands of libraries. We prioritize investments that deliver maximum benefit across our membership, reinforce and extend the strength of shared infrastructure, and sustain our community-building initiatives. Of the various Open contexts in which libraries currently operate, Open Access is the area where OCLC delivers the greatest immediate benefits today.
Supporting library participation in the Open Access ecosystem
OCLC delivers a range of solutions designed to support library participation in Open Access (OA), from facilitating the discovery of OA content to streamlining access and management of Open and ‘controlled’ collections under a variety of open licensing models. Many of these innovations were introduced a decade ago or more when the OA movement was still new. Others reflect more recent adaptations, and more will follow as libraries respond to changes in the scholarly communications environment, shifting policy mandates, and evolving institutional norms.
Examples include:
Enhancing the discoverability of Open Access content. For more than a decade, OCLC has provided a metadata harvesting service (the WorldCat Digital Collection Gateway) that enables libraries to register institutional repository content in WorldCat, supporting global library visibility of university research outputs. This service is used by thousands of libraries worldwide, resulting in the registration of more than 60 million items – a doubling of the collection size since OAISter’s transfer from the University of Michigan to OCLC in 2009. This massive collection of library-sourced content is integrated into the freely searchable WorldCat.org website and can also be searched as a separate collection. OCLC also partners with publishers and aggregators to bring metadata for Open Access content into OCLC’s data network so it can be made visible in WorldCat.org and library discovery environments like FirstSearch and WorldCat Discovery. Major OA metadata aggregators, including BioMed Central, the Directory of Open Access Journals (DOAJ), the Directory of Open Access Books (DOAB), JSTOR Open Access Books and Journals, and many others participate in this program. As a result of these partnerships, metadata for more than 100 million OA items is available in OCLC’s data network. To facilitate end-user discovery of these resources, OCLC provides a one-click OA filter in WorldCat Discovery and a separately searchable OA collection in FirstSearch. This reduces “noise” in the discovery experience, enabling library patrons to limit their searches to OA content.
Streamlining metadata management for Open Access content. As the scope and variety of OA collections continue to grow, the need for process automation increases. OCLC delivers efficient metadata management for e-content – including OA content – by automating holdings maintenance. This ensures that library holdings and record supply stay up to date as the scope and coverage of collections evolve. It also reduces the time libraries spend monitoring and updating metadata and ensures patron discovery and access keep pace with changes to the library collection. OCLC Collection Manager, part of the WorldShare platform, consolidates metadata management for print, licensed and OA collections in a single application. Using Collection Manager, libraries can activate more than 800 collections of Open Access content to create a customized discovery experience that meets local needs and interests. Leveraging the benefits of library collaboration, OCLC enables libraries to build and share collections of OA titles, so that the benefits of expert selection can be widely shared. Libraries can create custom OA collections to highlight the scholarly work of local faculty and researchers, and enable other libraries to add these resources to their local discovery environments. Thousands of libraries are using this tool to manage and share OA collections.
Improving license management workflows for Open Access content. The proliferation of Open Access business models has complicated license management for libraries. In addition to tracking the terms under which e-content may be accessed at a collection level, libraries may need to track entitlements for individual titles in a package or even individual articles for which Article Processing Charges (APC) have been paid. The complexity of these arrangements mirrors the complexity of the business models that support ‘free to read’ and ‘free to publish’ agreements between libraries and publishers. From an operational standpoint, the costs of ‘free’ may be considerable – beyond paying subscriptions and tracking entitlements, the library may need to assume burdensome new workflows to document compliance with OA mandates or monitor usage of OA content. The License Manager module in OCLC’s WorldShare platform accommodates evolving workflows, enabling libraries to define custom terms (for example, access rules for hybrid OA journals, or APC counts) in addition to configuring more than 30 standard terms. Libraries can also build license templates to create model OA agreements, to harmonize licensing terms when negotiating subscriptions and renewals. As libraries step up to new roles managing complex Open Access license agreements, infrastructure like OCLC’s License Manager enables them to share model agreements, scaling the benefits of traditional copy cataloging to new areas of library work.
Facilitating access to Open Access content. Global discovery of OA publications is only valuable when it enables reliable access to content. Years ago, OpenURL was introduced as a standard method of connecting library users to the ‘appropriate copy’ of an electronic resource, providing link resolution based on institutional subscriptions and entitlements. With the growth of Open Access publishing, libraries are increasingly interested to provide direct access to OA versions of published content to promote the value of OA as a publishing model. This also helps to reduce friction in the end-user’s discovery experience by removing authentication barriers, where users must provide credentials and/or click through multiple interfaces before accessing the full text of an article. Differentiating between various versions of OA content at the point of discovery – a preprint, an author-accepted manuscript (sometimes referred to as a ‘postprint’), the published version of record, or even a retracted version of record, etc. – and presenting the ‘appropriate copy’ is not trivial. OCLC is tackling this challenge in a couple of ways. We offer an API-based integration of Unpaywall (a popular aggregation of OA content) in WorldCat Discovery, which surfaces links to OA versions of published content dynamically. And we support integration with LibKey, a service that provides direct links to full-text (PDF) OA versions of articles, including retracted versions. For researchers especially, differentiating between different ‘states’ of publication – prior to peer-review, peer-reviewed but not yet published, published, retracted, etc. – is important to evaluating the credibility of sources.
Smarter ILL workflows: shortening the fulfillment cycle for Open Access content. Despite the advances libraries have made in providing global discovery and direct patron access to OA content, some fulfillment requests still land in the Interlibrary Loan (ILL) department. When ILL staff can identify OA fulfillment options, they can reduce the operational overhead of processing borrowing and lending requests. WorldShare ILL enables library staff to identify OA versions of titles that are requested by local borrowers or received as a lending request from another library at the time a request is received, so they can offer direct OA fulfillment instead. Using Tipasa, OCLC’s ILL management application, these workflows can be further automated, enabling ILL staff to notify a local patron or borrowing library and supply an OA link using a template. These timely interventions can short-circuit more costly and time-consuming fulfillment options, allowing ILL staff to focus on higher-value activities. Between January and April 2023 alone, more than 5000 requests were fulfilled using OA links in WorldShare ILL and Tipasa, delivering immediate impact for library patrons.
With these service enhancements and innovations, OCLC is tackling some of the key pain points that libraries face in transitioning their collections and services to achieve institutional Open Access goals, fulfill compliance mandates, and/or align with evolving community norms. We will continue to evolve our metadata, discovery, and management solutions to support emerging OA practices, consistent with our mission of expanding access to the world’s knowledge.
Beyond delivering workflow solutions that facilitate library participation in Open Access ecosystem, OCLC is committed to providing thought leadership to inform library strategy and practice, in the form of original research and educational programming that is shared freely with libraries worldwide.
Thought leadership: clarifying library roles and opportunities in the Open ecosystem
OCLC Research provides thought leadership on a wide range of topics, from data-centered analyses of the evolving collections landscape to practitioner-oriented explorations of library workflows. This distinctive research capacity – broad and deep subject matter knowledge, methodological expertise, direct access to networks of library practitioners and leaders – is part of what makes OCLC an extraordinary organization and community resource.
Here is a sampling of recent OCLC Research activities that provide insights into emerging library roles in the Open ecosystem:
Understanding researchers’ expectations for reusable data
For more than a decade, OCLC has been at the forefront of research on Open Science/Scholarship and its implications for libraries, particularly related to research data management. The IMLS-funded Dissemination Information Packages for Information Reuse (DIPIR) project investigated researchers’ data reuse needs and practices. It examined what it means for data to be reusable and for researchers to be satisfied with their data reuse experience. This research contributes to digital curation conversations about the preservation of context as well as content. Findings have been used to inform data curation checklists and demonstrate the value of data curation activities. The Secret Life of Data project is an NEH-funded research partnership between OCLC and Open Context, a research data publishing service. Our research team has been examining team-based research in the field and has identified ways to improve data creation and management practices to positively impact the team’s use of its data, as well as Open Context’s downstream curation activities and the broader community’s data reuse experiences.
Exploring the information-seeking behaviors of today’s (and tomorrow’s) learners
A long-running partnership between the University of Florida, Rutgers University and OCLC Research is examining how students find and use documents on the open web. The IMLS-funded Researching Students’ Information Choices project investigates the behavior, perceptions, and decision-making of students evaluating information resources in an open web search environment. The project has identified differences in students’ evaluations of scholarly and popular sources (see Container Collapse and the Information Remix, Science and News, and Authority, Context, and Containers), making it crucial for information literacy instruction to address evaluation in open web systems, not just curated library systems. More recently, the project found that most students don’t recognize preprints as a distinct category of publication and pay attention to peer-review indicators only when asked to evaluate the credibility of an information resource (Students’ Perceptions of Preprints Discovered in Google). The project team is currently analyzing students’ perceptions of access and its effects on their evaluation of information resources. Early results from this study were shared in a poster session at ACRL 2023.
Library roles in the Open Science arena
Universities around the world have begun implementing and adapting Open Science (OS) frameworks over the past decade and research libraries have followed suit, assuming new roles and responsibilities in Open Access publishing and FAIR data management. In Europe, the LIBER Open Science Roadmap provides guidance to research libraries exploring this landscape and establishing their role(s) within it. In 2020, OCLC Research and LIBER conducted a joint discussion series based on this roadmap. Participating librarians voiced concern about being inadequately equipped to collaborate with the array of non-library stakeholders needed to enable success in Open Science. This finding led to the development, in 2021, of a joint LIBER/OCLC Research workshop series on social interoperability to help librarians forge partnerships with more confidence in the rapidly evolving open ecosystem.
Discoverability of Open Access content
In 2018-19, OCLC’s Global Council sponsored an international survey of library ‘open content’ activities. Respondents reported a high level of confidence in the success of library efforts in several areas, including library support for faculty/researcher content creation, investments in institutional repositories and library-managed publishing platforms, and digitization of cultural heritage collections. They also reported dissatisfaction with the visibility of Open Access (as formally defined) and openly available content (whether expressly licensed as ‘open’ or not) in library discovery environments. These and other findings are highlighted in a report published by OCLC Research in 2020.
In 2021, at the invitation of the Dutch library community, OCLC Research organized Knowledge Sharing Discussion Series to learn more about open content activities and experiences in Dutch academic libraries. This investigation confirmed that there is a gap between library perceptions of success in Open Access publishing activities (i.e., creating and disseminating OA content) and confidence about the discoverability of these materials.
OCLC’s Open Access Discovery project is designed to address this gap by providing libraries with better evidence to improve the discoverability of OA publications. This project is investigating how library staff are integrating scholarly, peer-reviewed OA publications into their users’ discovery workflows and surveying users about their discovery experiences. This research is being carried out by OCLC in partnership with two important Dutch library consortia—Universiteitbibliotheken en Nationale Bibliotheek (UKB) and Samenwerkingsverband Hogeschoolbibliotheken (SHB).
These are just a few examples of OCLC Research programs that are exploring the Open landscape, clarifying significant opportunities for libraries, and helping them plan with confidence. OCLC shares the outputs of this work freely, in research reports that are published under Creative Commons (CC-BY) licenses or as Open Access preprint of articles in academic and professional journals.
Going forward
OCLC’s investments in community-facing research and developing solutions for evolving library workflows (including Open Access), are made possible by a business model that balances our responsibility to manage community infrastructure on behalf of our members, with our mission to serve the world’s libraries.
The combined strengths of sound organizational management, financial sustainability, and community governance have enabled OCLC to evolve its products, services, technology platform and data network to meet the changing needs of libraries. Without this foundation, OCLC could not have taken on stewardship of OAISter or enabled it to scale globally. We would not have the resources to develop and maintain the WorldCat.org discovery platform that provides visibility for libraries and publishers, to expand and improve the WorldCat data network so that it is more fully representative of global library capacity, to deliver resource-sharing and management solutions that improve library efficiency and reduce library costs, or to produce freely available research.
The next post in this series will describe the membership-good business model that sustains the community infrastructure OCLC provides and enables us to deliver ‘common good’ benefits to all libraries.
This post benefited from the input of numerous OCLC colleagues. Special thanks to Alexandra Winzeler, Jennifer Rosenfeld and Laura Falconi for their product insights; Ixchel Faniel and Titia Van der Werf for summarizing relevant OCLC Research projects; and Merrilee Proffit for all-around editorial excellence.
Below the fold I comment on both.
In List And Dump Schemes I discussed Fais Khan's "You Don't Own Web3": A Coinbase Curse and How VCs Sell Crypto to Retail, a description of how by outsourcing securities fraud to cryptocurrency entrepreneurs VCs could turn their money over much more quickly, juicing their returns while avoiding legal liability. This was a big part of the way the flood of money chasing returns in a low interest-rate environment corrupted venture capital.
$350 million in 2018, $515 million in 2020, $2.2 billion in 2021, and $4.5 billion in 2022.
White documents in detail how A16Z's desperation to show a return on these billions leads to obfuscating, lying, misleading and egregious forms of chart-crime. I'll just discuss three of the worst examples.
The first is from their 17th May, 2022 (almost exactly six months after "prices" started falling) State of Crypto, touting the 270.9% rise in cryptocurrencies' "market cap". White writes:
With the use of some extended x-axes and creative (and undisclosed) data cut-offs, they were able to make it appear that crypto prices could still be on a rocket trajectory. For example, take this chart of “global crypto market cap”. Looks compelling! Up only, which they helpfully illustrate with the crude black arrow in case those pesky “down” portions of the graph were threatening to confuse the message.
To their credit, they do cite their sources — in this case, CoinMarketCap. This allows us to pull the same data from the same source, as of the time at which the report was published:
...
I have taken the chart through May 2022, overlaid it on a16z’s, and massaged it a little bit to fit, which produces the following (note I’ve turned it into grayscale to help with visibility):
CoinMarketCap data (black rectangle) overlaid on a16z’s chart. The red dashed line marks on the CoinMarketCap chart the claimed cutoff of December 31, 2022.
This allows us to see how a16z cut off the data at roughly May 2021, despite claiming on that slide that the data was as of December 31, 2021 (marked by the red line), and despite in later slides using data through May 2022. This omission conveniently elides the downturns in mid and late 2021 that might make readers notice that “hmm, sometimes crypto prices do go down”.
Andreessen Horowitz begins the “Why Web3 Matters” section of the report by repeating a similar refrain as they did in the 2022 report: “web3 is the next evolution of the internet”. They fail to mention that while the columns describing the “web1” and “web2” eras describe actual changes in the web, the “web3” column remains wholly aspirational despite its supposed 2020 start date.
At this point, a critical reader should be wondering why Andreessen Horowitz — a venture capital firm that has backed (and continues to back) some of the largest Big Tech firms, and whose entire business model relies on accruing value to themselves and their investors — would be interested in something that was truly “community-governed” where “value accrues to network participants”.
As Fais Khan shows such value as there might be in "Web3" accrues not to "network participants" but to the VCs themselves via List And Dump Schemes. The idea that A16Z would invest spend $7.5B on something governed by "the community" so that “value accrues to network participants” doesn't pass the laugh test.
In fact this whole 3-era history of the Web that Chris Dixon has been pushing is deliberately distorted to make it look like his vision of "Web3" is inevitable and thus a massive investment opportunity. Dave Karpf's Web3's fake version of Web history supplies a neccessary corrective:
the problem with Dixon’s model is that it extremely, ceaselessly, aggressively wrong. It’s the type of wrong that might be useful for hawking unregistered Web3 security products (err, sorry, I mean, play-to-earn games), but is not at all useful for actually understanding the development of the internet.
1990-2005 wasn’t a single, contiguous era of “open decentralized protocols” and value accruing to the edges of the network. There were (at least) three eras in that timespan. Only the first (1990-95) had those qualities. As soon as the money got big, things changed drastically.
2005-2020 wasn’t a single era either. Again, there were at least three eras in there. And only the last one or two fit his description of “siloed, centralized services” with value accruing to the big tech companies. The years that most clearly represent the “web 2.0” era were characterized by social sharing and mass collaboration. It was only later that the platforms calcified and the “enshittification” cycle began in earnest
But Chris Dixon was there for too much of the history of the Web to be making innocent mistakes here. When he erases Microsoft from the ‘90s Web, it isn’t because he never heard of the browser wars. When he conflates the participatory Web 2.0 years with the platform years that followed, it is an intentional omission. He’s getting the basic history wrong because it serves his strategic purposes as a Web3 investor and evangelist.
...
You know why Web3 has turned out to be so much scammier than the internet of the 90s and 00s? The answer is simple. It’s the same reason why Willie Sutton robbed banks (“Because that’s where the money is!). You can’t code community participation and trust into the blockchain. Once the money gets big, the social incentives get skewed. In the complete absence of regulation, people are going to run huge scams.
We’ve seen the result. Web3 has been a catastrophe for everyone but the early investors and the scammers. Chris Dixon constructed a model of Web history to help sell his investments.
The third, A16Z's tribute to the reduction in Ethereum's energy consumption from the switch to Proof-of-Stake, is a real doozy. It centers on the claim that it "eliminates environmental objections". White counters:
Given they discuss Bitcoin throughout this report, it seems a little disingenuous to say that environmental concerns have been eliminated. It’s correct that The Merge greatly reduced Ethereum’s energy consumption, which is excellent! But Bitcoin still has a massive carbon footprint, which is currently comparable to that of the entire country of Peru. Its electricity consumption is comparable to that of Kazakhstan. If Bitcoin itself was a country, it would rank 34th in terms of energy consumption.
Well, yes, but as I noted in The Power Of Ethereum's Merge, Ethereum only consumed half of Bitcoin's power, so the Merge only eliminated 33% of the environmental objections. Actually, the decrease was much less because (a) there are many other Proof-of-Work cryptocurrencies apart from Bitcoin, and (b) many of the slots in mining centers previously occupied by Ethereum rigs migrated to holding Bitcoin rigs, increasing Bitcoin's consumption. And note that even Proof-of-Stake Ethereum uses vastly more power (and is much slower) than an equivalent centralized system running on a few Raspberry Pis.
But that isn't the doozy, which is A16Z's comparison of Ethereum's power consumption to that of YouTube, which includes the claim that while "global data centers" consume 200TWh/year, somehow YouTube consumes 244TWh/year! White traces the sources for this amazing claim in detail, showing that no-one ever thought about the text they cut-and-pasted, and concludes:
Google’s own annual environmental report for 2022 shows they used 18.6 TWh across the entire company.21 Only a portion of all of the electricity used by Google can reasonably be attributed to YouTube, but I couldn't find more granular figures. Either way, a16z's estimated 244 TWh is off by more than an order of magnitude, and possibly even closer to two orders of magnitude — something that should have been apparent at a glance to whoever wrote the report.
I hope these samples encourage you to go read the whole of Molly's epic takedown.
Matt Levine
The aftermath of the global financial crisis led to an extended period of very low interest rates. This created a flood of money chasing higher returns, and this in turn created huge demand for the supposedly superior returns from venture capital. The VC's approach to investing this cornucopia is best summarized by Sequoia's "due diligence" before investing in FTX, which consisted of watching Sam Bankman-Fried playing (not very well) League of Legends.
In 2022, Tiger’s flagship fund suffered its worst annual loss, losing more than 50 per cent of its value as Tiger marked down its unlisted holdings by nearly 20 per cent.
In the startup boom of recent years, Tiger Global Management got a reputation for investing in every startup, moving fast, paying top dollar and not being too involved in governance.
This worked because most VC investments fail, their returns come from the small fraction of big successes. It is difficult and time-consuming to spot the big winners ahead of time, so Tiger's approach was to be in as many deals as possible. Doing "due diligence" on the deals was thus not merely a waste of time, but actively counter-productive because it would alienate the founders.
That was during the boom, but after the boom comes the slump:
Technology-focused hedge fund Tiger Global is exploring options to cash in a piece of its more than $40bn portfolio of privately held companies, according to people familiar with the matter.
If you go to the secondary market with a bunch of startup stakes and a reputation for paying top dollar and not doing much due diligence, people are going to want a discount.
You should read the whole Tiger Global section of Levine's post, if only for his 7-point breakdown of why the Tiger Global "founder friendly" approach was justified in a boom. I'll return to the rest of Levine's post in a subsequent post here.
I recently did an internship for learning with the Chief of Staff Team (CoST) to CEO, specifically with Laurel Farrer on TeamOps. Since the internship is quite different from my regular work, I thought it best to do a blog post on it and reflect on the experience. What is an internship for learning? An … Continue reading "Internship for Learning: TeamOps (Chief of Staff Team to CEO)"
For the Open Data Day 2023 celebration Grafoscopio Community organised “Data Week: De los datos comunitarios a los chatbots” which was attended by the librarians, musicians, activists, academics, data analysts, and other participants of the Grafoscopio Community in the HackBo hackerspace in Bogotá, Colombia.
The event started with the introduction of participants, event, and hackerspace; then we discuss artificial intelligence from a critical perspective identifying it as “machine training”. Later we review the principal concepts of the “Test of Turing” and the “Chinese room argument” from John Searle, also quoting and recommending lectures like FAIR (Feminist AI Research Network) and the AI Decolonial Manifesto.
After that, we introduce our tech stack, concepts, and practices like:
Pocket infrastructures
Moldable tools
Interstitial programming
Live-coding
Data narratives
Performative and community writing/editing/publishing.
With technologies like Pharo, Gtoolkit, Lepiter, MiniDocs, Fossil, Chisel App, Tiddly Wiki, Mark deep, and HedgeDoc. We build personal repositories and share online data narratives with the practices and technologies introduced and demo a minimal implementation of a bot in telegram hosted on a personal laptop.
After that, we explore metadata provided by photos taken with the ProofMode mobile app to explore the possibilities of building a model for geolocalisation data in connection with previous workshops focused on pocket infrastructures for grassroots community maps and bots. And finally, we share the memories of Data Week with the network of communities related to the Grafoscopio group.
The overall feedback from the event was focused on the welcoming and fraternal space and workshop allowing diverse backgrounds to learn, share and connect in a safe and relaxed space.
The importance of supporting and funding the efforts of the Grafoscopio community and the HackBo hackerspace in building grassroots digital tools and self-learning material for peer-to-peer or community-to-community sharing open data, information, and critical views of grassroots local issues.
The following post is one in a regular series on issues of Inclusion, Diversity, Equity, and Accessibility, compiled by Jay Weitz.
Land Acknowledgement and Indigenous Metadata Resources
The Program for Cooperative Cataloging (PCC) Advisory Committee on Diversity, Equity, and Inclusion (ACDEI) has issued a thoughtful document, Land Acknowledgement and Indigenous Metadata Resources. Neither comprehensive nor exhaustive, it is intended to be “a list of resources to aid PCC members in developing greater awareness of Indigenous peoples and an understanding of the issues involved in making land acknowledgement statements, as well as Indigenous issues within metadata contexts.” It further encourages “the establishment of mutually beneficial relationships with local Indigenous peoples” as well as a corresponding recognition that such relationship building is a time and energy demand on those communities.
Culturally respectful subject headings
Richard Sapon-White, Emeritus Assistant Professor at Oregon State University (OCLC Symbol: ORE); Pamela Louderback, Library Director at the Broken Arrow Campus of Northeastern State University (OCLC Symbol: OKN) in Oklahoma; and Sara Levinson, Latin American and Iberian Languages Cataloger at the University of North Carolina at Chapel Hill (OCLC Symbol: NOC), have written and published Creating Subject Headings for Indigenous Topics: A Culturally Respectful Guide, freely available to all. The manual provides some general background information; the basics of proposing new, and revising existing, subject headings; suggestions for communicating with tribal nations; sample authority records; and considerably more. Levinson coordinates the Latin American and Indigenous Peoples of the Americas Funnel Project (LAIPA) of the Program for Cooperative Cataloging (PCC) Subject Authority Cooperative Program (SACO). “We hope that this manual will provide guidance for Indigenous librarians and members of tribal nations as to how to have their voices heard, as well as guidance to non-Indigenous librarians on how to go about gathering information in a respectful and culturally sensitive manner,” writes Sapon-White in the introduction. “While this first edition cannot resolve all terminological issues, we do hope that it is a beginning, with future versions providing greater adherence to principles of equity, diversity, and inclusion.”
Sexual and health information in libraries
Libraries have long served as reliable sources for healthcare information, a role that has become more important than ever considering today’s political climate. Episode 82 of the American Libraries podcast “Call Number” is devoted in part to “Sexual and Reproductive Health Information.” Barbara Alvarez, author of The Library’s Guide to Sexual and Reproductive Health Information, featured in the 2023 March 21 and 2023 April 18 posts of Advancing IDEAs, advises libraries on how to provide better services and information. The episode then moves on to Beth Myers, Director of Special Collections at Smith College (OCLC Symbol: SNN) in Northampton, Massachusetts. Among other things, Myers talks about the Sophia Smith Collection of Women’s History, which documents the struggle for sexual and reproductive justice, and how other institutions can do the same.
Libraries in Latin America face some of the same challenges as libraries in the United States do, but within their own specific contexts and leading to their own conclusions and solutions. On May 19, 2023, 10:00 a.m. Eastern time, join the ALA International Relations Round TableWebinar Series for “Changes and Challenges in Libraries in Latin America.” Former ALA President Carol Brey, Director of the Quality of Life Department in the City of Las Cruces, New Mexico, USA, will moderate. Panelists will include Pedro Lutz, Director of Inter-Institutional Projects and Libraries at the Centro Colombo Americano in Bogota, Colombia; Jesús Amado, Cultural Director of Centro Venezolano Americano del Zulia (CEVAZ), Biblioteca “Luis Guillermo Pineda” in Maracaibo, Venezuela; and Francisco Javier Bolaños of the Fundacion Bibliotec in Cali, Colombia.
On May 3, 2023, the Illinois Senate passed and sent on to Democratic Governor J.B. Pritzker for his signature, HB2789, amending the Illinois Library Systems Act, which had passed the Illinois House during March. Part of the new language states: “In order to be eligible for State grants, a library or library system shall adopt the American Library Association’s Library Bill of Rights that indicates materials should not be proscribed or removed because of partisan or doctrinal disapproval or, in the alternative, develop a written statement prohibiting the practice of banning books or other materials within the library or library system.” Illinois Secretary of State and State Librarian Alexi Giannoulias spearheaded this “counter-movement to growing efforts to restrict books on topics such as race, gender and sexuality in schools and libraries across the United States,” according to the Associated Press, “Illinois lawmakers push back on library book bans.” In his statement entitled “First-in-the-Nation Legislation to Prevent Book Bans Approved by General Assembly,” Giannoulias writes, “The concept of banning books contradicts the very essence of what our country stands for. It also defies what education is all about: teaching our children to think for themselves. This landmark legislation is a triumph for our democracy, a win for First Amendment Rights, and a great victory for future generations.”
Coretta Scott King Awards John Steptoe New Talent winners
On what would have been her 96th birthday, the Coretta Scott King Book Awards Round Table (CSKBART) will present the 2023 winners of the John Steptoe New Talent awards, author Jas Hammonds for We Deserve Monuments, and illustrator Janelle Washington for Choosing Brave: How Mamie Till-Mobley and Emmett Till Sparked the Civil Rights Movement. On May 25, 2023, 4:30 p.m. Eastern time, Hammonds and Washington will talk about their lives and share their award-winning books in the free CSKBART Webinar with John Steptoe Winners. We Deserve Monuments, about seventeen-year-old Avery dealing with a move from D.C. to rural Georgia, is Hammonds’ first novel. Angela Joy’s Choosing Brave, illustrated by Washington, is a biography of Emmett Till’s mother, Mamie Till-Mobley, who refocused her grief into deep advocacy for both personal and wider racial justice.
This post was written by Dana Reijerkerk, Rebecca Fried, and Mandy Mastrovita of the DLF Assessment Interest Group Metadata Working Group (DLF AIG-MWG) blog subcommittee and is part of the DLF AIG Metadata Assessment Series of blog posts that discuss assessment issues faced by metadata practitioners.
Contributions to the series are open to the public. If interested, please contribute ideas and contact information to our Google form: https://forms.gle/hjYFeC7XpQbJUTSC8
Figuring out where to start assessing your institution’s metadata can be challenging. Not every institution has the capacity to sustain long-term, consistent metadata assessment. While there may not be a universal starting point, we consider the following five assessment strategies attainable for digital collections. We work at a small liberal arts college, an R1 research university library, and a Digital Public Library of America (DPLA) service hub and cultural heritage wing of a statewide virtual library. While our institutional sizes and digital repository environments may differ, we still ask similar questions for assessment.
We ask ourselves:
Who would we have to partner with (developers, IT staff, content partners, etc.) to get the work done?
What are the expectations of these partners?
What is our information management landscape (DAMS, repositories, etc.)
What are the technological and managerial landscapes concerning systems, staffing, staff/student responsibilities, etc.?
Tip 1: Identify your Most-Used Collections and Focus on Those
Rationale: Metadata for highly used collections should be clean, consistent, and interoperable, which means that data is entered, harvested, or remediated in accordance with local, national, or international standards. Your metadata will give users the first impression of your collections and is essential for search engine optimization (SEO) content discoverability.
Workflow: Keep track of usage statistics for your digital collections using analytics tools. Many systems have built-in analytics. You can use Google Analytics instead for systems that lack a built-in analytics option. Do you have to create detailed reports for administrators or content partners? Or will you casually glance through analytics to see how well your records perform? You can manage the formality (or lack of) in reporting your analytics data. Examples of formalizing data collection include creating detailed reports for administrators or providing annual reports for site usage to partners.
Tip 2: Define and document how your schema(s) are used locally.
Rationale: An institution may use several metadata schemas or standards. Numerous content standards in your metadata environment could inform a single metadata schema.
Workflow: Document which standards are the most important for your institution to follow, and consider if they have been or can be locally modified to fit better your descriptive, administrative, and preservation metadata needs. An important consideration is documentation. For example, documenting this information is essential for long-term planning or future data migrations if you have changed your description methods.
Tip 3: Set Goals for Collections Metadata and Keep Track of Progress
Rationale: Responsible curation best practices call for setting benchmarks. Data-driven benchmarks are a strategy for tracking progress over time. Shared mutual goals help identify gaps and opportunities for training staff.
Workflow: Consider your metadata goals and create a shared spreadsheet to record and share that information for all systems that operate together. Some example goals include: assessing for consistency across collections or evaluating the existing metadata taxonomies (Are the fields helpful to users?). If your institution uses or interacts with more than one digital repository, you can consult the shared spreadsheet as a benchmark for interoperability.
Tip 4: Focus on Quality over Quantity.
Rationale: Determine which collections demand a “quality over quantity” approach and those collections where concessions may have to be made for expediency and sustainability. Metadata work can be iterative and may feel like it’s never over.
Workflow: Clean one field in multiple records or pare down the metadata template so all records in a set have consistent, clean metadata in a select few fields. A commonly used tool is OpenRefine, a free tool for data cleaning that facilitates the automation of metadata work. In addition, the DLF AIG-MWG has published an extensive Tools Repository that provides resources recommended by practitioners on how to use these tools and facilitate this work.
Tip 5: Try to Answer the Question: “What Would Happen in a Migration?”
Rationale: Besides considering interoperability between multiple digital repositories, look at your data and think about what would happen if you migrated to a new system. How much cleanup would you have to do?
Workflow: Export data as a CSV or XML file from your system(s) and assess the fields that you’re using. In your assessment, consider the following: What schemas have been used? Is the metadata high quality and consistent? Is each metadata field used consistently? Can the metadata be cleaned quickly in a program like OpenRefine, or could this process be automated? Are there fields used for only one or two collections, and should they be consolidated or eliminated? Does your current schema provide enough granularity for the types of resources you’ll need to describe?
I realise that my thinking on library collections may not be as well formed as it could be, but really ...? 😉
When I experiment with Chatbot interfaces I tend to have some standard queries. I ask whether the concept of the Irish diaspora is a helpful one in the context of either British or US politics. I also ask about library collections, in particular about the three models I have been writing and speaking about in recent years: collective collections, the inside out collection, and the facilitated collection.
These are the oddest results I have received from my chatbot experiments. Maybe they are having a day off? Maybe I missed a necessary configuration? Maybe my thoughts on library collections are completely incoherent?
I was trying Hugging Chat and it was using the OpenAssistant LLaMa 30B SFT 6 model.
First off, what is the collective collection?
OK, so much for that. What about the inside out collection?
mmm ... OK, and the facilitated collection?
It is of course easy to poke fun. And other results with other models can be so impressive. I do not know what is going on here ... perhaps it is a temporary glitch.
GPT-4 does somewhat better. First with a terse prompt.
A reminder that contra some expectations, GhatGPT is not a search engine. Then with some more guidance ...
This shows the importance of the prompt, giving a fuller answer than the first. Although this puts me in my place ...
It is interesting that it offers an interpretation based on the text itself. That said, Bard does &aposknow&apos more ...
And finally ...
I would not pose the inside out collection as competitive with the acquired collection, and I wonder is this partly influenced by somebody else writing about it.
I haven&apost done a deeper dive to see what I can find out about what is going on here. I note that my website is not part of the Google C4 collection analysed by The Washington Post.
Added 16 May 2023. OpenAI has just expanded the capabilities of ChatGPT to include web browsing. This is an important step. Here is ChatGPT on the collective collection with web browsing switched on:
And here for the first time I actually get named. It is also interesting to see the three collecting modes mentioned, although maybe the second should be collective collections?
Also interesting to see a name-check here. This is an OK answer, but I was especially interested to see the last paragraph which I thought did a good job without directly copying anything I am aware of.
Technical writing is something I’ve gotten more enjoyment out of the
longer I’ve been working professionally as a software developer.
Although, maybe it has been an interest all along, and I’ve only
recently begun to recognize it. I think Robin Sloan put it
well in his Specifying
Spring ’83 that writing specifications provides its own set of
challenges and joys:
I recommend this kind of project, this flavor of puzzle, to anyone who
feels tangled up by the present state of the internet. Protocol design
is a form of investigation and critique. Even if what I describe below
goes nowhere, I’ll be very glad to have done this thinking and writing.
I found it challenging and energizing.
Over the past couple of years I’ve had the opportunity to work with the
Webrecorder project to help document some of the data
formats they are developing to encourage interoperability between web
archiving tools. This hasn’t been, as Robin described, using the
specification as a canvas for imagining new sociotechnical ways of
being, so much as it has been helping shape existing code and
documentation into a form where it’s (hopefully) easier to digest by
others.
Nevertheless, this has been rewarding work, because Webrecorder have
been doing so much to advance web archiving practice. When you think
about it, it’s hard not to see web archives as increasingly important,
as the WWW
continues to be such a central technology for global publishing and
information sharing, despite (or in spite of) prominent examples of
greedy consolidation and market failure.
Chief among the Webrecorder specifications is the Web Archive Collection
Zipped or WACZ
(pronounced waxy or wack-zed), which is a packaging
standard for WARC
(ISO 28500:2017) data that lets archived web content created with one
set of tools, be readable or “playable” with another set of tools.
The hope/gamble here is that these specifications will be part of
an ecosystem of tools where web archives are portable, verifiable and
useful. It is early days, but I think we are already starting to see
this happen a bit. For example the Harvard Innovation Lab’s recent work
on Scoop is one of a
set of tools for assembling
evidentiary archives of web content in their Perma project. In addition the Internet
Archive’s Save Page Now
service also recently added the
ability to export collected data as a WACZ file.
Sure, these tools might create web archives by crawling the web in
different ways that are suitable to the content. Or they might make the
once live web content viewable and interactive once more. But these
tools might also help visualize web archives in other ways: inspecting
the file formats present in the archive, crawling behaviours used, the
websites and URL patterns in the archive, viewing the media files they
contain as a gallery, chart how language usage changes over time, or
publish them in new spaces like IPFS.
There’s no reason why we should rely on one tool, from one organization,
for all of this. Once you can depend on web archive data and metadata
being laid out in a particular way as files, packaged up in a zip file,
and (optionally) published on the web, these sorts of tools become
feasible in a way that was previous centralized web archive
architectures make more difficult.
These specifications are published at specs.webrecorder.net using Github Pages which is a popular
static site publishing tool that sits on top of data you have in version
control at Github. Being able to commit changes to the specifications
and gather issue tickets around them is really critical to this type of
documentation.
After having written them initially as Markdown we decided to
try using the W3C’s ReSpec
JavaScript library to publish the documents. ReSpec has lots of nice
features for formatting specifications in an accessible and recognizable
way. One simple example of a nicety that ReSpec offers is its system for
generating References. You can
easily embed citations to existing specifications so that they are
formatted correctly in a References section. If the spec you want to
cite hasn’t been cited before you can add it to the Specref corpus.
You can write ReSpec documents as straight up HTML, or as sections of Markdown text
interspersed in the HTML. We started out doing the latter, but found
over time that it was easiest to be able to edit the specifications as
stand-alone Markdown documents, and then generate the ReSpec HTML as
needed. I helped by writing markdown-to-respec,
which is a Github Action (written in Python) for automatically
generating a ReSpec document from Markdown, when a commit is pushed to
Github.
As an example you can see the
Markdown for the IPFS Chunking specification gets transformed into
this HTML
when changes to the specification are pushed to Github. ReSpec does
depend on some structured metadata (authors, editors, etc) which serves
as a configuration
for the specification. These can be included as YAML frontmatter in the
Markdown file, or if you prefer as a JSON file alongside the Markdown
file.
In addition markdown-to-respec
can be installed
and used from the command line, outside of the Github action, if you
want to see what changes look like without pushing to Github, or if you
just want to experiment a bit.
If you get a chance to given it a try please let me know. And happy spec
writing!
Much of the discussion occupying the Web recently has been triggered by the advent of Large Language Models (LLMs). Much of that has been hypeing the vast improvements in human productivity they promise, and glossing over the resulting unemployment among the chattering and coding classes. But the smaller negative coverage, while acknowledging the job losses, has concentrated on the risk of "The Singularity", the idea that these AIs will go HAL 9000 on us, and render humanity obsolete[0].
My immediate reaction to the news of ChatGPT was to tell friends "at last, we have solved the Fermi Paradox"[1]. It wasn't that I feared being told "This mission is too important for me to allow you to jeopardize it", but rather that I assumed that civilizations across the galaxy evolved to be able to implement ChatGPT-like systems, which proceeded to irretrievably pollute their information environment, preventing any further progress.
In August 2022, AI Impacts, an American research group, published a survey that asked more than 700 machine-learning researchers about their predictions for both progress in AI and the risks the technology might pose. The typical respondent reckoned there was a 5% probability of advanced AI causing an “extremely bad” outcome, such as human extinction (see chart). Fei-Fei Li, an AI luminary at Stanford University, talks of a “civilisational moment” for AI. Asked by an American tv network if AI could wipe out humanity, Geoff Hinton of the University of Toronto, another AI bigwig, replied that it was “not inconceivable”.
The essay correctly points out that this risk is beyond the current state of the art:
But in the specific context of GPT-4, the LLM du jour, and its generative ilk, talk of existential risks seems rather absurd. They produce prose, poetry and code; they generate images, sound and video; they make predictions based on patterns. It is easy to see that those capabilities bring with them a huge capacity for mischief. It is hard to imagine them underpinning “the power to control civilisation”, or to “replace us”, as hyperbolic critics warn.
I agree with the AI experts that the HAL 9000 problem is an existential risk worth considering. In order for humanity to encounter it society needs not just to survive a set of other existential risks which have much shorter fuses, but to do so with its ability to make rapid technological progress unimpaired. And the problem in this cursory paragraph will make these risks much worse:
The most immediate risk is that LLMs could amplify the sort of quotidian harms that can be perpetrated on the internet today. A text-generation engine that can convincingly imitate a variety of styles is ideal for spreading misinformation, scamming people out of their money or convincing employees to click on dodgy links in emails, infecting their company’s computers with malware. Chatbots have also been used to cheat at school.
These more urgent existential risks include climate change, pandemics, and the rise of warmongering authoritarian governments. A key technique used by those exacerbating them is "spreading misinformation". Brian Stelter explains Bannon's 2018 confession to the acclaimed writer Michael Lewis:
“The Democrats don’t matter,” Bannon told Lewis. “The real opposition is the media. And the way to deal with them is to flood the zone with shit.”
That’s the Bannon business model: Flood the zone. Stink up the joint. As Jonathan Rauch once said, citing Bannon’s infamous quote, “This is not about persuasion: This is about disorientation.”
The aide said that guys like me were "in what we call the reality-based community," which he defined as people who "believe that solutions emerge from your judicious study of discernible reality."
...
"That's not the way the world really works anymore," he continued. "We're an empire now, and when we act, we create our own reality. And while you're studying that reality -- judiciously, as you will -- we'll act again, creating other new realities, which you can study too, and that's how things will sort out. We're history's actors . . . and you, all of you, will be left to just study what we do."
In many applications a tendency to spout plausible lies is a bug. For some it may prove a feature. Deep fakes and fabricated videos which traduce politicians are only the beginning. Expect the models to be used to set up malicious influence networks on demand, complete with fake websites, Twitter bots, Facebook pages, TikTok feeds and much more. The supply of disinformation, Renée DiResta of the Stanford Internet Observatory has warned, “will soon be infinite”.
This threat to the very possibility of public debate may not be an existential one; but it is deeply troubling. It brings to mind the “Library of Babel”, a short story by Jorge Luis Borges. The library contains all the books that have ever been written, but also all the books which were never written, books that are wrong, books that are nonsense. Everything that matters is there, but it cannot be found because of everything else; the librarians are driven to madness and despair.
I disagree that the "threat to the very possibility of public debate" is not existential. Informed public debate is a neccessary but not sufficient condition for society to survive the existential threats it faces. An example is the continuing fiasco of the US' response to the COVID pandemic[3]. In Covid is still a leading cause of death as the virus recedes Dan Diamond writes [my emphasis]:
Federal health officials say that covid-19 remains one of the leading causes of death in the United States, tied to about 250 deaths daily, on average, mostly among the old and immunocompromised.
Few Americans are treating it as a leading killer, however — in part because they are not hearing about those numbers, don’t trust them or don’t see them as relevant to their own lives.
...
The actual toll exacted by the virus remains a subject of sharp debate. Since the earliest days of the pandemic, skeptics have argued that physicians and families had incentives to overcount virus deaths, and pointed to errors by the Centers for Disease Control and Prevention in how it has reported a wide array of covid data. Those arguments were bolstered earlier this year by a Washington Post op-ed by Leana Wen that argued the nation’s recent covid toll is inflated by including people dying with covid, as well as from covid — for instance, gunshot victims who also test positive for the virus — a conclusion echoed by critics of the pandemic response and amplified on conservative networks.
Johannes Gutenberg’s development of movable type has been awarded responsibility, at some time or other, for almost every facet of life that grew up in the centuries which followed. It changed relations between God and man, man and woman, past and present. It allowed the mass distribution of opinions, the systematisation of bureaucracy, the accumulation of knowledge. It brought into being the notion of intellectual property and the possibility of its piracy. But that very breadth makes comparison almost unavoidable. As Bradford DeLong, an economic historian at the University of California, Berkeley puts it, “It’s the one real thing we have in which the price of creating information falls by an order of magnitude.”
Much commentary on the effects of Gutenberg, including the essay, emphasizes books. But the greater effect on society came from propagandistic pamphlets, which being cheaper to produce had a much wider circulation. The economics of the moveable type revolution greatly impacted the production of high-quality content, but it impacted the production of lower-quality content much more. The explanation is simple, the raw costs of publication and distribution form a much greater proportion of the total cost of disseminating lower-quality content. Higher-quality content has much greater human and other costs in creating the content before it is published and distributed.
Initially when a new, more cost-effective medium becomes available, content quality is high because the early adopters value the new experience and put effort into using it. But as the low cost becomes more widely known, quality begins to degrade.We have seen this effect in action several times. My first was with Usenet newsgroups. I was an early Usenet adopter, and when I was working on the X Window System I found the "xpert" newsgroup a valuable resource to communicate with the system's early adopters. But as the volume grew the signal-to-noise ration dropped rapidly, and it became a waste of time. Usenet actually pioneered commercial spam, which migrated to e-mail where it provided yet another example of rapidly decaying signal-to-noise ratio.
Exactly the same phenomenon has degraded academic publishing. When Stanford's Highwire Press pioneered the transition of academic journals from paper to the Web in 1995 the cost of distribution was practically eliminated, but the cost of peer-review, copy-editing, graphics and so on was pretty much untouched.
In Who pays the piper calls the tune Jim O'Donnell writes:
The model of scientific and scholarly publishing is, arguably, undergoing a fundamental change. Once upon a time, the business model was simple: publish high quality articles and convince as many people as possible to subscribe to the journals in which they appear and raise the prices as high as the market will bear. We all know pretty well how that works.
But now an alternate model appears: charge people to publish their articles and give them away for free. The fundamental change that implies is that revenue enhancement will still come from charging whatever the market will bear, but now the search is not for more subscribers but for more authors. Of course peer review intrudes into this model, but if you could, for example, double the number of articles passing peer review for a journal you publish, you could double your gross revenue. That was mostly not the case before except where the publisher had room to increase the subscription price proportionately. There's a slippery slope here. Predatory journals have already gone over the edge on that slope and are in a smoldering heap at the bottom of the hill, but the footing can get dicey for the best of them.
Open access with "author processing charges" out-competed the subscription model. Because the Web eliminated the article rate limit imposed by page counts and printing schedules, it enabled the predatory open access journal business model. So now it is hard for people "doing their own research" to tell whether something that looks like a journal and claims to be "peer-reviewed" is real, or a pay-for-play shit-flooder[5]. The result, as Bannon explains in his context, is disorientation, confusion, and an increased space for bad actors to exploit.
Governments' response to AI's "threat to the very possibility of public debate" (and to their control of their population's information environment) is to propose regulation. Here are the EU and the Chinese government[6]. In a blog post entitled The Luring Test: AI and the engineering of consumer trust Michael Atleson of the FTC made threatening noises:
Many commercial actors are interested in these generative AI tools and their built-in advantage of tapping into unearned human trust. Concern about their malicious use goes well beyond FTC jurisdiction. But a key FTC concern is firms using them in ways that, deliberately or not, steer people unfairly or deceptively into harmful decisions in areas such as finances, health, education, housing, and employment. Companies thinking about novel uses of generative AI, such as customizing ads to specific people or groups, should know that design elements that trick people into making harmful choices are a common element in FTC cases, such as recent actions relating to financial offers, in-game purchases, and attempts to cancel services.
Until last year, he said, Google acted as a “proper steward” for the technology, careful not to release something that might cause harm. But now that Microsoft has augmented its Bing search engine with a chatbot — challenging Google’s core business — Google is racing to deploy the same kind of technology. The tech giants are locked in a competition that might be impossible to stop, Dr. Hinton said.
His immediate concern is that the internet will be flooded with false photos, videos and text, and the average person will “not be able to know what is true anymore.”
In other words Google abandoned their previously responsible approach at the first hint of competition. Companies in this state of panic are not going to pay attention to gentle suggestions from governments.
Part of the problem is the analogy to the massive bubbles in cryptocurrencies, non-fungible tokens and Web3. Just as with these technologies, AI is resistant to regulation, Just as with cryptocurrencies, this is the VC's "next big thing", so deployment will be lavishly funded and have vast lobbying resources. Just as with cryptocurrencies, the bad guys will be much quicker than the regulators[7].
And now, a fascinating leak from inside Google suggests that it simply won't matter what governments, VCs or even the tech giants do. The must-read We Have No Moat: And neither does OpenAI by Dylan Patel and Afzal Ahmad posts the anonymous document:
While our models still hold a slight edge in terms of quality, the gap is closing astonishingly quickly. Open-source models are faster, more customizable, more private, and pound-for-pound more capable. They are doing things with $100 and 13B params that we struggle with at $10M and 540B. And they are doing so in weeks, not months. This has profound implications for us:
We have no secret sauce. Our best hope is to learn from and collaborate with what others are doing outside Google. We should prioritize enabling 3P integrations.
People will not pay for a restricted model when free, unrestricted alternatives are comparable in quality. We should consider where our value add really is.
Giant models are slowing us down. In the long run, the best models are the ones which can be iterated upon quickly. We should make small variants more than an afterthought, now that we know what is possible in the <20B parameter regime.
...
Most importantly, they have solved the scaling problem to the extent that anyone can tinker. Many of the new ideas are from ordinary people. The barrier to entry for training and experimentation has dropped from the total output of a major research organization to one person, an evening, and a beefy laptop.
The purpose of the writer was to warn Google of the competitive threat from open source AI, thus increasing the panic level significantly. The writer argues that open source AI was accelerated by Facebook's release of LLaMA and leak of its data, and it has made very rapid progress since. Hardware and software resources in reach of individuals can achieve results close to those of ChatGPT and Bard, which require cloud-level investments.
Companies can argue that their AI will be better on some axes,for example in producing fewer "hallucinations", but they can't generate a return on their massive investments (Microsoft invested $10B in Open AI in January) if the competition is nearly as good and almost free. It seems unlikely that their customers care so much about "hallucinations" that they would be willing to pay a whole lot more to get fewer of them. The tech giants clearly don't care enough to delay deploying systems that frequently "hallucinate", so why should their customers?
That threat is what the tech giants are worried about. What I'm worried about is that these developments place good-enough AI in the hands of everyone, rendering governments' attempts to strong-arm companies into preventing bad guys using it futile. After all, the bad guys don't care about additional "hallucinations", for them that's a feature not a bug. They enhance the disorientation and confusion they aim for.
My favorite science fiction on this theme was published in 1954. I had probably read it by 1957 in one of the yellow-jacketed Gollancz SF collections. In Fredric Brown's (very) short story Answer Dwar Ev switches on the gigantic computer:
There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. “The honor of asking the first question is yours, Dwar Reyn.”
“Thank you,” said Dwar Reyn. “It shall be a question that no single cybernetics machine has been able to answer.”
He turned to face the machine. “Is there a God?”
The mighty voice answered without hesitation, without the clicking of single relay.
“Yes, now there is a God.”
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
That is the second half of the entire story.
For the record, my favored resolution of the Fermi Paradox is the Rare Earth hypothesis. This is the idea that, although there are a enormous number of stars in the galaxy, each likely with a retinue of planets, the existence of humanity on Earth has depended upon a long series of extremely unlikely events, starting from the fact that despite its location in the outskirts of the Milky Way it is a rare high-metalicity G-type star with low luminosity variation, through the Earth's stable orbit and large moon, to its plate tectonics and magnetosphere, and on and on. Thus the number of alien civilizations in the galaxy is likely to be small, possibly one.
Generative AI has helped bad actors innovate and develop new attack strategies, enabling them to stay one step ahead of cybersecurity defenses. AI helps cybercriminals automate attacks, scan attack surfaces, and generate content that resonates with various geographic regions and demographics, allowing them to target a broader range of potential victims across different countries. Cybercriminals adopted the technology to create convincing phishing emails. AI-generated text helps attackers produce highly personalized emails and text messages more likely to deceive targets.
Another recent example of the zone flooding problem is the Texas state government's stubborn defense of its fossil fuel industry via decades of misinformation from the industry and its media allies. The Texas Senate has a suite of bipartisan bills that purport to fix the recent grid failures, but:
These are bills meant to boost fossil fuels and crowd out renewables. S.B. 1287 requires energy companies to cover more of the costs of connecting to the grid depending on distance—in what amounts to an added tax on renewable generators that often operate farther away from the central source and depend on lengthy transmission lines. Then there’s S.B. 2012, which would “incentivize the construction of dispatchable generation” and “require electric companies to pay generators to produce power in times of shortage.” Definition: more gas buildout, and more levies on electricity providers instead of gas producers. S.B. 2014 “eliminates Renewable Energy Credits” so as to “level the playing field” with gas sources, never mind the generous tax breaks that already benefit fossil fuel producers. S.B. 2015 “creates a goal of 50% dispatchable energy” for the central grid, essentially mandating that gas sources provide at least half of Texas’ electricity at all times. Senate Joint Resolution 1 hopes to enshrine S.B. 6’s gas backup program in the state constitution as a new amendment.
For more than a decade, climate science deniers, rightwing politicians and sections of the Murdoch media have waged a campaign to undermine the legitimacy of the Bureau of Meteorology’s temperature records.
...
“This has frankly been a concerted campaign,” says climate scientist Dr Ailie Gallant, of Monash University. “But this is not about genuine scepticism. It is harassment and blatant misinformation that has been perpetuated.”
Despite multiple reviews, reports, advisory panels and peer-reviewed studies rejecting claims that its temperature record was biased or flawed, Gallant says the “harassment” of the bureau has continued.
...
One former executive, who for eight years was responsible for the bureau’s main climate record, says the constant criticism has affected the health of scientists over many years, who were diverted from real research to repeatedly answer the same questions.
Dr Greg Ayers, a former director of the bureau and leading CSIRO atmospheric scientist, has written four peer-reviewed papers testing claims made by sceptics.
“There’s a lot of assertion [from sceptics] but I haven’t seen much science,” said Ayers. “If you are going to make claims then we need to do peer-reviewed science, not just assertion.”
A climate denier with a laptop could easily ask ChatGPT to write a paper based on the Murdoch papers' articles, complete with invented citations, and pay the "author processing charge" to a predatory journal. Then Dr. Ayers would be reduced to arguing about the quality of the journal.
Experts say those books are likely just the tip of a fast-growing iceberg of AI-written content spreading across the web as new language software allows anyone to rapidly generate reams of prose on almost any topic. From product reviews to recipes to blog posts and press releases, human authorship of online material is on track to become the exception rather than the norm.
...
What that may mean for consumers is more hyper-specific and personalized articles — but also more misinformation and more manipulation, about politics, products they may want to buy and much more.
As AI writes more and more of what we read, vast, unvetted pools of online data may not be grounded in reality, warns Margaret Mitchell, chief ethics scientist at the AI start-up Hugging Face. “The main issue is losing track of what truth is,” she said. “Without grounding, the system can make stuff up. And if it’s that same made-up thing all over the world, how do you trace it back to what reality is?”
The Chinese government's attempt to have it both ways by leading in the AI race but ensuring their AIs stick to the party line isn't going well. Glyn Moody summarizes the state of play in How Will China Answer The Hardest AI Question Of All?:
Chinese regulators have just released draft rules designed to head off this threat. Material generated by AI systems “needs to reflect the core values of socialism and should not subvert state power” according to a story published by CNBC. The results of applying that approach can already be seen in the current crop of Chinese chatbot systems. Bloomberg’s Sarah Zheng tried out several of them, with rather unsatisfactory results:
In Chinese, I had a strained WeChat conversation with Robot, a made-in-China bot built atop OpenAI’s GPT. It literally blocked me from asking innocuous questions like naming the leaders of China and the US, and the simple, albeit politically contentious, “What is Taiwan?” Even typing “Xi Jinping” was impossible.
In English, after a prolonged discussion, Robot revealed to me that it was programmed to avoid discussing “politically sensitive content about the Chinese government or Communist Party of China.” Asked what those topics were, it listed out issues including China’s strict internet censorship and even the 1989 Tiananmen Square protests, which it described as being “violently suppressed by the Chinese government.” This sort of information has long been inaccessible on the domestic internet.
One Chinese chatbot began by warning: “Please note that I will avoid answering political questions related to China’s Xinjiang, Taiwan, or Hong Kong.” Another simply refused to respond to questions touching on sensitive topics such as human rights or Taiwanese politics.
Chinese authorities have detained a man for using ChatGPT to write fake news articles, in what appears to be one of the first instances of an arrest related to misuse of artificial intelligence in the nation.
...
The alleged offense came to light after police discovered a fake article about a train crash that left nine people dead, which had been posted to multiple accounts on Baidu Inc.’s blog-like platform Baijiahao. The article was viewed over 15,000 times before being removed.
Further investigations revealed that Hong was using the chatbot technology — which is not available in China but can be accessed via VPN networks — to modify viral news articles which he would then repost. He told investigators that friends on WeChat had showed him how to generate cash for clicks.
"The overwhelming majority of people who are ever going to see a piece of misinformation on the internet are likely to see it before anybody has a chance to do anything about it," according to Yoel Roth, the former head of Trust and Safety at Twitter.
When he was at Twitter, Roth observed that over 90% of the impressions on posts were generated within the first three hours. That’s not much time for an intervention, which is why it's important for the cybersecurity community to develop content moderation technology that "can give truth time to wake up in the morning," he says.
It is this short window in time that makes flooding the zone with shit so powerful. It is a DDOS attack on content moderation, which already cannot keep up.
§1 My new Google Reader §2 Law Library of Congress Reports §3 The Breach: "library workers are struggling to maintain a welcoming space in the face of policing solutions" §4 Google's Project Tailwind §5 How About Machine Learning Enhancing Theses?
My notes from the presentations on the second day of Write the Docs Portland 2023. Caitlin Davey – The visuals your users never saw… wait that’s most of them draws on instructional design principles, specifically visuals ways to make images more cognitively interesting so they don’t just breeze past them without taking away the important … Continue reading "Write the Docs 2023: Day 2 Notes"
On 8th March 2023, Women Environmental Programme (WEP) celebrated Open Data Day 2023 with the theme “Open AI for Environmental Conservation”. The event was attended by staff and volunteers of WEP and open data enthusiasts that were both present in-person and online. The objective of the event was to teach participants how Open AI solutions are used to track environmental changes and help improve environmental management and conservation.
During the event, participants were exposed to and taught how to use some Open AI tools for tracking environmental changes and management. These tools included: Global Forest Watch, Climate Watch, Climate AI, Google Maps, and ChatGPT. Participants were given an explanation on how to use each of the Open AI tools to either monitor changes to the environment, carry out environmental research and contribute data to the tools, or search for important environmental information to make informed decisions. A total of 20 people participated in this event, 12 in-person and 8 virtually.
With the arrival and registration of the physical participants at the venue, the program was kicked off and was followed by an introduction from participants and welcome remarks from Mr. John Baaki, Deputy Executive Director of WEP. He welcomed and thanked the participants for joining the event and promised that the exercise will be educative and filled with opportunities. The event was moderated by the head of human resources, Ms. Patience Adema.
The welcome address was followed by a brief rundown of the history of the Women Environmental Programme and the scope of work by the head of the Monitoring and Evaluation department, Ms. Damaris Ujah. She gave a brief history of WEP and how it was founded, its scope of work, partnerships, memberships, and affiliations with the United Nations (UN). She stated that WEP is a non-governmental, non-profit, non-political, non-religious, and voluntary organisation formed in April 1997 by a group of women in Kaduna State. She also stated that WEP has the United Nations Economic and Social Council (ECOSOC) special status, Observer Status to the United Nations Environment Programme (UNEP) Governing Council/Global Ministerial Environment Forum, and United Nations Framework Convention on Climate Change (UNFCCC).
After a brief rundown, Mr. Sammy Joel made a presentation on the overview of the subject matter, and the descriptions of terms and concepts related to Open Data. He stated that Open Data is data that can be freely used, re-used, and redistributed by anyone – subject only, at most, to the requirement to attribute and share-alike. He also stated that open data is typically made available by governments, non-governmental organisations, and other public institutions, as well as by individuals and private companies. It can include a wide range of information, such as weather data, financial information, healthcare data, and more.
He stated that open data is the backbone of Artificial Intelligence (AI). He also shared how we can use Artificial Intelligence for Environmental Conservation and listed methods that can be used to achieve this particular goal.
Satellite and drone imagery: Satellites and drones can capture high-resolution images of the environment, which can be analysed using AI algorithms to monitor changes such as deforestation, land use, and wildlife habitats.
Machine learning algorithms: Machine learning algorithms can analyse large datasets and identify patterns, allowing conservationists to identify areas that require conservation efforts and track changes over time.
Computer vision and image recognition: Computer vision and image recognition technologies can be used to track wildlife populations, monitor poaching, and identify illegal activities in protected areas.
Natural language processing: Natural language processing can be used to analyse social media posts and news articles to identify potential environmental threats or illegal activities.
Climate models: AI can be used to create climate models that can predict the impact of climate change on the environment and identify potential solutions.
Environmental decision support systems: Environmental decision support systems can use AI to provide real-time recommendations for conservation efforts, such as where to focus patrols or which areas to prioritise for reforestation
He also exposed participants to some AI tools and demonstrated how they are used:
Global Forest Watch: It has an AI Forest Monitor and investigator. GFW has a web app that helps in data collection, monitoring, and analysis in real-time powered by open data.
Climate Watch: Offers open data, visualisations and analysis to help policymakers, researchers and other stakeholders gather insights on countries’ climate progress.
Climate AI: Turn climate risk into a competitive advantage, by aiming for zero loss of lives, livelihoods, and nature.
ChatGPT: An artificial intelligence chatbot developed by Open AI and has been fine-tuned using both supervised and reinforcement learning techniques.
The event brought a lot of learning for the participants and WEP. We learned how there are a variety of tools that can be used in environmental monitoring and research. We learned that the usefulness of these tools needs more time to be devoted to training practitioners on their uses.
Knowledge was gained by the participants during the event on using open AI tools for the first time. This will help participants in their different endeavours but particularly in the field of environmental management.
Women Environmental Program (WEP) received a small grant to host Open Data Day(s) (ODD) events and activities.
Relatively recently, a colleague (
Peggy Griesinger) distributed a bibliography on the topic of diversity, equity, and inclusion (DEI), and I decided to spend some time analyzing the content of the bibliography. Below are a couple of visualizations from the analysis, albiet out of context:
Along the way, I learned about "slow librarianship" -- an antiracist, responsive, and values-driven practice. I was also able to programatically extract and enumerate quite a number of interesting question/answer pairs elaborating on DEI. A sampling follows:
What is a core value in libraries? - social justice
What is a core value in libraries? - diversity, equity, and inclusion principles
How are librarians' skill sets honed? - through ongoing interaction with diverse groups
From the analysis's summary:
It goes without saying, DEI is a thing in libraries and librarianship; the articles in the given bibliography discuss DEI, and many of the extracted sentences elaborate on the how's and why's of DEI in libraries, specificaly in cataloging practice.
What is interesting to me is an apparent shift of librarianship in the past few decades. For a long time, libraries seemed to be about books, but with the advent of computers, library schools evolved into "i" (information) schools, but now-a-days, especially in practice, librarianship seems to be less about data, information, and knowledge and more about social justice. I suppose such is a sign of the times.