The fundamental problem in the design of the LOCKSSsystem was to audit the integrity of multiple replicas of content stored in unreliable, mutually untrusting systems without downloading the entire content:
Multiple replicas, in our case lots of them, resulted from our way of dealing with the fact that the academic journals the system was designed to preserve were copyright, and the copyright was owned by rich, litigious members of the academic publishing oligopoly. We defused this issue by insisting that each library keep its own copy of the content to which it subscribed.
Unreliable, mutually untrusting systems was a consequence. Each library's system had to be as cheap to own, administer and operate as possible, to keep the aggregate cost of the system manageable, and to keep the individual cost to a library below the level that would attract management attention. So neither the hardware nor the system administration would be especially reliable.
Without downloading was another consequence, for two reasons. Downloading the content from lots of nodes on every audit would be both slow and expensive. But worse, it would likely have been a copyright violation and subjected us to criminal liability under the DMCA.
Lots of replicas are essential to the working of the LOCKSS protocol, but more normal systems don't have that many for obvious economic reasons. Back then there were integrity audit systems developed that didn't need an excess of replicas, including work by Mehul Shah et al, and Jaja and Song. But, primarily because the implicit threat models of most archival systems in production assumed trustworthy infrastructure, these systems were not widely used. Outside the archival space, there wasn't a requirement for them.
A decade and a half later the rise of, and risks of, cloud storage have sparked renewed interest in this problem. Yangfei Lin et al's Multiple‐replica integrity auditing schemes for cloud data storage provides a useful review of the current state-of-the-art. Below the fold, a discussion of their, and some related work. Their abstract reads:
Cloud computing has been an essential technology for providing on‐demand computing resources as a service on the Internet. Not only enterprises but also individuals can outsource their data to the cloud without worrying about purchase and maintenance cost. The cloud storage system, however, is not fully trustable. Cloud data integrity auditing is crucial for defending against the security threats of data in the untrusted multicloud environment. Storing multiple replicas is a commonly used strategy for the availability and reliability of critical data. In this paper, we summarize and analyze the state‐of‐the‐art multiple‐replica integrity auditing schemes in cloud data storage. We present the system model and security threats of outsourcing data to the cloud with classification of ongoing developments. We also summarize the existing data integrity auditing schemes for multicloud data storage. The important open issues and potential research directions are addressed.
There are three possible system architectures for auditing the integrity of multiple replicas:
As far as I'm aware, LOCKSS is unique in using a true peer-to-peer architecture, in which nodes storing content mutually audit each other.
In another possible architecture the data owner (DO in Yangfei Lin et al's nomenclature) audits the replicas.
Yangfei Lin et al generally consider an architecture in which a trusted third party audits the replicas on behalf of the DO.
Proof-of-Possession vs. Proof-of-Retrievability
There are two kinds of audit:
A Proof-of-Retrievability (PoR) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed and every bit was intact.
A Proof-of-Possession (PoP) audit allows the auditor to assert with very high probability that, at audit time, the audited replica existed, but not that every bit was intact. The paper uses the acronym PDP for Provable Data Possession.
Immutable, Trustworthy Storage
The reason integrity audits are necessary is that storage systems are neither reliable nor trustworthy, especially at scale. Some audit systems depend on storing integrity tokens, such as hashes, in storage which has to be assumed reliable. If the token storage is corrupted, it may be possible to detect but not recover from the corruption. It is generally assume that, because the tokens are much smaller than the content to whose integrity they attest, they are correspondingly more reliable. But it is easy to forget that both the tokens and the content are made of the same kind of bits, and that even storage protected by cryptographic hardware has vulnerabilities.
In many applications of cloud storage it is important that confidentiality of the data is preserved by encrypting it. In the digital preservation context, encrypting the data adds a significant single point of failure, the loss or corruption of the key, so is generally not used. If encryption is used, some means for ensuring that the ciphertext of each replica is different is usually desirable, as is the use of immutable, trustworthy storage for the decryption keys. The paper discusses doing this via probabilistic encryption using public/private key pairs, or via symmetric encryption using random noise added to the plaintext.
If the replicas are encrypted they are not bit-for-bit identical and thus their hashes will be different whether they are intact or corrupt. Thus a homomorphic encryption algorithm must be used:
Homomorphic encryption is a form of encryption with an additional evaluation capability for computing over encrypted data without access to the secret key. The result of such a computation remains encrypted.
In Section 3.3 Yangfei Lin et al discuss two auditing schemes based on homomorphic encryption:
If an audit operation is not to involve downloading the entire content, it must involve the auditor requiring the system storing the replica to perform a computation that:
The storage system does not know the result of ahead of time.
Takes as input part (PoP) or all (PoR) of the replica.
Thus, for example, asking the replica store for the hash of the content is not adequate, since the store could have pre-computed and stored the hash, rather than the content.
PoP systems can, for example, satisfy these requirements by requesting the hash of a random range of bytes within the content. PoR systems can, for example, satisfy these requirements by providing a random nonce that the replica store must prepend to the content before hashing it. It is important that, if the auditor pre-computes and stores these random values, they be kept secret from the replica stores. If the replica store discovers them, it can pre-compute the responses to future audit requests and discard the content without detection.
Unfortunately, it is not possible to completely exclude the possibility that a replica store, or a conspiracy among the replica stores, has compromised the storage holding the auditor's pre-computed values. A ideal design of auditor would generate the random values at each audit time, rather than pre-computing them. Alas, this is typically possible only if the auditor has access to a replica stored in immutable, trustworthy storage (see above). In the mutual audit architecture used by the LOCKSS system, the nodes do have access to a replica, albeit not in reliable storage, so the random nonces the system uses are generated afresh for each audit.
It is an unfortunate reality of current systems that, over long periods, preventing secrets from leaking and detecting in a timely fashion that they have leaked are both effectively impossible.
Auditing Dynamic vs. Static Data
In the digital preservation context, the replicas being audited can be assumed to be static, or at least append-only. The paper addresses the much harder problem of auditing replicas that are dynamic, subject to updates through time. In Section 3.2 Yangfei Lin et al discuss a number of techniques for authenticated data structures (ADS) to allow efficient auditing of dynamic data:
There are three main ADSs: rank-based authenticated skip list (RASL), Merkle hash tree (MHT), and map version table (MVT).
Cloud storage adoption, due to the growing popularity of IoT solutions, is steadily on the rise, and ever more critical to services and businesses. In light of this trend, customers of cloud-based services are increasingly reliant, and their interests correspondingly at stake, on the good faith and appropriate conduct of providers at all times, which can be misplaced considering that data is the "new gold", and malicious interests on the provider side may conjure to misappropriate, alter, hide data, or deny access. A key to this problem lies in identifying and designing protocols to produce a trail of all interactions between customers and providers, at the very least, and make it widely available, auditable and its contents therefore provable. This work introduces preliminary results of this research activity, in particular including scenarios, threat models, architecture, interaction protocols and security guarantees of the proposed blockchain-based solution.
Everything they want from the Ethereum blockchain could be provided by the same kind of verifiable logs as are used in Certificate Transparency, thereby avoiding the problems of public blockchains. But doing so would face insuperable scaling problems under the transaction rates of industrial cloud deployments.
Over the weekend I attended WikiConference North America in Cambridge, Massachusetts. This was my fourth time participating in this meeting, which is a wonderful gathering for Wikimedians as well as librarians, educators and others interested in open access to information. This meeting is purposefully expansive, including colleagues from Mexico, the Caribbean and Canada. It was wonderful to see so many Caribbean participants, possibly more so even than from Mexico or Canada (likely due to the emergence of a new and lively Wikimedians of the Caribbean User Group ).
This year the conference was held in conjunction with the Credibility Coalition and featured a “credibility summit” with participants from Google, Facebook and Microsoft alongside members of the Wikimedia Movement. This convergence facilitated necessary and timely discussions on credibility, reliability and the role that these organizations and communities play in combating fake information on the internet.
There was a contingent of librarians / Wikibrarians attending the conference, several talks that touched on Wikicite, and many talks on Wikidata. From my perspective, notable talks included:
Presenters from Vanderbilt University Library walked through how they are considering using Wikdata (or Wikibase) as a Research Information Management (RIM) system. [notes]
Will Kent of WikiEdu facilitated a discussion on Building a Wikidata Curriculum. This was fed by lessons learned, both by Will and his WikiEdu collaborators and others in the audience who have been teaching Wikidata to others. The session etherpad has a number of good resources for teaching Wikidata.
A Harvard libraries panel on digital humanities resources / “non traditional scholarship” included discussion about whether / how these resources might be used in a Wikipedia article and how you might use Wikidata to describe / model them. [notes]
A presentation from University of New Mexico explored how Wikipedia might stand in for electronic resources (specifically, to let selectors do analysis on children’s literature). Could they use Wikipedia articles instead of subscription resources, particularly when selectors are interested in finding books that center non-dominant cultures? [slides | article]. I’ve been looking at lists like 1000 Black Girl Books and those featured on We Need Diverse Books and have been thinking about how these resources show up (or, mostly don’t) in Wikipedia and Wikidata, so this is definitely a topic I’m interested in.
CiteUnseen is a Wikipedia user plugin that shows what sources are used to cite a Wikipedia article (books, websites, newspaper articles, etc.), and also flags sources that are from questionable sources.
I facilitated a discussion on “gap” projects and what tools / techniques those projects use. Some projects create simple crowd-sourced lists. Others leverage out-of-copyright topical encyclopedias or biographical dictionaries and then push structured data gleaned from transcriptions of these sources into Wikidata and then use Listeriabot to generate lists from them (here is one example from the Women in Red project). Other projects like Art+Feminism are focusing effort on articles that already exist but that are at risk, using a combination of Wikidata and information from Wikipedia articles. Overall I have been impressed by how many gap projects leveraging Wikidata to identify and prioritize work.
The event was funded by the Credibility Coalition, Craig Newmark Philanthropies, and the Craig Newmark School of Journalism, along with support from the Knowledge Futures Group, MIT Open Learning, and many greater Boston area arts organizations. Many thanks are due to the funders, the many volunteers and the program committee for making this fun and thought-provoking meeting possible. I look forward to next year’s event, which will be hosted by Wikimedia Canada.
This guest post is part of the CAP Research Community Series. This series highlights research, applications, and projects created with Caselaw Access Project data.
Abdul Abdulrahim is a graduate student at the University of Oxford completing a DPhil in Computer Science. His primary interests are in the use of technology in government and law and developing neural-symbolic models that mitigate the issues around interpretability and explainability in AI. Prior to the DPhil, he worked as an advisor to the UK Parliament and a lawyer at Linklaters LLP.
The United States of America (U.S.) has seen declining public support for major political institutions, and a general disengagement with the processes or outcomes of the branches of government. According to Pew's Public Trust in Government survey earlier this year, "public trust in the government remains near historic lows," with only 14% of Americans stating that they can trust the government to do "what is right" most of the time. We believed this falling support could affect the relationship between the branches of government and the independence they might have.
One indication of this was a study on congressional law-making which found that Congress was more than twice as likely to overturn a Supreme Court decision when public support for the Court is at its lowest compared to its highest level (Nelson & Uribe-McGuire, 2017). Furthermore, another study found that it was more common for Congress to legislate against Supreme Court rulings that ignored the legislative intentions, or rejects positions taken by federal, state, or local governments — due to ideological differences (Eskridge Jr, 1991).
To better understand how the interplay between the U.S. Congress and Supreme Court has evolved over time, we developed a method for tracking the ideological changes in each branch using word embeddings and text corpora generated. For Supreme Court, we used the opinions for the cases provided in the CAP dataset — though we extended this to include other federal court opinion to ensure our results were stable. As for Congress, we used the transcribed speeches of the Congress from Stanford's Social Science Data Collection (SSDS) (Gentzkow & Taddy, 2018). We use the case study of reproductive rights (particularly, the target word "abortion"), which is arguably one of the more contentious topics ideologically divided Americans have struggled to agree on. Over the decades, we have seen shifts in the interpretation of rights by both the U.S. Congress and Supreme Court that has arguably led to the expansion of reproductive rights in the 1960s and a contraction in the subsequent decades.
What are word embeddings?
To track these changes, we use a quantitative method of tracking semantic shift from computational linguistics, which is based on the co-occurrence statistics of words used — and corpora of Congress speeches and the Court's judicial opinions. These are also known as word embeddings. They are a distributed representation for text that is perhaps one of the key breakthroughs for the impressive performance of deep learning methods on challenging natural language processing problems. This allows us to see, using the text corpus as a proxy, how they have ideologically leaned over the years on the issue of abortion, and whether any particular case led to an ideological divide or alignment.
For a more detailed account on word embeddings and the different algorithms used, I highly recommend Sebastian Ruder's "On word embeddings".
Our experimental setup
In tracking the semantic shifts, we evaluated a couple of approaches using a word2vec algorithm. Conceptually, we formulate the task of discovering semantic shifts as follows. Given a time sorted corpus: corpus 1, corpus 2, …, corpus n, we locate our target word and its meanings in the different time periods. We chose the word2vec algorithm based comparisons made on the performance of the different algorithms which were count-based, prediction-based or a hybrid of the two on a corpus of U.S. Supreme Court opinions. We found that although there is variability in coherence and stability as a result of the algorithm chosen, the word2vec models show the most promise in capturing the wider interpretation of our target word. Between the two word2vec algorithms — Continuous Bag of Words (CBOW) and Skip-Gram Negative Sampling (SGNS) — we observe similar performance, however, the latter showed more promising results in capturing case law related to our target word at a specific time period.
As we test one algorithm in our experiments — a low dimensional representation learned with SGNS — with the incremental updates method (IN) and diachronic alignment method (AL), we got results for two models SGNS (IN) and SGNS (AL). In our implementation, we use parts of the Python library gensim and supplement this with implementations by Dubossarsky et al. (2019) and Hamilton et al. (2016b) for tracking semantic shifts. For the SGNS (AL) model, we only extract regular word-context pairs (w,c) for time slices and trained SGNS on these. For the SGNS (IN) model, we similarly extract the regular word-context pairs (w,c), but rather than divide the corpus and train on separate time bins, we train the first time period and incrementally add new words, update and save the model.
To tune our algorithm, we performed two main evaluations (intrinsic and extrinsic) on samples of our corpora, comparing the performance across different hyperparameters (window size and minimum word frequency). Based on these results, the parameters used were MIN = 200 (minimum word frequency), WIN = 5 (symmetric window cut-off), DIM = 300 (vector dimensionality), CDS = 0:75 (context distribution smoothing), K = 5 (number of negative samples) and EP = 1 (number of training epochs).
What trends did we observe in our results?
We observed some notable trends from the changes in the nearest neighbours to our target word. Using the nearest neighbours to abortion indicates how the speakers or writers who generated our corpous associate the word and what connotations it might have in the group.
To better assess our results, we conducted an expert interview with a Womens and Equalities Specialist to categorise the words as: (i) a medically descriptive word, i.e., it relates to common medical terminology on the topic; (ii) a legally descriptive word, i.e., it relates to case, legislation or opinion terminology; and (iii) a potentially biased word, i.e., it is not a legal or medical term and thus was chosen by the user as a descriptor.
Nearest Neighbours Table Key. Description of keys used to classify words in the nearest neighbours by type of terminology. These were based on the insights derived from an expert interview.
A key observation we made on the approaches to tracking semantic shifts is that depending on what type of cultural shift we intend to track, we might want to pick a different method. The incremental updates approach helps identify how parts of a word sense from a preceding time periods change in response to cultural developments in the new time period. For example, we see how the relevance of Roe v. Wade (1973) changes across all time periods in our incremental updates model for the judicial opinions.
In contrast, the diachronic alignment approach better reflects what the issues of that specific period are in the top nearest neighbours. For instance, the case of Roe v. Wade (1973) appears in the nearest neighbours for the judicial opinions shortly after it is decided in the decade up to 1975 but drops off our top words until the decades up to 1995 and 2015, where the cases of Webster v. Reproductive Health Services (1989), Planned Parenthood v. Casey (1992) and Gonzales v. Carhart (2007) overrule aspects of Roe v. Wade (1973) — hence, the new references to it. This is useful for detecting the key issues of a specific time period and explains why it has the highest overall detection performance of all our approaches.
Local Changes in U.S. Federal Court Opinions. The top 10 nearest neighbours to the target word "abortion" ranked by cosine similarity for each model.
Local Changes in U.S. Congress Speeches. The top 10 nearest neighbours to the target word "abortion" ranked by cosine similarity for each model.
These preliminary insights allow us to understand some of the interplay between the Courts and Congress on the topic of reproductive rights. The method also offers a way to identify bias and how it may feed into the process of lawmaking. As such, for future work, we aim to refine the methods to serve as a guide for operationalising word embeddings models to identify bias - as well as the issues that arise when applied to legal or political corpora.
Information capitalism, the system where information, a historically, largely free and ubiquitous product of basic communication, is commodified by private owners for profit, is entrenched in our society. Information brokers have consolidated and swallowed up huge amounts of data, in a system that leaves data purchase, consumption, and use largely unregulated and unchecked. This article focuses on librarian ethics in the era of information capitalism, focusing specifically on an especially insidious arena of data ownership: surveillance capitalism and big data policing. While librarians value privacy and intellectual freedom, librarians increasingly rely on products that sell personal data to law enforcement, including Immigration and Customs Enforcement (ICE). Librarians should consider how buying and using these products in their libraries comports with our privacy practices and ethical standards.
As a fellow librarian, I’m here to warn you: ICE is in your library stacks. Whether directly or indirectly, some of the companies that sell your library research services also sell surveillance data to law enforcement, including ICE (U.S. Immigration and Customs Enforcement). Companies like Thomson Reuters and RELX Group (formerly Reed Elsevier), are supplying billions of data points, bits of our personal information, updated in real time, to ICE’s surveillance program.1 Our data is being collected by library vendors and sold to the police, including immigration enforcement officers, for millions of dollars.
This article examines the privacy ethics conundrum raised by contemporary publishing models, where the very services libraries depend upon to fill their collections endanger patron privacy. In the offline world of paper collections and library stacks, librarians adhere to privacy ethics and practices to ensure intellectual freedom and prevent censorship. But librarians are unprepared to apply those same ethical requirements to digital libraries. As our libraries transition to largely digital collections2, we must critically assess our privacy ethics for the digital era.3 Where are the boundaries of privacy in libraries when several “data services”4 corporations that also broker personal data own the lion’s share of libraries’ holdings?
After describing library vendors’ data selling practices and examining how those practices affect privacy in libraries, this article concludes by suggesting that library professionals organize beyond professional organizations. Librarians can demand vendor accountability and insist that vendors be transparent about how they use, repackage, and profit from personal data.
An Overview of Vendors’ Data Brokering Work
The consolidation of library vendors in the digital age has created a library services ecosystem where several vendors own the majority of databases and services upon which libraries rely.5 This puts libraries at the whim of publishing giants like Elsevier, Springer, and Taylor and Francis. This article uses Thomson Reuters and RELX Group, major publishing corporations that own Westlaw and Lexis6 , as case studies to demonstrate how information consolidation and the rise of big data impact library privacy. Thomson Reuters and RELX Group do not just duopolize the legal research market, they are powerful players in many library collections. They own a bevy of news sources and archives, academic collections including ScienceDirect, Scopus, and ClinicalKey, and all of the Reed Elsevier journals.7 Companies like Thomson Reuters and RELX Group are gradually buying up information collections that libraries and their patrons depend upon.
In addition to selling research products, both Thomson Reuters and RELX are data brokers, companies that sell personal data to marketing entities and law enforcement including ICE.8 Data brokering is fast becoming a billion dollar industry. Personal information fuels the “Big Data economy,” a system that monetizes our data by running it through algorithm-based analyses to predict, measure, and govern peoples’ behavior.9 While data brokering for commercial gain (to predict peoples’ shopping habits and needs) is insidious, the sale of peoples’ data to law enforcement is even more dangerous. Brokering data to law enforcement fuels a policing regime that tracks and detains people based not on human investigation, but on often erroneous pools of data traded between private corporations and sorted by discriminatory algorithms.10 Big data policing disparately impacts minorities, creating surveillance dragnets in Muslim communities, overpolicing in black communities, and sustaining biases inherent in the U.S. law enforcement system.11 In the immigration context, big data policing perpetuates problematic biases with little oversight12 , resulting in mass surveillance, detention, and deportation.13
ICE pays RELX Group and Thomson Reuters millions of dollars for the personal data it needs to fuel its big data policing program.14 Thomson Reuters supplies the data used in Palantir’s controversial FALCON program, which fuses together a multitude of databases full of personal data to help ICE officers track immigrant targets and locate them during raids.15 LexisNexis provides ICE data that is “mission critical”16 to the agency’s program tracking immigrants and conducting raids at peoples’ homes and workplaces.17
Information Capitalism Drives Data Brokering
The new information economy is drastically changing vendors’ and libraries’ information acquisition, sales, and purchasing norms. For Thomson Reuters and RELX Group, data brokering diversifies profit sources as the companies transition their services from traditional publishing to become “information analytics” companies.18 These corporations are no longer the publishers that librarians are used to dealing with, the kind that focus on particular data types (academic journals, scientific data, government records, and other staples of academic, public, and specialized libraries). Instead, the companies are data barons, sweeping up broad swaths of data to repackage and sell. Libraries have observed drastic changes in vendor services over the last decade.
New business models are imperative for publishing companies that must maintain profits in a changing information marketplace. They are competing to remain profitable enterprises in an era where their traditional print publishing methods are less lucrative. To stay afloat financially, publishers are becoming predictive data analytics corporations.19 Publishers realize that the traditional publishing revenue streams from books and journals are unsustainable as those items become digital and open access.20 Reed Elsevier, one of the top five largest academic publishers has been “managing down” its print publishing services to focus on more lucrative online data analytics products.21 Reed Elsevier’s corporation rebranded itself RELX Group and Morgan Stanley recategorized RELX Group as a “business company” instead of a “media group.”22
For publishers, changing their business models is imperative to survive in a world where information access is changing dramatically and publishers are learning to maintain their market share in the new digital information regime.23 While print materials are less lucrative, publishers build technology labs, developing tools that stream and manipulate digital materials. Publishers like Thomson Reuters and RELX Group are finding new opportunities to consolidate and sell digital materials.24 Where information used to come in different shapes and sizes (papers, books, cassette tapes, photographs, paintings, newspapers, blueprints, and other disparate, irregular formats) it now flows in a single form, transmitted through fiber optic cables. Thomson Reuters and RELX Group are capitalizing on this new information form, buying up millions of published materials and storing them electronically to create digital data warehouses25 stored in servers. These new publishing enterprises are data hungry and do not discriminate between different types of data, be it academic, government, or personal. They want every data type, to compile as bundles of content to sell. Today’s library vendors are less like local bookstores and more like Costcos stocked with giant buckets of information.26 The new publishing company structure is a “big box” data store of library resources. Libraries buy bundles of journals, databases, and ebooks, and other mass-packaged materials in “big deals.”27
The Costco-ization of publishing drives publishers to collect tons of data, and to make systems that will slice and dice the data into new types of saleable bundles. Thus, publishers morph into data analytics corporations, developing AI systems to parse through huge datasets to gather statistics (“How many times does Ruth Bader Ginsburg say “social justice” in her Supreme Court opinions?”) and predict trends (“How many three pointers will Stephen Curry throw in 2019?”).28
As their vendors’ service models shift, librarians have also shifted from being information owners whose collection development focuses on purchasing materials to information borrowers that rent pre-curated data bundles shared through subscription databases. In 2019, Roxanne Shirazi, a librarian at CUNY’s Grad Center, described the phenomenon of “borrowing” information from gigantic data corporations in a blog post titled The Streaming Library.29 Shirazi compares the modern library to a collection of video subscription streaming services (Hulu, Netflix, Amazon). Libraries subscribe to online collections, “streaming” resources that live within various corporate data collections without owning them. “…Libraries used to purchase materials for shared use […] those materials used to live on our shelves.” But libraries no longer own all of their research materials, they temporarily borrow subscribe to them. Vendors can provide library resources, and make them disappear, at their whim.30
As lenders, library vendors do not end their relationships with libraries when they complete a sale. Instead, as streaming content providers, vendors become embedded in libraries. They are able to follow library patrons’ research activities, storing data about how people are using their services. When companies like Thomson Reuters and RELX Group are simultaneously library service providers and data brokers they can access library patron data and repackage that data for profit.31 Library vendors collect more and more patron data as they develop services to track patron preferences and make collection development decisions.32 Librarians have long been concerned with the privacy implications of digital authentication features vendors put in products to help verify patron identities and track their use of online databases.33 When vendors that track library patrons also participate in data brokering, it is entirely possible that patron data is in the mix of personal data the companies sell as data brokers.34 Neither Thomson Reuters or RELX Group has denied doing so.35 Furthermore, in 2018, both Thomson Reuters and RELX Group modified their privacy statements to clarify that they use personal data across their platforms, with business partners, and with third party service providers.36
In the current information economy, librarians increasingly lack leverage to confront powerful corporate vendors like Thomson Reuters and RELX Group.37 Information capitalism, the transition of industrialist capitalism to an economic system that assigns commercial value to knowledge, information, and data38, simultaneously intensifies privacy concerns in our libraries and empowers data corporations. As publishing conglomerates buy more and more data, libraries have little choice but to purchase their research products from these information monopolies. Data brokering is an especially threatening form of information capitalism, but other manifestations of information capitalism have also seeped into librarianship. When information sellers limit access to online content, put up paywalls, and charge exorbitant article processing charges (APCs), they profit from our patrons’ information needs and our roles as information providers.
We are beholden to information capitalism39, and our profession is captured by this new brand of digital warehouse-style publishing. If we want information, we must pay a premium to wealthy data barons. The power wielded by huge publishing companies makes it hard for librarians that negotiate contracts with the companies to demand accountability. Librarians are in the awkward role of being, simultaneously, both “the biggest consumer of the materials [the corporations] sell as well as their biggest critics.”40 When librarians and their patrons try to bypass library vendors and provide open access to information, vendors have the power to stifle those demands. For instance, vendors sued the computer programmer who developed Sci-Hub, a website providing free access to scientific research and texts, forcing the website offline.41 Librarians envision a world where information is free, but live in a reality where they are largely captive to giant publishing companies.
Because personal data is the “big data” empire’s most valuable currency, sought by companies like Thomson Reuters and RELX Group, librarians should be especially concerned about vendors’ gathering personal data in libraries. Data brokering is a multi-billion dollar industry.42 Data brokering capitalizes on lax software and online platform privacy policies43 , scraping and saving troves of personal data to analyze or repackage it for sale. Thus, as publishers become data analytics firms, it is useful for libraries to consider whether they unwittingly fuel the data brokering industry.
Librarians’ Roles in Data Brokering
It is important to begin the discussion about librarians’ roles in patron privacy by drawing a line between privacy ethics and the “vocational awe” that pervades our profession.44 The idea that certain parts of librarians’ work and values are sacred and beyond critique45 is harmful to our profession. We are certainly not obligated to consider ourselves the lone fighters at the front lines of academic freedom or bold crusaders for a larger cause. Much of what librarians have written about protecting patrons’ digital privacy focuses on librarians’ responsibilities, saddling the burden of privacy requirements and responsibilities on libraries and their staffs.46 Library professional education programs teach librarians that they must protect their patrons from online research platforms (clearing caches, erasing patron profiles, logging out of online systems, and other custodial tasks) rather than demanding that corporations stop tracking and collecting data from library patrons. It is not a librarian’s responsibility to save patrons from digital surveillance, rather, it is incumbent upon software developers to protect user privacy in the research tools they create.
Rather than considering libraries the ultimate digital privacy saviors and library ethics as some glowing bastion that librarians are burdened with protecting, we can think of intellectual freedom and privacy ethics as one of many factors to consider when we choose which resources and tools to implement in our libraries. Library ethics are points upon which we should hold our vendors accountable, not obligations to internalize and carry on our backs. While there may be no absolute, ideal privacy solution for our libraries, privacy is something to keep in mind and add to the list of concerns we have about the form and function of modern publishing and research.
Indeed, it is not the job of libraries, but the obligation of library vendors, to ensure that patrons are not surveilled by library products. Beyond unfairly burdening librarians, post hoc efforts to contain invasive digital research tools in libraries are not as effective as preemptively incorporating privacy into library products. Library’s digital hygiene activities are mere attempts to clean up after library vendors that breach patron privacy. When patrons use library vendors’ products, librarians follow behind, erasing profiles, clearing personal data from vendor systems, and trying to erase patrons’ digital footprints. We take on the work of cleaning up after our vendors.
Instead, our vendors should be proactively protecting our patrons’ privacy. Privacy expert Ann Cavoukian coined the concept “privacy by design” for the knowledge economy, believing that in the age of information capitalism, information capitalists should build privacy measures into their products by default. Cavoukian set out seven principles that have been adopted by law in other nations, including the European Union (EU) in its General Data Protection Regulation (GDPR).47 The principles require that online services, including research tools and resources, be designed to proactively protect privacy. According to the principles, research products should default to privacy. Privacy should be embedded into research products’ design, with “end to end” privacy throughout the entire data lifecycle, from the moment data is created to its eventual disposal.48 These privacy measures should be transparent and clear to the end user. For instance, users should know where their data will end up, especially if their data may be packaged and resold in a data brokering scheme.49
While the EU has embraced privacy by design and required the companies doing business in its member nations to adhere to the seven principles, there is no privacy by design requirement for research services in the U.S. This leaves U.S. librarians in an ethically complicated role as major information technology users who adhere to patron privacy standards. Librarians’ information access roles keep us at the forefront of technological advancement, as most information access occurs online.50 We are information technology’s early adopters51 , and we serve as gatekeepers to troves of online data collections. Oftentimes our role makes us information technology’s first critics, sounding warnings about products and practices that are oppressive to our patrons and that violate our ethical duties to protect patron privacy and intellectual freedom.52
As technology critics, we tend to focus on technologies a la carte, on a product-by-product basis.53 By honing in on specific products, companies, and practices, we’ve been able to condemn specific problems. We speak out against subscription fees and paywalls54 and e-book publishers’ give and take of online book collections.55 But scrutinizing specific products ignores a holistic critique of library vendors. When we step back and view our vendors as a class, we can see a large-scale issue that foreshadows our profession: all of the world’s information is being consolidated by several gigantic data corporations. We must consider how vendors becoming “technology, content, and analytics” businesses56 threatens the daily work of libraries and the privacy of those we serve.
Even as library privacy is threatened by vendors, librarians’ abilities to influence vendors’ privacy practices are decreasing as publishing companies change their business models. Publishing and data companies’ new data products and new, non-library-based data access points (including websites and apps) have created scores of new, non-library customers. Our vendors depend less on library customers as they diversify their customer base and recognize that they can sell directly to researchers without relying on library gatekeepers. In the last decade, Thomson Reuters has been criticized for trying to work around law librarians. The company even issued a controversial ad saying that patrons on a first name basis with their librarians are “spending too much time at the library” when they should use Westlaw from their offices instead.57 Through anti-competitive pricing schemes and sales practices, Lexis has similarly demonstrated its decreasing consideration of librarians in its marketing and sales plans.58 Librarians and their needs are getting pushed towards the back of the customer service queue. Declining library-vendor relations59 decrease librarians’ access to participate in vendor decision making.
Librarians cannot count on government intervention to protect library privacy in the digital age. While most states officially recognize and regulate library privacy60 , the information capitalism that incentivizes data brokering has gone largely unchecked. Federal and state governments do little to regulate information capitalism. The Federal Trade Commission has tried to break RELX Group’s monopoly on data brokering61 , but there is no comprehensive regulatory scheme in place to prevent the consolidation of information by several private entities or the unauthorized sale of personal data to law enforcement. Without regulation, library professionals are left to deal with vendors who flout privacy best practices and threaten patron privacy. Librarians should not be responsible for fixing vendor privacy practices. Instead, they should condemn them.
Solutions: Organizing Against Library Surveillance
While librarians’ relationships with their vendors may be changing, librarians still wield power as information consumers. Librarians can organize to 1) demand accountability from our vendors, and 2) insist on transparency to ensure that vendors comply with our ethics.
There are two major privacy issues raised by data brokers working as library vendors, and librarians can organize around both. The first issue is that the money libraries pay for products helps vendors develop surveillance products. The second issue is that the data that patrons provide vendors while using their products in libraries could be sold to law enforcement. These are two discrete problems that impact patron privacy, and vendors should be prepared to address both issues with librarians. The issues of libraries funding surveillance with subscription fees and library vendors including library patron data in their surveillance products are both major issues that could be the difference between library privacy and libraries as surveillance hubs.
If library products sell our patron data to the government, we are essentially inviting surveillance in our libraries. When libraries pay data brokering publishing giants to enter their libraries and serve their patrons without ensuring that their patron data will not be included in data brokering products, the government does not even have to ask librarians to track researchers. Government agencies can enter libraries electronically, inserting government surveillance in the Trojan horse of online research tools. Or they can buy the data collected by the information companies, like ICE does with Thomson Reuters and RELX Group.
If libraries are funding the research and development on surveillance products with our product subscription fees, libraries are spending money, often provided by patrons membership fees or taxes, on companies that use the income to build surveillance infrastructure that surveills various people and communities that may include library patrons. For instance, in law librarianship, law libraries collectively pay millions of dollars for Lexis and Westlaw each year. According to Thomson Reuters and RELX Group’s annual reports, that money is not kept in a separate pool of profits. It ostensibly funds their growing technology labs that create data analytics products and helps the companies afford scores of private data caches sold by smaller data brokering services. Especially in the post-9/11 surveillance regime, information vendors have been fighting for spots in the booming surveillance data markets62
Publishers like RELX Group are experts at cornering information markets. They’ve already bought the lion’s share of our academic publishing resources63 , from products where scholars incubate their research to the journals that publish the research after peer review, and even the post-publication “research evaluation” products and archives. The companies cash in at every step of academic research, profiting off of academics’ free labor.64 Thomson Reuters and Reed Elsevier are similarly cornering the legal information market. Beyond owning legal research products, they’re selling the surveillance products that help law enforcement track, detain, and charge people with crimes. When those swept up in law enforcement surveillance inevitably need lawyers, the lawyers use Westlaw and Lexis to represent them. The publishing companies transform legal research profits into products that help law enforcement create more criminal and immigration law clients.65
Librarians have the right to demand accountability from vendors about where patron data and subscription fees are being used. As major products customers, libraries can demand that the products they purchase maintain their ethical standards. Libraries do not have to sacrifice ethics and privacy norms for corporations like RELX Group and other information capitalists. We can research and learn about our products and their corporate purveyors and consider our privacy and intellectual freedom principles in relation to the things we buy. We should be able to discover what information our products are collecting about out patrons and who, if anyone, is using that personal data. We should also be able to find out what types of products our subscription fees support. Is the money we pay for library services supporting the research and development of police surveillance products? If it is, we should be able to make purchasing decisions with that surveillance relationship in mind.
To facilitate informed purchasing decisions, libraries can demand information about vendors’ practices. Requiring disclosures about our vendors’ research and product infrastructure should be part of doing business with data companies. With more transparency, librarians can assess which products are better at ensuring patron privacy and supporting intellectual freedom. The ethical conundrums raised by these products are multifaceted: Are we risking privacy and breaking our own ethical code? Are we funding unethical supply chains that harm people and violate ethics in the production of their products? If we are betraying the tenets of intellectual freedom, we must divest. Some library patrons, including University of California San Francisco faculty66 and thousands of mathematicians have already advocated for boycotting and divesting from companies like RELX Group over pricing practices.67 Universities are beginning to drop their Elevier contracts and thousands of scholars are protesting Elsevier over the company’s “exorbitantly high prices.”68 Activism around pricing suggests that, rather than relying on corporations with sketchy practices, librarians can support and talk more about alternate companies and startups or create our own resources, open access consortia, and search options as alternatives to companies involved in ICE surveillance. When powerful academic institutions like the University of California divest from RELX Group’s Elsevier products, it shows that large libraries can lead the way in pushing back against problematic vendor practices.
Importantly, holding vendors accountable should happen beyond the confines of library professional organizations, which are largely funded by the very vendors we need to hold accountable. Organizations that usually serve as librarians’ organizing hubs depend so thoroughly on funding from corporate vendors that they are not the best venues for criticizing library products and the corporations that sell them. Although the connections between research products and law enforcement surveillance unearth huge privacy concerns for libraries, professional library organizations are loath to discuss those concerns. Fighting corporate privacy issues may look the same as fighting FBI or other government surveillance to library professionals in their daily work (surveillance is surveillance whether it’s being conducted by the FBI or through RELX Group), but our professional organizations treat corporate and government practices very differently.
Historically, library organizations have fought alongside librarians against government surveillance in libraries. The American Library Association (ALA) has protested government surveillance in libraries, decrying the PATRIOT Act’s Sections 215 and 505, provisions that give the federal government sweeping authority to surveil people and obtain peoples’ library records.69 In fact, ALA and its members’ protests were so persistent that FBI agents called librarians “radical” and “militant,” and U.S. Attorney General John Ashcroft decried librarians as “hysterical.”70 ALA pushed back, partnering with the American Civil Liberties Union (ACLU) to deploy anonymous browsing tools and other resources to protect library patrons’ privacy.71
Library organizations’ reactions to corporate surveillance, so far, have been much different. A blog post about library privacy and research vendors’ participation in ICE surveillance titled “LexisNexis’s Role in ICE Surveillance and Librarian Ethics” was taken down from the American Association of Law Libraries (AALL) website within minutes of being posted, replaced by a message stating: “This post has been removed on the advice of AALL General Counsel.”72 While professional library organizations are comfortable standing up to the government when it threatens library patron privacy, the same organizations are not prepared to stand up to library vendors for the same privacy invasions.
There are several reasons for the disparate ways library organizations react to government surveillance versus vendor surveillance. The main rationale offered by AALL when it removed the blog post critiquing legal research vendors was that vendors are equal members in the organization and that the critique of their relationships with ICE amounted to “collective member actions” that raise antitrust issues. This rationale is nonsensical, implying that librarians voicing concerns about Thomson Reuters and RELX Group ICE contracts is akin to a group boycott designed to stifle competition among legal research vendors.73 This improbable excuse was likely a smokescreen designed to stop AALL members from potentially upsetting the organizations’ key donors. AALL relies on Thomson Reuters and RELX Group to sponsor their activities and scholarship programs. When library vendors are middlemen between library patrons and government surveillance, librarians may be prohibited from critiquing vendor practices in professional organizations’ forums.
The next wave of privacy concerns will come from our vendors and information sources, and they will require librarians organizing resistance outside of their professional organizations. As we begin to do this organizing work, we should keep track of the ways our vendors are changing and what that means for our ethical standards. This article focuses on surveillance, but it’s not the only issue that arises when publishers become data corporations. Librarians must either drop our privacy pretenses or create privacy policies that push back against information capitalism and data barons. Privacy is a new supply chain ethics problem, and librarians are stuck in its wake as major information technology purchasers and providers, promoters and gatekeepers. Privacy settings in digital products should be the default.74 Unfortunately, privacy defaults are aspirational, and largely unimplemented. When dealing with information corporations hungry for data to put on its warehouse shelves, for bundling and selling to new customers, librarians can make it clear that the surveillance work these companies do is forbidden in our stacks.
The author would like to thank Kellee Warren, Scott Young, and Ian Beilin for their thoughtful edits and for sagely shepherding this article through the peer review process. She would also like to thank Yasmin Sokkar Harker, Nicole Dyszlewski, Julie Krishnaswami, Rebecca Fordon, and the many other law librarians who have offered feedback, advice, and support throughout this research process.
The author also recognizes and applauds the critical work and purpose of In the Library with the Lead Pipe. Its role as an open access, peer reviewed library journal that supports creative solutions for major library issues makes the publication a vital part of our profession. The volunteer efforts of those who take on the challenge to “improve libraries, professional organizations, and their communities of practice by exploring new ideas, starting conversations, documenting our concerns, and arguing for solutions” are necessary for our sustenance and growth as information specialists and make discussions like the one in this article possible.
Davis, Caroline. (2019) Print Cultures: A Reader in Theory and Practice. Red Globe Press.
Dixon, Pam. (2008) “Ethical Issues Implicit in Library Authentication and Access Management: Risks and Best Practices.” Journal of Library Administration 47:3-4, 141-162.
Dooley, Jim. (2016) “University of California, Merced: Primarily an Electronic Library.” In Suzanne M. Ward et al., eds., Academic E-Books: Publisher, Librarians, and Users, 93-106. Purdue University Press.
Dunie, Matt. (2015) “Negotiating With Content Vendors: An Art or A Science?.” E-content Quarterly 1:4.
Kulp, Patrick. (2018) “Here’s How Publishers Are Opening Their Data Toolkits to Advertisers.” AdWeek (May 29, 2018), https://www.adweek.com/digital/heres-how-publishers-are-opening-their-data-science-toolkits-to-advertisers/.
Libraries are trending towards digitized collections. For instance, University of California’s Merced campus transitioned to a 90% digital library according to its 2003 development plans. Jim Dooley, “University of California, Merced: Primarily an Electronic Library,” in Academic E-Books: Publisher, Librarians, and Users 93-106 (Suzanne M. Ward, et al. eds. 2016).
April Lambert, et al., “Library patron privacy in jeopardy an analysis of the privacy policies of digital content vendors,” Proceedings of the Association for Information Science and Technology (February 24, 2016).
The phenomenon of library services consolidation is not new, but it has increased as library services move to online platforms. See Carolyn E. Lipscomb, “Mergers in the Publishing Industry,” Bulletin of the Medial Library Association (2001), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC34566/. Consolidation among vendors has changed the way libraries approach collection development and acquisition, pushing librarians from an a la carte model, where librarians pick their collection based on specific needs and titles, to a “big deal” model, where librarians buy huge bundles of information from only several publishers that own the lions’ share of library materials and user platforms. See Peggy Johnson, Fundamentals of Collection Development and Management 10-11 (4th ed. 2018).
Westlaw and Lexis are the go-to digital collections and research tools of law librarianship, Barbara Bintliff, et al., Fundamentals of Legal Research, Tenth Edition (April 7, 2015).
Albert Opher, et al,, The Rise of the Data Economy: Driving Value Through Internet of Things Data Monetization (2014), https://hosteddocs.ittoolbox.com/rise_data_econ.pdf; Mike Michael & Deborah Lupton, Toward a Manifesto for the ‘Public Understanding of Big Data’, Public Understanding of Science (2015).
Andrew Guthrie Ferguson, The Rise of Big Data Policing: Surveillance, Race, and the Future of Law Enforcement (2018).
In 2016, the Bureau of Labor Statistics reported on the bleak employment outlook in the traditional publishing industry, which showed that employment in the book, traditional news, and periodical industry declined since 1990, as employment in online and movie industries soared. Bureau of Labor Statistics, Employment Trends in Newspaper Publishing and Other Media (2016), https://www.bls.gov/opub/ted/2016/employment-trends-in-newspaper-publishing-and-other-media-1990-2016.htm Meanwhile, open access becomes increasingly normalized part of our information ecosystem. See Dan Pollock & Ann Michael, “Open Access Mythbusting: Testing Two Prevailing Assumptions About the Effects of Open Access Adoption,” The Association of Learned & Professional Society Publishers (January 2019).
The concept of the “data warehouse” was originally conceived by computer scientist Bill Inmon, He envisioned data warehouses as centralized storage for large collections data integrated from various sources. Bill Inmon, Building the Data Warehouse (4th ed. 2005).
Matt Dunie, “Negotiating With Content Vendors: An Art or A Science?,” E-content in Libraries, A Marketplace Perspective (Sue Polanka, ed.), Library Technology Reports, ALA. This writing describes how libraries struggle to cover rising cost of data bundles with decreasing budgets. Libraries must pay for the collections that patrons require by decreasing spending in other categories, like personnel and even numbers of library branches. Yet, the vendors have discovered that their content is so critical that, despite raising prices, libraries continue to acquire content at the same rate. The reliance of libraries on vendor content gives vendors the leverage to set prices ever higher.
Gabe Ignatow, “Information Capitalism” (2017)
Tessa Morris-Suzuki describes the ways the growth of the “information economy” limits access to freely available information, placing once-accessible research and reporting behind paywalls and monetizing information that used to be considered a public good. Morris-Suzuki identifies libraries as the former hub for free information “paid for by society as a whole”, and describes how the commodification of information alters both the concept of librarians as spaces where information is not commodified, and also libraries’ access to information collections. Tessa Morris-Suzuki, “Capitalism in the Computer Age,” The New Left Review (1986), https://newleftreview.org/issues/I160/articles/tessa-morris-suzuki-capitalism-in-the-computer-age
See Michael Gressel, “Are Libraries Doing Enough to Safeguard Their Patrons’ Digital Privacy?,” 67 The Serials Librarian 137 (2014). Librarians are tasked with educating patrons about digital privacy hygeine and ensuring that every public access computer in their libraries be properly set up to protect patrons against software practices that violate privacy. However, it is not the responsibility of librarians to be masters of digital privacy, rather, corporations should be held accountable and made to design their products in a way that protects everyone, including library patrons. See Sarah Lamdan, “Social Media Privacy: A Rallying Cry to Librarians,” 85 The Library Quarterly (July 2015).
Most modern information is born-digital, and librarians are pivoting from paper collections to online collection curation/building/digitization, John Palfrey & Urs Gasser, Born Digital: Understanding the First Generation of Digital Natives (2010).
For instance, Sofiya Umoja Noble warns about bias in search algorithms in her book, Algorithms of Oppression: How Search Engines Reinforce Racism, and Sarah T. Roberts writes about how social media moderation, behind the scenes, takes an emotional toll on its workers in Behind the Screen: Content Moderation in the Shadows of Social Media.
We protest individual contracting schemes by various vendors but we do not examine information capitalism as its own structure.
In 2008, the Federal Trade Commission ordered Reed Elsevier to divest part of its ChoicePoint aquisition to Thomson Reuters to ensure competition between the two data brokers, but no overarching law or particular action has broken up the data companies’ duopoly or significantly regulated privacy in the data broker industry. See Federal Trade Commission, “FTC Challenges Reed Elsevier’s Proposed $4.1 Billion Acquisition of ChoicePoint, Inc.” (September 16, 2008), https://www.ftc.gov/news-events/press-releases/2008/09/ftc-challenges-reed-elseviers-proposed-41-billion-acquisition
Sarah (@snewyuen) is an advocate for open, accessible, and secure technologies. While studying as a Master of Library and Information Science candidate at the University of Washington iSchool, she is expressing research interests through a few gigs: Project Coordinator for Preserve This Podcast at METRO, AssistantResearch Scientist for Investigating & Archiving the Scholarly Git Experience at NYU Libraries, Instructional Design Technologist at CUNY City Tech Open Education Resources Program, and Archivist for the Dance Heritage Coalition/Mark Morris Dance Group. Offline, she can be found riding a Cannondale mtb or practicing movement through dance.
DLF Forum 2019 took place during a lucky (questionable climate changing) pocket where hurricane season left and sunny Tampa, FL welcomed us onto the Seminole and Tocobaga lands—it reminded me of the Saved by the Bell: Palm Springs Weekend (1991) set.
The view from outside of the DLF Forum Conference Hotel. Taken by me.
I joined my first Forum as a DLF Student & New Professional Fellow, and I left with new perspectives to explore. The range of topics ranged widely, but there were three sessions that encompassed a theme that stood out to me, which I will highlight in this #DLFForum recap post.
As a current MLIS student, where do I fit in, what can I make of this, and what can the field offer to me when I graduate, is what keeps me up at night. For me, the week started off with the tone that Dr. Marisa Duarte set during her opening plenary, “Beautiful Data: Justice, Code, and Architectures of the Sublime” (shared notes). She told us about her journey in and out of LIS and about her advocacy on algorithmically-delivered data, its effects on society, justice, and how libraries have the knowledge and social power to do “reflexive justice work”. Duarte emphasized how we, as trained and/or degree’d information professionals, can educate our diverse population about digital biases and spread awareness on how to approach today’s “pan-capitalist algorithmic domination.” Duarte’s keynote touched farther than basic proactive equity, diversity, and inclusion (EDI) efforts, but how exposing the truth of current digital environments is an intentional practice that will not happen overnight. Her realistic but optimistic attitude toward digital librarianship gave me a better idea on how I’d like to represent and convey my digital library practices into the future.
Later that morning, I attended “I’m not an archivist, but…”: Working “Archives-Adjacent” to Transcend Traditional Library Roles” (slides & #m1c tweets), a panel of trained archivists, Dinah Handel, Monique Lassere, Jenna Freedman, Mary Kidd, and Stefanie Ramsay, who do not touch the actual objects but keep other archive operations lifelines well-oiled: systems and operation coordination, digitization service management, digital project librarianship, and communication (respectively). This panel brought me back to Duarte’s message about the power that librarians have in conveying accessible digital infrastructures amongst Big Tech products, or in this panel’s case, within the bureaucracy of the ivory towers. It’s not only just building an unbiased database, but also receiving the proper credit for the undefined, immeasurable, extra-mile they contribute to the archives. There’s a person behind that database infrastructure, that GitHub pull request, and those need to be acknowledged to equitable standards as other traditional archives positions (Mirza & Seale, 2017). I would be remiss to mention how the interactive slides were the best and that further demonstrates how much creative care these archives-adjacent colleagues put into their daily work:
A slide with a spinning card from Mary Kidd’s presentation. Taken by me.
On the last morning, Chela Scott-Weber’s portion of #w1d: “The Story Disrupted: Memory Institutions and Born Digital Collecting” (shared notes & #w1d tweets) resonated with me. She emphasized “we have been and are collecting reactively within systems of traditionally white institutions. Consider power models, think less about format specific collections and build proactive, collaborative practices around community, people, and phenomena.” Her push for EDI representation in born-digital collections made me proud to be starting my professional experiences with projects centered around the community of born-digital creators, instead of putting the collectors’ admiration desires first.
This post was written by Doyin Adenuga, who received a Focus Fellowship to attend this year’s DLF Forum.
Doyin has an MLIS degree from the University of British Columbia (UBC), Canada, and has been a librarian for four years. He has spent ten years in computer/systems support in Nigeria, as well as five years as technology coordinator/assistant editor at a textbook publishing center at the University of Wisconsin-Madison, where he became interested in providing support to users of information. He is interested primarily in the world of digital librarianship. Doyin started his library career during iSchool @ UBC co-op term as DSpace Cataloguer, and is currently the Electronic Resources Librarian at Houghton College, where he has installed and maintains a DSpace system. The public interface of DSpace @ Houghton College only provides searching and browsing of its multiple collections, while other features of this institutional repository system are disabled. You can find him on Twitter at @nugadoy.
I thank the 2019 Digital Library Federation (DLF) Forum committee for the honor of being selected as one of the fellows and being fully sponsored to attend the Forum. I have attended a number of library meetings and conferences but not one focused on digital library, and I appreciate the opportunity. Though the DLF Forum brings together specifically digital library professionals, I will encourage everyone in any area of information profession to attend at least one DLF forum.
The two areas that drew my interest even before the forum were sessions on open source repository solutions (other than DSpace) and privacy concerns with digital collections. I attended the session “Samvera – Sustainable Digital Repository Solutions & Community of Practice” (https://dlfforum2019.sched.com/event/S2U7/m1d-samvera-sustainable-digital-repository-solutions-community-of-practice) and was intrigued by what Samvera has to offer, such as its digital object viewer option. The need for digitization of institutional records to “advancing research, learning, social justice, & the public good” (https://www.diglib.org/about/) increases and having options for open digital library solutions will surely help. The presenters rightly said that “no one system fits all,” but with the limited time for the presentation, I became interested in investigating Samvera further.
As much work is put in setting up a digital repository system, so also much more attention is needed for continuous privacy concerns of its digital contents. The three presentations at the “privacy” (https://dlfforum2019.sched.com/event/S2X4/w1c-privacy ) session gave practical examples of handling privacy concerns and how the issues that arose were handled professionally. I would summarize this session into these three areas: pre-archiving privacy concerns and institutional workflows intended to minimize them, post-archiving scenarios and actions taken to resolve them, and privacy pedagogy at the institutional level.
Finally, the Forum gave me the opportunity to connect with other information professionals, some of the DLF working groups, and a locally based digital humanity group, which I otherwise might not have known existed, and I am really grateful.
"Scientific journals still disseminate our work, but in the Internet-connected world of the 21st century, this is no longer their critical function. Journals remain relevant almost entirely because they provide a playing field for scientific and professional competition: to claim credit for a discovery, we publish it in a peer-reviewed journal; to get a job in academia or money to run a lab, we present these published papers to universities and funding agencies. Publishing is so embedded in the practice of science that whoever controls the journals controls access to the entire profession."
My only criticisms are a lack of cynicism about the perks publishers distribute:
They pay no attention to the role of librarians, who after all actually "negotiate" with the publishers and sign the checks.
we work for them for free in producing the work, reviewing it, and serving on their editorial boards
We have spoken with someone who used to manage top journals for a major publisher. His internal margins were north of 90%, and the single biggest expense was the care and feeding of the editorial board.
The Library and Information Technology Association (LITA), a division of the American Library Association (ALA), is pleased to offer an award for the best unpublished manuscript submitted by a student or students enrolled in an ALA-accredited graduate program. Sponsored by LITA and Ex Libris, the award consists of $1,000, publication in LITA’s referred journal, Information Technology and Libraries (ITAL), and a certificate. The deadline for submission of the manuscript is February 28, 2020.
The award recognizes superior student writing and is intended to enhance the professional development of students. The manuscript can be written on any aspect of libraries and information technology. Examples include, but are not limited to, digital libraries, metadata, authorization and authentication, electronic journals and electronic publishing, open source software, distributed systems and networks, computer security, intellectual property rights, technical standards, desktop applications, online catalogs and bibliographic systems, universal access to technology, and library consortia.
The Library and Information Technology Association (LITA) is the leading organization reaching out across types of libraries to provide education and services for a broad membership of nearly 2,400 systems librarians, library technologists, library administrators, library schools, vendors, and many others interested in leading edge technology and applications for librarians and information providers. LITA is a division of the American Library Association. Follow us on our Blog, Facebook, or Twitter.
About Ex Libris Ex Libris, a ProQuest company, is a leading global provider of cloud-based solutions for higher education. Offering SaaS products for the management and discovery of the full spectrum of library and scholarly materials, as well as mobile campus solutions driving student engagement and success, Ex Libris serves thousands of customers in 90 countries. For more information about Ex Libris, see our website, and join us on Facebook, YouTube, LinkedIn, and Twitter.
The Association of Moving Image Archivists (AMIA) and DLF will be sending Marlo Longley to attend the 2019 DLF/AMIA Hack Day and AMIA conference in Baltimore, Maryland! During the event, Marlo will collaborate on projects with other attendees to develop solutions for digital audiovisual preservation and access.
About the Awardee
Marlo Longley is a Digital Repository Developer for the Metropolitan New York Library Council (METRO). At METRO he is working on the 2020 release of Archipelago, an open source repository system committed to flexible metadata. Last year he helped build Canyon Cinema’s new catalog search and website, and got interested in technical issues facing moving image archives from there. Marlo is based in Oakland, CA.
About Hack Day and the Award
The sixth AMIA+DLF Hack Day(November 13 at the Renaissance Baltimore Harborplace) will be a unique opportunity for practitioners and managers of digital audiovisual collections to join with developers and engineers for an intense day of collaboration to develop solutions for digital audiovisual preservation and access.
The goal of the AMIA + DLF Award is to bring “cross-pollinators”–developers and software engineers who can provide unique perspectives to moving image and sound archivists’ work with digital materials, share a vision of the library world from their perspective, and enrich the Hack Day event–to the conference.
In Harvard's Nuremberg Trials Project, being able to link to cited documents in each trial's transcript is a key feature of site navigation. Each document submitted into evidence by prosecution and defense lawyers is introduced in the transcript and discussed, and the site user is offered the possibility at each document mention to click open the document and view its contents and attendant metadata. While document references generally follow various standard patterns, deviations from the pattern large and small are numerous, and correctly identifying the type of document reference – is this a prosecution or defense exhibit, for example – can be quite tricky, often requiring teasing out contextual clues.
While manual linkage is highly accurate, it becomes infeasible over a corpus of 153,000 transcript pages and more than 100,000 document references to manually tag and classify each mention of a document, whether it be a prosecution or defense trial exhibit, or a source document from which the former were often chosen. Automated approaches offer the most likely promise of a scalable solution, with strategic, manual, final-mile workflows responsible for cleanup and optimization.
Initial prototyping by Harvard of automated document reference capture focused on the use of pattern matching in regular expressions. Targeting only the most frequently found patterns in the corpus, Harvard was able to extract more than 50,000 highly reliable references. While continuing with this strategy could have found significantly more references, it was not clear that once identified, a document reference could be accurately typed without manual input.
At this point Harvard connected with Tolstoy, a natural language processing (NLP) AI startup, to ferret out the rest of the tags and identify them by type. Employing a combination of machine learning and rule-based pattern matching, Tolstoy was able to extract and classify the bulk of remaining document references.
Background on Machine Learning
Machine learning is a comprehensive branch of artificial intelligence. It is, essentially, statistics on steroids. Working from a “training set” – a set of human-labeled examples – a machine learning algorithm identifies patterns in the data that allow it to make predictions. For example, a model that is supplied many labeled pictures of cats and dogs will eventually find features of the cat images that correlate with the label “cat,” and likewise, for “dog.” Broadly speaking, the same formula is used by self-driving cars learning how to respond to traffic signs, pedestrians, and other moving objects.
In Harvard’s case, a model was needed that could learn to extract and classify, using a labeled training set, document references in the court transcripts. To enable this, one of the main features used was surrounding context, including possible trigger words that can be used to determine whether a given trial exhibit was submitted by the prosecution or defense. To be most useful, the classifier needed to be very accurate (correctly labeled as either prosecution or defense), precise (minimal false positives), and have a high recall (few missing references).
The first step in any machine learning project is to produce a thorough, unbiased training set. Since Harvard staff had already identified 53,000 verified references, Tolstoy used that, along with an additional set generated using more precise heuristics, to train a baseline model.
The model is the predictive algorithm. There are many different families of models a data scientist can choose from. For example, one might use a support vector machine (SVM) if there are fewer examples than features, a convolutional neural net (CNN) for images, or a recurrent neural net (RNN) for processing long passages requiring memory. That said, the model is only a part of the entire data processing pipeline, which includes data pre-processing (cleaning), feature engineering, and post-processing.
Here, Tolstoy used a "random forest" algorithm. This method uses a series of decision-tree classifiers with nodes, or branches, representing points at which the training data is subdivided based on feature characteristics. The random forest classifier aggregates the final decisions of a suite of decision trees, predicting the class most often output by the trees. The entire process is randomized as each tree selects a random subset of the training data and random subset of features to use for each node.
Models work best when they are trained on the right features of the data. Feature engineering is the process by which one chooses the most predictive parts of available training data. For example, predicting the price of a house might take into account features such as the square footage, location, age, amenities, recent remodeling, etc.
In this case, we needed to predict the type of document reference involved: was it a prosecution or defense trial exhibit? The exact same sequence of characters, say "Exhibit 435," could be either defense or prosecution, depending on – among other things – the speaker and how they introduced it. Tolstoy used features such as the speaker, the presence or absence of prosecution or defense attorneys' names (or that of the defendant), and the presence or absence of country name abbreviations to classify the references.
Machine learning is a great tool in a predictive pipeline, but in order to gain very high accuracy and recall rates, one often needs to combine it with heuristics-based methods as well. For example, in the transcripts, phrases like “submitted under” or “offered under” may precede a document reference. These phrases were used to catch references that had previously been missed. Other post-processing included catching and removing tags from false positives, such as years (e.g. January 1946) or descriptions (300 Germans). These techniques allowed us to preserve high precision while maximizing recall.
Collaborative, Iterative Build-out
In the build-out of the data processing pipeline, it was important for both Tolstoy and Harvard to carefully review interim results, identify and discuss error patterns and suggest next-step solutions. Harvard, as a domain expert, was able to quickly spot areas where the model was making errors. These iterations allowed Tolstoy to fine-tune the features used in the model, and amend the patterns used in identifying document references. This involved a workflow of tweaking, testing and feedback, a cycle repeated numerous times until full process maturity was reached. Ultimately, Tolstoy was able to successfully capture more than 130,000 references throughout the 153,000 pages, with percentages in the high 90s for accuracy and low 90s for recall. After final data filtering and tuning at Harvard, these results will form the basis for the key feature enabling interlinkage between the two major data domains of the Nuremberg Trials Project: the transcripts and evidentiary documents. Working together with Tolstoy and machine learning has significantly reduced the resources and time otherwise required to do this work.
The Caselaw Access Project makes 360 years of U.S. case law available as a machine-readable text corpus. In developing a research community around the dataset, we’ve been creating and sharing resources for getting started.
The Frictionless Data for Reproducible Research Fellows Programme is training early career researchers to become champions of the Frictionless Data tools and approaches in their field. Fellows will learn about Frictionless Data, including how to use Frictionless Data tools in their domains to improve reproducible research workflows, and how to advocate for open science. Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content.
Hello there! My name is Monica Granados and I am a food-web ecologist, science communicator and a champion of open science. There are not too many times or places in my life where it is so easy to demarcate a “before” and an “after.” In 2014, I travelled to Raleigh, North Carolina to attend the Open Science for Synthesis (OSS) course co-facilitated by the National Centre for Ecological Synthesis and Analysis and the Renaissance Computing Institute.
I was there to learn more about the R statistical programming language to aid my quest for a PhD. At the conclusion of the course I did come home with more knowledge about R and programming but what I couldn’t stop thinking about was what I learned about open science. I came home a different scientist, truth be told a different person. You see at OSS I learned that there was a different way to do science – an approach so diametrically opposite to what I had been taught in my five years in graduate school. Instead of hoarding data and publishing behind paywalls, open science asks – wouldn’t science be better if our data, methods, publications and communications were open?
When I returned back from Raleigh, I uploaded all of my data to GitHub and sought out open access options for my publications. Before OSS I was simply interested in contributing my little piece to science, but after OSS I dedicated my career to the open science movement. In the years since OSS, I have made all my code, data and publications open and I have delivered workshops and designed courses for others to work in the open. I now run a not-for-profit that teaches researchers how to do peer-review using open access preprints and I am a policy analyst working on open science at Environment and Climate Change Canada. I wanted to become a Frictionless Data fellow because open science is continually evolving. I wanted to learn more about reproducible research. When research is reproducible, it is more accessible and that sets off a chain reaction of beneficial consequences. Open data, methods and publications mean that if you were interested in knowing more about the course of treatment your doctor prescribed or you are in doctor in the midst of an outbreak searching for the latest data on the epidemic, or perhaps you are a decision maker looking for guidance on what habitat to protect, this information is available to you. Easily, quickly and free of charge.
I am looking forward to building some training materials and data packages to make it easier for scientists to work in the open through the Frictionless Data fellowship. And I look forward to updating you on my and my fellow fellows’ progress.
More on Frictionless Data
The Fellows programme is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. Frictionless Data’s other current projects include the Tool Fund, in which four grantees are developing open source tooling for reproducible research. The Fellows programme will be running until June 2020, and we will post updates to the programme as they progress.
We’re continuing to add serial information to the Deep Backfile project that I announced here last month. I’m adding some of the existing information in The Online Books Page serials listings and our first serial renewals listings that hadn’t initially been linked in when I made the first announcement. I’ve added journals with deep backfiles from a couple more publishers (Oxford and Cambridge). I’ve started adding some new information on a few journals that I’ve heard people be interested in. And I’ve heard from some librarians who are interested in contributing more information, which I welcome, since there are a lot of journals with information still to fill in.
But we needn’t stop with librarians and journals. I’ve seen many kinds of serials written about online that potentially have public domain content, or the otherwise offer free online issues. Many of them have articles about them in Wikipedia, sometimes short summary stub, and sometimes more extensive write-ups. I’m most familiar with English Wikipedia, the largest and oldest, and recently wondered how many serials had free online issues or were old enough to potentially have public domain issues. So I decided to answer that question by building a table for that set of serials.
It turns out to be a very big table: over 10,000 serials with English Wikipedia articles that have free or potentially public domain content. That’s bigger than the combination of all the other publisher and provider tables I currently link to from the Deep Backfile page. There are lots of serials in it with no copyright or free issue information available, and it would take any single person a very long time to find such information, verify it, and fill it in.
But I think it’s still useful in its current state. You can use it to find out about a lot of public domain and open access serials you’ve probably never heard of, as well as many that you have. You can click through serial titles to see their Wikipedia articles, and improve on them if you have more information. You can click on their Wikidata IDs to see and add to their metadata. (As you can see from the relatively small number of end dates shown in the “coverage” column, there is limited information currently in Wikidata for many of the serials.) You can see what we know about their copyrights, and about free online issue availability, and follow the “Contact us” links if you want to contribute more information about either of those. (Last month’s post included instructions on how to research serial copyrights. Links to the two main resources you need to research them– the first renewals listing and the Copyright Office database— are now provided directly from the form you get to when you select a “Contact us” link.) And when a new English Wikipedia article and Wikidata entry on a serial gets added that shows it was published before 1964, it will be automatically added to this table the next time we generate it.
Whether you’re a Wikipedian, a librarian, or just a reader interested in journals, magazines, newspapers, comics, or other serials, I hope you find this information useful, and I invite you to help fill it in as your interests and time permit. Let me know or comment here if you have any questions, comments, or suggestions.
The Distant Reader takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process.
The Distant Reader empowers one to use & understand large amounts of textual information both quickly & easily. For example, the Distant Reader can consume the entire issue of a scholarly journal, the complete works of a given author, or the content found at the other end of an arbitrarily long list of URLs. Thus, the Distant Reader is akin to a book’s table-of-contents or back-of-the-book index but at scale. It simplifies the process of identifying trends & anomalies in a corpus, and then it enables a person to further investigate those trends & anomalies.
The Distant Reader is designed to “read” everything from a single item to a corpus of thousand’s of items. It is intended for the undergraduate student who wants to read the whole of their course work in a given class, the graduate student who needs to read hundreds (thousands) of items for their thesis or dissertation, the scientist who wants to review the literature, or the humanist who wants to characterize a genre.
How it works
The Distant Reader takes five different forms of input:
a URL – good for blogs, single journal articles, or long reports
a list of URLs – the most scalable, but creating the list can be problematic
a file – good for that long PDF document on your computer
a zip file – the zip file can contain just about any number of files from your computer
a zip file plus a metadata file – with the metadata file, the reader’s analysis is more complete
Once the input is provided, the Distant Reader creates a cache — a collection of all the desired content. This is done via the input or by crawling the ‘Net. Once the cache is collected, each & every document is transformed into plain text, and along the way basic bibliographic information is extracted. The next step is analysis against the plain text. This includes rudimentary counts & tabulations of ngrams, the computation of readability scores & keywords, basic topic modeling, parts-of-speech & named entity extraction, summarization, and the creation of a semantic index. All of these analyses are manifested as tab-delimited files and distilled into a single relational database file. After the analysis is complete, two reports are generated: 1) a simple plain text file which is very tabular, and 2) a set of HTML files which are more narrative and graphical. Finally, everything that has been accumulated & generated is compressed into a single zip file for downloading. This zip file is affectionately called a “study carrel“. It is completely self-contained and includes all of the data necessary for more in-depth analysis.
What it does
The Distant Reader supplements the traditional reading process. It does this in the way of traditional reading apparatus (tables of content, back-of-book indexes, page numbers, etc), but it does it more specifically and at scale.
Put another way, the Distant Reader can answer a myriad of questions about individual items or the corpus as a whole. Such questions are not readily apparent through traditional reading. Examples include but are not limited to:
How big is the corpus, and how does its size compare to other corpora?
How difficult (scholarly) is the corpus?
What words or phrases are used frequently and infrequently?
What statistically significant words characterize the corpus?
Are there latent themes in the corpus, and if so, then what are they and how do they change over both time and place?
How do any latent themes compare to basic characteristics of each item in the corpus (author, genre, date, type, location, etc.)?
What is discussed in the corpus (nouns)?
What actions take place in the corpus (verbs)?
How are those things and actions described (adjectives and adverbs)?
What is the tone or “sentiment” of the corpus?
How are the things represented by nouns, verbs, and adjective related?
Who is mentioned in the corpus, how frequently, and where?
What places are mentioned in the corpus, how frequently, and where?
People who use the Distant Reader look at the reports it generates, and they often say, “That’s interesting!” This is because it highlights characteristics of the corpus which are not readily apparent. If you were asked what a particular corpus was about or what are the names of people mentioned in the corpus, then you might answer with a couple of sentences or a few names, but with the Distant Reader you would be able to be more thorough with your answer.
The questions outlined above are not necessarily apropos to every student, researcher, or scholar, but the answers to many of these questions will lead to other, more specific questions. Many of those questions can be answered directly or indirectly through further analysis of the structured data provided in the study carrel. For example, each & every feature of each & every sentence of each & every item in the corpus has been saved in a relational database file. By querying the database, the student can extract every sentence with a given word or matching a given grammer to answer a question such as “How was the king described before & after the civil war?” or “How did this paper’s influence change over time?”
A lot of natural language processing requires pre-processing, and the Distant Reader does this work automatically. For example, collections need to be created, and they need to be transformed into plain text. The text will then be evaluated in terms of parts-of-speech and named-entities. Analysis is then done on the results. This analysis may be as simple as the use of concordance or as complex as the application of machine learning. The Distant Reader “primes the pump” for this sort of work because all the raw data is already in the study carrel. The Distant Reader is not intended to be used alone. It is intended to be used in conjunction with other tools, everything from a plain text editor, to a spreadsheet, to database, to topic modelers, to classifiers, to visualization tools.
I don’t know about you, but now-a-days I can find plenty of scholarly & authoritative content. My problem is not one of discovery but instead one of comprehension. How do I make sense of all the content I find? The Distant Reader is intended to address this question by making observations against a corpus and providing tools for interpreting the results.
Next year marks the 10th anniversary of Open Data Day! Open Data Day is the annual event where we gather to reach out to new people and build new solutions to issues in our communities using open data.
The next edition will take place on Saturday 7th March 2020.
Over the last decade, this event has evolved from a small group of people in a few cities trying to convince their governments about the value of open data, to a full-grown community of practitioners and activists around the world working on putting data to use for their communities.
Like in previous years, the Open Knowledge Foundation will continue with the mini-grants scheme giving between $200 and $300 USD to support great Open Data Day events across the world, so stay tuned for that.
In the meantime, you can collaborate on the website. opendataday.org is on Github. Pull requests are welcome and we have a bunch of issues we’d love to get through.
If coding is not your thing but you know a language besides English, you can translate the website into your language, or update one of the other nine languages available so far.
If you have started planning your Open Data Day event for next year, the new form to start populating the map will be available soon. You can also connect with others and spread the word about Open Data Day using the #OpenDataDay or #ODD2020 hashtags. Alternatively you can join the Google Group to ask for advice or share tips.
To get inspired, you can read more about everything from this year’s edition on our wrap-up blog post.
It’s World Digital Preservation Day – a time of celebration and good cheer with friends and family! Not to mention a great way to raise awareness about the importance of preserving digital materials.
To get into the spirit of the day, take some time to check out the series of blog posts sponsored by
the Digital Preservation Coalition, featuring thoughts, opinions, and stories
from around the world on the theme “At-Risk Digital Materials”.
The posts are being published on a rolling basis throughout the day. Have
a look – it’s a great opportunity to take the temperature of current thinking
on digital preservation.
I was pleased to make a contribution to the posts, which is here.
Happy World Digital Preservation Day to you and yours!
Mike Lynch and Peter Sefton attended the 2019 eResearch Australasia conference
in Brisbane from 22-24 October 2019, where we presented a few things - and a
pre-conference summit on the 21st held by the Australian Research Data Commons,
where Mike presented our report from our small discovery project on scalable
repository technology. UTS paid for the trip.
What we presented - our work on Simple Scalable Research Data Repositories
We've posted fleshed-out versions of our conference papers as usual. Mike
presented a short version of
ARDC funded work on data repositories
at both the the summit and the conference and Peter had also put in an abstract
which is less technically focussed and gives more of the context for why this
work is important.
I (Mike) went to the breakfast given in honour of the late Dr Jacky Pallas, a senior
figure in the eResearch community who had given a keynote at last year's
The speaker was Dr Toni Collis, a research software engineer and director of
Women in High Performance Computing, on how lack of
diversity is damaging your research, making the point that diverse research and
support teams can be demonstrated to be more effective in terms of performance,
and the importance of equity, diversity and inclusivity in attracting and
Research as a primary function of Electronic Health Records
Prof Nikolajz Zeps' presentation was one of a couple of talks which sold
themselves as being provocative, in that they were arguing for a loosening of
traditional, restrictive health care ethics and consent practices so that data
could be made more readily available for research. He made the point that data
sharing is very difficult in the Australian healthcare system not just because
of ethical restrictions but because the system is so fragmented. He argued for
an integration of research and clinical data consent and management practices,
which would allow the information flows required for medical research to be used
for the health care system itself to better monitor the effectiveness of
treatments and patient outcomes. Moving the consent process from one of research
ethics to clinical ethics can make things simpler, in terms of administration.
(I was less impressed by the other provocative keynote in which the speaker said
"no-one ever died as the result of a health-care data breach", which I thought
was a bit of posturing, even though it got applause from some of the audience.)
Galaxy Australia and the Australian Bioinformatic Commons
These two presentations were part of the bioinformatics stream, about the
Australian node of the global Galaxy workflow and computational platform, and
the Australian BioCommons Pathfinder Project.
A lot of what we spoke about in this BOF, which was chaired by Ingrid Mason,
echoed what I (Mike) had heard the week before at a Big Data for the Digital
Humanities symposium in Canberra which was organised by the ARDC and AARNet (and
which I should write a blog post about).
The challenges faced by digital humanities researchers and support staff mirror
one another - researchers are unsure of the right way to engage with technical
staff and vice versa, and good collaboration is too labour-intensive to be
sustainable if it's going to be spread beyond a minority of researchers who are
already linked in to support networks.
This gave me a kind of wistful feeling about an earlier
keynote from Dell
about machine learning in medical science, because the speaker was very
enthusiastic about moving HPC tools out of the realm where researchers needed to
become technology experts to use them at all, into something more like commodity
software. Although there are some areas of the humanities where this sort of
thing is starting to happen - transcription is one which came up both in
Canberra and Brisbane.
I (Peter) chaired a session on Data Discovery with a couple of lead-in
talks that outlined what's going on in the world of generic research data
discovery, leading into a discussion.
From our viewpoint at UTS it was useful to get confirmation that discovery services are converging on using
Schema.org for high-level description of data sets, for
indexing by other services. Which is good, because that's the horse we bet on at UTS.
It's being used by both Research Data Australia (RDA) and
the new player
Google dataset search (that's run by
a tiny team apparently, but it will have a huge impact on how everyone has to
structure their metadata).
Amir Aryani (Swinburne) and Melroy Almeida (Australian Access Federation)
presented on ORCID graph
looking at collaboration networks. This is testament to the power of using
strong, URI-based identifiers, once you start doing that, metadata changes from
an un-reliable soup of differently spelled ambiguous names, to something you can
do real analytics on.
Adrian Burton (ARDC, ex ANDS) has been dealing with metadata a long time - he
put it that the Schema.org approach had won, and suggested that this might be a
bit of a loss. The RIF-CS standard that ANDS
inherited and built RDA around had a entity based model, with Collections,
Parties, Activities and services (based on
ISO 2146) rather than simple flat
name-value metadata. I agree that the entity model was a strength of RIF-CS But
actually, for those that want to convey rich context about data, then schema.org
with linked data can do everything RIF-CS can, more elegantly and with more
detail. See the work we've been doing on RO-Crate which takes things to a
deeper level with descriptions of files and (soon) even variables inside files
including provenance chains (what people and equipment did to make those files
from observations, or other files).
The leaders of that session have done a follow up survey, so I think they'll be
putting out more info soon.
The RO-Crate talk I gave (Peter here) was in a stream on Digital Preservation and data packaging.
I was talking, not taking notes, but we discussed what the research community
and cultural collections folks can learn from each other - actually I think we
made some of the same mistakes, both the eResearch community and GLAM sector
invested in big silos which ended up not just storing data, but making it
difficult to move, re-use. To labour the metaphor a bit, silos have small holes
in the bottom, so getting data in and out is slow.
Mike's diagram of an OCFL Repository shows an alternative approach - instead of
putting data in a container with constricted ingress and egress, lay it all out
in the open. I'm not an expert in preservation systems, but I do know that
that's the approach taken by the open source
Archivematica preservation system (Note:
I've done a bit of work for Artefactual systems which looks after it), it works
as an application that sits beside a set of files on disk - if needed you can
use the grandparent of all APIs, that is file operations to fetch data. All of
our talks we gave, linked above, were about this idea in one way or another.
Trusted Repository certification - one for the UTS Roadmap
I (Peter again) attended
a session MCed by Richard Ferrers
from ARDC with contributions from people from a range of institutions and
repositories who are part of an ARDC community of practice.
They talked about the Core Trust Seal repository certification program - and the process of getting certified.
Here's some background on CTS:
Core Certification and its Benefits
Nowadays certification standards are available at different levels, from a core level to extended and formal
levels. Even at the core level, certification offers many benefits to a repository and its stakeholders.
Core certification involves a minimally intensive process whereby data repositories supply evidence that they
are sustainable and trustworthy. A repository first conducts an internal self-assessment, which is then
reviewed by community peers. Such assessments help data communities—producers, repositories, and
consumers—to improve the quality and transparency of their processes, and to increase awareness of and
compliance with established standards. This community approach guarantees an inclusive atmosphere in
which the candidate repository and the reviewers closely interact.
In addition to external benefits, such as building stakeholder confidence, enhancing the reputation of the
repository, and demonstrating that the repository is following good practices, core certification provides a
number of internal benefits to a repository. Specifically, core certification offers a benchmark for comparison
and helps to determine the strengths and weaknesses of a repository.
Right now at UTS we're in the process of making a new Digital Strategy, aligned the UTS 2027 Strategy - one of the core goals (which are still evolving so we can't link to them just yet) is to have trusted systems. CTS would be a great way for the IT Department (that's us) to demonstrate to the organisation that we have the governance, technology and operational model in place to run a repository.
We're talking now about getting at least the first step (self certification) on the 2021 Roadmap - but before that, we'll see if we can join the community discussion and start planning.
The objective of this course is to explain how DSpace software (version 6.3) is installed on an operating system, in this case Windows. The user will be able to configure and personalize DSpace, as well as have control of testing and seeing the complete installation process.
The course will begin on Friday, November 8, please to register: https://wiki.duraspace.org/pages/viewpage.action?pageId=176490109
El Grupo Mexicano de Usuarios de DSpace se complace en ofrecer a la comunidad la oportunidad de registrarse para un curso virtual sobre “Instalación, Configuración y Personalización de DSpace 6.3” con Sheyla Salazar Waldo y Julian Timal Tlachi del Grupo Mexicano de Usuarios de DSpace.
El objetivo de este curso es explicar cómo se instala el software DSpace (versión 6.3) en un sistema operativo, en este caso Windows. El usuario podrá configurar y personalizar DSpace, así como tener el control de las pruebas y ver el proceso completo de instalación.
Doing something great with Islandora and/or Fedora that you want to share with the community? Have a recent project that the world just needs to know about? Send us your proposals to present at the joint Islandora and Fedora Camp in Arizona! Presentations should be roughly 20-25 minutes in length (with time after for questions) and deal with Islandora and/or Fedora in some way. The camp will be focussed on the latest versions of Islandora and Fedora, so preference will be given to sessions that relate to Islandora 8 and Fedora 4 and higher, but we still welcome proposals relating to earlier versions.
All we need is a session title and a brief abstract. Submit your proposal here.
The venerable Project Gutenberg is perfect fodder for the Distant Reader, and this essay outlines how & why. (tl;dnr: Search my mirror of Project Gutenberg, save the result as a list of URLs, and feed them to the Distant Reader.)
Wall Paper by Eric
A long time ago, in a galaxy far far away, there was a man named Micheal Hart. Story has it he went to college at the University of Illinois, Urbana-Champagne. He was there during a summer, and the weather was seasonably warm. On the other hand, the computer lab was cool. After all, computers run hot, and air conditioning is a must. To cool off, Micheal went into the computer lab to be in a cool space.† While he was there he decided to transcribe the United States Declaration of Independence, ultimately, in the hopes of enabling people to use a computers to “read” this and additional transcriptions. That was in 1971. One thing led to another, and Project Gutenberg was born. I learned this story while attending a presentation by the now late Mr. Hart on Saturday, February 27, 2010 in Roanoke (Indiana). As it happened it was also Mr. Hart’s birthday. 
To date, Project Gutenberg is a corpus of more than 60,000 freely available transcribed ebooks. The texts are predominantly in English, but many languages are represented. Many academics look down on Project Gutenberg, probably because it is not as scholarly as they desire, or maybe because the provenance of the materials is in dispute. Despite these things, Project Gutenberg is a wonderful resource, especially for high school students, college students, or life-long learners. Moreover, its transcribed nature eliminates any problems of optical character recognition, such as one encounters with the HathiTrust. The content of Project Gutenberg is all but perfectly formatted for distant reading.
Unfortunately, the interface to Project Gutenberg is less than desirable; the index to Project Gutenberg is limited to author, title, and “category” values. The interface does not support free text searching, and there is limited support for fielded searching and Boolean logic. Similarly, the search results are not very interactive nor faceted. Nor is there any application programmer interface to the index. With so much “clean” data, so much more could be implemented. In order to demonstrate the power of distant reading, I endeavored to create a mirror of Project Gutenberg while enhancing the user interface.
To create a mirror of Project Gutenberg, I first downloaded a set of RDF files describing the collection.  I then wrote a suite of software which parses the RDF, updates a database of desired content, loops through the database, caches the content locally, indexes it, and provides a search interface to the index. [3, 4] The resulting interface is ill-documented but 100% functional. It supports free text searching, phrase searching, fielded searching (author, title, subject, classification code, language) and Boolean logic (using AND, OR, or NOT). Search results are faceted enabling the reader to refine their query sans a complicated query syntax. Because the cached content includes only English language materials, the index is only 33,000 items in size.
Project Gutenberg & the Distant Reader
The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to supplement the traditional reading process. Project Gutenberg and the Distant Reader can be used hand-in-hand.
As described in a previous posting, the Distant Reader can take five different types of input.  One of those inputs is a file where each line in the file is a URL. My locally implemented mirror of Project Gutenberg enables the reader to search & browse in a manner similar to the canonical version of Project Gutenberg, but with two exceptions. First & foremost, once a search has been gone against my mirror, one of the resulting links is “only local URLs”. For example, below is an illustration of the query “love AND honor AND truth AND justice AND beauty”, and the “only local URLs” link is highlighted:
By selecting the “only local URLs”, a list of… URLs is returned, like this:
This list of URLs can then be saved as file, and any number of things can be done with the file. For example, there are Google Chrome extensions for the purposes of mass downloading. The file of URLs can be fed to command-line utilities (ie. curl or wget) also for the purposes of mass downloading. In fact, assuming the file of URLs is named love.txt, the following command will download the files in parallel and really fast:
cat love.txt | parallel wget
This same file of URLs can be used as input against the Distant Reader, and the result will be a “study carrel” where the whole corpus could be analyzed — read. For example, the Reader will extract all the nouns, verbs, and adjectives from the corpus. Thus you will be able to answer what and how questions. It will pull out named entities and enable you to answer who and where questions. The Reader will extract keywords and themes from the corpus, thus outlining the aboutness of your corpus. From the results of the Reader you will be set up for concordancing and machine learning (such as topic modeling or classification) thus enabling you to search for more narrow topics or “find more like this one”. The search for love, etc returned more than 8000 items. Just less than 500 of them were returned in the search result, and the Reader empowers you to read all 500 of them at one go.
Project Gutenberg is very useful resource because the content is: 1) free, and 2) transcribed. Mirroring Project Gutenberg is not difficult, and by doing so an interface to it can be enhanced. Project Gutenberg items are perfect items for reading & analysis by the Distant Reader. Search Project Gutenberg, save the results as a file, feed the file to the Reader and… read the results at scale.
Lehigh University Libraries has developed a new tool for querying WorldCat using the WorldCat Search API. The tool is a Google Sheet Add-on and is available now via the Google Sheets Add-ons menu under the name “MatchMarc.” The add-on is easily customizable, with no knowledge of coding needed. The tool will return a single “best” OCLC record number, and its bibliographic information for a given ISBN or LCCN, allowing the user to set up and define “best.” Because all of the information, the input, the criteria, and the results exist in the Google Sheets environment, efficient workflows can be developed from this flexible starting point. This article will discuss the development of the add-on, how it works, and future plans for development.
Columbia University Libraries recently embarked on a multi-phase project to migrate nearly 4,000 records describing over 70,000 linear feet of archival material from disparate sources and formats into ArchivesSpace. This paper discusses tools and methods brought to bear in Phase 2 of this project, which required us to look closely at how to integrate a large number of legacy finding aids into the new system and merge descriptive data that had diverged in myriad ways. Using Python, XSLT, and a widely available if underappreciated resource—the Google Sheets API—archival and technical library staff devised ways to efficiently report data from different sources, and present it in an accessible, user-friendly way,. Responses were then fed back into automated data remediation processes to keep the migration project on track and minimize manual intervention. The scripts and processes developed proved very effective, and moreover, show promise well beyond the ArchivesSpace migration. This paper describes the Python/XSLT/Sheets API processes developed and how they opened a path to move beyond CSV-based reporting with flexible, ad-hoc data interfaces easily adaptable to meet a variety of purposes.
The Black Book Interactive Project at the University of Kansas (KU) is developing an expanded corpus of novels by African American authors, with an emphasis on lesser known writers and a goal of expanding research in this field. Using a custom metadata schema with an emphasis on race-related elements, each novel is analyzed for a variety of elements such as literary style, targeted content analysis, historical context, and other areas. Librarians at KU have worked to develop a variety of computational text analysis processes designed to assist with specific aspects of this metadata collection, including text mining and natural language processing, automated subject extraction based on word sense disambiguation, harvesting data from Wikidata, and other actions.
In October 2018, the Library of Congress launched its crowdsourcing program By the People. The program is built on Concordia, a transcription and tagging tool developed to power crowdsourced transcription projects. Concordia is open source software designed and developed iteratively at the Library of Congress using Agile methodology and user-centered design. Applying Agile principles allowed us to create a viable product while simultaneously pushing at the boundaries of capability, capacity, and customer satisfaction. In this article, we share more about the process of designing and developing Concordia, including our goals, constraints, successes, and next steps.
With funding from multiple sources, an augmented-reality application was developed and tested by researchers to increase interactivity for an online exhibit. The study found that augmented reality integration into a library exhibit resulted in increased engagement and improved levels of self-reported enjoyment. The study details the process of the project including describing the methodology used, creating the application, user experience methods, and future considerations for development. The paper highlights software used to develop 3D objects, how to overlay them onto existing exhibit images and added interactivity through movement and audio/video syncing.
This paper offers a primer in the programming language R for library staff members to perform factor analysis. It presents a brief overview of factor analysis and walks users through the process from downloading the software (R Studio) to performing the actual analysis. It includes limitations and cautions against improper use.
4Science, together with The Library Code and other organizations in the DSpace-CRIS community, such as the Hamburg University of Technology, the Fraunhofer Gesellschaft, the Georg-August-University Goettingen, the Otto-Friedrich-University Bamberg, the University of Bern, the University of Trieste, invite all institutions interested in DSpace-CRIS, the free open-source Research Information Management System (aka CRIS/RIMS), to join in the first International DSpace-CRIS User Group Meeting to be held in Muenster (Germany, EU) on November 18, 2019.
The event is free and is organized in the framework of the euroCRIS Membership Meeting that will be held on November 18-20, 2019 at the University of Münster, Germany, EU. At https://eurocris.uni-muenster.de there are details about the euroCRIS registration, program, venue, transport, and accommodation. Participation also in the euroCRIS event is strongly encouraged.
The DSpace-CRIS User Group Meeting is scheduled for Monday, November 18 from 14h to 17h, at: Johannisstr. 8-10, room KTh IV, Muenster (click here for the map), with thanks to the generous hospitality of the University of Muenster.
• Introduction and DSpace-CRIS roadmap, future plans and conclusions: Susanna Mornati and Andrea Bollini, 4Science (Italy), Pascal Becker, The Library Code (Germany)
• Experiences from participants and discussion:
Beate Rajski and Oliver Goldschmidt, Hamburg University of Technology (Germany)
Daniel Beucke, Georg-August University of Goettingen (Germany)
Michael Erndt and Dirk Eisengräber-Pabst, Fraunhofer Gesellschaft (Germany)
Steffen Illig, Otto-Friedrich-University Bamberg (Germany)
Anna Keller, University of Bern (Switzerland)
Jordan Piščanc, University of Trieste (Italy)
Other participants are invited to share their experiences and wish-list (open discussion)
Nominations are open for the 2020 LITA/Library Hi Tech Award, which is given each year to an individual or institution for outstanding achievement in educating the profession about cutting edge technology within the field of library and information technology. Sponsored by the Library and Information Technology Association (LITA) and Library Hi Tech, the award includes a citation of merit and a $1,000 stipend provided by Emerald Publishing, publishers of Library Hi Tech. The deadline for nominations is December 31, 2019.
The award, given to either a living individual or an institution, may recognize a single seminal work or a body of work created during or continuing into the five years immediately preceding the award year. The body of work need not be limited to published texts but can include course plans or actual courses and/or non-print publications such as visual media. Awards are intended to recognize living persons rather than to honor the deceased; therefore, awards are not made posthumously. More information and a list of previous winners can be found on the LITA website.
The award will be presented at the LITA President’s Program during the 2020 Annual Conference of the American Library Association in Chicago, IL.
The Library and Information Technology Association (LITA) is the leading organization reaching out across types of libraries to provide education and services for a broad membership of nearly 2,400 systems librarians, library technologists, library administrators, library schools, vendors, and many others interested in leading edge technology and applications for librarians and information providers. Follow us on our Blog, Facebook, or Twitter.
About Emerald Publishing
Founded in 1967, Emerald Publishing today manages a range of digital products, a portfolio of nearly 300 journals, more than 2,500 books and over 450 teaching cases. More than 3,000 Emerald articles are downloaded every hour of every day. The network of contributors includes over 100,000 advisers, authors and editors. Globally, Emerald has an extraordinary reach with 12 offices worldwide and more than 4,000 customers in over 120 countries. Emerald is COUNTER 4 compliant. It is also a partner of the Committee on Publication Ethics (COPE) and works with Portico and the LOCKSS initiative for digital archive preservation. It also works in close collaboration with a number of organizations and associations worldwide.
This presentation was given by Peter Sefton at the eResearch Australasia 2019 Conference in Brisbane, on the 24th of October 2019.
This presentation is part of a series of talks delivered here at eResearch Australasia - so it won’t go back over all of the detail already covered - see the introduction of datacrate in 2017 and and the 2018 update. The standard formerly known as DataCrate has been subsumed into a new standard called Research Object Crate - RO-Crate for short.
This is a recent snapshot of the makeup of the current RO-Crate team- compiled by Stian.
The website says:
RO-Crate is a community effort to establish a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wider variety of situations, from an individual researcher working with a folder of data, to large data-intensive computational research environments.
This is a timeline for the merging of the Research Object packaging work with DataCrate - again compiled by Stian. While our DataCrate work was driven by practical concerns and a desire to describe research data with high-quality metadata Research Object shared those concerns but with more of a focus on reproducibility and detailed provenance for research data.
This is what an RO-Crate looks like if you open the HTML file that’s in the root directory (or you see one on the web).
Where did RO-Crate come from?
RO-Crate is the marriage of Research Objects with DataCrate. It aims to build on their respective strengths, but also to draw on lessons learned from those projects and similar research data packaging efforts. For more details, see background.
Who is it for?
The RO-Crate effort brings together practitioners from very different backgrounds, and with different motivations and use-cases. Among our core target users are: a) research engaged with computation and data-intensive, wokflow-driven analysis; b) digital repository managers and infrastructure providers; c) individual researchers looking for a straight-forward tool or how-to guide to “FAIRify” their data; d) data stewards supporting research projects in creating and curating datasets.
RO-Crate is a collaboration between people all over the world, but the Editors are from Cork, Manchester and Katoomba
Version one of the standard will be out in by Summer.
But which summer? Standard reference points are important. Standards are important.
Which brings us the benefits of Standards. Without this standardised date format chaos would reign. What if that date had been written 05/08 or 08/05 - someone might end up eating food from May in August, or worse, eating last August’s food in May.
Anyway, If you find a partner who’ll adopt the ISO 8601 data standard then ...
… you should marry them.
Like how we married the Research Object and DataCrate - we bonded over standardisation.
Let’s explore standards a bit more. Iif you see this in metadata - what does it mean?
In RO-Crate - there’s an HTML page which ships with each dataset that allows you to browse the object in as much detail as the author described it and we are careful to avoid ambiguity by adding help links to each metadata term so you see the definition.
Just wanted to shout out to ResearchGraph - led by Amir Aryani at Swinburne Uni - they are also using schema.org.
RO-Crates ship with two files, a human readable one and a machine readable JSON file. The two views (human and machine) of the data are equivalent - in fact the HTML version is generated from the JSON-LD version, via the DataCrate nodejs library.
And here’s an automatically generated diagram extracted from the sample DataCrate showing how two images were created. The first result was an image file taken by me (as an agent) using two instruments (my camera and lens), of a place (the object: Catalina park in Katoomba). A sepia toned version was the result of a CreateAction, with the instrument this time being the ImageMagick software. The DataCrate also contains information about that CreateAction such as the command used to do the conversion and the version of the software-as-instrument.
This way of representing file provenance is Action-centred - the focus is on the action that creates a file, rather than the more usual metadata approach of having the file at the centre with properties for “Author” and the like. The action-based approach is MUCH more flexible as it can model the contribution of multiple agents and instruments separately at the expense of being somewhat counter-intuitive to those of us who are used to a library-card approach to metadata where the work is at the centre and has simple properties.
There was a question after this presentation about whether I had the arrows in this diagram pointing in the right direction. Yes, I do! The convention here is the standard way of representing a subject-predicate-object semantic triple with the subject as the source of the arrow, the predicate (in this case Schem.org property) as a label, and the pointy end pointing at the object.
What’s new / developing at the moment in the RO-Crate world? I will illustrate by looking at recent activity on our Github project.
Breakig news: In the last couple of months Marco La Rosa, an independent developer working for PARADISEC, has ported 10,000 data and collection items into RO-Crate format, AND built a portal which can display them. This means that ANY repository with a similar structure Items in Collections could easily re-use the code and the viewers for various file types.
This shows an intralinear transcription where you can play various segments of a recording and see the transcription.
The .eaf files in the previous example are produced using ELAN software. Marco has done the groundwork for a system that could work across multiple repositories and for stand-alone RO-Crates - the crate metadata describes the files, and what format they’re in, and the viewer which is an HTML page either served by a repository or possibly just off your hard disk, can use that information to load an appropriate viewer.
RO-Crate will be released in version 1 in November 2019 - we were aiming for
October, but missed that.
We will publish the parts that are well-tested and stable, and immediately start
on a new version with bleeding-edge cases.
We want input from potential users, current and prospective implementers and
help drafting new parts of the spec is welcome.
<a rel="license" href="http://creativecommons.org/licenses/by/3.0/au/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/3.0/au/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/3.0/au/">Creative Commons Attribution 3.0 Australia License</a>.
This presentation was given by Peter Sefton & Michael Lynch at the eResearch Australasia 2019 Conference in Brisbane, on the 24th of October 2019.
Welcome - we’re going to share this presentation. Peter/Petie will talk through the two major standards we’re building on, and Mike will talk about the software stack we ended up with.
This project is about building highly scalable research data repositories quickly, cheaply and above all sustainably by using Standards for organizing and describing data.
We had a grant to continue our OCFL work from the Australian Research Data Commons. (I’ve used the new Research Organisation Registry (ROR) ID for ARDC, just because it’s new and you should all check out the ROR).
This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital information in a structured, transparent, and predictable manner. It is designed to promote long-term object management best practices within digital repositories.
Specifically, the benefits of the OCFL include:
Completeness, so that a repository can be rebuilt from the files it stores
Parsability, both by humans and machines, to ensure content can be understood in the absence of original software
Robustness against errors, corruption, and migration between storage technologies
Versioning, so repositories can make changes to objects allowing their history to persist
Storage diversity, to ensure content can be stored on diverse storage infrastructures including conventional filesystems and cloud object stores
Here’s a screenshot of what an OCFL object looks like - it’s a series of versioned directories, each with a detailed inventory.
One of the standards we are using is RO-Crate - for describing research data sets. I presented this at eResearch as well [TODO - link]
This is an example of an RO-Crate showing that each Crate has a human-readable HTML view as well as a machine readable view.
The two views (human and machine) of the data are equivalent - in fact the HTML version is generated from the JSON-LD version using a tool called CalcyteJS.
This is a screenshot of work very much in progress - it’s a shows an example of the repository system working at the smallest scale, showing a single collection, “Farms to Freeways”; a social history project from Western Sydney, which we have exported into RO-Crate format as a demonstration. Each of the participants has been indexed for discovery. In a more deployment for a institutional repository, datasets would be indexed at the top level only. The point is to show that this software will be highly configurable.
OCFL needs some explaining. I’ve had a couple of conversations with developers
where it takes them a little while to get what it’s for.
But they DO get it the standard is well designed.
Solr is an efficient search engine.
nginx is an industry-standard scalable web server, used by companies like DropBox and Netflix
Both are standard, open-source, easy to deploy and keep patched: unlike dedicated data repositories, which tend be fussy and make your server team swear.
Todo: we want to use the Memento standard so that clients can request versioned rsources.
We are also looking at versioned DOIs pointing to versioned URLs and resources
The codebase is in a lot of places but that’s consistent with the approach - they are all just components which we can deploy as we need them
The nginx extension is very small and would be easy to reimplement against another server
This is the most prototypical / primitive part of what we’ve got so far.
Licences on RO-Crate are indexed in the solr index. nginx authenticates web users, looks up which licences they can access, and applies access control to both search results and payloads.
At the moment, we’ve got a test server which doesn’t authenticate but which only serves datasets with a public licence and denies access to everything else.
The screenshot on the left is a Solr query showing public and internal licences
The screenshot on the right is a basic web view of what nginx serves to an unauthenticated guest user - datasets with internal licenses aren’t shown
Good data standards make incremental development much easier.
We were able to get real results in one- and two-day workshops with teams from PARADISEC and the State Library of New South Wales, both with large, structured digital humanities collections behind APIs.
Both the OCLF and RO-Crate standards are new and changing, but agile development means that it’s OK and even productive to keep pace with this and feed back into community consultation.
In the last couple of months Marco La Rosa, an independent developer working for PARADISEC, has ported 10,000 data and collection items into RO-Crate format, AND built a portal which can display them. This means that ANY repository with a similar structure Items in Collections could easily re-use the code and the viewers for various file types.
The Mitchell Collection - digitised public domain books with detailed metadata in METS and specialised OCR standards. We spend a day at the State Library and were able to successfully extract books into directories of JPEGs and metadata, package these using RO-Crate and start building an OCFL repository.
Two recent events made me think (again) about the toxic nature of “library neutrality” and the fact that, more often than not, neutrality is whiteness/patriarchy/cis-heteronormativity/ableism/etc. parading around as neutrality and causing harm to folks from historically marginalized groups. The insidious thing about whiteness and these other dominant paradigms is that they are largely invisible to people in the dominant groups. It’s depressing to say this, but I sometimes feel grateful for the antisemitic macroaggressions and microaggressions I’ve been a victim of over the years because they opened my eyes to what it feels like to be othered and bullied and made me more sensitive to when it happens to others. That doesn’t mean I don’t get things wrong plenty of the time and cause harm unintentionally (we all do), but I am trying to be better because I don’t want anyone to feel the way I did when I was a target.
The first event that got me thinking about this is the fact that the Toronto Public Library, against a flurry of opposition, allowed the feminist and transphobic Megan Murphy to give a talk in one of their meeting rooms entitled “Gender Identity: What does It Mean for Society, the Law and Women?” Murphy is on a crusade to “protect” women and children from transwomen who seek to use women-only facilities like bathrooms or locker rooms. She has been banned on Twitter for her transphobia and misgendering in the past. TPL already has a pretty robust room booking policy that says —
Contracting Party’s event will not promote, or have the effect of promoting, discrimination, contempt or hatred for any group or person on the basis of race, ethnic origin, place of origin, citizenship, colour, ancestry, language, creed (religion), age, sex, gender identity, gender expression, marital status, family status, sexual orientation, disability, political affiliation, membership in a union or staff association, receipt of public assistance, level of literacy or any other similar factor.
But TPL didn’t see this event as something that promoted discrimination, contempt, or hatred. According to the City Librarian of TPL, Vickery Bowles, their stated purpose was “to have an educational and open discussion on the concept of gender identity and its legislation ramifications on women in Canada.” Now, let’s imagine that we could go in a time machine to the past. Can you imagine some of these titles being discussed in libraries?
“to have an educational and open discussion on the concept of blacks living in white neighborhoods and its ramifications on the safety of white women in the United States.”
“to have an educational and open discussion on the concept of Jews as teachers and its ramifications on our impressionable children in Germany.”
Clearly I’m dense because I can’t see a difference between any of these lecture topics. It is treating the existence and/or civil rights of one group as something that is 1) up for debate and 2) a danger to others. I’m baffled how anyone could not see such a talk as something “promoting, discrimination, contempt or hatred,” and yet Vickery Bowles is being treated like a hero for standing up against censorship in a number of publications (see Kris Joseph’s excellent blog post for links to a few of them). For more on the TPL controversy, other excellent blog posts you may want to consult are authored by —
Another thing came up this week on an occasion that should have been such a positive one. OLA Quarterly, the official publication of the Oregon Library Association (of which I’m a member and served on its Board last year) came out with a mostly fantastic issue focused on Equity, Diversity, and Inclusion. I’ve read it cover to cover and was so impressed with the way library workers in our state and in all sorts of positions in their organizations have made efforts (big and small) to improve diversity, equity, and inclusion. There’s some great stuff in the issue. Unfortunately, it ended with an article entitled “Yes, but … One Librarian’s Thoughts About Doing It Right” by Heather McNeil. I’m sure most of you can guess that with a title like that, no good can come, and you’d be so very right.
Honestly, the only positive thing I can see ever coming from this article is that when someone asks in the future what people mean by white fragility or by the idea of white people centering themselves in conversations about diversity, I have something to point to. Truly, I’ve seen no clearer example. It’s hard for me to imagine what would possess a librarian with a long and celebrated career as a children’s librarian to write something so uncollegial, offensive, and dismissive of diversity (not to mention poorly written and supported) as her parting gift to the profession upon her retirement. I can only imagine that her feeling that we have “overcorrect[ed] ourselves” on issues of diversity was so strong that she believed she was doing us all a favor in sharing it. And if that isn’t whiteness in its purest form, I don’t know what is. Her misrepresentation of criticisms of Dr. Seuss books, Dr. Debbie Reese’s speech (the text of which is available so you can form your own conclusions), the blog Reading While White, and others trying to improve the diversity of books in libraries, celebrate diverse books, and critique whiteness in libraries were egregious and mostly unsupported.
Like others, I wrote a letter to the editors of OLA Quarterly, which I also shared on Twitter and on our state library listserv. My hope is that the editors will address this issue publicly and revisit their editorial standards so something this unprofessional is never published in OLA Quarterly again. However, what troubles me most is that lots of people read this article prior to its inclusion in the issue and thought it appropriate for publication. Again, clear evidence of how invisible whiteness can be to people who are white.
McNeil argues in her article that the Caldecott Committee does not consider the race or ethnicity of the author in their voting, but that’s pretty much impossible in a racist society. What we find beautiful and touching and important is very much based on our worldview, which, when we’ve been baked in a racist society, is influenced by whiteness. And based on McNeil’s article, it’s clear that some people are more aware of their problematic biases than others. It left me wondering whether members of the Newbery and Caldecott Committees are given implicit bias training so they can be more aware of how their biases impact their views of each book. If not, they absolutely should.
What strikes me about both of these issues is the fundamental lack of empathy expressed for people from historically marginalized groups. McNeil seems to worry much more about libraries with limited budgets (who might not want to buy diverse books that she believes won’t circulate) and Dr. Seuss lovers than about young children of color who might be impacted by racist caricatures or a lack of books in their library’s collection featuring protagonists who look like them. In the case of Toronto, even if the Library decided to hold firm on allowing the meeting room to be used on intellectual freedom grounds, they could have provided affirmation for their trans patrons in the form of statements and programming. That City Librarian Bowles would not even deign to acknowledge that trans women are women suggests to me that there is nothing “neutral” about the library’s stance. The fact that they see the question of whether trans women are women as an academic question that could reasonably be up for debate speaks volumes.
No one on the @torontolibrary should serve this community, especially not @vbowlestpl , because regardless of their transphobic beliefs, they couldn’t even acknowledge my humanity in that moment.
I can’t even fathom what all this feels like for LGBTQ+ staff at the Toronto Public Library who are not only being harmed by this, but in whose names these harms are being perpetrated. I felt angry about the article in OLA Quarterly on behalf of those whose needs and legitimate claims were being minimized and dismissed by McNeil, but I also felt like it made all Oregon library workers look bad. It made me feel embarrassed to be an OLA member.
In both of these cases, supporting diversity, equity, and inclusion are seen as things that are nice to do, but are secondary to other values libraries hold, like intellectual freedom. I wrote about the tension between access & diversity and intellectual freedom in American Libraries and while I was not allowed to take a strong stand in that publication, I can say here that I unequivocally put people over ideals (especially people who are frequently victimized by institutions). To me, events by white supremacists or TERFs (trans-exclusionary radical feminists) are designed to repudiate the dignity and existence of marginalized groups and to make those groups feel unsafe. How can we say we welcome everyone into our libraries if we welcome folks who explicitly make people from marginalized groups feel unwelcome? But instead, libraries hide behind the idea of neutrality and not taking sides when clearly, TPL did choose a side. So did McNeil. So did I. And hanging onto your supposed neutrality only ensures that your behavior and choices are going to be influenced by whiteness/patriarchy/cis-heteronormativity/ableism/etc.
Key to stopping situations like this from happening is helping people become aware of their own biases and privilege, but clearly that is a difficult pill for many white library workers to swallow. I was asked last Spring to serve on an Oregon Library Association Equity, Diversity, and Inclusion (EDI) Task Force that is going to have its first meeting soon. I was originally really excited to serve on this group because I could see that libraries and library workers in the state needed educational tools that facilitate open discussions and encourage critical reflection about EDI issues and privilege. I could imagine creating a multi-modal learning program where people read articles, watch videos, critically reflect on their own blogs, and participate in F2F or virtual group discussions. After this week, that need is even more glaring. When I saw that our charge was focused on creating an EDI plan, I worried that we would be simply creating a meaningless document that the OLA Board will file away and maybe develop a few long-term goals around. I hope I’m wrong and we really move the needle on EDI in the state. I think I’ve just been burned too many times when working to create transformative planning documents that administrators just file away and ignore. I want to support meaningful work and I don’t want to feel so cynical about it.
What makes me hopeful is reading the other articles in this OLA Quarterly issue where library workers are moving the needle on making their libraries, collections, and the information ecosystem more diverse, equitable, and inclusive in ways large and small. There is great work happening in Oregon. I hope you’ll take the time to read some of their stories too and will amplify them more than McNeil’s terrible contribution.
The first Islandora event of 2020 will also be our first joint event with Fedora! From February 24 - 26, we will be partnering with LYRASIS and hosted by Arizona State University to bring you a three day camp packed with the latest in both Islandora and Fedora. Registration is now open!
Our focus will be on the latest versions of each, so this is an excellent opportunity to learn all about Islandora 8 and get some hands-on experience. The camp will be led by a group of experienced instructors with expertise spanning the front-end and code base of both platforms:
Melissa Anez has been working with Islandora since 2012 and has been the Community and Project Manager of the Islandora Foundation since it was founded in 2013. She has been a frequent instructor in the Admin Track and developed much of the curriculum, refining it with each new Camp. Lately she has been enjoying the challenge of fitting two versions of Islandora into a single day of workshops!
Danny Lamb has his B.Sc. in Mathematics and has been programming since before he could drive. He is currently serving as the Islandora Foundation's Technical Lead, and hopes to promote a collaborative and respectful environment where constructive criticism is encourage and accepted. He is married with two children, and lives on beautiful Prince Edward Island, Canada. If he had free time, he'd be spending it in front of his kamado style grill.
Bethany Seeger is a software developer in the library at Amherst College, a liberal arts college in Massachusetts. She’s a Fedora committer, and also is the lead committer and release manager of the ISLandora Enterprise (ISLE) project. She was an instructor at Fedora Camp in Austin, TX, and co-led the ISLE workshop at Islandoracon 2019. Bethany has lurked in the Islandora community for a while watching Islandora 8 develop; during this time, she’s installed Islandora 7 (manually, and then using ISLE) and Islandora 8 (using Ansible). Currently she is working on migrating a custom Fedora 3 repository to Islandora 7 (using ISLE) with the hopes of adopting Islandora 8 in the very near future. Bethany enjoys explaining complicated processes in plain English.
Seth Shaw jumped directly into developing with Islandora 8, and became a committer in 2018. He developed the Controlled Access Terms module and an ArchivesSpace integration module. He has been teaching workshops for over a decade but this will be his first Islandora Camp. His day job is as an Application Developer for Special Collections at the University of Nevada, Las Vegas.
David Wilcox is the Product Manager for Fedora at LYRASIS. He has been working with the Fedora and Islandora communities since 2011. David organizes camps and workshops for Fedora, where he is also frequently an instructor. The Arizona camp will be the first to feature both Fedora and Islandora, and he is excited for the opportunity to bring this new, combined camp to the community.
The Frictionless Data for Reproducible Research Fellows Programme is training early career researchers to become champions of the Frictionless Data tools and approaches in their field. Fellows will learn about Frictionless Data, including how to use Frictionless Data tools in their domains to improve reproducible research workflows, and how to advocate for open science. Working closely with the Frictionless Data team, Fellows will lead training workshops at conferences, host events at universities and in labs, and write blogs and other communications content.
You can call me Daniel Ouso. My roots trace to the lake basin county of Homabay in the Equatorial country in the east of Africa; Kenya. Currently, I live in its capital Nairobi – once known as “The Green City in the Sun”, although thanks to the poor stewardship to Mother Nature this is now debatable. The name is Maasai for a place of cool waters.
But enough of beautiful Kenya. I work in the International Centre of Insect Physiology and Ecology as a Bioinformatics expert within the Bioinformatics Unit involved in bioinformatics training and genomic data management. I am a master of science in Molecular biology and Bioinformatics (2019) from Jomo Kenyatta University of Agriculture and Technology, Kenya. My previous work is in infectious disease management and a bit of conservation. My long-term interest is in disease genomics research.
I am passionate about research openness and reproducibility, which I gladly noticed as a common interest in the Frictionless Data Fellowship (FDF). I have had previous experience working on a Mozilla Open Science project that really piqued my interest in wanting to learn skills and to expand my knowledge and perspective in the area. To that destination, this fellowship advertised itself as the best vehicle, and it was a frictionless decision to board. My goal is to become a better champion for open-reproducible research by learning data and metadata specifications for interoperability, the associated programmes/libraries/packages and data management best practices. Moreover, I hope to discover additional resources, to network and exchange with peers, and ultimately share the knowledge and skills acquired.
Knowledge is cumulative and progressive, an infinite cycle, akin to a corn plant, which grows into a seed from a seed, in between helped by the effort of the farmer and other factors. Whether or not the subsequent seed will be replanted depends, among other competitions, on its quality. You may wonder where I am going with this, so here is the point: for knowledge to bear it must be shared promiscuously; to be verified and to be built upon. The rate of research output is very fast, and so is the need for advancement of the research findings. However, the conclusions may at times be wrong. To improve knowledge, the goal of research is to deepen understanding and confirm findings and claims through reproduction. However, this is dependent on the contribution of many people from diverse places, as such, there is an obvious need to remove or minimise obstacles to the quest for research excellence. As a researcher, I believe that to keep with the rate of research production, findings and data from it must be made available in a form that doesn’t antagonise its re-use or/and validation for further research. It means reducing friction on the research wheel by making research easier, cheaper and quicker to conduct, which will increase collaboration and prevent the reinvention of the wheel. To realise this, it is incumbent on me (and others) to make my contribution both as a producer and an affected party, especially seeing that exponentially huge amounts of biological data continue to be produced. Simply, improving research reproducibility is the right science of this age.
I am a member of The Carpentries community as an instructor and currently also in the task force planning the CarpentryCon2020, and hope to meet some of OKF community members there. I am excited to join this community as a Frictionless Data Fellowship! You can find important links and follow my fellowship here.
More on Frictionless Data
The Fellows programme is part of the Frictionless Data for Reproducible Research project at Open Knowledge Foundation. This project, funded by the Sloan Foundation, applies our work in Frictionless Data to data-driven research disciplines, in order to facilitate data workflows in research contexts. Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. Frictionless Data’s other current projects include the Tool Fund, in which four grantees are developing open source tooling for reproducible research. The Fellows programme will be running until June 2020, and we will post updates to the programme as they progress.
Information Technology and
Libraries (ITAL), the quarterly open-access journal published by
ALA’s Library Information Technology Association, is looking for contributors
for its regular “Public Libraries Leading the Way” column. This column
highlights a technology-based innovation or approach to problem solving from a
public library perspective. Topics we are interested in include the following,
but proposals on any other technology topic are welcome.
3-D printing and makerspaces
Diversity, equity, and inclusion and technology
Privacy and cyber-security
Virtual and augmented reality
Internet of things
Geographic information systems and mapping
Library analytics and data-driven services
Anything else related to public libraries and innovations in technology
To propose a topic, use this brief form, which
will ask you for three pieces of information:
Your email address
A brief (75-150 word) summary of your proposed column that describes your library, the technology you wish to write about and your experience with it.
Columns are in the 1,000-1,500 word
range and may include illustrations. These will not be research articles, but
are meant to share practical experience with technology development or uses
within the library. Proposals are due by November 30, and selections will be
made by December 15.
The Evergreen Outreach Committee is pleased to announce that October’s Contributor of the Month is Anna Goben of the Indiana State Library (ISL). Anna serves as the Evergreen Indiana Program Director and Associate Database Analyst at ISL, and has been involved with Evergreen Indiana since its earliest days in 2007. Anna oversees daily operations of the Evergreen Indiana consortium.
“Because I work with staff fighting through workflow and policy issues daily, I am especially focused on monitoring and helping to fund development which will enhance the daily experience of Evergreen for patrons and staff,” Anna tells us. She has coordinated with other Evergreen consortia to collectively fund development and also brings her workflow knowledge to Launchpad, where she has filed 33 bugs.
Anna has been involved with the international Evergreen community since 2013. She credits attending the Evergreen Conference with “fir[ing] me up to get involved in the wider community,” and encourages new members to attend community events. “I would suggest that any meeting that involves members of multiple Evergreen communities will get you excited,” she says, “As you learn that they have the same enthusiasms, frustrations, and experiences that you deal with regularly.”
Anna and ISL have served as hosts of the community Hack-a-way in 2016, 2017, and 2019. Those community members who have attended the Hack-a-way can attest to the stellar hosting capabilities Anna brings to this event, which includes a homemade first-night welcome dinner for attendees.
Anna has been heavily involved with two large transitions affecting the Evergreen community in recent years – as current President of the Evergreen Project Board, she is part of the ongoing process to establish Evergreen as its own 501(c)(3) organization. This started in 2017 with the initial transition from the Software Freedom Conservancy to MOBIUS as the project’s “home”, and Anna and the current Board are working hard to permanently establish Evergreen as its own organization.
Additionally, Anna coordinated the creation of the Evergreen Community Development Initiative (ECDI) under the aegis of ISL. ECDI has assumed all development contracts from the former MassLNC. As MassLNC did, ECDI will serve as a clearinghouse for Evergreen community development funds and will continue managing cooperative development projects for the benefit of the community at large.
This cooperative spirit is something Anna embodies. “I’m always so excited to be part of a community that makes the changes they want to see rather than just feeling like they’re stuck with what they have.”
Do you know someone in the community who deserves a bit of extra recognition? Please use this form to submit your nominations. We ask for your email in case we have any questions, but all nominations will be kept confidential.
The first in a series on the historical parallels and lessons that unite the groundings of the DC-10 and 737 Max.
I hope he's right about the series, because this first part is a must-read account of the truly disturbing parallels between the dysfunction at McDonnell-Douglas and the FAA that led to the May 25th 1979 Chicago crash of a DC-10, and the dysfunction at Boeing (whose management is mostly the result of the merger with McDonnell-Douglas) and the FAA that led to the two 737 MAX crashes. Ostrow writes:
The grounding of the DC-10 ignited a debate over system redundancy, crew alerting, requirements for certification, and insufficient oversight and expertise of an under-resourced regulator — all familiar topics that are today at the center of the 737 Max grounding. To revisit the events of 40 years ago is to revisit a safety crisis that, swapping a few specific details, presents striking similarities four decades later, all the way down to the verbiage.
Below the fold, some commentary with links to other reporting.
The DC-10 crashed because one of the pylons holding the under-wing engines broke, with massive damage to the wing. Despite this obvious mechanical failure, it took 12 days for the FAA to ground DC-10s:
On June 6, all 138 DC-10s at eight U.S. airlines were ordered grounded by the FAA when it revoked the jet’s airworthiness certificate and would stay that way for 37 days in 1979. The FAA initially opposed the grounding and the crash forced a legal battle with the American Airline Passengers Association, which sought an injunction to halt DC-10 flying in the U.S. “pending fuller analysis,” according to coverage in Flight. Inspections in the days that followed the Chicago crash revealed cracks on the engine pylons on other aircraft. FAA Administrator Langhorne Bond had no choice but to withdraw the roughly 275-seat jet’s airworthiness certificate. Carriers and regulators around the world — totaling some 274 aircraft, including 74 in Europe — followed suit.
McDonnell Douglas called the order “an extreme and unwarranted act.”
Responding to the second crash of a Boeing 737 Max 8 soon after takeoff in less than five months, China and Indonesia ordered their airlines on Monday to ground all of these aircraft that they operate.
The Civil Aviation Administration of China noted in its announcement on Monday morning of the grounding that both the Ethiopian Airlines crash on Sunday and a Lion Air crash in Indonesia in late October had involved very recently delivered Boeing 737 Max 8 aircraft that crashed soon after takeoff.
Indonesia joined China about nine hours later in also ordering its airlines to stop operating their Boeing 737 Max 8 aircraft.
On Wednesday, when announcing the grounding of the 737 MAX, the FAA cited similarities in the flight trajectory of the Lion Air flight and the crash of Ethiopian Airlines Flight 302 last Sunday.
It is doubtful whether the FAA would have acted so fast had they not been preempted. In both cases, the FAA set up a blue-ribbon review board. The 1980 board concluded:
“The committee finds that, as the design of airplanes grows more complex, the FAA is placing greater reliance on the manufacturer,” the blue-ribbon panel wrote in 1980. “The FAA’s human resources are not remotely adequate to the enormous job of certifying an airliner,” wrote Newhouse, and said the lure of more attractive salaries in the private sector meant 94% of approval work was delegated to the manufacturers. “The committee finds that the technical competence and up-to-date knowledge required of people in the FAA have fallen behind those in industry.”
The FAA, citing lack of funding and resources, has over the years delegated increasing authority to Boeing to take on more of the work of certifying the safety of its own airplanes.
Early on in certification of the 737 MAX, the FAA safety engineering team divided up the technical assessments that would be delegated to Boeing versus those they considered more critical and would be retained within the FAA.
But several FAA technical experts said in interviews that as certification proceeded, managers prodded them to speed the process. Development of the MAX was lagging nine months behind the rival Airbus A320neo. Time was of the essence for Boeing.
A former FAA safety engineer who was directly involved in certifying the MAX said that halfway through the certification process, "we were asked by management to re-evaluate what would be delegated. Management thought we had retained too much at the FAA."
"There was constant pressure to re-evaluate our initial decisions," the former engineer said. "And even after we had reassessed it ... there was continued discussion by management about delegating even more items down to the Boeing Company."
Even the work that was retained, such as reviewing technical documents provided by Boeing, was sometimes curtailed.
The New York Times would call the panel’s findings “damning” for Boeing and the FAA. The JATR, which included regulators from nine countries along with the U.S., found “signs of undue pressure” on the delegated Boeing staff responsible for regulatory approvals of the MCAS system, which it said (without elaborating) “may be attributed to conflicting priorities and an environment that does not support FAA requirements.”
Maynard Pennell, retired Boeing executive and aerodynamicist drafted to the blue-ribbon commission for the review of the FAA and DC-10 told Newhouse: “Douglas met the letter of the FAA regulations, but it did not build as safe an airplane as it could have. This was not a deliberate policy on its part…Douglas was determined not to over-run or do more than required by regulation to do.”
The JATR concluded Boeing broadly met every regulation, but raised “the foundational issue” of whether or not regulations can go far enough to foster a safety culture without creating complacency. “To the extent they do not address every scenario, compliance with every applicable regulation and standard does not necessarily ensure safety.
On the closing page of its 1980 report, the blue-ribbon committee made a recommendation stemming directly from the lessons it saw as crucial from the 1979 DC-10 crash. The report recommended that each commercial aircraft manufacturer “consider having an internal aircraft safety organization to provide additional assurance of airworthiness to company management.” [Emphasis theirs] McDonnell Douglas had created roving non-advocate review boards to assess program safety, according to a former Douglas executive, but it stopped short of a central organization. But the virtue of the recommendation didn’t end in 1980. Whether it realized it or not, Boeing’s Board of Directors on September 30, 2019 adopted the committee’s suggestion, forty years later.
[Maureen Tkacik] recounts how Boeing bought the failing McDonnell-Douglas in 1997 and basically handed management of the combined company to the team that had driven McDonnell-Douglas into the ditch:
The line on Stonecipher was that he had “bought Boeing with Boeing’s money.” Indeed, Boeing didn’t ultimately get much for the $13 billion it spent on McDonnell Douglas, which had almost gone under a few years earlier. But the McDonnell board loved Stonecipher for engineering the McDonnell buyout, and Boeing’s came to love him as well.
Douglas was determined to beat the L-1011 Tristar to the sky in 1970 and did so 10 weeks before Lockheed. The externally similar looking tri-jet occupied an identical spot in the market. And arriving first would be part of the competitive advantage, Douglas surmised. That expediency by Douglas (recently merged in 1967 with McDonnell) would invite some of withering criticism from those tasked with officially evaluating the jet after Flight 191, including that its design might’ve met the letter of the law, but fell far short of its spirit of safety.
Although the L-1011 was more technologically advanced, the DC-10 would go on to outsell the L-1011 by a significant margin due to the DC-10's lower price and earlier entry into the market.
In Flawed analysis, failed oversight: How Boeing and FAA certified the suspect 737 MAX flight control system Dominic Gates of the Seattle Times explains that Boeing was desperate to get the 737 MAX in the air because the Airbus A320 Neo had a 9-month lead in the market. And that Boeing also had a serious competitive disadvantage against Airbus. Airbus's planes are fly-by-wire, and the flight control software minimizes the differences between different models, reducing the need for pilot training. Boeing was also desperate to ensure that pilots certified for earlier 737 vesions would not need significant training to fly the MAX. Gates writes that Boeing:
had promised Southwest Airlines Co., the plane’s biggest customer, to keep pilot training to a minimum so the new jet could seamlessly slot into the carrier’s fleet of older 737s, according to regulators and industry officials.
[Former Boeing engineer] Mr. [Rick] Ludtke [who worked on 737 MAX cockpit features] recalled midlevel managers telling subordinates that Boeing had committed to pay the airline $1 million per plane if its design ended up requiring pilots to spend additional simulator time. “We had never, ever seen commitments like that before,” he said.
The Association for Library Collections & Technical Services (ALCTS), the Library and Information Technology Association (LITA) and the Library Leadership & Management Association (LLAMA) have collaborated to create The Exchange, an interactive, virtual forum designed to bring together experiences, ideas, expertise and individuals from these American Library Association (ALA) divisions. Modeled after the 2017 ALCTS Exchange, the Exchange will be held May 4, May 6 and May 8 in 2020 with the theme “Building the Future Together.”
As a fully online interactive forum, the Exchange will give participants the opportunity to share the latest research, trends and developments in collections, leadership, technology, innovation, sustainability and collaborations. Participants from diverse areas of librarianship will find the three days of presentations, panels and activities both thought-provoking and highly relevant to their current and future career paths. The Exchange will engage an array of presenters and participants, facilitating enriching conversations and learning opportunities. Everyone, members and non-members alike, are encouraged to register and bring their questions, experiences and perspectives to the events. Registration opens Nov. 4.
The Exchange Working Group welcomes proposals for the May 2020 forum that highlight the innovation happening in the profession and that span across areas that may have been traditionally siloed. Proposal topics should be relevant to the overarching theme, as well as the daily themes for each session. Daily themes include leadership and change management, continuity and sustainability, and collaborations and cooperative endeavors. The deadline for proposals is Dec. 6. Proposals can be submitted using the Presentation Proposal Form.
Before submitting a proposal, visit the Exchange website to learn what makes a strong proposal, view the success criteria, and check out the session formats.
The Exchange is presented by the Association for Library Collections & Technical Services (ALCTS), the Library and Information Technology Association (LITA) and the Library Leadership & Management Association (LLAMA), divisions of the American Library Association. To get more information about the proposed future for joint projects such as the Exchange, join the conversation about #TheCoreQuestion.