Planet Code4Lib

Weeknote 9 (2021) / Mita Williams

§1 Zotero PDF Reader

A new look and functionality for Zotero’s PDF Reader is still in beta.

I can’t wait for this version to be unleashed!

§2 MIT D2O

Earlier this week, MIT Press announced a new Open Access Monograph program. It appears that the transition of scholarly ebooks to another form of subscription product is continuing.

§3 AI but Canadian

I’m glad to see that the Federal Government has an Advisory Council on AI and I hope they are going to meaningfully fulfill their mandate. We are already late of the gate on this front. The city where I live is already trialing software that will suggest where road safety investments should be made based on an AI’s recommendations.

§4 Discovering Science Misconduct via Image Integrity

Not new but new to me. I’ve recently started following Elisabeth Bik on Twitter and it has been an eye-opening experience.

Bik, a microbiologist from the Netherlands who moved to the United States almost two decades ago, is a widely lauded super-spotter of duplicated images in the scientific literature. On a typical day, she’ll scan dozens of biomedical papers by eye, looking for instances in which images are reused and reported as results from different experiments, or where parts of images are cloned, flipped, shifted or rotated to create ‘new’ data…

Her skill and doggedness have earned her a worldwide following. “She has an uncommon ability to detect even the most complicated manipulation,” says Enrico Bucci, co-founder of the research-integrity firm Resis in Samone, Italy. Not every issue means a paper is fraudulent or wrong. But some do, which causes deep concern for many researchers. “It’s a terrible problem that we can’t rely on some aspects of the scientific literature,” says Ferric Fang, a microbiologist at the University of Washington, Seattle, who worked on a study with Bik in which she analysed more than 20,000 biomedical papers, finding problematic duplications in roughly 4% of them (E. M. Bik et al. mBio 7, e00809-16; 2016). “You have postdocs and students wasting months or years chasing things which turn out to not be valid,” he says.

Nature 581, 132-136 (2020), doi: https://doi.org/10.1038/d41586-020-01363-z

§5 And here’s one thing I did this week!

Activism Outside and Inside the Institution / Tara Robertson

Activism inside and outside the institution: Strategies and tactics for increasing diversity, equity and inclusion Tara Robertson Slides: http://bit.ly/KPUactivism

The recording for the talk I did for Kwantlen Polytechnic University’s Digital Pedagogy Webinar Series is up. Here’s the video and the slides.

I submitted the title and abstract 6 months ago and as I wrote the talk I realized that the dichotomy of inside/outside is much messier than the title suggests. It’s a false dichotomy to frame burning it down from the outside or building building new ways of doing things on the inside. Most of us, whether we are inside or outside an institution, do both–we build new things AND destroy barriers and structures that need to go. 

The talk had four parts:

  1. Introducing myself and sharing some context about where I’m from.
  2. Sharing about some of the queer and feminist activism I did in my 20s and 30s, including The Lesbian Avengers, swimming on several International Gay and Lesbian Aquatics teams, The Glasgow Women’s Library, and some of my union involvement especially speaking up for sex workers rights at the Workers Out human rights conference.
  3. Defining what I mean by diversity, equity and inclusion and talking about some of the work I led at Mozilla.
  4. Naming 3 people who I see as possibility models of powerful change agents within their organizations as well as make bigger ripples for change in the world.

Dr. Dori Tunstall, Respectful design + decolonizing hiring @deandori_ocadu

Dr. Dori Tunstall is the Dean of Design at OCADU. She’s leading systems change work at her university. This is visible through her work on respectful design and leading very successful Indigenous and Black faculty cluster hires that are both shifting representation and more importantly changing the culture of the institution. Follow her Instagram account  where she spotlights Black and Indigenous fashion and design and models self-care/repair as a leader.

Dr. Ninan Abraham, What have we been missing in racial equity in academia?

Dr. Ninan Abraham is a professor in the Department of Microbiology and Immunology and the Department of Zoology and the Associate Dean of Equity and Diversity in the Faculty of Science. I tuned into his February 11th webinar titled “What have we been missing in racial equity in academia?” and it was the most robust diversity data analysis I’ve seen in a post-secondary institution. Two things that stuck with me were the observations from the hiring funnel analysis and how the term racialized (and who sees themselves as racialized) isn’t straightforward. Look for these findings in a forthcoming publication with Carola Hibsch-Jetter, Howard Ramos and Minelle Mahtani. 

Annie Jean-Baptiste, Founder of The Equity Army

Annie Jean-Baptiste is the author of Building for Everyone, the Head of Product Inclusion at Google and the Founder of the Equity Army.

The Equity Army is a community of learners, builders, dreamers and doers who are committed to ensuring everyone, especially historically underrepresented people, feel seen in any product or service. The Equity Army meets in an informal cohort model and learns about product inclusion, shares resources, hears from subject matter experts, and most importantly, takes action.

I’m part of the current cohort and it’s the most diverse group of people I’ve ever worked with. I can see the diversity on so many intersectional axes: race, age, gender, disability, geography, education, industry, role and level. I’m sure there’s other dimensions I’m not able to see yet. I’m massively excited to have found a community to learn and take action with.

As we all do the work of diversity, equity and inclusion as activists in community or as activists within our organizations it’s important to have community to learn from, to share courage with, and to be accountable to. What community do you need to do this work?

The post Activism Outside and Inside the Institution appeared first on Tara Robertson Consulting.

History Of Window Systems / David Rosenthal

Alan Kay's Should web browsers have stuck to being document viewers? makes important points about the architecture of the infrastructure for user interfaces, but also sparked comments and an email exchange that clarified the early history of window systems. This is something I've wrtten about previously, so below the fold I go into considerable detail.

Browser Architecture

Kay's basic argument is that designers of user interface infrastructure typically cannot predict all the requirements that will be placed upon it. At some point, as we see with the Web, programmability will evolve, so it had better be designed in from the start. His main point is summarized at the end of the post:
Key Point: “sending a program, not a data structure” is a very big idea (and also scales really well if some thought is put into just how the program is set up).
The starting point that leads him to this conclusion is a contrast between his recommendations at the inception of the Web:
I made several recommendations — especially to Apple where I and my research group had been for a number of years — and generally to the field. These were partially based on the scope and scalings that the Internet was starting to expand into.
  • Apple’s Hypercard was a terrific and highly successful end-user authoring system whose media was scripted, WYSIWYG, and “symmetric” (in the sense that the “reader” could turn around and “author” in the same high-level terms and forms). It should be the start of — and the guide for — the “User Experience” of encountering and dealing with web content.
  • The underlying system for a browser should not be that of an “app” but of an Operating System whose job would be to protectively and safely run encapsulated systems (i.e. “real objects”) gotten from the web. It should be the way that web content could be open-ended, and not tied to functional subsets in the browser.
And where the Web has ended up:
One way to look at where things are today is that the circumstances of the Internet forced the web browsers to be more and more like operating systems, but without the design and the look-aheads that are needed.
  • There is now a huge range of conventions both internally and externally, and some of them require and do use a dynamic language. However, neither the architecture of this nor the form of the language, or the forms of how one gets to the language, etc. are remotely organized for the end-users. The thresholds are ridiculous when compared to both the needs and the possibilities.
  • There is now something like a terribly designed OS that is the organizer and provider of “features” for the non-encapsulated web content. This is a disaster of lock-in, and with actually very little bang for the buck.
This was all done after — sometimes considerably after — much better conceptions of what the web experience and powers should be like. It looks like “a hack that grew”, in part because most users and developers were happy with what it did do, and had no idea of what else it *should do* (and especially the larger destinies of computer media on world-wide networks).
Kay sees this as a failure of imagination:
let me use “Licklider’s Vision” from the early 60s: “the destiny of computing is to become interactive intellectual amplifiers for all humanity pervasively networked worldwide”.

This doesn’t work if you only try to imitate old media, and especially the difficult to compose and edit properties of old media. You have to include *all media* that computers can give rise to, and you have to do it in a form that allows both “reading” and “writing” and the “equivalent of literature” for all users.

Examples of how to do some of this existed before the web and the web browser, so what has happened is that a critically weak subset has managed to dominate the imaginations of most people — including computer people — to the point that what is possible and what is needed has for all intents and purposes disappeared.
He uses the example of the genesis of PostScript to make the point about programmability:
Several of the best graphics people at Parc created an excellent “printing standard” for how a document was to be sent to the printer. This data structure was parsed at the printer side and followed to set up printing.

But just a few weeks after this, more document requirements surfaced and with them additional printing requirements.

This led to a “sad realization” that sending a data structure to a server is a terrible idea if the degrees of freedom needed on the sending side are large.

And eventually, this led to a “happy realization”, that sending a program to a server is a very good idea if the degrees of freedom needed on the sending side are large.

John Warnock and Martin Newell were experimenting with a simple flexible language that could express arbitrary resolution independent images — called “JAM” (for “John And Martin” — and it was realized that sending JAM programs — i.e. “real objects” to the printer was a much better idea than sending a data structure.

This is because a universal interpreter can both be quite small and also can have more degrees of freedom than any data structure (that is not a program). The program has to be run in a protected address space in the printer computer, but it can be granted access to a bit-buffer, and whatever it does to it can then be printed out “blindly”.

This provides a much better match up between a desktop publishing system (which will want to print on any of the printers available, and shouldn’t have to know about their resolutions and other properties), and a printer (which shouldn’t have to know anything about the app that made the document).
I have worked on both a window system that sent "a data structure" and one that sent "a program" and, while I agree with Kay that in many cases sending a program is a very good idea, the picture is more complicated than he acknowledges.

Window Systems

NeWS & Pie Menus
Philip Remaker, inspired by Kay's argument, asked in a comment about NeWS, the PostScript-based window system that James Gosling and I worked on at Sun Microsystems in the late 80s that did "send a program". Kay replied:
I liked NEWS as far as it went. I don’t know why it was so cobbled together — Sun could have done a lot more. For example, the scalable pixel-independent Postscript imaging model, geometry and rendering was a good thing to try to use (it had been used in the Andrew system by Gosling at CMU) and Sun had the resources to optimize both HW and SW for this.

But Postscript was not well set up to be a general programming language, especially for making windows oriented frameworks or OS extensions. And Sun was very intertwined with both “university Unix” and C — so not enough was done to make the high-level part of NEWS either high-level enough or comprehensive enough.

A really good thing they should have tried is to make a Smalltalk from the “Blue Book” and use the Postscript imaging model as a next step for Bitblt.

Also, Hypercard was very much in evidence for a goodly portion of the NEWS era — somehow Sun missed its significance.
Kay is wrong that the Andrew window system that Gosling and I built at C-MU used PostScript. At that time the only implementation of PostScript that we had access to was in the Apple LaserWriter. It used the same Motorola 68K technology as the Sun workstations on our desks, and its rendering speed was far too slow to be usable as a graphical user interface. Gosling announced that he was going to Sun and told me he planned to build a PostScript-based window system. I thought it was a great idea in theory but was skeptical that it would be fast enough. It wasn't until Gosling showed me PostScript being rendered on Sun/1's screen at lightning speed by an early version of SunDew that I followed him to Sun to work on what became NeWS.

It is true that vanilla PostScript wasn't a great choice for a general programming language. But Owen Densmore (with my help) leveraged PostScript's fine-grained control over name resolution to build an object-oriented programming environment for NeWS that was, in effect, a Smalltalk-like operating system, with threads and garbage collection. It is described in Densmore's 1986 Object-Oriented Programming in NeWS, and Densmore's and my 1987 A User‐Interface Toolkit in Object‐Oriented PostScript.

As regards Hypercard, Don Hopkins pointed out that:
Outside of Sun, at the Turing Institute in Glasgow, Arthur van Hoff developed a NeWS based reimagination of HyperCard in PostScript, first called GoodNeWS, then HyperNeWS, and finally HyperLook. It used PostScript for code, graphics, and data (the axis of eval).

Like HyperCard, when a user clicked on a button, the Click message could delegate from the button, to the card, to the background, then to the stack. Any of them could have a script that handled the Click message, or it could bubble up the chain. But HyperLook extended that chain over the network by then delegating to the NeWS client, sending Postscript data over a socket, so you could use HyperLook stacks as front-ends for networked applications and games, like SimCity, a cellular automata machine simulator, a Lisp or Prolog interpreter, etc. SimCity, Cellular Automata, and Happy Tool for HyperLook (nee HyperNeWS (nee GoodNeWS))
Source
Hopkins' post about HyperLook illustrates the amazing version of SimCity that was implemented in it, including Hopkins' fascinating implementation of pie menus.

Hopkins also pointed Kay to Densmore's Object-Oriented Programming in NeWS. Kay was impressed:
This work is so good — for any time — and especially for its time — that I don’t want to sully it with any criticisms in the same reply that contains this praise.

I will confess to not knowing about most of this work until your comments here — and this lack of knowledge was a minus in a number of ways wrt some of the work that we did at Viewpoints since ca 2000.
I was impressed at the time by how simple and powerful the programming environment enabled by the combination of threads, messages and control over name resolution was. Densmore and I realized that the Unix shell's PATH variable could provide exactly the same control over name resolution that PostScript's dictionaries did so, as I recall in an afternoon, we ported the PostScript object mechanism to the shell to provide a fully object-oriented shell programming environment.

My Two Cents Worth

As you can see, NeWS essentially implemented the whole of Kay's recommendations, up to and including HyperCard. And yet it failed in the marketplace, whereas the X Window System has been an enduring success over the last three decades. Political and marketing factors undoubtedly contributed to this. But as someone who worked on both systems I now believe even in the absence of these factors X would still have won out for the following technical reasons:
  • Kay writes:
    One of the great realizations of the early Unix was that the *kernel* of an OS — and essentially the only part that should be in “supervisor mode” — would only manage time (quanta for interleaved computations) and space (memory allocation and levels) and encapsulation (processes) — everything else should be expressible in the general vanilla processes of the system.
    Fundamentally X only managed time (by interleaving rendering operations on multiple windows), space (by virtualizing the framebuffer) and encapsulation (by managing the overlapping of multiple virtualized framebuffers or windows, and managing inter-window communication such as cut-and-paste).

    It is important to note the caveat in Kay's assertion that:
    sending a data structure to a server is a terrible idea if the degrees of freedom needed on the sending side are large.
    The reason X was successful despite "sending a data structure" was that the framebuffer abstraction meant that the "degrees of freedom needed on the sending side" weren't large. BitBlt and, later, an alpha-channel allowed everything else to be "expressible in the general vanilla processes of the system". Thus X can be viewed as conforming to Kay's recommendations just as NeWS did.
  • The PostScript rendering model is designed for an environment with enough dots-per-inch that the graphic designer can ignore the granularity of the display. In the 80s, and still today to a lesser extent, this isn't the case for dynamic displays. Graphic design for displays sometimes requires the control over individual pixels that PostScript obscures. Display PostScript, which was effectively the NeWS rendering model without the NeWS operating system, also failed in the marketplace partly for this reason.
  • With the CPU power available in the mid 80s, rendering PostScript even at display resolutions fast enough to be usable interactively required Gosling-level programming skills from the implementer. It was necessary to count clock cycles for each instruction in the inner loops, and understand the effects of the different latencies of main memory and the framebuffer. Porting NeWS was much harder than porting X, which only required implementing BitBlt, Of course, this too rewarded programming skill, but it was also amenable to hardware implementation in a way the PostScript wasn't in those days. So X had a much easier deployment.
  • The lack of CPU power in those days also meant there was deep skepticism about the performance of interpreters in general, and in the user interface in particular. Mention "interpreted language" and what sprung to mind was BASIC or UCSD Pascal, neither regarded as fast.
  • Similarly, X applications were written in single-threaded C using a conventional library of routines. This was familiar territory for most programmers of the day. Not so the object-oriented, massively multi-threaded NeWS environment with its strange "reverse-polish" syntax. At the time these were the preserve of truly geeky programmers.
I should add one more factor that should have been, but wasn't, a consideration. As we see with Flash, insecurity doesn't prevent wide adoption. The attack surface of an "operating system" such as NeWS is much greater than that of a fixed-function server such as X's. Nevertheless, in the design and initial implementation of X.11 we committed several security faux pas. I don't think NeWS was ever seriously attacked; had it been I'm sure the results would have been even more embarrassing.

This is all at the window system level, but I believe similar arguments apply at the level Kay is discussing, the Web. For now, I'll leave filling out the details as an exercise to the reader.

Early Window Systems

Hopkins also pointed Kay to Methodology of Window Management, the record of a workshop in April 1985, because it contains A Window Manager for Bitmapped Displays and Unix, a paper by Gosling and me about the Andrew window system, and SunDew - A Distributed and Extensible Window System, Gosling's paper about SunDew.

The workshop also featured Warren Teitelman's Ten Years of Window Systems - A Retrospective View. Later, I summarized my view of the early history in this comment to /.:
There were several streams of development which naturally influenced each other- broadly:
  • Libraries supporting multiple windows from one or more threads in a single address space, starting from Smalltalk leading to the Mac and Windows environments.
  • Kernel window systems supporting access to multiple windows from multiple address spaces on a single machine, starting with the Blit and leading to SunWindows and a system for the Whitechapel MG-1.
  • Network window systems supporting access to multiple windows from multiple address spaces on multiple machines via a network, starting from work at PARC by Bob Sproull & (if memory serves) Elaine Sonderegger, leading to Andrew, SunDew which became NeWS, and W which became X.
Smalltalk Windows
Like me, Teitelman starts his history in the mid 70s with Smalltalk at PARC. This inspired Kay to push the history more than a decade further back. Kay graciously permitted me to quote this e-mail:
Windows didn't start with Smalltalk. The first *real* windowing system I know of was ca 1962, in Ivan Sutherland's Sketchpad (as with so many other firsts). The logical "paper" was about 1/3 mile on a side and the system clipped, zoomed, and panned in real time. Almost the same year -- and using much of the same code -- "Sketchpad III" had 4 windows showing front, side, top, and 3D view of the object being made. These two systems set up the way of thinking about windows in the ARPA research community. One of the big goals from the start was to include the ability to do multiple views of the same objects, and to edit them from any view, etc.

When Ivan went ca 1967 to Harvard to start on the first VR system, he and Bob Sproull wrote a paper about the general uses of windows for most things, including 3D. This paper included Danny Cohen's "mid-point algorithm" for fast clipping of vectors. The scheme in the paper had much of what later was called "Models-Views-and-Controllers" in my group at Parc. A view in the Sutherland-Sproull scheme had two ends (like a telescope). One end looked at the virtual world, and the other end was mapped to the screen. It is fun to note that the rectangle on the screen was called a "viewport" and the other end in the virtual world was called "the window". (This got changed at Parc, via some confusions demoing to Xerox execs).

In 1967, Ed Cheadle and I were doing "The Flex Machine", a desktop personal computer that also had multiple windows (and Cheadle independently developed the mid-point algorithm for this) -- our viewing scheme was a bit simpler.
The paper to which Kay refers is A clipping divider by Bob Sproull & Ivan Sutherland.

Next Generation Metadata… it’s getting real! / HangingTogether

This spring, OCLC Research are running a discussion series on Next Generation Metadata, where library leaders, metadata experts and practitioners from the EMEA time zone (Europe, Middle East, and Africa) can participate to share their experiences, deepen their understanding of the topic, and gain confidence in planning ahead.

The theme of the series is inspired by Karen Smith-Yoshimura’s OCLC Research report, “Transitioning to the Next Generation of Metadata“, which depicts this transition as an evolving process, intertwined with changing metadata standards, infrastructures, and tools. The series also offer an opportunity to showcase OCLC’s pioneering research and experimentations in this area and its current work on building a Shared Entity Management Infrastructure, that will support linked data initiatives throughout the library community.

As part of the series, seven round table discussions are held in different European languages – English, Spanish, French, German, Italian, and Dutch – to address the question:

How do we make the transition to the Next Generation of Metadata happen at the right scale and in a sustainable manner, building an interconnected ecosystem, not a garden of silos?”

Making it happen

This blog post reports back from the Opening Plenary webinar, held on 23 February 2021, where OCLC speakers presented to kick off the series.

Rachel Frick, Executive Director Research Library Partnership, introduced the theme by depicting how the library community is in the midst of a transformative change: the metadata is changing, the creation process and the supply chains are changing, and new architectures are emerging. At the same time, metadata departments in libraries are getting less attention, professional staff is decreasing, and staff is de-professionalizing. Libraries used to be knowledge organizations and library professionals were trained in bibliographic description and authority control. Now, authorities are called entities and the new description logic is about creating a “Knowledge Graph of Entities”. Is this transition simply about putting old wine in new bottles? For a long time, linked data experts and enthusiasts were not able to convince their peers nor their leadership. After some years of intensified experimentation, this has changed. OCLC, leading libraries, and many other stakeholders are now making it happen and confidence is growing that linking data across isolated systems and services will enhance the user experience and allow for efficiencies across the value chain. Rachel concluded:

“OCLC plays an important role in cultivating understanding and helping the community to embrace this transition – and this discussion series in the EMEA region is one of the many avenues it offers for doing so.”

Annette Dortmund, Senior Product Manager and Research Consultant, went on to present key themes from Karen Smith-Yoshimura’s report, which compiles six years of discussions with the OCLC Research Library Partners Metadata Managers Focus Group on the evolution of the next generation of metadata. In these discussions, the need for change became crystal clear. “Curated text-strings in bibliographic records” are nearing obsolescence, both conceptually and technically. The report describes the changes taking place in a number of areas including the transition to linked data and identifiers, the description of inside-out and facilitated collections, the evolution of metadata as a service, as well as resulting staffing requirements. The transition is key to achieving important library goals, such as multilingualism – which in turn, is closely connected to EDI (Equity, Diversity and Inclusion) goals and principles. ​It opens opportunities for libraries to engage in new areas where metadata is becoming key, such as Research Data Management (RDM). Karen’s rich and informative report is on the reading list for the participants of the discussion series. As Annette put it:

In a way, it frees us ​from discussing the WHY and WHERE TO of this transition. Why do we need a change, and where do we need to go? It is all in there, based on so much expertise and so many hours of robust discussion. ​And so, it frees us to move on to the HOW and WHO. How do we make the change happen, and who can help us with it – and who is already working on it?” ​

From experimenting to the heavy lifting

I gave a short overview of the findings from the OCLC Research report, “Transforming Metadata into Linked Data to Improve Digital Collection Discoverability: A CONTENTdm Pilot Project”, published early in 2021. CONTENTdm is a digital library system – it is OCLC’s product that allows libraries to build, manage, and showcase digitized cultural heritage collections, and make them discoverable to people and search engines. The Product Development team, OCLC Research colleagues and library professionals from the CONTENTdm user community, together, investigated methods, workflows and tools to produce linked data from the Dublin Core-based descriptions and link up the people, places, concepts, and events that populate these descriptions across CONTENTdm systems. The pilot convincingly showed the innovative potential of linking data and using persistent identifiers – both from a data management and discovery perspective. It also brought home the realization that a paradigm shift of this scale will necessarily take time to carry out and calls for long-term planning and collaboration strategies:

An overarching question driving the linked data project was, for a paradigm shift of this magnitude, how can the foundational changes be made more scalable, affordable, and sustainable?​ (…) It will require substantial and shared resource commitments from a decentralized community of practitioners who will need to be supplied with easily accessible tools and workflows for carrying out the transition.”​

In the follow up presentation, John Chapman, Senior Product Manager, explained how OCLC’s Shared Entity Management Infrastructure addresses this general concern and the needs identified by library partners, namely the need for: (1) entity URIs/persistent identifiers relevant to library workflows (for works and persons) at the point of need (during the descriptive process) and (2) facilities to link library data to non-library data and shared data to local data. The Andrew W. Mellon Foundation identified OCLC as an organization that can operate at the large scale that is required…and do so sustainably. It awarded OCLC a $2.436 million grant to develop such an infrastructure in 2 years’ time. The Knowledge Graph that is being built is seeded from the knowledge contained in authority files, WorldCat creative works, and controlled vocabularies. This requires much “semantic lifting” – i.e., turning the knowledge that is hidden “in the spaghetti pile of strings in MARC” into structured data or facts. Due attention is given to the provenance and context of the knowledge claims and to multilingual approaches. John noted that curation support for the library community will be important, as well as thinking of APIs and machines as core users. The infrastructure is currently still leveraging Wikibase as its primary technical component, but an architectural decision is in the making to move beyond and scale much larger – in John’s words:

“While Wikibase and Wikidata grew organically over time, this project is going forward very quickly and adding a lot of data very quickly, so we’ve needed to think about engineering some different loading and ingest technologies that keep up with the bigger scale.”

The presentations elicited much interest and many questions. One attendee asked: “Do you think that the value propositions of next generation metadata have been effectively explained to and understood by library directors? And what would be the main things that you would want library directors to do next, if they are engaged?” Indeed, an excellent suggestion to approach library directors now, at such a decisive moment in time, when infrastructures are emerging, and strategic choices need to be made! Another question touched on an important aspect that we propose to address during the round table discussions: “You are now creating a knowledge graph based on library data. Are other projects making other knowledge graphs, on other topics, that are then in the end hopefully connected in some kind of super knowledge graph?

What’s next?

Next up in the discussion series are the seven round table discussions. Unfortunately, seating is limited, and we have already reached capacity for all of them. However, we will be synthesizing these discussions and sharing findings with the community in the Closing Plenary webinar on 13 April. It is open to everyone interested in the topic. We will also be sharing relevant findings via blog posts here, on the OCLC Research blog Hanging Together, and through other channels, so please stay tuned and watch this space.

The post Next Generation Metadata… it’s getting real! appeared first on Hanging Together.

Evergreen Community Spotlight: Erica Rohlfs / Evergreen ILS

The Evergreen Outreach Committee is pleased to announce Erica Rohlfs as March’s Community Spotlight.  Erica is the Senior Project Manager of Implementation for Equinox Open Library Initiative and has been gently flying along with Evergreen since its development and deployment by Georgia PINES.

When first approached about being interviewed for the Community Spotlight, Erica’s response was one of surprise. Her incredulity grew when informed that she had received more than one nomination.  Erica’s humility and service-mindedness characterize much of her contribution to the success of Evergreen users and the Evergreen community.

In 2007, Erica worked as an Information Specialist at one of the first PINES libraries to use the new Evergreen ILS.  Her experiences teaching staff and patrons the intricacies of Evergreen helped determine the direction of Erica’s professional life and her growing and continued involvement in the Evergreen Community.  

Erica joined Equinox as the Education Librarian in 2012, initially training the staff of libraries migrating from proprietary ILS to Evergreen.  Now, as Senior Project Manager of Implementation, she coordinates those migrations to both Evergreen and Koha.  Erica says there has been a marked change in her role over the course of those years.  Initially, she spent a lot of time working with circulation and cataloging staff, gaining understanding of the needs from front-line workers.  Now, Erica says she works with fewer staff and her focus has shifted from discovery and training for workflows to discovery and negotiation of policies.  

Although she’s taken a step back in recent years because of her expanded role with Equinox, Erica is most proud of her time spent wrangling bugs for the community, as well as submitting bug tickets and testing fixes.  She also loves encouraging staff at new Evergreen libraries and “throwing nuggets” about how to get involved in the community.

Outside of work and Evergreen, Erica is an avid gardener with a traditional Southern garden incorporating vegetables with ornamentals.  She has also been an authorized and legal rescuer of native and endangered plants along with having a certified GA native pollinator yard.

Do you know someone in the community who deserves a bit of extra recognition? Please use this form to submit your nominations. We ask for your email in case we have any questions, but all nominations will be kept confidential.

Any questions can be directed to Andrea Buntz Neiman via abneiman@equinoxinitiative.org.

The CARE Principles for Indigenous Data Governance: overview and Australian activities / HangingTogether

Reader advisory: I have retained British spelling as supplied by contributors. Some of the questions posed during the webinar have been edited for clarity.

The CARE Principles for Indigenous Data Governance focus on appropriate use and reuse of Indigenous data. The principles recenter and reframe discussion and action on the sovereign rights and dignity of Indigenous Peoples, especially against the backdrop of “big data” and broad open access initiatives that are prevalent in today’s libraries and archives.

What is Indigenous data and where might it be found in libraries and special collections? Think about cultural heritage collections, which contain photos, drawings, field notes, even objects. Think about theses and dissertations which contain information that is not appropriate for all to see (or may be inaccurate, or a misrepresentation of a people and their culture). Think about research about climate change, fire management or environmental resilience on Indigenous lands and waters. Think about census records. Think about research data sets produced from Indigenous Peoples or from samples of flora and fauna with which Indigenous Peoples relate. In short, there is already an enormous amount of Indigenous data in our collections and the fact that it has not been properly identified or accorded appropriate treatment is, or should be, of concern.

On February 2, 2021 representatives of the Global Indigenous Data Alliance (GIDA) and the Equity for Indigenous Research and Innovation Coordinating Hub (ENRICH) and representatives from National Library of Australia and University of Sydney joined attendees from Australian and New Zealand institutions for a discussion session hosted by National and State Libraries Australia (NSLA) and the OCLC Research Library Partnership. The panelists shared updates and examples of their work, as well as lessons they’ve learned. Many thanks to those who offered wisdom and expertise. This is a summary of what was shared in the session.

Attendees were asked to watch a previously recorded OCLC RLP webinar, Operationalizing the CARE Principles for Indigenous Data Governance. Additionally, GIDA and ENRICH offered a brief overview and update of the CARE Principles for Indigenous Data Governance, with examples of work done at the Library of Congress in the US, Simon Fraser University in Canada and the University of Tasmania.

GIDA and ENRICH representatives included:

  • Stephanie R. Carroll, Assistant Professor and Associate Director of the Native Nations Institute, University of Arizona
  • Maui Hudson, Director, Te Kotahi Research Institute, University of Waikato
  • Maggie Walter, Distinguished Professor of Sociology, University of Tasmania
  • Jane Anderson, Associate Professor, Department of Anthropology and Museum Studies, New York University

During the COVID-19 pandemic, ENRICH have made great strides in launching the ENRICH Cultural Institutions Network, an important space for thinking and experimentation. As of early February, there were over 56 cultural institutions across seven countries that had joined the network. The goal of the Network is to share work implementing practical mechanisms like the TK and BC Labels & Notices and the CARE Principles as well as creating an engaged learning space for institutional staff interested in thinking through next steps , implementation pathways and developing new workflows.  In addition, GIDA, ENRICH, and others are developing training modules around Indigenous Data Sovereignty, Indigenous collections and intellectual property, decision-making and governance. ENRICH trainings will be delivered though a distinct Educational and Training Platform under development. This Platform will also make available template data sovereignty agreements, contracts, and clauses.

An important component of moving to appropriate practice within an institutional context, is the incorporation of the CARE Principles into policy. A great example is the AIATSIS Code of Ethics for Aboriginal and Torres Strait Islander Research – these embed the CARE Principles into a code of practice for ethical Australian Indigenous research.

GIDA and ENRICH representatives also answered several questions posed by attendees.

What are best practices for implementing the TK (Traditional Knowledge) and BC (Biocultural) Labels and Notices?

The ENRICH Cultural Institutions Network is sharing best practices and Local Contexts is developing workflows and training material to support best practice in application and implementation of the Notices and the Labels too

Do you see potential for the Labels to be adapted to support acknowledgement and awareness of other communities represented in data–such as data that can be disaggregated based on gender, disability, ethnicity, religion, etc.?

The Labels are oriented towards Indigenous communities who as collectives have clearer lines of accountability and responsibility between members and the governance entities. While the CARE Principles and Labels might be thought of in the context of other collectives it isn’t clear whether they could be translated to these other contexts without limiting their effectiveness for Indigenous communities.

How can we approach appropriate implementation if we lack clear cultural authorities to work with? We have Land Councils and networks of Elders but there is rarely community consensus on who holds local and cultural authority here in Australia.
The work of customizing the TK and BC Labels at a community level can take time.  This is partly because conversations internal governance and decision-making are necessary. Community consensus and decision-making are complex across every context. As we have found through the development of the TK Labels with Indigenous communities is that the Labels activate conversations that are timely and necessary within Indigenous contexts. Communities want to be recognized as the authorities over their cultural heritage and data. We encourage cultural institutions to start where they can – identifying collections, connecting them to communities. This is why we developed the Notices as a distinct tool for institutions and researchers. Notices can be added before community developed and customized TK Labels–so institutions can act and wait for those decisions and governance-structures to develop themselves as appropriate.

Our Library does not have Indigenous staff or representation in the Library workforce that would be implementing the CARE Principles. As a University we do have an Indigenous student support centre and we have a number of Indigenous academic staff. Who should we be looking to for guidance to make sure we implement any initiatives appropriately?

From an Australian point of view there are rarely Aboriginal and Torres Strait Islander staff available to provide this sort of expertise. One of the principles of Indigenous data governance is Indigenous leadership and decision making. If you don’t have staff, you need to think about a mechanism so that you can have that leadership and decision making. The risk on the other side is that you have academics or staff at your institution, and they become overburdened. The expectations for those individuals is high, and they already have full time jobs. Speaking with Indigenous people in senior roles at your institution is probably the best place to start and then developing clear employment strategies so there are Aboriginal and Torres Strait Islander staff.

On a world scale, do Labels work when each cultural group might have a different understanding of the visuals conveyed by the icons?

We developed the icons over 10 years with Indigenous communities across four countries. The icons remain the same so that the Labels can be an internationally recognized system. But each community retains the sovereign right to define the protocols and use their language to make the Label unique and contextual. Here is an example of this specificity with the Sq’ewlets Virtual Museum. The system pivots on both the local and universal to be effective. From the beginning of this initiative there was a need to create icons that were relatively neutral but communicated a clear message about cultural protocols. The need to develop icons that could work across different institutions and jurisdictions was also really important. Communities have to work with multiple organisations and institutions and organisations have to work with multiple communities. The icons are visual markers for the existence of Indigneous rights and interests in collections and data and therefore need to remain the same but each community can customise the metadata associated with their own Labels, so a Whakatōhea Attribution Label is different from a Tainui Attribution Label for example.

Do you have any comments on any systems and information architecture features that we should be aware of when working with vendors and documenting requirements that enable systems to be compliant with these CARE Principles?

This is something that will need to emerge as the CARE implementation criteria are developed. It’s an important issue to think about in the context of Indigenous and mainstream procurement needs for digital products and services. Space for metadata, including provenance, permissions, and protocols, potentially in combination with the TK and BC Labels creates a place for Indigenous governance and control that can become a permanent component throughout the data lifecycle.

Would reference to the CARE Principles in policies and procedures have more weight than a more local reference (such as “Māori Data Sovereignty Principles”)?

CARE Principles are high level. Māori Data Sovereignty Principles are thus more specific to Aotearoa, akin to OCAP (First Nations principles of ownership, control, access, and possession) in Canada. First Nation/tribal/iwi are the rightsholders and grounded base for each Indigenous Nation’s principles or expressions of governance and self-determination in relation to their data.

Any update on adding Labels into ORCID and publications?

We are working very closely with ORCiD on this and we are anticipating this functionality within the first half of 2021. We are also moving this along with a variety of publishers. The digital publishing platform RavenSpace at the University of British Columbia Press, which is built using Scalar, has embedded the Labels within its structure. The inaugural book, ‘As I Remember it’ using this platform integrates the Labels – see here.

To provide additional context about how institutions in Australia are working with issues around Indigenous data, two institutions presented how they are grappling with Indigenous data in library collections.

Rebecca Bateman (Assistant Director, Indigenous Engagement), Kevin Bradley (Assistant Director General, Collections), and Marcus Hughes (Director, Indigenous Engagement) gave a tour of activities at the National Library of Australia (NLA).

The National Library of Australia has been working on strategic, staged processes and adjusting ways of thinking – such as the inclusion of First Nations’ languages – that can be seen as steps towards embedding Indigenous cultural perspectives  rather than a string of “decolonization projects.”

After a decade of working in partnership with the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS), the Australian National Library was successful in having AUSTLANG codes added to the MARC format.* From an Indigenous perspective, the AUSTLANG codes provide a tool for identification of Country and culture through language groups, so they provide more than an association with a geographic location or place name. Adding these codes to metadata records about materials means that it is much easier to discover material related to Indigenous history, languages, and cultures. Each AUSTLANG language has an alphanumeric code and connects to additional data, such as alternative terms, spellings, number of speakers, how the language has been documented, etc. [Example: Keerray Woorroong

To help celebrate the International Year of Indigenous Languages in 2019, NLA, AIATSIS, and NSLA partner institutions and an enthusiastic volunteer community joined forces in a national code-a-thon to identify Trove** items in Indigenous Australian languages and add AUSTLANG codes.

This event was wildly successful with 8,000 codes added, enabling search and (importantly) access to materials. The code-a-thon was a significant effort, in terms of the CARE Principles, in that tying cultural items to the appropriate language code helps to establish provenance for these items. The AUSTLANG data set and access to the API are now available via data.gov.au. The dataset includes the language names, each with a unique alphanumeric code that functions as a stable identifier, alternative/variant names and spellings, and the approximate location of each language variety.

The language codes are being harnessed in other ways as well. Over the last 18 months the Trove team has made efforts to enhance the discovery capacities for Indigenous collections, materials marked with the AUSTLANG codes and/or with AIATSIS subject headings that are signaled in Trove as “First Australians” (the terminology used for content and features relating to Aboriginal and Torres Strait Islander peoples). Rebecca demonstrated this by showing an example of the work Culture in Translation which has the “First Australians” designator. Additionally, most works in Trove have a mechanism for reporting culturally sensitive content. Visual material that may be sensitive or inappropriate for viewing (again, driven by subject headings) are presented with blurred thumbnail images which the Trove user must agree to have unblurred.

Other related efforts include the development and implementation of guidelines for describing published materials. That work has been completed, and NLA has now turned attention to unpublished materials. The NLA Indigenous Engagement team is better positioned to collaborate with communities and individuals, and to use this initial work as a basis for Indigenous data sovereignty regarding how materials are found and described. Speakers described the TK and BC Labels as powerful tools to promote that sovereignty.

As revealed in the audience Q&A, there was considerable interest from the audience around the underpinnings of the AUSTLANG codes, the mechanisms for reporting culturally sensitive content, and the AUSTLANG codes.

What happens when a report on Indigenous cultural sensitivity is submitted via Trove?  What is the turn around on those reports? How is the feedback passed on to contributing libraries?

All queries sent via the Indigenous cultural sensitivity form on Trove are sent to the NLA Indigenous Engagement Team. Trove staff also see these queries as they come in, however it is Indigenous Engagement staff who review and take action.

Turn around really depends on the nature of the query.  Most are answered within a few days, however complex issues will take longer to resolve. We aim to make contact with the person who has sent the form as soon as we are able to.

At this stage necessary changes are made to records in Trove only. We would like to work more closely with partner organisations on the resolution of these issues and are currently considering what that might look like.

Is there a way to “roll back” or “roll into” these Trove features into individual library catalogues?

From a technical perspective this is a question for Trove. The best way to ensure these features apply to the records your organisation provides to Trove is by making sure those records contain metadata that will be picked up by the criteria. NLA Indigenous Engagement and Trove staff are currently working on a suite of resources, including a webinar, to provide partner organisations with guidance around this.

AIATSIS identified 465 languages in Australia Austlang, what is occurring to get these records included into mainstream Library catalogues?

The work done in 2019, via the Austlang code-a-thon and related activities, was the first step in promoting the use of Austlang codes in mainstream Library catalogues. There are plans for a second code-a-thon and other activities to further this work.

Finally, Jennifer Stanton (Manager Digital Collections) from University of Sydney gave an overview of activities that have happened over the last five years. The library has been grounded in embedding culturally safe practices into their activities, mindful of Indigenous Cultural Intellectual Property as well as encouraging ethical use of First Nations cultural knowledge and culturally appropriate research practices.

Jennifer reflected on their previous practices, “where in-house decisions … came out of ad-hoc approaches.” Looking at theses as a specific example, there is generally no reliable way to tell is there is Indigenous data or culturally sensitive information in those materials; staff are left to scour tables of contents, keywords and occasional notes. This lack of provenance is what Jennifer identified as one of the biggest barriers.

In previous practice, they might go ahead and digitize the materials OR contact the Department of Anthropology, but would not seek advice outside the university. Decisions and actions were often not recorded.

This began to change after cultural competence training was implemented at a University level, giving library staff more confidence in engaging in discussions around Indigenous data governance. Discussions led to questioning previous practices. Jennifer shared that in her own experience, training led to more open discussions becoming the norm. Now when staff encounter Indigenous data or culturally sensitive materials, they can have an open conversation about how to move forward.

Very recently Aboriginal and Torres Strait Islander Cultural Protocols were created by the Library’s Wingara Mura group which was led by Nathan Sentance, a Wiradjuri man who worked with the Library as a Wingara Mura Advisor. These protocols, which are now available, are intended as a set of principles and guidelines to enhance and embed culturally competent practice within an Australian academic library context. They focus on things like identifying potentially sensitive first nations cultural materials, dealing with take down requests relating to culturally sensitive materials and how we should approach involving communities in the decision-making process. They have been endorsed by the University Executive Indigenous Strategy and Services Committee, and the library will soon begin implementation of these protocols into systems, procedures and culture.

The library has already taken a number of smaller actions in moving towards culturally safe collections and practices, including:

  • A cultural care warning, now on the digital collections website and soon across all library websites, so that visitors are aware that they may encounter culturally sensitive materials
  • Cultural care warnings on relevant individual digitised items, mainly theses
  • Documenting items that may contain culturally sensitive information so they can be reviewed once a formalized decision-making process is in place
  • Adding Auslang codes to metadata when an item includes a particular first nations language or research about that language.
  • Enabling submitters to the institutional repository to select a first nations language as the language of the item.

Jennifer highlighted that there is no blanket solution when it comes to working with communities in the Australian landscape and it will take time build relationships and find appropriate solutions. Ideally, organizations will develop a central communications office who can provide expertise and support dialog and relationship building.

An additional complication is harvested content, which appears (from an end user perspective) as being in the library’s collection. Addressing the type of metadata that turns up in a federated search is a difficult challenge and one that requires an ecosystem of content and record creators who are all working towards the same goals. The library can, however, be more discerning about the content that is being harvested and prioritize content coming from Aboriginal and Torres Strait Islander organizations like AIATISIS.

The overall goal of the library is the be inclusive, to go beyond the colonial perspective, for staff to feel informed and supported in their knowledge about Indigenous data governance, and for the library to work with communities outside the walls of the institution.

This was a well-attended webinar with terrific audience engagement, and I encourage you to view it in its entirety!

For more guidance on working with Indigenous collections in Australia, see NSLA’s online resources based on the ATSILIRN Protocols.

*In Australia there are more than 250 Indigenous languages including 800 dialects. Each language is specific to a particular place and people. 

**Trove is a discovery portal representing a collaboration between the National Library of Australia and hundreds of Partner organisations around Australia.

Many thanks to the presenters for taking the additional time to review and edit this blog post and to Barbara Lemon (NSLA) and Mercy Procaccini (OCLC RLP) who also reviewed and supplied helpful suggestions.

The post The CARE Principles for Indigenous Data Governance: overview and Australian activities appeared first on Hanging Together.

Digitization Wars, Redux / CrossRef

 (NB: IANAL) 

 Because this is long, you can download it as a PDF here.

From 2004 to 2016 the book world (authors, publishers, libraries, and booksellers) was involved in the complex and legally fraught activities around Google’s book digitization project. Once known as “Google Book Search,” the company claimed that it was digitizing books to be able to provide search services across the print corpus, much as it provides search capabilities over texts and other media that are hosted throughout the Internet. 

Both the US Authors Guild and the Association of American Publishers sued Google (both separately and together) for violation of copyright. These suits took a number of turns including proposals for settlements that were arcane in their complexity and that ultimately failed. Finally, in 2016 the legal question was decided: digitizing to create an index is fair use as long as only minor portions of the original text are shown to users in the form of context-specific snippets. 

We now have another question about book digitization: can books be digitized for the purpose of substituting remote lending in the place of the lending of a physical copy? This has been referred to as “Controlled Digital Lending (CDL),” a term developed by the Internet Archive for its online book lending services. The Archive has considerable experience with both digitization and providing online access to materials in various formats, and its Open Library site has been providing digital downloads of out of copyright books for more than a decade. Controlled digital lending applies solely to works that are presumed to be in copyright. 

Controlled digital lending works like this: the Archive obtains and retains a physical copy of a book. The book is digitized and added to the Open Library catalog of works. Users can borrow the book for a limited time (2 weeks) after which the book “returns” to the Open Library. While the book is checked out to a user no other user can borrow that “copy.” The digital copy is linked one-to-one with a physical copy, so if more than one copy of the physical book is owned then there is one digital loan available for each physical copy. 

The Archive is not alone in experimenting with lending of digitized copies: some libraries have partnered with the Archive’s digitization and lending service to provide digital lending for library-owned materials. In the case of the Archive the physical books are not available for lending. Physical libraries that are experimenting with CDL face the added step of making sure that the physical book is removed from circulation while the digitized book is on loan, and reversing that on return of the digital book. 

Although CDL has an air of legality due to limiting lending to one user at a time, authors and publishers associations had raised objections to the practice. [nwu] However, in March of 2020 the Archive took a daring step that pushed their version of the CDL into litigation: using the closing of many physical libraries due to the COVID pandemic as its rationale, the Archive renamed its lending service the National Emergency Library [nel] and eliminated the one-to-one link between physical and digital copies. Ironically this meant that the Archive was then actually doing what the book industry had accused it of (either out of misunderstanding or as an exaggeration of the threat posed): it was making and lending digital copies beyond its physical holdings. The Archive stated that the National Emergency Library would last only until June of 2020, presumably because by then the COVID danger would have passed and libraries would have re-opened. In June the Archive’s book lending service returned to the one-to-one model. Also in June a suit was filed by four publishers (Hachette, HarperCollins, Penguin Random House, and Wiley) in the US District Court of the Southern District of New York. [suit] 

The Controlled Digital Lending, like the Google Books project, holds many interesting questions about the nature of “digital vs physical,” not only in a legal sense but in a sense of what it means to read and to be a reader today. The lawsuit not only does not further our understanding of this fascinating question; it sinks immediately into hyperbole, fear-mongering, and either mis-information or mis-direction. That is, admittedly, the nature of a lawsuit. What follows here is not that analysis but gives a few of the questions that are foremost in my mind.

 Apples and Oranges 

 Each of the players in this drama has admirable reasons for their actions. The publishers explain in their suit that they are acting in support of authors, in particular to protect the income of authors so that they may continue to write. The Authors’ Guild provides some data on author income, and by their estimate the average full-time author earns less than $20,000 per year, putting them at poverty level.[aghard] (If that average includes the earnings of highly paid best selling authors, then the actual earnings of many authors is quite a bit less than that.) 

The Internet Archive is motivated to provide democratic access to the content of books to anyone who needs or wants it. Even before the pandemic caused many libraries to close the collection housed at the Archive contained some works that are available only in a few research libraries. This is because many of the books were digitized during the Google Books project which digitized books from a small number of very large research libraries whose collections differ significantly from those of the public libraries available to most citizens. 

Where the pronouncements of both parties fail is in making a false equivalence between some authors and all authors, and between some books and all books, and the result is that this is a lawsuit pitting apples against oranges. We saw in the lawsuits against Google that some academic authors, who may gain status based on their publications but very little if any income, did not see themselves as among those harmed by the book digitization project. Notably the authors in this current suit, as listed in the bibliography of pirated books in the appendix to the lawsuit, are ones whose works would be characterized best as “popular” and “commercial,” not academic: James Patterson, J. D. Salinger, Malcolm Gladwell, Toni Morrison, Laura Ingalls Wilder, and others. Not only do the living authors here earn above the poverty level, all of them provide significant revenue for the publishers themselves. And all of the books listed are in print and available in the marketplace. No mention is made of out-of-print books, no academic publishers seem to be involved. 

On the part of the Archive, they state that their digitized books fill an educational purpose, and that their collection includes books that are not available in digital format from publishers:

“ While Overdrive, Hoopla, and other streaming services provide patrons access to latest best sellers and popular titles,  the long tail of reading and research materials available deep within a library’s print collection are often not available through these large commercial services.  What this means is that when libraries face closures in times of crisis, patrons are left with access to only a fraction of the materials that the library holds in its collection.”[cdl-blog]

This is undoubtedly true for some of the digitized books, but the main thesis of the lawsuit points out that the Archive has digitized and is also lending current popular titles. The list of books included in the appendix of the lawsuit shows that there are in-copyright and most likely in-print books of a popular reading nature that have been part of the CDL. These titles are available in print and may also be available as ebooks from the publishers. Thus while the publishers are arguing that current, popular books should not be digitized and loaned (apples), the Archive is arguing that they are providing access to items not available elsewhere, and for educational purposes (oranges). 

The Law 

The suit states that publishers are not questioning copyright law, only violations of the law.

“For the avoidance of doubt, this lawsuit is not about the occasional transmission of a title under appropriately limited circumstances, nor about anything permissioned or in the public domain. On the contrary, it is about IA’s purposeful collection of truckloads of in-copyright books to scan, reproduce, and then distribute digital bootleg versions online.” ([Suit] Page 3).

This brings up a whole range of legal issues in regard to distributing digital copies of copyrighted works. There have been lengthy arguments about whether copyright law could permit first sale rights for digital items, and the answer has generally been no; some copyright holders have made the argument that since transfer of a digital file is necessarily the making of a copy there can be no first sale rights for those files. [1stSale] [ag1] Some ebook systems, such as the Kindle, have allowed time-limited person-to-person lending for some ebooks. This is governed by license terms between Amazon and the publishers, not by the first sale rights of the analog world. 

Section 108 of the copyright law does allow libraries and archives to make a limited number of copies The first point of section 108 states that libraries can make a single copy of a work as long as 1) it is not for commercial advantage, 2) the collection is open to the public and 3) the reproduction includes the copyright notice from the original. This sounds to be what the Archive is doing. However, the next two sections (b and c) provide limitations on that first section that appear to put the Archive in legal jeopardy: section “b” clarifies that copies may be made for preservation or security; section “c” states that the copies can be made if the original item is deteriorating and a replacement can no longer be purchased. Neither of these applies to the Archive’s lending. 

 In addition to its lending program, the Archive provides downloads of scanned books in DAISY format for those who are certified as visually impaired by the National Library Service for the Blind and Physically Handicapped in the US. This is covered in 121A of the copyright law, Title17, which allows the distribution of copyrighted works in accessible formats. This service could possibly be cited as a justification of the scanning of in-copyright works at the Archive, although without mitigating the complaints about lending those copies to others. This is a laudable service of the Archive if scans are usable by the visually impaired, but the DAISY-compatible files are based on the OCR’d text, which can be quite dirty. Without data on downloads under this program it is hard to know the extent to which this program benefits visually impaired readers. 

 Lending 

Most likely as part of the strategy of the lawsuit, very little mention is made of “lending.” Instead the suit uses terms like “download” and “distribution” which imply that the user of the Archive’s service is given a permanent copy of the book

“With just a few clicks, any Internet-connected user can download complete digital copies of in-copyright books from Defendant.” ([suit] Page 2). “... distributing the resulting illegal bootleg copies for free over the Internet to individuals worldwide.” ([suit] Page 14).
Publishers were reluctant to allow the creation of ebooks for many years until they saw that DRM would protect the digital copies. It then was another couple of years before they could feel confident about lending - and by lending I mean lending by libraries. It appears that Overdrive, the main library lending platform for ebooks, worked closely with publishers to gain their trust. The lawsuit questions whether the lending technology created by the Archive can be trusted.
“...Plaintiffs have legitimate fears regarding the security of their works both as stored by IA on its servers” ([suit] Page 47).

In essence, the suit accuses IA of a lack of transparency about its lending operation. Of course, any collaboration between IA and publishers around the technology is not possible because the two are entirely at odds and the publishers would reasonably not cooperate with folks they see as engaged in piracy of their property. 

Even if the Archive’s lending technology were proven to be secure, lending alone is not the issue: the Archive copied the publishers’ books without permission prior to lending. In other words, they were lending content that they neither owned (in digital form) nor had licensed for digital distribution. Libraries pay, and pay dearly, for the ebook lending service that they provide to their users. The restrictions on ebooks may seem to be a money-grab on the part of publishers, but from their point of view it is a revenue stream that CDL threatens. 

Is it About the Money?

“... IA rakes in money from its infringing services…” ([suit] Page 40). (Note: publishers earn, IA “rakes in”)
“Moreover, while Defendant promotes its non-profit status, it is in fact a highly commercial enterprise with millions of dollars of annual revenues, including financial schemes that provide funding for IA’s infringing activities. ([suit] Page 4).

These arguments directly address section (a)(1) of Title 17, section 108: “(1) the reproduction or distribution is made without any purpose of direct or indirect commercial advantage”. 

At various points in the suit there are references to the Archive’s income, both for its scanning services and donations, as well as an unveiled show of envy at the over $100 million that Brewster Kahle and his wife have in their jointly owned foundation. This is an attempt to show that the Archive derives “direct or indirect commercial advantage” from CDL. Non-profit organizations do indeed have income, otherwise they could not function, and “non-profit” does not mean a lack of a revenue stream, it means returning revenue to the organization instead of taking it as profit. The argument relating to income is weakened by the fact that the Archive is not charging for the books it lends. However, much depends on how the courts will interpret “indirect commercial advantage.” The suit argues that the Archive benefits generally from the scanned books because this enhances the Archive’s reputation which possibly results in more donations. There is a section in the suit relating to the “sponsor a book” program where someone can donate a specific amount to the Archive to digitize a book. How many of us have not gotten a solicitation from a non-profit that makes statements like: “$10 will feed a child for a day; $100 will buy seed for a farmer, etc.”? The attempt to correlate free use of materials with income may be hard to prove. 

Reading 

Decades ago, when the service Questia was just being launched (Questia ceased operation December 21, 2020), Questia sales people assured a group of us that their books were for “research, not reading.” Google used a similar argument to support its scanning operation, something like “search, not reading.” The court decision in Google’s case decided that Google’s scanning was fair use (and transformative) because the books were not available for reading, as Google was not presenting the full text of the book to its audience.[suit-g] 

The Archive has taken the opposite approach, a “books are for reading” view. Beginning with public domain books, many from the Google books project, and then with in-copyright books, the Archive has promoted reading. It developed its own in-browser reading software to facilitate reading of the books online. [reader] (*See note below)

Although the publishers sued Google for its scanning, they lost due to the “search, not reading” aspect of that project. The Archive has been very clear about its support of reading, which takes the Google justification off the table. 

“Moreover, IA’s massive book digitization business has no new purpose that is fundamentally different than that of the Publishers: both distribute entire books for reading.” ([suit] Page 5). 

 However, the Archive's statistics on loaned books shows that a large proportion of the books are used for 30 minutes or less. 

“Patrons may be using the checked-out book for fact checking or research, but we suspect a large number of people are browsing the book in a way similar to browsing library shelves.” [ia1] 

 In its article on the CDL, the Center for Democracy and Technology notes that “the majority of books borrowed through NEL were used for less than 30 minutes, suggesting that CDL’s primary use is for fact-checking and research, a purpose that courts deem favorable in a finding of fair use.” [cdt] The complication is that the same service seems to be used both for reading of entire books and as a place to browse or to check individual facts (the facts themselves cannot be copyrighted). These may involve different sets of books, once again making it difficult to characterize the entire set of digitized books under a single legal claim. 

The publishers claim that the Archive is competing with them using pirated versions of their own products. That leads us to the question of whether the Archive’s books, presented for reading, are effectively substitutes for those of the publishers. Although the Archive offers actual copies, those copies that are significantly inferior to the original. However, the question of quality did not change the judgment in the lawsuit against copying of texts by Kinko’s [kinkos], which produced mediocre photocopies from printed and bound publications. It seems unlikely that the quality differential will serve to absolve the Archive from copyright infringement even though the poor quality of some of the books interferes with their readability. 

Digital is Different

Publishers have found a way to monetize digital versions, in spite of some risks, by taking advantage of the ability to control digital files with technology and by licensing, not selling, those files to individuals and to libraries. It’s a “new product” that gets around First Sale because, as it is argued, every transfer of a digital file makes a copy, and it is the making of copies that is covered by copyright law. [1stSale] 

The upshot of this is that because a digital resource is licensed, not sold, the right to pass along, lend, or re-sell a copy (as per Title 17 section 109) does not apply even though technology solutions that would delete the sender’s copy as the file safely reaches the recipient are not only plausible but have been developed. [resale] 

“Like other copyright sectors that license education technology or entertainment software, publishers either license ebooks to consumers or sell them pursuant to special agreements or terms.” ([suit] Page 15)

“When an ebook customer obtains access to the title in a digital format, there are set terms that determine what the user can or cannot do with the underlying file.”([suit] Page 16)

This control goes beyond the copyright holder’s rights in law: DRM can exercise controls over the actual use of a file, limiting it to specific formats or devices, allowing or not allowing text-to-speech capabilities, even limiting copying to the clipboard.

Publishers and Libraries 

The suit claims that publishers and libraries have reached an agreement, an equilibrium.

“To Plaintiffs, libraries are not just customers but allies in a shared mission to make books available to those who have a desire to read, including, especially, those who lack the financial means to purchase their own copies.” ([suit] Page 17).
In the suit, publishers contrast the Archive’s operation with the relationship that publishers have with libraries. In contrast with the Archive’s lending program, libraries are the “good guys.”
“... the Publishers have established independent and distinct distribution models for ebooks, including a market for lending ebooks through libraries, which are governed by different terms and expectations than print books.”([suit] Page 6).
These “different terms” include charging much higher prices to libraries for ebooks, limiting the number of times an ebook can be loaned. [pricing1] [pricing2]
“Legitimate libraries, on the other hand, license ebooks from publishers for limited periods of time or a limited number of loans; or at much higher prices than the ebooks available for individual purchase.” [agol]
The equilibrium of which publishers speak looks less equal from the library side of the equation: library literature is replete with stories about the avarice of publishers in relation to library lending of ebooks. Some authors/publishers even speak out against library lending of physical books, claiming that this cuts into sales. (This same argument has been made for physical books.)
“If, as Macmillan has determined, 45% of ebook reads are occurring through libraries and that percentage is only growing, it means that we are training readers to read ebooks for free through libraries instead of buying them. With author earnings down to new lows, we cannot tolerate ever-decreasing book sales that result in even lower author earnings.” [agliblend][ag42]

The ease of access to digital books has become a boon for book sales, and ebook sales are now rising while hard copy sales fall. This economic factor is a motivator for any of those engaged with the book market. The Archive’s CDL is a direct affront to the revenue stream that publishers have carved out for specific digital products. There are indications that the ease of borrowing of ebooks - not even needing to go to the physical library to borrow a book - is seen as a threat by publishers. This has already played out in other media, from music to movies. 

It would be hard to argue that access to the Archive’s digitized books is merely a substitute for library access. Many people do not have actual physical library access to the books that the Archive lends, especially those digitized from the collections of academic libraries. This is particularly true when you consider that the Archive’s materials are available to anyone in the world with access to the Internet. If you don’t have an economic interest in book sales, and especially if you are an educator or researcher, this expanded access could feel long overdue. 

We need numbers 

We really do not know much about the uses of the Archive’s book collection. The lawsuit cites some statistics of “views” to show that the infringement has taken place, but the page in question does not explain what is meant by a “view”. Archive pages for downloadable files of metadata records also report “views” which most likely reflect views of that web page, since there is nothing viewable other than the page itself. Open Library book pages give “currently reading” and “have read” stats, but these are tags that users can manually add to the page for the work. To compound things, the 127 books cited in the suit have been removed from the lending service (and are identified in the Archive as being in the collection “litigation works

Although numbers may not affect the legality of the controlled digital lending, the social impact of the Archive’s contribution to reading and research would be clearer if we had this information. Although the Archive has provided a small number of testimonials, a proof of use in educational settings would bolster the claims of social benefit which in turn could strengthen a fair use defense. 

Notes

(*) The NWU has a slide show [nwu2] that explains what it calls Controlled Digital Lending at the Archive. Unfortunately this document conflates the Archive's book Reader with CDL and therefore muddies the water. It muddies it because it does not distinguish between sending files to dedicated devices (which is what Kindle is) or dedicated software like what libraries use via software like Libby, and the Archive's use of a web-based reader. It is not beyond reason to suppose that the Archive's Reader software does not fully secure loaned items. The NWU claims that files are left in the browser cache that represent all book pages viewed: "There’s no attempt whatsoever to restrict how long any user retains these images". (I cannot reproduce this. In my minor experiments those files disappear at the end of the lending period, but this requires more concerted study.) However, this is not a fault of CDL but a fault of the Reader software. The reader is software that works within a browser window. In general, electronic files that require secure and limited use are not used within browsers, which are general purpose programs.

Conflating the Archive's Reader software with Controlled Digital Lending will only hinder understanding. Already CDL has multiple components:

  1. Digitization of in-copyright materials
  2. Lending of digital copies of in-copyright materials that are owned by the library in a 1-to-1 relation to physical copies

We can add #3, the leakage of page copies via the browser cache, but I maintain that poorly functioning software does not automatically moot points 1 and 2. I would prefer that we take each point on its own in order to get a clear idea of the issues.

The NWU slides also refer to the Archive's API which allows linking to individual pages within books. This is an interesting legal area because it may be determined to be fair use regardless of the legality of the underlying copy. This becomes yet another issue to be discussed by the legal teams, but it is separate from the question of controlled digital lending. Let's stay focused.

 

Citations

[1stSale] https://abovethelaw.com/2017/11/a-digital-take-on-the-first-sale-doctrine/ 

[ag1]https://www.authorsguild.org/industry-advocacy/reselling-a-digital-file-infringes-copyright/ 

[ag42] https://www.authorsguild.org/industry-advocacy/authors-guild-survey-shows-drastic-42-percent-decline-in-authors-earnings-in-last-decade/ 

[aghard] https://www.authorsguild.org/the-writing-life/why-is-it-so-goddamned-hard-to-make-a-living-as-a-writer-today/

[aglibend] https://www.authorsguild.org/industry-advocacy/macmillan-announces-new-library-lending-terms-for-ebooks/

[agol] https://www.authorsguild.org/industry-advocacy/update-open-library/ 

[cdl-blog] https://blog.archive.org/2020/03/09/controlled-digital-lending-and-open-libraries-helping-libraries-and-readers-in-times-of-crisis

[cdt] https://cdt.org/insights/up-next-controlled-digital-lendings-first-legal-battle-as-publishers-take-on-the-internet-archive/ 

[kinkos] https://law.justia.com/cases/federal/district-courts/FSupp/758/1522/1809457

[nel] http://blog.archive.org/national-emergency-library/

[nwu] "Appeal from the victims of Controlled Digital Lending (CDL)". (Retrieved 2021-01-10) 

[nwu2] "What is the Internet Archive doing with our books?" https://nwu.org/wp-content/uploads/2020/04/NWU-Internet-Archive-webinar-27APR2020.pdf

[pricing1] https://www.authorsguild.org/industry-advocacy/e-book-library-pricing-the-game-changes-again/ 

[pricing2] https://americanlibrariesmagazine.org/blogs/e-content/ebook-pricing-wars-publishers-perspective/ 

[reader] Bookreader 

[resale] https://www.hollywoodreporter.com/thr-esq/appeals-court-weighs-resale-digital-files-1168577 

[suit] https://www.courtlistener.com/recap/gov.uscourts.nysd.537900/gov.uscourts.nysd.537900.1.0.pdf 

[suit-g] https://cases.justia.com/federal/appellate-courts/ca2/13-4829/13-4829-2015-10-16.pdf?ts=1445005805

Engaging in “Difficult Conversations” on race: lessons learned from an RLP team practice group / HangingTogether

In June 2020, following the murder of George Floyd I felt overcome with sadness, anger, and a sense of powerlessness. Within my personal network I saw people reacting in a range of ways. They participated in demonstrations and marches, not only in the US but across the world. Others engaged in a practice of reading and reflection or joined book clubs (sending books like The New Jim Crow and How to Be an Antiracist to the top of the sales charts). Still others engaged in vigorous discussion with loved ones, living in close quarters with family members due to COVID (I know many people who had these discussions with their children who were home from college). Many did all these things, and more.

Well before 2020, conversations about bias and racism in libraries were very much part of the professional discourse, a regular part of meetings and conferences attended by the OCLC Research Library Partnership team and well documented in the professional literature. However, last year what had been a steady and consistent stream grew to a tidal wave, and we recognized that it was urgent that we accelerate our learning to become fluent participants in this conversation.

The OCLC RLP team are regularly called on to facilitate conversations about important issues facing research libraries – and at this moment of national racial reckoning, confronting and disrupting embedded racism is among the most important issues facing our profession. How well were we equipped to participate in – let alone lead — conversations about the legacy of white supremacy and racial injustice?  Out of these discussions, our “Difficult Conversations” practice group was born. In this blog post, I am sharing what we did to help expose our practice, to share resources, and to invite feedback.

What was the practice?

We had three concrete goals for our practice group:

  • Develop skills in facilitating difficult conversations in general.
  • Develop specific skills and knowledge to talk about race and race-related issues.
  • Understand what productive and unproductive conversations look and feel like, and ways to steer unproductive conversations back on track.

We utilized a few key resources. The Readers Guide for White Fragility and the Aorta Anti-Oppressive Facilitation for Democratic Process documents both have useful guidance to help identity common patterns for unproductive conversations, and ways to get discussions back on track (way easier said than done!). We used the Aorta Community Agreements as a basis for establishing community norms for behavior within the group.

We started by going through the Smithsonian National Museum of African American History and Culture Museum’s (NMAAHCM ) Talking About Race portal. This was a perfect resource for our purpose – there are a number of learning modules, including (listed in the order we went through them):

  • historical foundations of race
  • social identities and systems of oppression
  • race and racial identity
  • bias
  • whiteness
  • community building

Each topic in the Talking About Race portal has several learning resources, such as videos and short articles, paired with questions, activities and exercises.

Once we completed the Talking About Race portal we read a pair of articles: Low morale in ethnic and racial minority academic librarians: An experiential study (Kaetrina Davis Kendrick and Ione T. Damasco, 2019) and The low-morale experience of academic librarians: A phenomenological study (Kaetrina Davis Kendrick, 2017)

We scheduled monthly 90-minute conversations, and took turns facilitating the discussion in pairs. Facilitators would review the learning materials ahead of time to prioritize materials that we would focus on, as well as come up with several questions to help guide the conversation. Each meeting started with a reminder of our goals, and the question: “What in this unit was new to you or challenged the way you previously thought about or approached [topic]”. As this was a skill building exercise, we reserved the last 30 minutes of our meetings to reflect on the conversation itself, how it felt to participate, how it felt to facilitate the discussions, what we did well and what we did poorly or would want to approach differently in the future.

What was my experience?

Here, I am sharing my own experience and not speaking for anyone else in the group. I entered these conversations with a fair bit of skepticism – I did not think it was realistic for me to emerge from these discussions as someone qualified to “facilitate” a discussion centered on issues related to race. I am a white woman who has grown up in a culture that has persistently upheld white supremacy. I can easily do damage, eroding trust and credibility by engaging in non-helpful ways.

I am fortunate to work with a group of people I trusted fully, well before we went down this path. I want to linger on this point of existing trust for a moment because it seems to be an essential ingredient in my own ability to move forward – if I did not feel cared for by my colleagues, this would have been a more difficult endeavor. The Kendrick / Damasco readings were useful in helping me to realize that trust in the workplace is a privilege and one I benefit from every day. In this practice, knowing that my colleagues are invested in my success made it possible for me to be open and honest and fully engage in what were almost always emotional conversations. In being open, I made mistakes. Remember what I said above about unproductive conversations? I am grateful to colleagues for correcting my mistakes, being patient with me, and investing in my growth!

After our meetings, I felt both energized and drained. Energized because I was learning new things and gaining new perspectives. Drained because the issues we discussed seem enormous, pervasive, and resistant to change. I also frequently felt a sense of embarrassment about being oblivious to history and patterns of behavior.

After months of talking and thinking about the issues outlined in the Talking About Race portal, I still do not feel prepared or qualified to lead in a discussion about race. However, I do believe I can be a more productive contributor in discussions that others are leading.  Engaging directly with difficult topics requires practice; an invaluable aspect of this conversation group has been setting aside the time to practice on a regular basis.

Lessons learned and suggestions for others

If you are interested in mirroring or adapting our approach, here are a few ideas.

  • Establishing community norms for our group was an important first step and I’m glad we budgeted appropriate time for this purpose. It helped to create space and community for this specific purpose, even within a group that meets regularly. I have now attended a few meetings that make use of community agreements and I love them, as long as appropriate time is set aside for understanding, discussion, adaptation, adoption, and buy in. They should not, in my opinion, be treated as a click-through license agreement with the assumption that everyone has a common reference point.
  • Working in pairs and cycling through facilitators has worked well for our group. Everyone got a chance to lead and “host” these discussions.
  • It was beneficial to devote time to the content and also to discuss how our learnings will be applied in the work we do. We noted specific tools and examples that we might use in our work. This has resulted in more concrete, actionable outcomes than simply following a set of suggested discussion questions.
  • The Talking About Race portal is excellent, and I’d suggest it to others who are looking for study materials. Once we finished working our way through the NMAAHCM materials, we sought to identify other materials that learning resources along with tools and exercises. We’d very much welcome suggestions for such material!

I’m certain that many of you have engaged in similar practices of self-study, and we would love to hear about your approach, what resources you would recommend, and any other ideas you have to share with the team. Let us know in the comments, or send us an email.

[Other participants in the “Difficult Conversations” practice group are members of the RLP team: Chela Scott Weber, Dennis Massie, Karen Smith-Yoshimura (now retired), Mercy Procaccini, Rachel Frick, and Rebecca Bryant.]

The post Engaging in “Difficult Conversations” on race: lessons learned from an RLP team practice group appeared first on Hanging Together.

Weeknote 8 (late) 2021 / Mita Williams

Last week I had a week that was more taxing than normal and I had nothing in the tank by Friday. So I’m putting together last week’s weeknotes today. Also, going forward each section heading has been anchor tagged for your link sharing needs. e.g. §1 §2 §3 §4 §5 and §6.

I say this recognizing that the weeknote format resists social sharing which I consider a feature not a bug.

§1 We Are Here

From Library and Archives Canada:

Over the past three years, We Are Here: Sharing Stories has digitized and described over 590,000 images of archival and published materials related to First Nations, Inuit and the Métis Nation.

Digitized and described content includes textual documents, photographs, artworks and maps as well as numerous language publications. All items are searchable and linked in our Collection Search or Aurora databases.

In order to make it easier to locate recently digitized Indigenous heritage content at LAC, we have created a searchable list of the collections and introduced a Google map feature – allowing users to browse archival materials by geographic region!

Visit the We Are Here: Sharing Stories page to pick your destination and start your research!

Those who know me, know that I’ve been advocating for more means of discovery via maps and location for a while now. While my own mapping has slowed down, I still bookmarked Georeferencing in QGIS 2.0 from The Programming Historian today.

If used appropriately, maps hold a great deal of potential as a means to discover works related to indigenous peoples. Some forms of Indigenous Knowledge Organization such as the X̱wi7x̱wa Classification Scheme emphasize geographic grouping over alphabetical grouping.

§2 Bookfeedme, Seymour! *

Not every author has a newsletter that you can subscribe to in order to be informed when they have a new book out.

You would think it would be easier to be notified otherwise, but with the mothballing of Amazon Alerts, the only other way I know to be notified is through Bookfeed.io which uses the Google Books API at its core.

If don’t have a familiarity with RSS, see About Feeds for more help.

* musical reference

§3 Best article title in librarianship for 2021

Ain’t no party like a LibGuides Party / ’cause a LibGuides Party is mandatory **

** musical reference

§4 This is the time and this is the record of the time ***

ScholComm librarians ask: Do we want a Version of Record or Record of Versions?

*** musical reference

§5 The 5000 Fingers of Dr. T ****

A Hand With Many Fingers is a first-person investigative thriller. While searching through a dusty CIA archive you uncover a real Cold War conspiracy. Every document you find has new leads to research. But the archive might not be as empty as you think…  

Slowly unravel a thrilling historical conspiracy

Discover new clues through careful archival research

Assemble your theories using corkboard and twine

Experience a story of creeping paranoia

**** musical reference / movie reference

Hat tip: Errant Signal’s Bad Bosses, Beautiful Vistas, and Baffling Mysteries: Blips Episode 8

§6 Citational politics bibliography

I’m not entirely sure how this bibliography on the politics of citation and references crossed my twitter stream, but I immediately bookmarked it. The bibliography is from a working group of CLEAR from Memorial University:

Civic Laboratory for Environmental Action Research (CLEAR) is an interdisciplinary natural and social science lab space dedicated to good land relations directed by Dr. Max Liboiron at Memorial University, Canada. Equal parts research space, methods incubator, and social collective, CLEAR’s ways of doing things, from environmental monitoring of plastic pollution to how we run lab meetings, are based on values of humility, accountability, and anti-colonial research relations. We specialize in community-based and citizen science monitoring of plastic pollution, particularly in wild food webs, and the creation and use of anti-colonial research methodologies.

To change science and research from its colonial, macho, and elitist norms, CLEAR works at the level of protocol. Rather than lead with good intentions, we work to ensure that every step of research and every moment of laboratory life exemplifies our values and commitments. To see more of how we do this, see the CLEAR Lab Book, our methodologies, and media coverage of the lab.

About CLEAR

I have no musical reference for this.

What is Metadata Assessment? / Digital Library Federation

DLF Digital Library AssessmentThis blog post was authored by Hannah Tarver and Steven Gentry, members of the Digital Library Assessment Interest Group’s Metadata Assessment Working Group (DLF AIG MWG). It is intended to provide a summary overview of metadata assessment in digital libraries, including its importance and benefits. 

If you are interested in metadata evaluation, or want to learn more about the group’s work, please consider attending one of our meetings!


Metadata assessment involves evaluating metadata to enhance its usefulness for both internal and external users. There are three main categories of metadata:

[1] Administrative metadata provides information about the management or preservation of digital objects such as when it was archived, what access or restrictions are placed on an item, a unique/permanent identifier for an object, when files were last migrated/copied/checked, etc.

[2] Descriptive metadata is the human-readable text describing the creation and content of an item, such as who made it, what it is about, and when it was made/published. This information is displayed in a publicly-accessible and searchable user interface (while administrative and structural metadata may be less visible, or only internally accessible).

[3] Structural metadata names all of the files associated with an item (e.g., a single PDF or multiple individual image files, metadata files, OCR files, etc.) and describes the relationship among them. For example, if there are images for individual text pages, or multiple views of a physical object, the structural metadata would express how many images there are and the order in which they should be displayed.

Specific pieces of information may be stored in different types of metadata depending on a local system (e.g., some access information may be in public-facing descriptive metadata, or some preservation information may be incorporated into structural metadata). An organization could evaluate various characteristics for any or all of these metadata types to ensure that a digital library is functioning properly; however, most researchers and practitioners focus on descriptive metadata because this information determines whether users can find the materials that fit their personal or scholarly interests.

Metadata Errors

Metadata assessment is necessary because errors and/or inconsistencies inevitably creep into records. Collections are generally built over time, which means that many different people are involved in the lifecycle of metadata; standards or guidelines may change; and information may be moved or combined. There are a number of different quality aspects that organizations may want to evaluate within metadata values; here are some examples:

Accuracy

  • Typos.  Spelling or formatting errors may happen by accident or due to a misunderstanding about formatting rules. Even when using controlled lists, values may be copied or selected incorrectly.
  • Mis-identification.  Metadata creators may incorrectly name people or places represented or described in an item. This is especially problematic for images.
  • Wrong records.  Depending on how items and their metadata records are matched in a particular system, a record describing one item may be applied to an entirely different item (see Figure 1).
exampleFigure 1. A record imported from a database; the information, e.g., title (in orange box), does not match the document.

Completeness 

  • Missing information.  Whether due to a lack of resources or simply by accident, information may be left out of metadata records. This could include required data that affects system functionality, or optional information that could help users find an item.
  • Unknown information.  Especially for cultural heritage objects—such as historic photos and documents—information that would benefit researchers (e.g., detailing the creation of an item or important people/locations) may be absent (see Figure 2).
exampleFigure 2. An image that does not have creator/creation date information in the record (noted by orange box).

Conformance to expectations

  • Inappropriate terminology.  Sometimes, the language used in records does not align with the terms that a primary user-group might prefer (e.g., a subject value for “kittens” instead of “felines” in a science database record). This may be due to an inconsistent use of words (e.g., “cars” vs. “automobiles”) or an editor’s lack of knowledge about the most appropriate or precise descriptors (e.g., “flower brooch” for corsage, or “caret-shaped roof” for gabled roofs).
  • Outdated language.  Collections that describe certain groups of people—such as historically underrepresented or marginalized groups—may use language that is inappropriate and harmful. This is particularly relevant for records that rely on slow-changing, commonly shared vocabularies such as Library of Congress Subject Headings (see Further Reading, below).

Consistency

  • Formatting differences.  If exact-string matching is important, or if fields use controlled vocabularies, any differences in formatting (e.g., “FBI” vs. “F.B.I.”) might affect searching or public interface search filters.
  • Name variations.  The same name may be entered differently in different records depending on how they are written on items (e.g., “Aunt Betty” vs. “Beatrice”), name changes (e.g., maiden names or organizational mergers), information available over time, or inconsistent use of a name authority (see Figure 3). 

 

exampleFigure 3. Examples of inconsistent values: name variations in creator entries (left) and language entries (right).

Timeliness

  • Legacy or Harvested Data.  If formatting rules have changed over time, or if information has been migrated or imported from another system, inconsistent values or artifacts may be present in the records. These include MARC subdivisions in name/subject values (see Figure 4), technical mark-up from databases (e.g., “. pi. /sup +/, p”), or broken character encodings (e.g., “’” instead of an apostrophe).
exampleFigure 4. Example subject entries, including one that was copied with a MARC subdivision (|x)

Benefits

There are a number of benefits to users and organizations when metadata is assessed and improved over time. For example: 

Users:

  • Records with complete, accurate, and consistent metadata are more findable in online searches.
  • Similarly-described materials allow relevant items to be more easily collocated.
  • Good metadata may allow public interfaces to enhance the user experience (e.g., via filtering search results).

Organizations maintaining digital collections:

  • Error-free metadata is easier to migrate from one system to another or to integrate with other resources (e.g., a discovery layer).
  • Complete records make it easier for staff to find and promote/advertise special items when opportunities arise.
  • Well-formed metadata records are more easily shared with other organizations (e.g., the Digital Public Library of America), thereby making such materials more widely accessible.
  • Good records reflect well on the organization, as users might be put off by spelling, grammar, or related issues.

Methods/Resources

Although metadata assessment is tremendously beneficial, it often requires organizational support such as a large or ongoing commitment of people and other resources. First and foremost, knowledgeable personnel are crucial to successful assessment and metadata improvement; trained professionals contribute metadata expertise (e.g., ability to determine what values need to be reviewed or changed) and subject-area specialties necessary for successful metadata assessment efforts (particularly for larger projects). Additionally, assessment and mitigation or enhancement of collections require significant personnel time to evaluate and edit metadata.

Another major component of metadata assessment activities are tools, which may include spreadsheet-based resources (e.g., OpenRefine), or specialized scripts written in a variety of programming languages. An important note to bear in mind is that even though tools can expedite metadata assessment efforts, they may require technical experience and training to be used effectively.

Aside from using tools for broad analysis, one popular assessment method is manually evaluating records (i.e., looking at an individual record and reviewing all of the values). Employing this kind of workflow would appeal to professionals for a few reasons:

  • Manual metadata assessment requires the least amount of technological training. 
  • Particularly for smaller collections, checking all values in a record may allow for fewer edits & revisions (i.e., records are not “touched” as often).
  • Some aspects of metadata quality (e.g., accuracy) can only be determined through manual evaluation. 

However, there are challenges to consider when assessing metadata. Effective manual evaluation, for example, can be hard to scale as records increase, and may not provide collection-level insights. Additionally, as collections grow in size, comprehensive assessment becomes more difficult and requires increased resources to review and correct errors or outdated values. Finally, it is important to recognize that improving records is an ongoing and often iterative process. Overall, metadata assessment is a resource-balancing exercise.

Further Reading

This blog post was informed by numerous resources and practical experience. The following resources provide additional information if you want to learn more about various aspects of metadata assessment:

Papers/Publications

Metadata Assessment Working Group Resources

Example images come from the Digital Collections at the University of North Texas (UNT) — https://digital2.library.unt.edu/search — and from the Digital Public Library of America (DPLA) — https://dp.la/

The post What is Metadata Assessment? appeared first on DLF.

What do you miss least about pre-lockdown life? / Jez Cope

@JanetHughes on Twitter: What do you miss the least from pre-lockdown life?

I absolutely do not miss wandering around the office looking for a meeting room for a confidential call or if I hadn’t managed to book a room in advance. Let’s never return to that joyless frustration, hey?

10:27 AM · Feb 3, 2021

After seeing Terence Eden taking Janet Hughes’ tweet from earlier this month as a writing prompt, I thought I might do the same.

The first thing that leaps to my mind is commuting. At various points in my life I’ve spent between one and three hours a day travelling to and from work and I’ve never more than tolerated it at best. It steals time from your day, and societal norms dictate that it’s your leisure & self-care time that must be sacrificed. Longer commutes allow more time to get into a book or podcast, especially if not driving, but I’d rather have that time at home rather than trying to be comfortable in a train seat designed for some mythical average man shaped nothing like me!

The other thing I don’t miss is the colds and flu! Before the pandemic, British culture encouraged working even when ill, which meant constantly coming into contact with people carrying low-grade viruses. I’m not immunocompromised but some allergies and residue of being asthmatic as a child meant that I would get sick 2-3 times a year. A pleasant side-effect of the COVID precautions we’re all taking is that I haven’t been sick for over 12 months now, which is amazing!

Finally, I don’t miss having so little control over my environment. One of the things that working from home has made clear is that there are certain unavoidable aspects of working in my shared office that cause me sensory stress, and that are completely unrelated to my work. Working (or trying to work) next to a noisy automatic scanner; trying to find a light level that works for 6 different people doing different tasks; lacking somewhere quiet and still to eat lunch and recover from a morning of meetings or the constant vaguely-distracting bustle of a large shared office. It all takes energy. Although it’s partly been replaced by the new stress of living through a global pandemic, that old stress was a constant drain on my productivity and mood that had been growing throughout my career as I moved (ironically, given the common assumption that seniority leads to more privacy) into larger and larger open plan offices.

Recovering Foucault / Ed Summers

I’ve been enjoying reading David Macey’s biography of Michel Foucault, that was republished in 2019 by Verso. Macey himself is an interesting figure, both a scholar and an activist who took leave from academia to do translation work and to write this biography and others of Lacan and Fanon.

One thing that struck me as I’m nearing the end of Macey’s book is the relationship between Foucault and archives. I think Foucault has become emblematic of a certain brand of literary analysis of “the archive” that is far removed from the research literature of archival studies, while using “the archive” as a metaphor (Caswell, 2016). I’ve spent much of my life working in libraries and digital preservation, and now studying and teaching about them from the perspective of practice, so I am very sympathetic to this critique. It is perhaps ironic that the disconnect between these two bodies of research is a difference in discourse which Foucault himself brought attention to.

At any rate, the thing that has struck me while reading this biography is how much time Foucault himself spent working in libraries and archives. Here’s Foucault in his own words talking about his thesis:

In Histoire de la folie à l’âge classique I wished to determine what could be known about mental illness in a given epoch … An object took shape for me: the knowledge invested in complex systems of institutions. And a method became imperative: rather than perusing … only the library of scientific books, it was necessary to consult a body of archives comprising decrees, rules hospital and prison registers, and acts of jurisprudence. It was in the Arsenal or the Archives Nationales that I undertook the analysis of a knowledge whose visible body is neither scientific nor theoretical discourse, nor literature, but a daily and regulated practice. (Macey, 2019, p. 94)

Foucault didn’t simply use archives for his research: understanding the processes and practices of archives were integral to his method. Even though the theory and practice of libraries and archives are quite different given their different functions and materials, they are often lumped together as a convenience in the same buildings. Macey blurs them a little bit, in sections like this where he talks about how important libraries were to Foucault’s work:

Foucault required access to Paris for a variety of reasons, not least because he was also teaching part-time at ENS. The putative thesis he had begun at the Fondation Thiers – and which he now described to Polin as being on the philosophy of psychology – meant that he had to work at the Bibliothèque Nationale and he had already become one of its habitues. For the next thirty years, Henri Labrouste’s great building in the rue de Richelieu, with its elegant pillars and arches of cast iron, would be his primary place of work. His favourite seat was in the hemicycle, the small, raised section directly opposite the entrance, sheltered from the main reading room, where a central aisle separates rows of long tables subdivided into individual reading desks. The hemicycle affords slighty more quiet and privacy. For thirty years, Foucault pursued his research here almost daily, with occasional forays to the manuscript department and to other libraries, and contended with the Byzantine cataloguing system: two incomplete and dated printed catalogues supplemented by cabinets containing countless index cards, many of them inscribed with copperplate handwriting. Libraries were to become Foucault’s natural habitat: ‘those greenish institutions where books accumulate and where there grows the dense vegetation of their knowledge’

There’s a metaphor for you: libraries as vegetation :) It kind of reminds me of some recent work looking at decentralized web technologies in terms of mushrooms. But I digress.

I really just wanted to note here that the erasure of archival studies from humanities research about “the archive” shouldn’t really be attributed to Foucault, whose own practice centered the work of libraries and archives. Foucault wasn’t just writing about an abstract archive, he was practically living out of them. As someone who has worked in libraries and archives I can appreciate how power users (pun intended) often knew aspects of the holdings and intricacies of their their management better than I did. Archives, when they are working, are always collaborative endeavours, and the important thing is to recognize and attribute the various sides of that collaboration.

PS. Writing this blog post led me to dig up a few things I want to read (Eliassen, 2010; Radford, Radford, & Lingel, 2015 ).

References

Caswell, M. (2016). The archive is not an archives: On acknowledging the intellectual contributions of archival studies. Reconstruction, 16(1). Retrieved from http://reconstruction.eserver.org/Issues/161/Caswell.shtml

Eliassen, K. (2010). Archives of Michel Foucualt. In E. Røssaak (Ed.), The archive in motion, new conceptions of the archive in contemporary thought and new media practices. Novus Press.

Macey, D. (2019). The lives of Michel Foucault: A biography. Verso.

Radford, G. P., Radford, M. L., & Lingel, J. (2015). The library as heterotopia: Michel Foucault and the experience of library space. Journal of Documentation, 71(4), 773–751.

Open Knowledge Justice Programme challenges the use of algorithmic proctoring apps / Open Knowledge Foundation

Today we’re pleased to share more details of the Justice Programmes new strategic litigation project: challenging the (mis)use of remote proctoring software.  

What is remote proctoring?

Proctoring software uses a variety of techniques to ‘watch’ students as they take exams. These exam-invigilating software products claim to detect, and therefore prevent, cheating. Whether this software can actually do what it claims, or not, there is concern that they breach privacy, data and equality rights and that the negative impacts of their use on students are significant and serious. 

Case study: Bar Exams in the UK

In the UK, barristers are lawyers who specialise in courtroom advocacy. The Bar Professional Training Course (BPTC) is run by the professional regulatory body: the Bar Standards Board (BSB).

In August 2020, because of COVID 19, the BPTC exams took place remotely, and used a proctoring app from US company Pearson Vue.

Students taking exams had to allow their room to be scanned and an unknown, unseen exam invigilator to surveil them.  Students had to submit a scan of their face to verify their identity – and were prohibited from leaving their seat for the duration of the exam. That meant up to 5 hours (!) without a toilet break.  Some students had to relieve themselves in bottles and buckets under their desks whilst maintaining ‘eye contact’ with their faceless invigilator. Muslim women were forced to remove their hijabs – and at least one individual had to withdraw from sitting the exam rather than, as they felt it,  compromise their faith. The software had numerous errors in functionality, including suddenly freezing without warning and deleting text. One third of students were unable to complete their exam due to technical errors.

Our response

The student reports, alongside our insight into the potential harms caused by public impact algorithms, prompted us to take action. We were of the opinion that what students were subjected to breached data, privacy and other legal rights as follows:

Data Protection and Privacy Rights

  • Unfair and opaque algorithms. The software used algorithmic decision-making in relation to the facial recognition and/or matching identification of students and behavioural analysis during the exams. The working of these algorithms was unknown and undisclosed.
  • The app’s privacy notices were inadequate. There was insufficient protection of the students’ personal data. For example, students were expressly required to confirm that they had ‘no right to privacy at your current location during the exam testing session’ and to ‘explicitly waive any and all claims asserting a right to individual privacy or other similar claims’. Students were asked to consent to these questions just moments before starting an extremely important exam and without being warned ahead of time.
  • The intrusion involved was disproportionate. The software required all students to carry out a ‘room scan’ (showing the remote proctor around their room). They were then surveilled by an unseen human proctor for the duration of the exam. Many students felt this was unsettling and intrusive.
  • Excessive data collection. The Pearson VUE privacy notice reserved a power of data collection of very broad classes of personal data, including biometric information, internet activity information (gleaned through cookies or otherwise), “inferences about preferences, characteristics, psychological trends, preferences, predispositions, behavior, attitudes, intelligence, abilities, and aptitudes” and protected characteristics.
  • Inadequately limited purposes. Students were required to consent to them disclosing to third parties their personal data “in order to manage day to day business needs’, and to consent to the future use of “images of your IDs for the purpose of further developing, upgrading, and improving our applications and systems”.
  • Unlawful data retention. Pearson VUE’s privacy notice states in relation to data retention that “We will retain your Personal Data for as long as needed to provide our services and for such period of time as instructed by the test sponsor.”
  • Data security risks. Given the sensitivity of the data that was required from students in order to take the exam, high standards of data security are required. Pearson VUE gave no assurances regarding the use of encryption. Instead there was a disclaimer that “Information and Personal Data transmissions to this Site and emails sent to us may not be secure. Given the inherent operation and nature of the Internet, all Internet transmissions are done at the user’s own risk.”
  • Mandatory ‘opt-ins’. The consent sought from students was illusory, as it did not enable students to exert any control over the use of their personal data. If they did not tick all the boxes, they could not participate in the exam. Students could not give a valid consent to the invasion of privacy occasioned by online proctoring when their professional qualification depended on it. They were in effect coerced into surrendering their privacy rights. According to the GDPR, consent must be “freely given and not imposed as a condition of operation”.

 

Equality Rights

Public bodies in the UK have a legal duty to carefully consider the equalities impacts of the decisions they make. This means that a policy, project or scheme must not unlawfully discriminate against individuals on the basis of a ‘protected characteristic’: their race, religion or belief, disability, sex, gender reassignment, sexual orientation, age, marriage or civil partnership and/or pregnancy and maternity.

In our letter to the BSB, we said that the BSB had breached their equality rights duties by using a software that featured facial recognition and/or matching processes, which are widely proven to discriminate against people with dark skin.

The facial recognition process also required female students to remove their religious dress, therefore breaching the protections that are afforded to people to observe their religion. Female Muslim students were unable to select being observed by female proctors, despite the negative cultural significance of unknown male proctors viewing them in their homes.

We also raised the fact that some people with disabilities or women who were pregnant were unfairly and excessively impacted by the absence of toilet breaks for the duration of the assessment. The use of novel and untested software, we said, had the potential to discriminate against older students with fewer IT skills.

The BSB’s Reply

After we wrote to express these concerns, the BSB:

  • stopped using remote proctoring apps, as was scheduled for the next round of bar exams
  • announced that an inquiry into their use of remote proctoring apps in August 2020 to produce an independent account of the facts, circumstances and reasons as to why things went wrong. The BSB invited us to make submissions to this inquiry, which we have done. You can read them here.

Next steps

Here at the Open Knowledge Justice Programme, we’re delighted that the BSB has paused the use of remote proctoring and keenly await the publication of the findings of the independent inquiry.

However, we’ve been recently concerned to discover that the BSB has delegated decision-making authority for the use of remote proctoring apps to individual educational providers – e.g universities, law schools – and that many of these providers are scheduling exams using remote proctoring apps.

We hope that the independent inquiry’s findings will conclusively determine that this must not continue.

 

Sign up to our mailing list or follow the Open Knowledge Justice Programme on Twitter to receive updates.

 

 

 

SolrWayback 4.0 release! What’s it all about? Part 2 / State Library of Denmark


In this blog post I will go into the more technical details of SolrWayback and the new version 4.0 release. The whole frontend GUI was rewritten from scratch to be up to date with 2020 web-applications expectations along with many new features implemented in the backend. I recommend reading the frontend blog post first. The frontend blog post has beautiful animated gifs demonstrating most of the features in SolrWayback.

Live demo of SolrWayback

You can access a live demo of SolrWayback here.
Thanks to National Széchényi Library of Hungary for providing the SolrWayback demo site!

Back in 2018…

The open source SolrWayback project was created in 2018 as an alternative to the existing webarchive frontend applications at that time. At the Royal Danish Library we were already using Blacklight as search frontend. Blacklight is an all purpose Solr frontend application and is very easy to configure and install by defining a few properties such as Solr server url, fields and facet fields. But since Blacklight is a generic solr-frontend it had no special handling of the rich datastructure we had in Solr. Also the binary data such as images and videos are not in Solr, so integration to the WARC-file repository can enrich the experience and make playback possible, since Solr has enough information to work as CDX server also.

Another interesting frontend was the Shine frontend. It was custom tailored for the Solr index created with WARC-indexer and had features such as Trend analysis (n-gram) visualization of search results over time. The showstopper was that Shine was using an older version the Play-framework and the latest version of the Play-framework was not backwards compatible to the maintained branch of the Play-framework. Upgrading was far from trivial and would require a major rewrite of the application. Adding to that, the frontend developers had years of experience with the larger more widely used pure javascript-frameworks. The weapon of choice by the frontenders for SolrWayback was the VUE JS framework. Both SolrWayback 3.0 and the new rewritten SolrWayback 4.0 had the frontend developed in VUE JS. If you have skills in VUE JS and interest in SolrWayback your collaboration will be appriciate 🙂

WARC-Indexer. Where the magic happens!

WARC-files are indexed into Solr using the WARC-Indexer. The WARC-Indexer reads every WARC record,extracts all kind of information and splits this into up to 60 different fields. It uses Tika to parse all the different Mime types that can be encountered in WARC-files. Tika extract the text from HTML, PDF, Excel, Word documents etc. It also extracts metadata from binary documents if present. The metadata can include created/modified time, title, description, author etc. For images meta-data it can also width/height or exif information such as latitude/longitude. The binary data themselves are not stored in Solr but for every record in the warc-file there is a record in Solr. This also includes empty records such as HTTP 302 (MOVED) with information about the new URL.

WARC-Indexer. Paying the price up front…

Indexing a large amount of warc-files require massive amounts of CPU, but is easily parallelized as the warc-indexer takes a single warc-file as input. Indexing 700 TB (5.5M WARC files) of warc-files took 3 months using 280 CPUs to give an idea of the requirements.
When the existing collection is indexed, it is easier to keep up with the incremental growth of the collection. So this is the drawback when using SolrWayback on large collections: The Warc-files have to be indexed first.

Solr provides multiple ways of aggregating data, moving common netarchive statistics tasks from slow batch processing to interactive requests. Based on input from researchers, the feature set is continuously expanding with aggregation, visualization and extraction of data.
Due to the amazing performance of Solr, the query is often performed in less than 2 seconds in a collection with 32 billion (32*10⁹) documents and this includes facets. The search results are not limited to HTML pages where the freetext is found, but every document that matches the search query. When presenting the results each document type has custom display for that mime-type.
HTML results are enriched with showing thumbnail images from page as part of the result, images are shown directly, and audio and video files can be be play directly from the results list with an in-browser player or downloaded if the browser does not support that format.

Solr. Reaping the benefits from the Warc-indexer

The SolrWayback java-backend offers a lot more than just sending queries to Solr and returning them to the frontend. Methods can aggregate data from multiple Solr queries or directly read WARC entries and return the processed data in a simple format to the frontend. Instead of re-parsing the warc files, which is a very tedious task, the information can be retreived from Solr, and the task can be done in seconds/minutes instead of weeks.

See the frontend blog post for more feature examples.

Wordcloud
Generating a wordcloud image is done by extracting text from 1000 random HTML pages from the domain and generate a wordcloud from the extracted text.

Interactive linkgraph
By extracting domains that links to a given domain(A) and also extract outgoing links from that domain(A) you can build a link-graph. Repeating this for new domains found gives you a two-level local linkgraph for the domain(A). Even though this can be 100s of seperate Solr-queries it is still done in seconds on a large corpus. Clicking a domain will highlight neighbors in the graph.
(Try demo:interactive linkgraph)

Large scale linkgraph
Extraction of massive linkgraphs with up to 500K domains can be done in hours. Link graph example from the Danish NetArchive.
The exported link-graph data was rendered in Gephi and made zoomable and interactive using Graph presenter.
The link-graphs can be exported fast as all links (a href) for each HTML-record are extracted and indexed as part of the corresponding Solr document.

Image search
Freetext search can be used to find HTML documents. The HTML documents in Solr are already enriched with image links on that page without having to parse the HTML again. Instead of showing the HTML pages, SolrWayback collects all the images from the pages and shows them in a Google like image search result. Under the assumption that text on the HTML page relates to the images, you can find images for can match the query. If you search for “Cats” in HTML pages , the results found will mostly likely show pictures of cats. The pictures could not be found by just searching for the image documents if no meta data (or image-name) has “Cats” as part of it.

CVS Stream export
You can export result sets with millionsof documents to a CSV file. Instead of exporting all possible 60 Solr fields for each result, you can custom pick which fields to export. This CSV export has been used by several researchers at the Royal Danish Library already and gives them the opportunity to use other tools, such as RStudio, to perform analysis on the data. The National Széchényi Library demo site has disabled CSV export in the SolrWayback configuration, so it can not be tested live.

WARC corpus extraction
Besides CSV export, you can also export a result to a WARC-file. The export will read the Warc-entry for each document in the resultset and copy the WARC-header+ Http-header + payload and create a new Warc-file with all results combined.
Extract a sub-corpus this easy has already shown to be extremely useful for researchers. Examples includes extracting of a domain for a given date range, or query with restriction to a list of defined domains. This export is a 1-1 mapping from the result in Solr to the entries in the WARC-files.
SolrWayback can also perform an extended WARC-export which will include all resources(js/css/images) for every HTML page in the export.
The extended export ensures that playback will also work for the sub-corpus. Since the exported Warc file can become very large, you can use a WARC splitter tool or just split up the export in smaller batches by adding crawl year/month to the query etc. The National Széchényi Library demo site has disabled WARC export in the SolrWayback configuration, so it can not be tested live.

SolrWayback playback engine

SolrWayback has a built-in playback engine, but using it is optional and SolrWayback can be configured to use any other playback engine that uses the same API in URL for playback “/server/<date>/<url>” such as PyWb. It has been a common misunderstanding that SolrWayback forces you to use the SolrWayback playback engine. The demo at National Széchényi Library has configured PyWb as alternative playback engine. Clicking the icon next to the titel for a HTML result will open playback in PyWb instead of SolrWayback.

Playback quality

The playback quality of SolrWayback is an improvement over OpenWayback for the Danish Netarchive, but not as good as PyWb. The technique used is url-rewrite just as PyWb does, and replaces urls according to the HTML specification for html-pages and CSS files. However , SolrWayback does not replace links generated from javascript yet, but this is most likely to be improved in a next major release. It has not been a priority since the content for the The Danish NetArchive is harvested with Heritrix and the dynamic javascript resources are not harvested by Heritrix.

This is only a problem for absolute links, ie. starting with http://domain/… since all relative URL paths will be resolved automatically due to the URL playback API. Relative links that refer to the root of the playback-server will also be resolved by the SolrWaybackRootProxy application which has this sole purpose. It calculates the correct URL from the http-referer tags and redirect back into SolrWayback. The absolute URL from javascript (or dynamic javascript) can result in live leaks. This can be avoided by a HTTP proxy or just adding a white list of urls to the browser. In the Danish Citrix production environment, live leaks are blocked by sandboxing the enviroment. Improving playback is in the pipeline.

The SolrWayback playback has been designed to as authentic as possible without showing a fixed toolbar in top of the browser. Only a small overlay is included in the top left corner, that can be removed with a click, so that you see the page as it was harvested. From playback overlay you can open the calendar and an overview of the resources included by the HTML page along with their timestamps compared to the main HTML page, similar to the feature provided by the archive.org playback engine.

The URL replacement is done up front and fully resolved to an exact WARC file and offset. An HTML page can have 100 of different resources on the page and each of them require an URL lookup for the version nearest to the crawl time of the HTML page. All resource lookups for a single HTML page are batched as a single Solr query, which both improves performance and scalability.

SolrWayback and Scalability

For scalability, it all comes down to the scalability of SolrCloud, which has proven without a doubt to be one of the leading search technologies and is still rapidly improving for each new version. Storing the iindexes on SSD gives substantial performance boosts as well but can be costly. The Danish Netarchive has 126 Solr servers running in a SolrCloud setup.

One of the servers is master and the only one that recieve requests. The Solr master has an empty index but is responsible for gathering the data from the other Solr-services. If the master server also had an index there would be an overhead. 112 of the Solr servers have a 900 GB index with an average of ~300M documents while the last 13 servers currently has an empty index, but it makes expanding the collections easy without any configuration changes. Even with 32 billion documents, the query response times are sub 2 seconds. The result query and the facet query are seperate simultaneous calls and its advantage is that the result can be rendered very fast and the facets will finish loading later.

For very large results in the billions, the facets can take 10 seconds or more, but such queries are not realistic and the user should be more precise in limiting the results up front.

Building new shards
Building of new shards (collection pieces) is done outside the production enviroment and moved into one of the empty Solr servers when the index reaches ~900GB. The index is optimized before it is moved, since no more data will be written to it that would undo the optimization. This will also give a small performance improvement in query times. If the indexing was done directly into the production index, it would also impact response times. The separation of the production and building environment has spared us from dealing with complex problems we would have faced otherwise. It also makes speeding up the index building trivial by assigning more machines/CPU for the task and creating multiple indexes at once.
You can not keep indexing into the same shard forever as this would cause other problems. We found the sweet spot at that time to be ~900GB index size and it could fit on the 932GB SSDs that were available to us when the servers were built. The size of the index also requires more memory of each Solr server and we have allocated 8 GB memory to each. For our large scale webarchive we keep track of which WARC files has been indexed using Archon and Arctica.

Archon is the central server with a database and keeps track of all WARC files and if they have been index and into which shard number.

Arctika is a small workflow application that starts WARC-indexer jobs and query Arctika for next WARC file to process and return the call when it has been completed.

SolrWayback – framework

SolrWayback is a single Java Web application containing both the VUE frontend and Java backend. The backend has two Rest service interfaces written with Jax-Rs. One is responsible for services called by the VUE frontend and the other handles playback logic.

SolrWayback software bundle

Solrwayback comes with an out of the box bundle release. The release contains a Tomcat Server with Solrwayback, a Solr server and workflow for indexing. All products are configured. All that is required is unzipping the zip file and copying the two property-files to your home-directory. Add some WARC-files yourself and start the indexing job.

Try: SolrWayback Software bundle

Principles For The Decentralized Web / David Rosenthal

A week ago yesterday the Internet Archive launched both a portal for the Decentralized Web (DWeb) at https://getdweb.net/, designed by a team led by Iryna Nezhynska of Jolocom, and a set of principles for the Decentralized Web, developed with much community input by a team led by Mai Ishikawa Sutton and John Ryan.

Nezhynska led a tour of the new website and the thinking behind its design, including its accessibility features. It looks very polished; how well it functions as a hub for the DWeb community only time will tell.

Brewster Kahle introduced the meeting by stressing that, as I have written many times, if the DWeb is successful it will be attacked by those who have profited massively from the centralized Web. The community needs to prepare for technical, financial and PR attacks.

Below the fold I look at how the principles might defend against some of these attacks.

The fundamental goal of the DWeb is to reduce the dominance of the giant centralized platforms, replacing it with large numbers of interoperable smaller services each implementing its own community's policies. Inevitably, as with cryptocurrencies and social networks such as Parler, Gab, 4chan and 8chan, some of the services will be used for activities generally regarded as malign. These will present an irresistible target for PR attacks intended to destroy the DWeb brand.

The principles are a start on building a defense for the DWeb brand against the attacks. The idea is to deflect the criticism that the DWeb technologies inevitably lead to bad outcomes by allowing the community to point to statements disclaiming them. How effective this will be is open to doubt, but it may well be the best available defense. The principles fall into five groups:
  1. Technology for Human Agency
  2. Distributed Benefits
  3. Mutual Respect
  4. Humanity
  5. Ecological Awareness
For quite a few of the principles I don't have anything useful to add beyond saying that they are aspirational but perhaps idealistic. As a defensive PR move this is reasonable. I have cherry-picked a few on which to comment.

Technology for Human Agency

  • We urge coexistence and interoperability, and discourage walled gardens. Interoperability based on standard protocols is an essential pre-condition for decentralized networks with diverse implementations.

    The canonical response of dominant players to standards-based interoperability strategies is Microsoft's embrace, extend and extinguish. While it is true that Microsoft gradually evolved to a more cooperative approach, their example stands ready to inspire others. The community could learn from the "connectathons" that Sun Microsystems sponsored to ensure interoperability between disparate implementations of NFS. Thinking about structures that encourage "embrace" but discourage "extend" by supporting broad participation in protocol development would be useful.

  • We value open source code as a fundamental building block of an open and inclusive Web. Well, yes, ideally this would be the case.

    Once dominant content owners forced Encrypted Media Extensions on the Web, Google prevented users from disabling the closed-source plugin that implements them. There was a small outcry, but the vast majority of users couldn't care less; "I want my MTV". So, in practice, if you want to reach a wide audience, you can't be religious about open source (see Linux vs. OpenBSD).

  • We aim for peer-to-peer relationships, rather than hierarchical control and power imbalance. Peer-to-peer system architecture can provide resilience and stability that centralized systems, which necessarily have single points of failure cannot. This is the reason the LOCKSS digital preservation system has a peer-to-peer architecture.

    However, W. Brian Arthur's 1994 Increasing Returns and Path Dependence in the Economy and the subsequent histories of the Web and social media, not to mention peer-to-peer systems, show how difficult it is to push back against the winner-take-all economics of technology. Systems with increasing returns to scale, such as the Web, amplify small "power imbalances" rapidly. Pushing back is particularly difficult in the case of peer-to-peer systems which, despite their many advantages, are inherently slower and less efficient, and thus more expensive, than equivalent centralized systems. Thus even initially small amounts of centralization in the system tend to grow into major "power imbalances".

  • Our technologies must minimize surveillance and manipulation of people’s behavior, and optimize for social benefits and empower individuals to determine how and why their data is used. Three points to support this:
    1. DuckDuckGo has shown that a profitable business model doesn't depend on surveillance.
    2. Much research has shown that targeting advertisements is not effective.
    3. The metrics that Facebook uses to price ads are wildly inflated, part of Facebook's long history of lying to advertisers, politicians and the public.

    But as I wrote in The Data Isn't Yours:
    Most discussions of Internet privacy, for example Jaron Lanier Fixes the Internet, systematically elide the distinction between "my data" and "data about me". In doing so they systematically exaggerate the value of "my data".

    The typical interaction that generates data about an Internet user involves two parties, a client and a server. Both parties know what happened (a link was clicked, a purchase was made, ...). This isn't "my data", it is data shared between the client ("me") and the server. The difference is that the server can aggregate the data from many interactions and, by doing so, create something sufficiently valuable that others will pay for it. The client ("my data") cannot.
    Thus the goal must not be to prevent collection of "data about me", which isn't possible, but to prevent aggregation of "data about me", which is a difficult but perhaps soluble problem.

  • We believe that multiple technical means will be more effective than a single technical solution to achieve ethical and people-centric outcomes. This is an important principle for two reasons. First, a single technical solution, even if it is open source, centralizes control in the implementors of the technology (as we see with Bitcoin), and effectively decreases users' choices (as we increasingly see in browsers). Second, diversity in technology is essential to resisting the likely technical attacks. A technology monoculture will fail abruptly and completely. Avoiding a monoculture has significant costs, which must somehow be paid.

Distributed Benefits

  • We believe that decentralized technologies will be most beneficial to society when the rewards and recognition of their success, monetary or otherwise, are distributed among those who contributed to that success. Ensuring that the benefits of technology development flow to the people actually doing the development is an issue for both closed- and open-source software:
    • In the closed-source world, two developments over the past decades have greatly reduced to proportion of the added-value of technology developments flowing to the technologists. First, the VCs funding startups have become much better at transferring risk to the entrepreneurs and employees, and rewards to themselves. Second, the lack of anti-trust enforcement has meant that many fewer startups become large, profitable companies providing many employees with rich rewards. Promising startups are purchased by the big companies before they have many employees to reward. Employees of large companies may be comfortable but few are richly rewarded.
    • How the volunteer contributors to open source projects can be rewarded and motivated has been the topic of a couple of recent posts, Supporting Open Source Software and Open Source Saturation. It is an increasingly difficult and urgent issue.

  • If that is infeasible, proportionate benefit should flow to the community at large. One approach to a compromise between the need for a corporate structure to support significant open source technology with the volunteer ethos is the idea of a Benefit Corporation:
    A benefit corporation's directors and officers operate the business with the same authority and behavior as in a traditional corporation, but are required to consider the impact of their decisions not only on shareholders but also on employees, customers, the community, and local and global environment.
    Although there is no legal requirement for a Benefit Corporation to donate a proportion of its profits to the public good, this requirement can be written into the corporate charter.

  • High concentration of organizational control is antithetical to the decentralized web. A successful peer-to-peer system requires some form of "organizational control" over the development of the protocol and encouragement of interoperability. An example is the way the success of Bram Cohen's 2001 BitTorrent protocol in 2004 spawned BitTorrent Inc., and the company's acquisition in 2018 by TRON.

Ecological Awareness

  • We believe projects should aim to minimize ecological harm and avoid technologies that worsen environmental health. And:

  • We value systems that work towards reducing energy consumption and device resource requirements, while increasing device lifespan by allowing repair, recycling, and recovery. The point about "device resource requirements" is important. I stood out among my engineering colleagues in Sun Microsystems' operating system group because, while most of them had the biggest, fastest machine they could lay their hands on, I always had the oldest, slowest machine the company still supported. Someone had to experience what the least of our customers did. I'm still travelling with old Chromebooks running Linux. Developers rarely live on minimal hardware, a fact that helps them get their jobs done but contributes to rapid hardware obsolescence.

    The cost of coordination in decentralized systems means that they are inherently less efficient than equivalent centralized systems. Their benfits in terms of resilience, personal autonomy, etc. are not free. They don't need to, and should not, be as catastrophically inefficient as proof-of-work cryptocurrencies (Bitcoin alone consumes 124TWh/yr, or 0.5% of the world's electricity). But critics will easily find examples of relative inefficiency.

How we are improving the quality and interoperability of Frictionless Data / Open Knowledge Foundation

As we announced in January, the Open Knowledge Foundation has been awarded funds from the Open Data Institute to improve the quality and interoperability of Frictionless Data. We are halfway through the process of reviewing our documentation and adding new features to Frictionless Data, and wanted to give a status update showing how this work is improving the overall Frictionless experience.

We have already done four feedback sessions and have been  delighted to meet 16  users from very diverse backgrounds and different levels of expertise using Frictionless Data, some of whom we knew and some not. In spite of the variety of users, it was very interesting to see a widespread consensus on the way the documentation can be improved. You can have a look at a few of the community PRs here and here.

We are very grateful to all the Frictionless Data users who took part in our sessions – they helped us see all of our guides with fresh eyes. It was very important for us to do this review together with the Frictionless Data community because they are (together with those to come) the one who will benefit from it, so are the best placed to flag issues and propose changes.

Every comment is being carefully reviewed at the moment and the new documentation will soon be released.

What are the next steps?

  • We are going to have 8 to 12 more users giving us feedback in the coming month. 
  • We are also adding a FAQ section based on the questions we got from our users in the past.

If you have any feedback and/or improvement suggestions, please let us know on our Discord channel or on Twitter.

More about Frictionless Data

Frictionless Data is a set of specifications for data and metadata interoperability, accompanied by a collection of software libraries that implement these specifications, and a range of best practices for data management. The project is funded by the Sloan Foundation.

Upcoming Sprint: Metadata / Islandora

Upcoming Sprint: Metadata dlamb Wed, 02/24/2021 - 16:12
Body

Our very own Metadata Interest Group is running a sprint from March 8th to the 19th, and everyone's invited to participate.  We'll be auditing the default metadata fields that we ship with and comparing them to the excellent metadata profile the MIG has worked so hard to create for us. The goal of the sprint is just to find out where the gaps are so we know the full scope of work needed to implement their recommendations.  If you can navigate the Drupal fields UI (or just want to learn!), contributing is easy and would be super helpful to us. NO PROGRAMMING REQUIRED. And if you don't have an Islandora 8 instance to work on (or are having a hard time installing one), we're making a fresh sandbox just for the sprint. Also, Islandora Foundation staff (a.k.a. me) and representatives from the MIG will be on hand to help out and answer any questions you may have.

You can sign up for the sprint here, and choose a metadata field to audit in this spreadsheet.  As always, commit to as much or as little as you like.  It only takes a couple minutes to check out a field and its settings to see if they line up with the recommendations. If we get enough folks to sign up, then many hands will make light work of this task!

This is yet another sign of the strength of our awesome community.  An interest group is taking it upon themselves to run a sprint to help achieve their goals, and the Islandora Foundation couldn't be happier to help. If you're a member of an interest group and want help engaging the community to make your goals happen, please feel free to reach out on Slack or email me (dlamb@islandora.ca).

We Need to Talk About How We Talk About Disability: A Critical Quasi-systematic Review / In the Library, With the Lead Pipe

By Amelia Gibson, Kristen Bowen, and Dana Hanson

In Brief

This quasi-systematic review uses a critical disability framework to assess definitions of disability, use of critical disability approaches, and hierarchies of credibility in LIS research between 1978 and 2018. We present quantitative and qualitative findings about trends and gaps in the research, and discuss the importance of critical and justice-based frameworks for continued development of a liberatory LIS theory and practice.

Disability and Conditional Citizenship

Much of the mythos of modern American librarianship is grounded in its connections to grand ideas about community, citizenship, rights, common ownership, and, in many cases, stewardship of public goods and information. Despite these ostensibly community-oriented ideals, much of the current practice of librarianship and information science (LIS) in the U.S. is also embedded with and within neoliberal social and political institutions and norms (Bourg, 2014; Ettarh, 2018; Jaeger 2014) that often tie library and information access to selective membership and “productive citizenship” (Fadyl et al., 2020), and marginalize disabled people who are considered “unproductive.” This ongoing internal conflict is reflected in perennial debates about intellectual freedom and social justice (Knox, 2020), as well as commonplace beliefs and practices about belonging and ownership in public space. Under neoliberalism, “tax paying citizens,” “dues paying members,” and “loyal customers” all claim rights and privileges denied to non-contributing “non-citizens”or outsiders. This excludes disabled people, people experiencing homelessness, teens, nonresidents, and anyone else who has not sufficiently “paid their dues.” This framework reflects the medical model of disability, which sites the locus of responsibility for cure or rehabilitation with the individual. This same value system deprioritizes community investments in libraries and other non-revenue generating information infrastructures—framing them as leisure, demanding increasing budget efficiencies and consistent positive return on investment (ROI), while engaging in “benevolent” surveillance and data collection in the name of optimization and revenue generation (Lee & Cifor, 2019; Mathios, 2019). These austerity measures often reinforce the status of disabled people as “sub-citizens” who are often denied access based on cost or inconvenience (Sparke, 2017; Webb & Bywaters, 2018), and who must often trade personal disclosures and dignity for the most basic accommodations. This effect is particularly pronounced for disabled people of color who bear the burden of multiple marginalizations across many social and technical contexts.

The Danger of Single-Axis Definitions of Disability

Without critical self-reflection and examination, and without the willingness to tease out and openly name intersections of ableism, racism, xenophobia, sexism, classism, homophobia, and transphobia masquerading as practice, policy, research and education, we continue to create and reproduce violent (and mediocre) information systems and spaces. This violence might not be explicit or intentional, but intentions are beside the point (and are frequently used to derail discussions about outcomes and the need for systemic changes). Focusing on disability as a single axis of identity effectively codes disability as White (Gibson & Hanson-Baldauf, 2019), and actively ignores the specific information needs and access issues that disabled BIPOC experience, including heightened risk for police violence, lowest rates of employment, and increased discrimination within educational and financial institutions (Goodman, Morris, & Boston, 2019). It also prioritizes White comfort over the needs of disabled BIPOC, and allows White people (disabled and nondisabled) to enjoy the fruits of structural ableism and racism without the guilt of doing explicitly ableist and racist things (Stikkers, 2014).

Disciplinary subdivisions—especially public, academic, and medical librarianship—also betray a one-dimensional understanding of what constitutes “normal” public or academic information and what constitutes special “medical” or “health-related” information. These implicit frameworks for “norms” and “pathologies,” which are based in a medical model of disability that focuses on individual impairment, cure, and accommodation, also impact the qualitative experience of library spaces for community members and library workers.

Over the last quarter of a century, disability activists and movements have challenged the primacy of the medical model, arguing for disability as a social construct, imposed by a society that values conformity over individuality (Oliver 1990; Olkin 1999). According to the social model, one is not disabled by his or her physical embodiment, but rather by a society unwilling to accept and accommodate the wide diversity of humanity. Igniting the disability rights movement, the social model served as an emancipatory force by empowering individuals to assert their rights and fight for the removal of disabling barriers (Barnartt &Altman 2001). More recently, critical disability theory, disability justice frameworks, and other disability frameworks have offered disability rights advocates and scholars a means to construct more authentic representations of disability, physical embodiment, social power structures, and intersecting identities. Both shifts could, if integrated into LIS, have profound implications for the way we conceptualize systems and services for disabled people.

Who Do We Believe? The Hierarchy of Credibility

The near-invisibility of disabled people in the research process can be attributed to what Becker (1966) referred to as a hierarchy of credibility. That is, the tendency of researchers to prioritize and amplify perspectives from those with greater institutional power and social capital over the voices and lived experiences of marginalized people. In academia, this hierarchy assigns the highest authority on information related to disability to medical professionals and other healthcare providers, parents, and additional caretakers; disabled people are placed at the bottom of the hierarchy. The regular exclusion of disabled people in positions of power in research reflects broader societal devaluation. This devaluation is evident in citation practices and research methods, such as those that assign authority to healthcare provider narratives as “expert opinions” while denigrating disabled people’s narratives as unscientific anecdotes. It is a gross understatement to say that the history of disability reflects a systemic lack of justice in American social and political systems (Braddock & Parish, 2001). Although the birth of the scientific method eventually facilitated more sophisticated biological and technical treatments and supports for many disabilities, it also established rigid parameters for “normalcy” and social aberrance, and firmly established lines dividing disability and power. These parameters persist in the current day, informing scholarship and practice that still elevate conformity to the norms established by a nondisabled white center. Simultaneously, that center reserves for itself the institutional, economic, and political power to shape norms, determine policy, establish the boundaries of “ability” and “disability,” and assign value to those categories. This hierarchy of intersectional ableism reflects the hierarchy of privilege and prerogative outlined in Harris’ (1993) Whiteness as property. We see this phenomenon frequently in discussions about disability accommodations, when nondisabled managers decide that an employee is not disabled “enough” to warrant accommodations, or that an accommodation is an “undue burden” (e.g., Pionke, 2017). The same processes that marginalize the employee also make it difficult for them to attain the career advancement needed to make those sorts of decisions themselves. Like whiteness, ableism protects itself by building social, political, and economic structures (policies, norms, institutions, etc.) that reinforce systems of privilege and exclusion. We see this in the elevation of professional training over lived experience, and parents and families over disabled individuals themselves. Research that focuses on disabled people as subjects but ignores their perspectives function to exclude them from influencing LIS practice and research in any meaningful way. It silences critique in favor of topics, frameworks, and questions designed (and often answered) by nondisabled people.

Hill’s (2013) study offers one of the few expansive reviews of LIS field’s literature on disability. Employing a content analysis of articles published between 2000-2010, Hill reported a scarcity, unbalanced, and limited scope of disability-related publications. Of the 198 articles identified, the vast majority focused on visual disabilities (41%) and non-specific, general “disabilities” (42%). Practitioner literature made up approximately 65% of the available literature. Hill described the articles as predominantly practitioner-centered expositions of how people with disabilities operate within library spaces, challenges encountered, and interventions. Hill also observed that these articles were most often written from a non-disabled viewpoint, gave minimal consideration to intervening social or attitudinal factors, and offered guidance largely lacking empirical support. Research investigations made up the remaining 35% of the literature reviewed. Of the seventy studies identified, only twenty-five solicited input from disabled people. Research involving accessibility testing commonly recruited non-disabled participants over disabled participants. Hill concluded, “Overall, there appears to be a lot of discussion about people with disabilities, but little direct involvement of these people in the research” (p. 141).

Methods

A systematic literature review was deemed most appropriate to meet the aims of the present study. As the methodological tool, the approach offers a highly structured and rigorously comprehensive process, allowing the research team to capture a bird’s eye view of the available literature, detect variance in the literature, and identify gaps in the recorded knowledge. This methodology also allows for replication and the ongoing pursuance, appraisal, and integration of new knowledge. While widely applied within medical and healthcare research, Xu, et al. (2015) report a recent rise in systematic (and quasi-systematic) reviews in LIS, and advocate for further use of the method.

Researchers conducted the systematic review between August 2018 to May 2019 using the following steps:

  1. Identifying research questions
  2. Determining criteria for the inclusion and exclusion of literature
  3. Establishing and implementing the search protocol
  4. Screening and extracting non-relevant results using inclusion/exclusion criteria
  5. Conduct a randomized sample selection of the literature
  6. Develop coding schema and establish intercoder reliability
  7. Review and apply coding schema

Research Questions

Building on data presented in Hill’s (2013) review of LIS literature covering a ten-year span between 2000 and 2010, this investigation aimed to establish a baseline of understanding pertaining the historic and present-day representation of people with disabilities through a systematic quantitative and qualitative examination of the LIS research and practitioner literature, guided by the following questions:

  1. How many articles about disability in LIS have been published in the 40-year period in question? How do LIS researchers define disability? What disability frameworks have informed LIS research related to disability between 1965 and 2018?
  2. What hierarchies of disability and credibility are evidenced in the literature during that period?
  3. To what extent are critical disability frameworks employed in writing about disability in LIS?

Inclusion Criteria

The next step was to establish distinct perimeters for the scope of literature to be reviewed. This involved discussion among the research team to develop criteria for the inclusion and exclusion of specific works. As outlined in Table 1, selection included publication date, peer-review status, presented content (topic and type of article), availability in full (either online or through interlibrary access), printed publication language, and the discipline-explicit subject heading of the journal or conference proceeding in which the article was published.

Table 1: Inclusion and Exclusion Criteria
Screening Categories Included Excluded
Publication Date Published between January 1978 and September 2018 Published prior to January 1978
Language English Non-English
Subject Discipline Library and Information Science (LIS) Non-Library and Information Science
Review Process Peer review process for publication No peer review process for publication
Article Type Original research, viewpoint, technical paper, conceptual paper, case study, literature review, general review Book reviews, bibliographies, editorials, posters, announcements and other brief communications
Central focus Article content centers on people with disabilities Article content contains only a perfunctory reference to people with disabilities
Availability Article can be found online and in full text Full text article is not available online

*Publication subject discipline verified using Ulrich’s Online Serial Directory

Establish and implement a search protocol

The research team developed a search protocol for the systematic identification and extraction of literature. This included determining relevant sources, database tools, and search terminology. Sources were limited to LIS journals and conferences with established peer-review processes for publication acceptance. The team selected two bibliographic databases to conduct the search based on relevancy, coverage, and reputation: Library and Information Science Source (LISS) and Library and Information Science Abstracts (LISA). Next, the researchers developed a list of keywords by conducting an exploratory search of controlled vocabulary within each database thesaurus. These keywords were intended to capture conditions listed under “disability” within each database and to capture common disabilities that were not specifically associated with variations on “disability” (Search query 1) in each database. This meant that names of specific conditions that might be expected in discussions about disability (e.g., “Deaf” or “Blind”) might not appear in initial searches, because they were included in the results from Search Query 1. Search queries 2 and 3 were intended to expand the possible scope of the initial list, in order to capture literature on specific conditions that were inconsistently indexed as disabilities in an exploratory literature search. From this list, three search queries were constructed (Table 2) comprising: (1) the term “disability” and variations of the term in current and past practice; (2) additional specific categories of disability for expansion of results list; and (3) and disability frameworks.

Table 2: Inclusion and Exclusion Criteria
Search Queries Query Content
Search Query 1 disability OR disabled OR handicapped OR “differently abled” OR “differently-abled” OR disabilities OR “Special Needs” OR retardation OR retard*
Search Query 2 Autism OR ASD OR ADHD OR “attention deficit hyperactivity disorder” OR “Auditory Processing Disorder” OR “Traumatic Brain Injury” OR dyslexia OR (add AND attention) OR (sensory AND disorder) OR Aphasia OR Agraphia OR “perceptual disturbance”
Search Query 3 “Social role valorization” OR “complex embodiment”

Search queries 2 and 3 did not include specific categories and frameworks containing the word (or variations of the word) “disability” as it was assumed the queries would yield duplicate results captured under query 1, for example “physical disability” or “social model of disability”.

The three search queries conducted on each database that used the available filtering functions to narrow the scope of literature based on established inclusion and exclusion criteria yielded 17,632 references. Next, reference results were exported into a Microsoft Excel spreadsheet and displayed horizontally (one reference per row) and organized vertically into columns (by author, article title, publication date, etc.). Duplicate references were identified and extracted and using conditional formatting and customized sort & filter functions. Upon completion of this process, 7859 duplicate articles were removed from the literature set and 9773 references remained. Table 3 presents an aggregate view of this process.

Table 3: Distribution of downloaded and duplicate references
Database References Downloaded Duplicates Identified Within and Between Prescreened References
Library and Information Science Abstracts(LISA) 13445 (7859) 7955
Library and Information Science Source(LISS) 3562 1620
Library and Information Science Technology Abstracts (LISTA) 625 198
Total 17632 (7859) 9773

Screen and extract non-relevant reference data using inclusion/exclusion criteria

The next phase involved screening references for the purpose of extracting all non-relevant references from the literature set. As previously outlined in Table 1, screening categories included the article publication date, printed language, publication discipline, review process, article type, online availability, and central focus. While database filter functions were activated during the initial query search process, manual screening proved necessary to ensure inclusion and exclusion criteria were met. A summary of the process can be found in Table 4, followed by detailed descriptions of the screening categories.

Table 4: Summary of screening process
Screening Categories References Removed References Remaining
Publication date 0 9773
Language 223 9550
Subject discipline 2382 7168
Review process 21 7147
Article type 927 6220
Topic relevance 5400 820
Online and full text availability 161 659(remainder requested via Interlibrary Loan)

Publication Date. The first step in the screening process was to organize the literature set by publication year and scan for articles published prior to January 1978. No articles were identified for extraction during this process.

Printed Language. To identify and extract non-English publications, the language column of the literature set was alphabetized, thereby grouping publications by language for bulk removal. In some cases, languages were displayed in brackets following the article’s title (i.e. “Title [Hungarian]”). These titles were identified using Excel’s “Find” function, and searching for brackets. The remaining non-English titles were identified individually throughout the screening process. The researchers removed a total of 223 non-English language articles, leaving 9550 references in the set.

Subject Discipline. The literature set was then screened to identify and remove publications (i.e. journals and conference reports) from outside of LIS. To accomplish this step, references were grouped by publication title and screened using Ulrich’s Online Serial Directory to verify each publication’s subject heading. Publications without LIS subject headings (e.g., “Library and Information Science” or “Information Science and Information Theory”) were removed from the literature set. A total of 7168 remained after the removal of non-LIS sources.

Review Status. In addition to extracting non-LIS publications, Ulrich’s Online Serial Directory was used to determine a publication’s review process. Works published in sources without a peer-review process were removed, accounting for 21 references with 7147 references remaining in the literature set.

Article Type. References were then sorted and categorized by article type, as determined by the established exclusion criteria. This task was completed using Excel’s “Find” function to search the title and abstract columns. Table 5 provides a list of terms employed in the search. References identified through this process were then screened by eye to confirm or reject group identity and color-coded accordingly. Nine hundred twenty seven references met criteria for exclusion from the study and were removed. A total of 6220 references remained.

Table 5: Exclusion criteria screening terms
Content Type Excluded Search Terms
Book reviews “book review”, “is reviewed by”, (name of reviewer) “reviews the book”
Bibliographies “bibliograph*”
Posters “poster”
Editorials “editor*”
Introductions “introduction” “special issue”
Brief communication “notes” “memorandum” “announcements” “news brief” “from the field” “article alerts”
Conference Proceeding, Misc. “schedule” “agenda” “overview” “conference” “meeting” “summary” “minutes” “proceedings”

Central focus. Remaining references were then screened to determine topic relevance – that is, articles predominantly centering on disability and disabled people. This entailed open coding a subset of the data to create an extensive list of disability-related search terms, including disability classifications, frameworks, degree of impact language, intervention-related terms, and people-first language (Table 6). Again, the purpose here was to avoid excluding any article focused on disability, but to begin to weed out articles that were not focused on human disabilities. Additionally, we attempted to stick as closely as possible to an emergent characterization of disability as from the literature, rather than imposing our own definition.

Table 6: Disability-related search terms
Terms
Disability Classifications ADHD, Alzheimer, apraxia*, apsperg*, acquired brain injury, ASD, Attention, autis*, behavior challenges, blind, brain damage, cognitive needs, deaf, delay*, dementia, depress*, development*, disab*, disorder, dwarf*, dyslexi*, epilep*, gait, handicap*, hearing, impair*, little people, low vision, memory, mental illness, mobility, limitations, motor skills, multiple sclerosis neurodiv*, palsy, PPD-NOS, quadri*, retard*, sensory, special needs, spinal muscular atrophy, stammer, stutter, syndrome, TBI, tic, traumatic brain damage
Degree of Impact mild, moderate, profound, severe, significant
Intervention accessibil*, AAC, adaptive, assisted, assistive, augment, braille, cochlear implants, habil*, inclusion, inclusiv*, sign language, special education, therap*, wheelchair
Population children with, individuals with, people with, young adults with, underserved, marginalized
Disability Frameworks social model, medical model, critical disability, disability studies, social role valorization, ecological model, complex embodiment

Using these terms, the abstract column was searched using Excel’s “Find” function. Abstracts with one or more search terms were marked for screening and reviewed individually to determine topic relevance and central focus. Abstracts without a central focus on disability or disabled people were extracted from the set one at a time, and additional disability classifications were added as they emerged (after the initial search). Upon completion, 820 relevant works that were available online, in full text, and via Interlibrary Loan remained. Table 6 provides a summary of the screening process to filter out non-relevant references from the literature set.

Select a randomized sample from the literature

Given the high volume of articles identified and time limitations of the research team, a representative sample size from the literature set was calculated at 262 with a confidence level of 95% and a confidence interval of 5. In total, 282 references were randomly selected using an online randomizer tool.

Develop coding schema and establish intercoder reliability

The primary aim of the study was to gain a better sense of how disability is understood and represented in the LIS literature, and moreover, the extent to which critical disability has informed research and practice. To accomplish this goal, the reviewers established a coding schema to represent basic determinants of a critical disability approach. Table 7 provides a short list of codes and criteria. While precise definitions of critical disability theory differ, some we applied the following common criteria to rating the articles (Alper and Goggin, 2017; Goodley, 2013; Meekosha and Shuttleworth, 2009).

  1. Connects theory with practice/praxis and vice versa (Code: TAP). The article explicitly sits within ongoing conversations among academic research communities, communities of practice (e.g. librarianship, occupational medicine), and the disability communities being discussed. The criteria for connection between theory and practice/praxis refers to connections with ongoing discussions that affect the lived experiences of disabled people, as well as contributions to those discussions.
  2. Recognizes disability as socially and politically constructed (Code: CON). The article goes beyond an individual-medical definition of disability to include the understanding that disability is also a state of being (or an identity) produced through socially constructed definitions of physical, emotional, mental, and social “normalcy” and “fitness;” and reinforced through systems and infrastructure (e.g., information systems and physical structures) that support those understandings of normalcy.
  3. Acknowledges the corporeality of disability and engages with the reality of physical impairment and individual perception of limitations (Code: BOD). The article recognizes that, while disability is socially and politically constructed, it is also situated within the individual, and acknowledges experiences situated in the individual body and mind such as pain and sense of impairment.
  4. Explicitly acknowledges and/or addresses power and/or justice (Code: POW). The article explicitly acknowledges the existence of social and political power in information systems and services and/or frames equity/equality in terms of rights and justice. The writers of the article write from the position that disabled people have the right to equality and equity in information access. The article does not characterize access as optional, extra, or “charitable”.
  5. Builds on an activist historical tradition of fighting the exclusion of disabled people (Code: OPP). The article makes practical recommendations to improve access to systems, information, or data.

Each article was rated for each criterion on a scale from 1 (more true) to 5 (less true).

Table 7: Critical disability coding schema
Code Criteria
TAP Connects theory with praxis and vice versa.
CON Recognizes disability as socially and/or politically constructed.
BOD Acknowledges the corporeality of disability and engages with the reality of physical impairment and limitations.
POW Explicitly acknowledges and/or examines issues related to power, access and justice. Frames access as a right.
OPP Builds on activist historical tradition in response to exclusion of disabled people. Recommends changes to improve on oppressive or exclusionary systems.

To ensure intercoder reliability, the three reviewers coded identical samples of twenty articles independently and compared finding using the ReCal 3 tool (Freelon, 2017; Freelon, 2010). Krippendorf’s alpha for the 3 coders was .867. Once intercoder reliability was established, each reviewer read and coded full text of their assigned set of articles.

Table 8: Intercoder reliability
Coder Sample Size
Coder 1 139
Coder 2 118
Coder 3 25
Total 282

Findings

The following findings begin with a general analysis of the identified 820 relevant references available online and in print, followed by a more nuanced analysis of a representative sample (n=282) of the relevant literature to better understand the ways in which disability has been understood and represented in the LIS research and practice literature over the years.

General overview of the disability-related literature published in LIS

The number of LIS published articles focused on disability and disabled people has increased sharply over the past four decades. Figure 1 presents a distribution of disability-related articles published in LIS journals over the last four decades. As Hill noted, much of the work was practitioner focused, but the percentage of work focused on technology increased over time. Of the 820 relevant articles identified, 396 (48.29%) center on various types of technology or aspects of technology, including assistive technology, mobile technologies, web design, software development, communication devices, accessibility, and the digital divide over the 40-year span.

Figure 1: Distribution of articles published between 1978 and 2018

A focus on technology in original research.

Figure 2 shows how many of the full final sample articles published in each decade were focused specifically on technology. Over the 40-year period, the percentage of papers focused on technology grew from 6.45% in 1978-1988 to 53.87% in 2009-2018.

Figure 2: Proportion of LIS disability articles focused on technology between 1978 and 2018

The final sample of 282 articles included original research, technical papers, conceptual works, viewpoint articles, and literature reviews. As highlighted in Figure 3, research papers comprised the largest portion of the literature set (55.1%). Of the 155 original research papers in the final sample, 124 (80.7%) studies predominantly focused on technology. Studies related to website accessibility comprised 44.8% of this research. Other technology-related topics included assistive technology design and evaluation, evaluation and assessment of mobile platforms, e-reader use and program implementations, online behavior, and distance education.

Figure 3: Distribution of articles in the representative sample

What disability communities do we study?

Figure 4 describes the frequency of specific disabilities identified in the representative sample. Most authors addressed disability as a single, broad category, generalizing among a wide range of conditions (41%, n=118). The second category of disability that emerged most frequently in the literature was blindness, vision impairment, and vision loss (19%, n=53), followed by learning disabilities (7.4%, n=21). These two findings echo Hill’s (2013) findings. The least frequently listed were palliative care patients (.3%, n=1), people with multiple sclerosis (.3%, n=1), and people with HIV (.7%, n=2). It should be noted that we did not add chronic illness categories to the initial search for disability topics. It is likely that the number of articles found is more reflective of LIS’ general exclusion of people with chronic illness from disability discourse over the period of study than a reflection of the total LIS research focused on HIV.

Figure 4:Categories of disability identified in the sample literature set

Assessing Disability Models

Articles were assessed and given a single point for each of the five critical framework criteria met (see Table 7 for explanation of criteria). This examination of the sample literature set reveals most works do not engage with critical disability frameworks (see Figure 5) in any meaningful way. Approximately 54% (n=153) of the articles explicitly connected the work to some form of theory or built on previous research and broader conversations in disability communities. Of the 282 articles, 79% (n=223) addressed disability as an individual physical and/or cognitive condition experienced by individuals. Only 36% acknowledged disability as a social and/or political construct. It should be noted that these categories (disability as physical and disability as social) are not mutually exclusive. Approximately one-third (32.7%) of the articles acknowledged disability influenced by a combination of both individual/physical and social/political factors. Approximately 44% (n=122) of the sample explicitly recognized information access and accessibility as explicitly related to power or as a justice issue. Sixty seven percent (n=190) of the articles in the sample recommended systemic change, but these changes varied based on authors’ specific definitions of disability. Those who defined disability primarily physically were more likely to suggest changes to technical systems. Those who defined it socially were more likely to suggest changes in services, policy, and training.

Figure 5: Frequency of Each Critical Criteria in Sample

Critical criteria points were tallied and each article was given a critical framework score (CFS). The higher the CFS, the more criteria the title fulfilled. Of the articles in the sample, most fell between 2CFS (24.9%, n=70) and 3CFS (26%, n=73). Among the 2CFS group, BOD (recognizing disability as individual/physical impairment)(80%, n=56), and OPP (makes recommendations for change)(55.7%, n=39) were most frequently selected, suggesting that most articles took a strictly medical approach, and made suggestions for improvements based on that perspective. CON (disability as socially constructed) was the least popular (15.7%, n=11). Among the 3CFS group, BOD(83.6%, n=61), OPP (82.2%, n=60), and TAP (connects work to theory)(67.1%, n=39) were most frequently selected. POW (addresses power, access, and justice)(n=32) and CON(n=11) trailed behind. Among the 4CFS group, CON (60.5%, n=26) was selected least.

Figure 6: Percentage of Titles with Total Critical Framework Scores (CFS)

Many articles avoided discussing social implications of recommendations made beyond immediate service or technology improvements. Authors largely sidestepped intersections of race, gender, sexuality, and disability, focusing on a single-axis identity (disability only).

Hierarchies of Credibility

Assessment of credibility in articles coded as research studies were conducted qualitatively, as this data is more nuanced than other measures and therefore more difficult to evaluate quantitatively. Because many studies were multi-method, several studies fit into multiple categories. Five strong themes emerged in the sample studies that hint at a framework for assigning credibility and authority in the design and evaluation of LIS disability research (presented in order of relative frequency).

1. Self-referred authority: Author described systems/services/designs proposed, created, and (sometimes) evaluated by the author (without input from disabled people). These included self-generated system designs, and articles describing library programs.

2. Academic/Professional authority: Heuristic evaluations of existing programs and policies based on previous researchers’ and practitioners’ work, reflecting a traditional literature review model and traditional academic citation practices.

3. Academic/Institutional authority: Policy, practice, and curricular analyses – sometimes informed by previous researchers’ work and statements or input from local, national, and international disability organizations.

4. Self-referred authority/Mediated participation: Primarily experiments engaging disabled participants as research subjects in studies designed by (presumably nondisabled) researchers.

5. Experiential authority: Surveys and interviews that asked disabled people for their thoughts/opinions.

Discussion

LIS researchers continue to struggle with operationalizing the power and identity in studies related to access, racism, sexism, and ableism (Honma, 2005; Yu, 2006). Many also continue to ignore the social and technical implications of power relationships between and among groups. Kumbier and Starkey (2016) argue that the “conceptualization of access as a matter of equity requires library workers to account not only for the needs of individual users, or of specific groups of users but also for the contexts (social, cultural, historical, material, and economic) that shape our users’ terms of access.” The extent and quality of information services in support of disabled people directly reflects how disability is perceived and understood within the library and information science (LIS) literature, practice, and education. An appraisal of the LIS landscape suggests an unprioritized and short-sighted understanding of disability and much work to do.

We exist in a time where we are constantly confronted by the importance and value of human-centered and humane, respectful, compassionate information and data systems and services. Libraries are constantly innovating and reinventing themselves. The COVID-19 pandemic has, yet again, given us an opportunity to listen to our disabled and chronically ill colleagues and community members, reflect, and re-imagine a field that values and respects us all equally. What would it mean to decenter nondisabled perspectives in our practice and our research and to lean into the frameworks and values established in disability justice communities and literature? What would it mean to deconstruct our hierarchy of credibility and rely on disability communities to define who is disabled? What would our research and education look like if we regularly engaged with disabled people as research collaborators, rather than subjects?

In our current context, what could co-liberatory information work become? How would we expand our thinking and practice beyond technical accessibility, basic physical accessibility, and segregated programming models? Is this even possible, given the financial and administrative structures of most public and academic libraries? Much of the work sampled framed the discipline’s struggle to provide minimally equivalent access and services as innovation. A large percentage of technology research focused on basic web accessibility and relied on self-referred or professional authority. Few studies actually engaged disabled participants in ways that might influence the scope or design of the research. Fewer still considered intersectionality (Crenshaw, 1990), and the experiences of disabled BIPOC and/or disabled LGBTQIA+ people. This highlights the scarcity of training and education on disability issues, rights, justice, and accessibility in the field.

This gap has, in the past, limited continued development of substantive theory and praxis around disability and information within LIS. Reassessing our disciplinary and cultural preference for individualism over interdependence could support information systems, services, spaces, and communities built on care webs (Piepzna-Samarasinha, 2018) and institutional responsibility rather than individual access and responsibility, and wholeness (Berne et al., 2018) rather than clinical cure or rehabilitation. Recent growth in the number of researchers in this area, including BIPOC researchers (Cooke & Kitzie, 2021), disabled researchers (Brown & Sheidlower, 2019; Copeland et al., 2020; Niebauer, 2020; Schomberg and Hollich, 2019) and an increase in community-engaged and community-led research provides a point for optimism. This holds promise for more radical redesign of systems and services related to information, data, literacies, in more diverse communities.

The COVID-19 pandemic has reminded us how important access to good health information is for public decision-making, but this need for public data and information is not new. The need for national and local level community information and data exists for marginalized folks all the time. We envision a public and academic (and perhaps health) librarianship that follows the lead of disabled community members, supporting decision-making and on issues of importance to them, such as personal data stewardship and safety, informed refusal, financial and insurance literacy. Community-led research and practice should move beyond leveling access and centering marginalized people to the work of dismantling margins whenever possible.

No Justice Without Critique

The critical disability framework we developed was not an especially rigorous one, in terms of building capacity for equitable and just information systems and spaces for people with disabilities, and most of the works examined only scored a 2 or 3 out of 5. Disability justice advocates have long pushed for much more transformative policies, disability-led work (rather than participatory work, which is typically led by nondisabled people), anti-capitalism, collectively oriented systems of care and access (rather than individually oriented access), humane and sustainable labor practices and ideologies, and collective liberation (Hosking, 2008; Berne et al., 2018). The framework also excluded explicit acknowledgment of intersectionality; justice-oriented methodologies (e.g. community-led research and autoethnographies). While a critical perspective alone isn’t enough to build just systems, we cannot hope to build just, humane, or useful systems with a timid stance on critique. Additionally, a myopic focus on individual behaviors, intentions and the cognitive perspective hampers our capacity for rigorous examination and improvement. A critical perspective enables us to contend with the social systems, norms, and frameworks that scaffold our research and practice, the definitions, assumptions, and biases that form their foundations, and the systems of power and control that serve as mortar to hold them together.

Acknowledgments

We would like to thank Jessica Schomberg and Ryan Randall for your work in reviewing this paper. Your insightful feedback helped us clarify and strengthen our arguments and explanations. We would also like to thank Ian Beilin for managing this publication and keeping it on track through a summer of unrest and a year of pandemic life. This publication would not have been possible without your support.

This publication was made possible in part by the Institute of Museum and Library Services RE-07-17-0048-17.

Accessible Equivalents

Figure 1 as a Table

Figure 1. Distribution of articles published between 1978 and 2018
Year Number of Disability-related articles
1978-1988 62
1989-1998 90
1999-2008 280
2009-2018 388

Return to Figure 1 caption.

Figure 2 as a Table

Figure 2: Proportion of LIS disability articles focused on technology between 1978 and 2018
Year Disability with Tech Disability without tech Total articles Percentage with Focus on Technology
1978-1988 4 58 62 6.45%
1989-1998 25 65 90 27.78%
1999-2008 158 122 280 56.43%
2009-2018 209 179 388 53.87%

Return to Figure 2 caption.

Figure 3 as a Table

Figure 3: Distribution of articles in the representative sample
Number of Articles Percentage of Articles
Original research 145 55.13%
General reviews 51 19.39%
Technical papers 23 8.75%
Conceptual works 19 7.22%
Viewpoint articles 16 6.08%
Literature reviews 9 3.42%

Return to Figure 3 caption.

Figure 4 as a Table

Figure 4: Categories of disability identified in the sample literature set
Category of Disability Number of Articles Percentage of Articles (Rounded)
Non-specific and generalized 127 45
Blindness and low vision 59 21
Learning disabilities, including dyslexia and other print disabilities 28 10
Deafness and hearing impairment 19 7
Physical disabilities 13 5
Autism 10 4
General aging 10 4
Intellectual disabilities, including Down syndrome 8 3
Mental illness, including dementia 3 1
HIV 2 1
Multiple sclerosis 2 1
Palliative – end of life 1 0
Totals 282 102

Return to Figure 4 caption.

Figure 5 as a Table

Figure 5: Frequency of Each Critical Criteria in Sample
Critical Criteria Number of Articles
CON (Disability as socially constructed) 102
POW (Addresses power, access, and/or justice) 123
TAP (Connects theory to practice/praxis) 153
OPP (Recommends systemic changes) 190
BOD (Disability as physical/medical model) 223

Return to Figure 5 caption.

Figure 6 as a Table

Figure 6: Percentage of Titles with Total Critical Framework Scores (CFS)
Number of Criteria Fulfilled Percentage of Articles Number of Articles
1 Criterion 17.79% 50
2 Criteria 24.91% 70
3 Criteria 25.98% 73
4 Criteria 15.3% 43
5 Criteria 16.01% 45

Return to Figure 6 caption.

References

Alper, M., Goggin, G. (2017). Digital technology and rights in the lives of children with disabilities. New Media & Society 19, 726–740. https://doi.org/10.1177/1461444816686323

American Library Association. (2009, February 2). Services to People with Disabilities: An Interpretation of the Library Bill of Rights. Advocacy, Legislation & Issues. http://www.ala.org/advocacy/intfreedom/librarybill/interpretations/servicespeopledisabilities

American Library Association. (2015). “Standards for Accreditation of Master’s Programs in Library and Information Studies. http://www.ala.org/accreditedprograms/standards/glossary.

American Library Association. (2019). “Democracy Statement | About ALA.” http://www.ala.org/aboutala/governance/officers/past/kranich/demo/statement.

Barnartt, S., and B. M. Altman. (2001). “Exploring Theories and Expanding Methodologies: Where We Are and Where We Need to Go. Title.” Research In Social Science and Disability.

Becker, Howard S. (1966). “Whose Side Are We On.” Social Problems 14. https://heinonline.org/HOL/Page?handle=hein.journals/socprob14&id=245&div=36&collection=journals.

Berne, P., Morales, A. L., Langstaff, D., & Invalid, S. (2018). Ten principles of disability justice. WSQ: Women’s Studies Quarterly, 46(1), 227-230.

Bourg, C. (2014, January 16). The Neoliberal Library: Resistance is not futile. Feral Librarian. https://chrisbourg.wordpress.com/2014/01/16/the-neoliberal-library-resistance-is-not-futile/

Braddock, D. & Parish, S. (2001). An institutional history of disability. In G. L. Albrecht, K. Seelman & M. Bury Handbook of disability studies (pp. 11-68). Thousand Oaks, CA: SAGE. doi:10.4135/9781412976251.

Brown, R., & Sheidlower, S. (2019). Claiming Our Space: A Quantitative and Qualitative Picture of Disabled Librarians. Library Trends 67(3), 471-486. doi:10.1353/lib.2019.0007.

Cooke, N. A., & Kitzie, V. L. (2021). Outsiders-within-Library and Information Science: Reprioritizing the marginalized in critical sociocultural work. Journal of the Association for Information Science and Technology, n/a(n/a). https://doi.org/10.1002/asi.24449

Copeland, C. A., Cross, B., & Thompson, K. (2020). Universal Design Creates Equity and Inclusion: Moving from Theory to Practice. South Carolina Libraries, 4(1), 18. https://doi.org/10.51221/sc.scl.2020.4.1.7

Crenshaw, K. (1990). Mapping the margins: Intersectionality, identity politics, and violence against women of color. Stanford Law Review, 43, 1241.

Ettarh, F. (2018). Vocational Awe and Librarianship: The Lies We Tell Ourselves. In The Library With the Lead Pipe. http://www.inthelibrarywiththeleadpipe.org/2018/vocational-awe/

Fadyl, J. K., Teachman, G., & Hamdani, Y. (2020). Problematizing ‘productive citizenship’ within rehabilitation services: Insights from three studies. Disability and Rehabilitation, 42(20), 2959–2966. https://doi.org/10.1080/09638288.2019.1573935

Freelon, D. (2017). ReCal3. Retrieved from http://dfreelon.org/recal/recal3.php

Freelon, D. (2010). ReCal: Intercoder reliability calculation as a web service. International Journal of Internet Science, 5(1), 20-33.

Gibson, A. N., & Hanson-Baldauf, D. (2019). Beyond Sensory Story Time: An Intersectional Analysis of Information Seeking Among Parents of Autistic Individuals. Library Trends, 67(3), 550–575. https://doi.org/10.1353/lib.2019.0002

Goodley, D. (2013). Dis/entangling critical disability studies. Disability & Society, 28(5), 631–644. https://doi.org/10.1080/09687599.2012.717884

Goodman, N., Morris, M., Boston, K.(2019). Financial Inequality: Disability, Race, and Poverty in America. A report by the National Disability Institute. Retrieved from https://www.nationaldisabilityinstitute.org/wp-content/uploads/2019/02/disability-race-poverty-in-america.pdf

Harris, C. I. (1993). Whiteness as Property. Harvard Law Review, 106(8), 1707–1791. https://harvardlawreview.org/1993/06/whiteness-as-property/

Hill, H. (2013). Disability and accessibility in the library and information science literature: A content analysis. Library & Information Science Research, 35(2), 137–142. https://doi.org/10.1016/j.lisr.2012.11.002

Honma, T. (2005). Trippin’ Over the Color Line: The Invisibility of Race in Library and Information Studies. InterActions: UCLA Journal of Education and Information Studies, 1(2). http://escholarship.org/uc/item/4nj0w1mp

Hosking, D. L. (2008, September). Critical disability theory. In A paper presented at the 4th Biennial Disability Studies Conference at Lancaster University, UK.

Jaeger, P. T., Gorham, U., Bertot, J. C., & Sarin, L. C. (2014). Public Libraries, Public Policies, and Political Processes: Serving and Transforming Communities in Times of Economic and Political Constraint. Rowman & Littlefield Publishers.

Knox, E.J.M. (2020). Intellectual freedom and social justice: Tensions between core values in librarianship. Open Information Sciences, 4(1). https://doi.org/10.1515/opis-2020-0001

Kumbier, A., & Starkey, J. (2016). Access Is Not Problem Solving: Disability Justice and Libraries. Library Trends, 64(3), 468–491. https://doi.org/10.1353/lib.2016.0004

Lee, J., & Cifor, M. (2019). Evidences, implications, and critical interrogations of neoliberalism in information studies. Journal of Critical Library and Information Studies, 2(1), 1–10. https://journals.litwinbooks.com/index.php/jclis/article/view/122

Mathios, K. (2019, July 31). The commodification of the library patron. EdLab, Teachers College Columbia University. https://edlab.tc.columbia.edu/b/23060

Meekosha, H., & Shuttleworth, R. (2009). What’s so ‘critical’ about critical disability studies? Australian Journal of Human Rights, 15(1), 47–75. https://doi.org/10.1080/1323238X.2009.11910861

Niebauer, A. (2020, February 28). Information studies prof works to address mental illness among librarians. UWM REPORT. https://uwm.edu/news/information-studies-prof-works-to-address-mental-illness-among-librarians/

Oliver, M. (1990). The Ideological Construction of Disability. In M. Oliver (Ed.), The Politics of Disablement (pp. 43–59). Macmillan Education UK. https://doi.org/10.1007/978-1-349-20895-1_4

Olkin, R. (1999). The personal, professional and political when clients have disabilities. Women & Therapy, 22(2), 87–103. https://doi.org/10.1300/J015v22n02_07

Pionke, J. J. (2017). Toward Holistic Accessibility: Narratives from Functionally Diverse Patrons. Reference & User Services Quarterly, 57(1), 48–56. https://doi.org/10.5860/rusq.57.1.6442

Piepzna-Samarasinha, L. L. (2018). Care work: Dreaming disability justice. Arsenal Pulp Press.

Schomberg, J., and Hollich, S. (Eds.). (2019). Disabled Adults in Libraries [Special issue]. Library Trends, 67(3). https://muse.jhu.edu/issue/40307

Sparke, M. (2017). Austerity and the embodiment of neoliberalism as ill-health: Towards a theory of biological sub-citizenship. Social Science & Medicine, 187, 287–295. https://doi.org/10.1016/j.socscimed.2016.12.027

Stikkers, K. (2014). “ . . . But I’m not racist”: Toward a pragmatic conception of “racism.” The Pluralist, 9(3), 1–17. https://doi.org/10.5406/pluralist.9.3.0001

Webb, C. J. R., & Bywaters, P. (2018). Austerity, rationing and inequity: Trends in children’s and young peoples’ services expenditure in England between 2010 and 2015. Local Government Studies, 44(3), 391–415. https://doi.org/10.1080/03003930.2018.1430028

Xu, J., Kang, Q., & Song, Z. (2015). The current state of systematic reviews in library and information studies. Library & Information Science Research, 37(4), 296–310. https://doi.org/10.1016/j.lisr.2015.11.003

Yu, L. (2006). Understanding information inequality: Making sense of the literature of the information and digital divides. Journal of Librarianship and Information Science 38, 229–252. https://doi.org/10.1177/0961000606070600

Vexation After Vexation / William Denton

STAPLR is running a new composition: “Vexation After Vexation,” an interpretation of Erik Satie’s mysterious solo piano work Vexations. You can listen to STAPLR on the site or right here:

I ran “Library Silences” for months and months—it seemed appropriate—but now it’s time for something different. (York University Libraries Ambiences are always available for home listening.)

Photo from Wikipedia Photo from Wikipedia

The score of Vexations fits on one page, but instructions say (translated from French), “In order to play the theme 840 times in succession, it would be advisable to prepare oneself beforehand, and in the deepest silence, by serious immobilities.” The piece was ignored until John Cage took it, and the suggestion about 840 repetitions, seriously and organized a performance in 1963 where it was actually played 840 times. (This has been done many times since. In 2010 during Nuit Blanche I watched a performance downtown for quite a while. It was beautiful. If you want to hear it yourself by a real pianist, I recommend buying Stephane Ginsburgh’s 42 Vexations (1893) and making a playlist where the one track is played twenty times.)

One of many Nuit Blanche pianists One of many Nuit Blanche pianists

In this STAPLR composition, one one-minute iteration of Vexations is played for each minute of help given at any desk at York University Libraries that day. It keeps a running counter of how many more minutes it should play. Let’s say that at 0900 someone checks their email and answers a quick research question that takes them five minutes. They enter that into our reference statistics system, where STAPLR sees it and counts 5. It starts to play five iterations. One minute later, the counter is at 4. One minute later, the counter goes down to 3, but there’s another question in the system, this time a virtual chat that took 10 minutes to answer, so the counter goes up to 13. After one iteration it goes down to 12, then 11, then up again if there’s another question. If the counter reaches 0 it will wait and start back up when there’s another question answered.

STAPLR screenshot STAPLR screenshot

This is how it began this morning:

[2021-02-23 07:57:40] Vexation After Vexation 1.0: {} (0 mins)
[2021-02-23 07:58:40] Vexation After Vexation 1.0: {} (0 mins)
[2021-02-23 07:59:40] Vexation After Vexation 1.0: {"AskUs"=>{"1"=>[3]}} (3 mins)
[2021-02-23 08:00:40] Vexation After Vexation 1.0: {} (2 mins)
[2021-02-23 08:01:40] Vexation After Vexation 1.0: {"AskUs"=>{"1"=>[3]}} (4 mins)
[2021-02-23 08:02:40] Vexation After Vexation 1.0: {} (3 mins)
[2021-02-23 08:03:40] Vexation After Vexation 1.0: {} (2 mins)
[2021-02-23 08:04:40] Vexation After Vexation 1.0: {} (1 mins)
[2021-02-23 08:05:40] Vexation After Vexation 1.0: {"AskUs"=>{"1"=>[3]}} (3 mins)
[2021-02-23 08:06:40] Vexation After Vexation 1.0: {} (2 mins)
[2021-02-23 08:07:40] Vexation After Vexation 1.0: {} (1 mins)
[2021-02-23 08:08:40] Vexation After Vexation 1.0: {} (0 mins)
[2021-02-23 08:09:40] Vexation After Vexation 1.0: {} (0 mins)

Close to 1000 it really got going:

[2021-02-23 09:53:40] Vexation After Vexation 1.0: {"Osgoode"=>{"4"=>[40]}} (40 mins)

That ran down for 13 minutes then more activity came in and it’s been going ever since. I’m curious to see when it stops. (The server reboots around 0600, but it could run all night.)

Vexations has a bass theme played in the left hand and two sections (the second a slight variation of the first) played by the right hand accompanied by the bass theme on the left. It’s usually played thus: bass theme alone, theme A, bass theme alone, theme B, repeat. There are 13 quarter-notes in each section, so setting the speed to 52 bpm makes it work out at exactly one minute per repetition. This is faster than it’s normally played, but it still works well.

“Vexation After Vexation” doesn’t tell you how busy the desks at York University Libraries are right now, the way other STAPLR sonifications do, but I think it perfectly combines Satie and STAPLR. I’m looking forward to listening to it through to the end of April, at least.

Press play, turn the volume low, and let it go in the background through your day as a piece of aural furniture.

Engaging with the Canadian Web Archiving Coalition / Archives Unleashed Project

Engaging with the Canadian Web Archiving Coalition

Earlier this year, Archives Unleashed had an opportunity to engage with a Canadian-based group focusing on web archiving support for practitioners and researchers. The Canadian Web Archiving Coalition (CWAC) is a community that brings together libraries, archives, and other memory institutions across Canada that engage with web archiving. CWAC is part of the Canadian Association of Research Libraries (CARL), which focuses on “identifying gaps and opportunities that could be addressed by nationally coordinated strategies, actions, and services, including collaborative collection development, training, infrastructure development.” (1)

The launch of the World Wide Web created a massive shift for research because it has dramatically impacted the way we produce, interact and preserve information. While the Web is a tool used to distribute, present, share, and search for information (2), it also becomes a way through which to explore our modern historical record through web archives.

There are both opportunities and challenges when engaging with digital content and digital cultural heritage in the form of web archives. Libraries, archives, and other memory institutions can use web archives to expand collections while adding underrepresented voices and voices and perspectives. In fact, without web archives, our understanding of topics post-mid-1990s would have considerable gaps. Yet with opportunities come challenges, and for many, the barrier to using or exploring web archives at scale is too high and remains by-in-large out of reach.

Over the past few years, the Archives Unleashed team has engaged with many institutions, including several Canadian research institutions. Those collaborations have underscored that librarians, archivists, and researchers alike, all experience a high barrier to working with web archives. As such, our project has developed data analysis tools that help scholars explore and investigate web archives.

Project Manager Samantha Fritz connected with CWAC participants and web archiving professionals to share an overview of the tools developed over the past three years and share plans for our second phase, which involves a partnership with Archive-It. Joined by project investigators Ian Milligan and Nick Ruest, the group engaged in discussions on how the Web archiving community in Canada can be part of development and supports that impact access to web archives.

If you weren’t able to join in, the Archives Unleashed presentation recording and slides are now available.

CWAC webinars are recorded and made available through the Canadian Association of Research Libraries YouTube channel. For more information about upcoming webinars and general events, please check out CARL’s event page.

References

(1) Canadian Web Archiving Coalition (CWAC). https://www.carl-abrc.ca/advancing-research/digital-preservation/cwac/?cn-reloaded=1

(2) Stephen Robertson, “The Differences between Digital Humanities and Digital History.” Debates in Digital Humanities 2016 (Minneapolis: University of Minnesota Press, 2016). [available online]


Engaging with the Canadian Web Archiving Coalition was originally published in Archives Unleashed on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tethering a Ubiquity Network to a Mobile Hotspot / Peter Murray

I saw it happen.

The cable-chewing device

The contractor in the neighbor’s back yard with the Ditch Witch trencher burying a cable. I was working outside at the patio table and just about to go into a Zoom meeting. Then the internet dropped out. Suddenly, and with a wrenching feeling in my gut, I remembered where the feed line was buried between the house and the cable company’s pedestal in the right-of-way between the properties. Yup, he had just cut it.

To be fair, the utility locator service did not mark the my cable’s location, and he was working for a different cable provider than the one we use. (There are three providers in our neighborhood.) It did mean, though, that our broadband internet would be out until my provider could come and run another line. It took an hour of moping about the situation to figure out a solution, then another couple of hours to put it in place: an iPhone tethered to a Raspberry Pi that acted as a network bridge to my home network’s UniFi Security Gateway 3P.

Network diagram with tethered iPhone

A few years ago I was tired of dealing with spotty consumer internet routers and upgraded the house to UniFi gear from Ubiquity. Rob Pickering, a college comrade, had written about his experience with the gear and I was impressed. It wasn’t a cheap upgrade, but it was well worth it. (Especially now with four people in the household working and schooling from home during the COVID-19 outbreak.) The UniFi Security Gateway has three network ports, and I was using two: one for the uplink to my cable internet provider (WAN) and one for the local area network (LAN) in the house. The third port can be configured as another WAN uplink or as another LAN port. And you can tell the Security Gateway to use the second WAN as a failover for the first WAN (or as load balancing the first WAN). So that is straight forward enough, but do I get the Personal Hotspot on the iPhone to the second WAN port? That is where the Raspberry Pi comes in.

The Raspberry Pi is a small computer with USB, ethernet, HDMI, and audio ports. The version I had laying around is a Raspberry Pi 2—an older model, but plenty powerful enough to be the network bridge between the iPhone and the home network. The toughest part was bootstrapping the operating system packages onto the Pi with only the iPhone Personal Hotspot as the network. That is what I’m documenting here for future reference.

Bootstrapping the Raspberry Pi

The Raspberry Pi runs its own operating system called Raspbian (a Debian/Linux derivative) as well as more mainstream operating systems. I chose to use the Ubuntu Server for Raspberry Pi instead of Raspbian because I’m more familiar with Ubuntu. I tethered my MacBook Pro to the iPhone to download the Ubuntu 18.04.4 LTS image and follow the instructions for copying that disk image to the Pi’s microSD card. That allows me to boot the Pi with Ubuntu and a basic set of operating system packages.

The Challenge: Getting the required networking packages onto the Pi

It would have been really nice to plug the iPhone into the Pi with a USB-Lightning cable and have it find the tethered network. That doesn’t work, though. Ubuntu needs at least the usbmuxd package in order to see the tethered iPhone as a network device. That package isn’t a part of the disk image download. And of course I can’t plug my Pi into the home network to download it (see first paragraph of this post).

My only choice was to tether the Pi to the iPhone over WiFi with a USB network adapter. And that was a bit of Ubuntu voodoo. Fortunately, I found instructions on configuring Ubuntu to use a WPA-protected wireless network (like the one that the iPhone Personal Hotspot is providing). In brief:

sudo -i
cd /root
wpa_passphrase my_ssid my_ssid_passphrase > wpa.conf
screen -q
wpa_supplicant -Dwext -iwlan0 -c/root/wpa.conf
<control-a> c
dhclient -r
dhclient wlan0

Explanation of lines:

  1. Use sudo to get a root shell
  2. Change directory to root’s home
  3. Use the wpa_passphrase command to create a wpa.conf file. Replace my_ssid with the wireless network name provided by the iPhone (your iPhone’s name) and my_ssid_passphrase with the wireless network passphrase (see the “Wi-Fi Password” field in Settings -> Personal Hotspot).
  4. Start the screen program (quietly) so we can have multiple pseudo terminals.
  5. Run the wpa_supplicant command to connect to the iPhone wifi hotspot. We run this the foreground so we can see the status/error messages; this program must continue running to stay connected to the wifi network.
  6. Use the screen hotkey to create a new pseudo terminal. This is control-a followed by a letter c.
  7. Use dhclient to clear out any DHCP network parameters
  8. Use dhclient to get an IP address from the iPhone over the wireless network.

Now I was at the point where I could install Ubuntu packages. (I ran ping www.google.com to verify network connectivity.) To install the usbmuxd and network bridge packages (and their prerequisites):

apt-get install usbmuxd bridge-utils

If your experience is like mine, you’ll get an error back:

couldn't get lock /var/lib/dpkg/lock-frontend

The Ubuntu Pi machine is now on the network, and the automatic process to install security updates is running. That locks the Ubuntu package registry until it finishes. That took about 30 minutes for me. (I imagine this varies based on the capacity of your tethered network and the number of security updates that need to be downloaded.) I monitored the progress of the automated process with the htop command and tried the apt-get command when it finished. If you are following along, now would be a good time to skip ahead to Configuring the UniFi Security Gateway if you haven’t already set that up.

Turning the Raspberry Pi into a Network Bridge

With all of the software packages installed, I restarted the Pi to complete the update: shutdown -r now While it was rebooting, I pulled out the USB wireless adapter from the Pi and plugged in the iPhone’s USB cable. The Pi now saw the iPhone as eth1, but the network did not start until I went to the iPhone to say that I “Trust” the computer that it is plugged into. When I did that, I ran these commands on the Ubuntu Pi:

dhclient eth1
brctl addbr iphonetether
brctl addif iphonetether eth0 eth1
brctl stp iphonetether on
ifconfig iphonetether up

Explanation of lines:

  1. Get an IP address from the iPhone over the USB interface
  2. Add a network bridge (the iphonetether is an arbitrary string; some instructions simply use br0 for the zero-ith bridge)
  3. Add the two ethernet interfaces to the network bridge
  4. Turn on the Spanning Tree Protocol (I don’t think this is actually necessary, but it does no harm)
  5. Bring up the bridge interface

The bridge is now live! Thanks to Amitkumar Pal for the hints about using the Pi as a network bridge. More details about the bridge networking software is on the Debian Wiki.

Note! I'm using a hardwired keyboard/monitor to set up the Raspbery Pi. I've heard from someone that was using SSH to run these commands, and the SSH connection would break off at brctl addif iphonetecther eth0 eth1

Configuring the UniFi Security Gateway

I have a UniFi Cloud Key, so I could change the configuration of the UniFi network with a browser. (You’ll need to know the IP address of the Cloud Key; hopefully you have that somewhere.) I connected to my Cloud Key at https://192.168.1.58:8443/ and clicked through the self-signed certificate warning.

First I set up a second Wide Area Network (WAN—your uplink to the internet) for the iPhone Personal Hotspot: Settings -> Internet -> WAN Networks. Select “Create a New Network”:

  • Network Name: Backup WAN
  • IPV4 Connection Type: Use DHCP
  • IPv6 Connection Types: Use DHCPv6
  • DNS Server: 1.1.1.1 and 1.0.0.1 (CloudFlare’s DNS servers)
  • Load Balancing: Failover only

The last selection is key…I wanted the gateway to only use this WAN interfaces as a backup to the main broadband interface. If the broadband comes back up, I want to stop using the tethered iPhone!

Second, assign the Backup WAN to the LAN2/WAN2 port on the Security Gateway (Devices -> Gateway -> Ports -> Configure interfaces):

  • Port WAN2/LAN2 Network: WAN2
  • Speed/Duplex: Autonegotiate

Apply the changes to provision the Security Gateway. After about 45 seconds, the Security Gateway failed over from “WAN iface eth0” (my broadband connection) to “WAN iface eth2” (my tethered iPhone through the Pi bridge). These showed up as alerts in the UniFi interface.

Performance and Results

So I’m pretty happy with this setup. The family has been running simultaneous Zoom calls and web browsing on the home network, and the performance has been mostly normal. Web pages do take a little longer to load, but whatever Zoom is using to dynamically adjust its bandwidth usage is doing quite well. This is chewing through the mobile data quota pretty fast, so it isn’t something I want to do every day. Knowing that this is possible, though, is a big relief. As a bonus, the iPhone is staying charged via the 1 amp power coming through the Pi.

Creating Value with Open Access Books / Eric Hellman

Can a book be more valuable if it's free? How valuable? To whom? How do we unlock this value?

a lock with ebooks
I've been wrestling with these questions for over ten years now.  And for each of these questions, the answer is... it depends. A truism of the bookselling business is that "Every book is different" and the same is true of the book freeing "business".

Recently there's been increased interest in academic communities around Open Access book publishing and in academic book relicensing (adding an Open Access License to an already published book). Both endeavors have been struggling with the central question of how to value an open access book. The uncertainty in OA book valuation has led to many rookie mistakes among OA stakeholders. For example, when we first started Unglue.it, we assumed that reader interest would accelerate the relicensing process for older books whose sales had declined. But the opposite turned out to be true. Evidence of reader interest let rights holders know that these backlist titles were much more valuable than sales would indicate, thus precluding any notion of making them Open Access. Pro tip: if you want to pay a publisher to make a books free, don't publish your list of incredibly valuable books!

Instead of a strictly transactional approach, it's more useful to consider the myriad ways that academic books create value. Each of these value mechanisms offer buttons that we can push to promote open access, and point to new structures for markets where participants join together to create mutual value.

First, consider the book's reader. The value created is the reader's increased knowledge, understanding and sometimes, sheer enjoyment. The fact of open access does not itself create the value, but removes some of the barriers which might suppress this value. It's almost impossible to quantify the understanding and enjoyment from books; but "hours spent reading" might be a useful proxy for it.

Next consider a book's creator. While a small number of creators derive an income stream from their books, most academic authors benefit primarily from the development and dissemination of their ideas. In many fields of inquiry, publishing a book is the academic's path to tenure. Educators (and their students!) similarly benefit. In principle, you might assess a textbook's value by measuring student performance.

The value of a book to a publisher can be more than just direct sales revenue. A widely distributed book can be a marketing tool for a publisher's entire business. In the world of Open Access, we can see new revenue models emerging - publication charges, events, sponsorships, even grants and memberships. 

The value of a book to society as a whole can be enormous. In areas of research, a book might lead to technological advances, healthier living, or a more equitable society. Or a book might create outrage, civil strife, and misinformation. That's another issue entirely!

Books can be valuable to secondary distributors as well. Both used book resellers and libraries add value to physical books by increasing their usage. This is much harder to accomplish for paywalled ebooks! Since academic libraries are often considered as potential funding sources for Open Access publishing it's worth noting that the value of an open access ebook to a library is entirely indirect. When a library acts as an Open Access funding source, it's acting as a proxy for the community it serves.

This brings us to communities. The vast majority of books create value for specific communities, not societies as a whole. I believe that community-based funding is the most sustainable path for support of Open Access Books. Community supported OA article publishing has already had plenty of support. Communities organized by discipline have been particularly successful: consider the success that ArXiv has had in promoting Open Access in physics, both at the preprint level and for journals in high-energy physics. A similar story can be told for biomedicine, Pubmed and Pubmed Central. A different sort of community success story has been SciELO, which has used Open Access to address challenges faced by scholars in Latin America.

So far, however, sustainable Open Access has proven to be challenging for scholarly ebooks. My next few posts will discuss the challenges and ways forward for support of ebook relicensing and for OA ebook creation:

Open Access for Backlist Books, Part I: The Slush Pile / Eric Hellman

"Kale emerging from a slush pile"
(CC BY, Eric Hellman)
Book publishers hate their "slush pile": books submitted for publication unsolicited, rarely with literary merit and unlikely to make money for the publisher if accepted. In contrast, book publishers love their backlist; a strong backlist is what allows a book publisher to remain consistently profitable even when most of their newly published books fail to turn a profit. A publisher's backlist typically consists of a large number of "slushy" books that generate negligible income and a few steady "evergreen" earners. Publishers don't talk much about the backlist slush pile, maybe because it reminds them of their inability to predict a book's commercial success.

With the advent of digital books has come new possibilities for generating value from the backlist slush pile. Digital books can be kept "in print" at essentially no cost (printed books need warehouse space) which has allowed publishers to avoid rights reversion in many cases. Some types of books can be bundled in ebook aggregations that can be offered on a subscription basis. This is reminiscent of the way investment bankers created valuable securities by packaging junk bonds with opaque derivatives.

Open access is a more broadly beneficial way to generate value from the backlist slush pile. There is a reason that libraries keep large numbers of books on their shelves even when they don't circulate for years. The myriad ways that books can create value doesn't have to be tied to book sales, as I wrote in my previous post.

Those of us who want to promote Open Access for backlist ebooks have a number of strategies at our disposal. The most basic strategy is to promote the visibility of these books. Libraries can add listings for these ebooks in their catalogs. Aggregators can make these books easier to find.

Switching backlist books to Open Access licenses can be expensive and difficult. While the cost of digitization has dropped dramatically over the past decade, quality control is still a significant conversion expense. Licensing-related expenses are sometimes large. Unlike journals and journal articles, academic books are typically covered by publishing agreements that give authors royalties on sales and licensing, and give authors control over derivative works such as translations. No publisher would consent to OA relicensing without the consent and support of the author. For older books, a publisher may not even have electronic rights (in the US, the Tasini decision established that electronic rights are separate from print rights), or may need to have a lawyer interpret the language of the original publishing contract. 

While most scholarly publishers obtain worldwide rights to the books they publish, rights for trade books are very often divided among markets. Open-access licenses such as the Creative Commons licenses are not limited to markets, so a license conversion would require the participation of every rights holder worldwide. 

The CC BY license can be problematic for books containing illustrations or figures used by permission from third party rights holders. "All Rights Reserved" illustrations are often included in Open Access Books, but they are carved out of the license by separate rights statements, and to be safe, publishers use the CC BY-ND or CC BY-ND-NC license for the complete book, as the permissions do not cover derivative works. Since the CC BY license allows derivative works, it cannot be used in cases where translation rights have been sold (without also buying out the translation rights). A publisher cannot use a CC BY license for a translated work without also having rights to the original work.

The bottom line is that converting a backlist book to OA often requires economic motivations quite apart from any lost sales. Luckily, there's evidence that opening access can lead to increased sales. Nagaraj and Reimers found that digitization and exposure through Google Books increased sales of print editions by 35% for books in the Public Domain.  In addition, a publisher's commercial position and prestige can be enhanced by the attribution requirement in Creative Commons licenses.

Additional motivation for OA conversion of the backlist slush pile has been supplied by programs such as used by Knowledge Unlatched, where libraries contribute to to a fund used for "unlatching" backlist books. (Knowledge Unlatched has programs for front list books as well.) While such programs can in principle be applied for the "evergreen" backlist, the incentives currently in place result in the unlatching of books in the "slush pile" backlist. While value for society is being gained this way, the willingness of publishers to "unlatch" hundreds of these books poses the question of how much library funding for Open Access should be allocated to the discount bin, as opposed to the backlist books most used in libraries. That's the topic of my next post! 

Notes

This is the second in a series of posts about creating value of Open Access books. The others are:

Open Access for Backlist Books, Part II: The All-Stars / Eric Hellman

Libraries know that a big fraction of their book collections never circulate, even once. The flip side of this fact is that a small fraction of a library's collection accounts for most of the circulation. This is often referred to as Zipf's law; as a physicist I prefer to think of it as another manifestation of log-normal statistics resulting a preferential attachment mechanism for reading. (English translation: "word-of-mouth".)

In my post about the value of Open Access for books, I suggested that usage statistics (circulation, downloads, etc.) are a useful proxy for the value that books generate for their readers. The logical conclusion is that the largest amount of value that can be generated from opening of the backlist comes from the books that are most used, the "all-stars" of the library, not the discount rack or the discards. If libraries are to provide funding for Open Access backlist books, shouldn't they focus their resources on the books that create the most value?

The question of course, is how the library community would ever convince publishers, who have monopolies on these books as a consequence of international copyright laws, to convert these books to Open Access. Although some sort of statutory licensing or fair-use carve-outs could eventually do the trick, I believe that Open Access for a significant number of "backlist All-Stars" can be achieved today by pushing ALL the buttons available to supporters of Open Access. Here's where the Open Access can learn from the game (and business) of baseball.

"Baseball", Henry Sandham, L. Prang & Co. (1861).
  from Digital Commonwealth


Baseball's best player, Mike Trout, should earn $33.25 million this year, a bit over $205,000 per regular season game. If he's chosen for the All-Star game, he won't get even a penny extra to play unless he's named MVP, in which case he earns a $50,000 bonus. So why would he bother to play for free? It turns out there are lots of reasons. The most important have everything to with the recognition and honor of being named as an All-Star, and with having respect for his fans. But being an All-Star is not without financial benefits considering endorsement contracts and earning potential outside of baseball. Playing in the All-Star game is an all-around no-brainer for Mike Trout.

Open Access should be an All-Star game for backlist books. We need to create community-based award programs that recognize and reward backlist conversions to OA. If the world's libraries want to spend $50,000 on backlist physics books, for example, isn't it better to spend it on the the Mike Trout of physics books than on a team full of discount-rack replacement-level players?

Competent publishers would line up in droves for major-league all-star backlist OA programs. They know that publicity will drive demand for their print versions (especially if NC licenses are used.) They know that awards will boost their prestige, and if they're trying to build Open Access publication programs, prestige and quality are a publisher's most important selling points.

The Newbury Medal

Over a hundred backlist books have been converted to open access already this year. Can you name one of them? Probably not, because the publicity value of existing OA conversion programs is negligible. To relicense an All-Star book, you need an all-star publicity program. You've heard of the Newbury Medal, right? You've seen the Newbury medal sticker on children's books, maybe even special sections for them in bookstores. That prize, award by the American Library Association every year to honor the most distinguished contributions to American literature for children, is a powerful driver of sales. The winners get feted in a gala banquet and party (at least they did in the before-times). That's the sort of publicity we need to create for open access books.

If you doubt that "All-Star Open Access" could work, don't discount the fact that it's also the right thing to do. Authors of All-Star backlist books want their books to be used, cherished and remembered. Libraries want books that measurably benefit the communities they serve. Foundations and governmental agencies want to make a difference. Even publishers who look only at their bottom lines can structure a rights conversion as a charitable donation to reduce their tax bills.

And did I mention that there could be Gala Award Celebrations? We need more celebrations, don't you think?

If your community is interest in creating an Open-Access program for backlist books, don't hesitate to contact me at the Free Ebook Foundation!

Notes

I've written about the statistics of book usage here, here and here.

This is the third in a series of posts about creating value of Open Access books. The first two are:

Chromebook Linux Update / David Rosenthal

My three Acer C720 Chromebooks running Linux are still giving yeoman service, although for obvious reasons I'm not travelling these days. But it is time for an update to 2017's Travels with a Chromebook. Below the fold, an account of some adventures in sysadmin.

Battery Replacement

The battery in C720 #1, which is over six years old, would no longer hold a charge. I purchased a Dentsing AP13J3K replacement battery from Amazon. I opened the C720, removed the old battery, inserted the new one, closed up the case, and all was well. It was an impressively easy fix.

Sleeping & Waking

Sometime around last October Linux Mint 19 switched from kernels in the 5.0 series to kernels in the 5.4 series. Mint 20 uses 5.4 series kernels. The 5.0 kernels on the C720 went to sleep properly when the lid closed, and woke properly when the lid opened. The 5.4 kernels appeared to go to sleep correctly, but when the lid opened did a cold boot. Because this problem happens immediately on wake, and because sleep appears to work correctly, there is no useful information in the logs; this appears to be a very hard problem to diagnose.

Here is my work-around to use the 5.0.0-3265 kernel (the last I installed via updates available) on a vanilla Linux Mint installation:
  • Install Linux Mint 20.1 MATE edition.
  • Add repositories using Administration/Software Sources:
    • deb http://archive.ubuntu.com/ubuntu bionic main restricted universe multiverse
    • deb http://archive.ubuntu.com/ubuntu bionic-updates main restricted universe multiverse
    • deb http://security.ubuntu.com/ubuntu/ bionic-security main restricted universe multiverse
  • Install kernel 5.0.0-3265-generic:
    sudo apt-get install linux-headers-5.0.0-65 linux-headers-5.0.0-65-generic linux-image-5.0.0-65-generic linux-modules-5.0.0-65-generic linux-modules-extra-5.0.0-65-generic
  • Edit /etc/default/grub to show the menu of kernels:
    GRUB_TIMEOUT_STYLE=menu
    GRUB_TIMEOUT=15
  • Edit /etc/default/grub so that your most recent choice of kernel becomes the default:
    GRUB_SAVEDEFAULT=true
    GRUB_DEFAULT=saved
  • Run update-grub
After you choose the 5.0.0-3265 kernel the first time, it should boot by default, and sleep and wake should work properly. The problem with ehci-pci on wakeup has gone away, there is no need to install the userland files from galliumos.

Disk & Home Directory Encryption

Note that you should ideally install Linux Mint 20.1 with full-disk encryption. The release notes explain:
The move to systemd caused a regression in ecrypts which is responsible for mounting/unmounting encrypted home directories when you login and logout. Because of this issue, please be aware that in Mint 20 and newer releases, your encrypted home directory is no longer unmounted on logout: https://bugs.launchpad.net/ubuntu/+source/gnome-session/+bug/1734541.
Mint 19 with a full-disk encryption had this problem but I haven't and I have been able to reproduce it with Mint 20 and the 5.0.0-32 and 5.0.0-65 kernels, so it isn't usable. Home directory encryption works, but will leave its contents decrypted after you log out, rather spoiling the point.

Touchpad

As I described here, the touchpad isn't one of the C720's best features, and it is necessary to disable it while typing, or while using a mouse such as the excellent Tracpoint. I use Ataraeo's touchpad-indicator, but this doesn't seem to work on Mint 20.1 out of the box, using X.org's libinput driver. The release notes discuss using the synaptics driver instead. I installed it and, after creating the directory ~/.config/autostart the touchpad-indicator starts on login and works fine.

Update: I updated the post to reflect my mistake about full-disk encryption and to recommend the 5.0.0.65 kernel

Call for Volunteers: NDSA Task Force on Membership Engagement and Recruitment / Digital Library Federation

The NDSA Leadership group is spinning up a new group around NDSA Membership and invites you to consider volunteering for the Task Force on Membership Engagement and Recruitment. The focus of the Task Force will be to examine membership engagement, benefits/drawbacks of the current model type, and recruitment efforts of the NDSA. Through research and surveying members of the consortium, a primary goal of the Task Force is to provide recommendations that will move the NDSA towards a culture that is more inclusive, collaborative, intentional, and that has well defined metrics around recruitment and engagement. If interested, please complete the form, which can be found here by March 5, 2021.

 ~ Jes Neal

The post Call for Volunteers: NDSA Task Force on Membership Engagement and Recruitment appeared first on DLF.

Re-live the excitement of Generous and Open GLAM 2021 / Hugh Rundle

I wrote earlier about the Generous and Open GLAM miniconf I ran at LinuxConf.au2021 with Bonnie Wildie. All our sessions were recorded as is usually the case at LCA, but they're a little hard to find amongst the rest of the videos posted to YouTube, so I've collected them here in case you missed out the first time or want to watch one of the talks again.

I know I said I wouldn't post my video, but it seems a bit weird to leave it out so I guess I changed my mind.

You can find the Program and abstracts on the conference website.


Weeknote 7 (2021) / Mita Williams

Today the library is closed as is my place of work’s tradition on the last day of Reading Week.

But as I have three events (helping in a workshop, giving a presentation, participating in a focus group) in my calendar, I’m just going to work the day and bank the time for later.

§1

Barbara Fister in The Atlantic!

We are experiencing a moment that is exposing a schism between two groups: those who have faith that there is a way to arrive at truth using epistemological practices that originated during the Enlightenment, and those who believe that events and experiences are portents to be interpreted in ways that align with their personal values. As the sociologist and media scholar Francesca Tripodi has demonstrated, many conservatives read the news using techniques learned through Bible study, shunning secular interpretations of events as biased and inconsistent with their exegesis of primary texts such as presidential speeches and the Constitution. The faithful can even acquire anthologies of Donald Trump’s infamous tweets to aid in their study of coded messages.

While people using these literacy practices are not unaware of mainstream media narratives, they distrust them in favor of their own research, which is tied to personal experience and a high level of skepticism toward secular institutions of knowledge. This opens up opportunities for conservative and extremist political actors to exploit the strong ties between the Republican Party and white evangelical Christians. The conspiracy theory known as QAnon is a perfect—and worrisome—example of how this works. After all, QAnon is something of a syncretic religion. But its influence doesn’t stop with religious communities. While at its core it’s a 21st-century reboot of a medieval anti-Semitic trope (blood libel), it has shed some of its Christian vestments to gain significant traction among non-evangelical audiences.

§2

New to me: Andromeda Yelton’s course reading list dedicated to AI in the Library. Hat-tip to Beck Tench.

§3

I recently suggested that MPOW’s next Journal Club should deviate from looking at the library literature and reflect on personal knowledge management. I’m not sure how much take up there will be on the topic, but I love reading about how other people deliberately set up how they set up systems to help them learn.

Case in point: Cecily Walker’s Thoughts Like A Runaway Train: Notes on Information Management with Zettelkasten

Fun fact: I first learned of Zettelkasten from Beck Tench.

Teaching OOP in the Time of COVID / Ed Summers

I’ve been teaching a section of the Introduction to Object Oriented Programming at the UMD College for Information Studies this semester. It’s difficult for me, and for the students, because we are remote due to the Coronavirus pandemic. The class is largely asynchronous, but every week I’ve been holding two synchronous live coding sessions in Zoom to discuss the material and the exercises. These have been fun because the students are sharp, and haven’t been shy about sharing their screen and their VSCode session to work on the details. But students need quite a bit of self-discipline to move through the material, and probably only about 1/4 of the students take advantage of these live sessions.

I’m quite lucky because I’m working with a set of lectures, slides and exercises that have been developed over the past couple of years by other instructors: Josh Westgard, Aric Bills and Gabriel Cruz. You can see some of the public facing materials here. Having this backdrop of content combined with Severance’s excellent (and free) Python for Everybody has allowed me to focus more on my live sessions, on responsive grading, and to also spend some time crafting additional exercises that are geared to this particular moment.

This class is in the College for Information Studies and not in the Computer Science Department, so it’s important for the students to not only learn how to use a programming language, but to understand programming as a social activity, with real political and material effects in the world. Being able to read, understand, critique and talk about code and its documentation is just as important as being able to write it. In practice, out in the “real world” of open source software I think these aspects are arguably more important.

One way I’ve been trying to do this in the first few weeks of class is to craft a sequence of exercises that form a narrative around Coronavirus testing and data collection to help remind the students of the basics of programming: variables, expressions, conditionals, loops, functions, files.

In the first exercise we imagined a very simple data entry program that needed to record results of Real-time polymerase chain reaction tests (RT-PCR). I gave them the program and described how it was supposed to work, and asked them describe (in English) any problems that they noticed and to submit a version of the program with problems fixed. I also asked them to reflect on a request from their boss about adding the collection of race, gender and income information. The goal here was to test their ability to read the program and write English about it while also demonstrating a facility for modifying the program. Most importantly I wanted them to think about how inputs such as race or gender have questions about categories and standards behind them, and weren’t simply a matter of syntax.

The second exercise builds on the first by asking them to adjust the revised program to be able to save the data in a very particular format. Yes, in the first exercise the data is stored in memory and printed to the screen in aggregate at the end. The scenario here is that the Department of Health and Human Services has assumed the responsibility for COVID test data collection from the Centers for Disease Control. Of course this really happened, but the data format I chose was completely made up (maybe we will be working with some real data at the end of the semester if I continue with this theme). The goal in this exercise was to demonstrate their ability to read another program and fit a function into it. The students were given a working program that had a save_results() function stubbed out. In addition to submitting their revised code I asked them to reflect on some limitations of the data format chosen, and the data processing pipeline that it was a part of.

And in the third exercise I asked them to imagine that this lab they were working in had a scientist who discovered a problem with some of the thresholds for acceptable testing, which required an update to the program from Exercise 2, and also a test suite to make sure the program was behaving properly. In addition to writing the tests I asked them to reflect on what functionality was not being tested that probably should be.

This alternation between writing code and writing prose is something I started doing as part of a Digital Curation class. I don’t know if this dialogical or perhaps dialectical, approach is something others have tried. I should probably do some research to see. In my last class I alternated week by week: one week reading and writing code, the next week reading and writing prose. But this semester I’ve stayed focused on code, but required the reading and writing of code as well as prose about code in the same week. I hope to write more about how this goes, and these exercises as I go. I’m not sure if I will continue with the Coronavirus data examples. One thing I’m sensitive to is that my students themselves are experiencing the effects of the Coronavirus, and may want to escape it just for a bit in their school work. Just writing in the open about it here, in addition to the weekly meetings I’ve had with Aric, Josh and Gabriel has been very useful.

Speaking of those meetings. I learned today from Aric that tomorrow (February 20th, 2021) is the 30th anniversary of Python’s first public release! You can see this reflected in this timeline. This v0.9.1 release was the first release Guido van Rossum made outside of CWI and was made on the Usenet newsgroup alt.sources where it is split out into chunks that need to be reassembled. Back in 2009 Andrew Dalke located a and repackaged these sources in Google Groups which acquired alt.sources as part of DejaNews in 2001. But if you look at the time stamp on the first part of the release you can see that it was made February 19, 1991 (not February 20). So I’m not sure if the birthday is actually today.

I sent this little note out to my students with this wonderful two part oral history that the Computer History Museum did with Guido van Rossum a couple years ago. I turns out Both of his parents were atheists and pacifists. His dad went to jail because he refused to be conscripted into the military. That and many more details of his background and thoughts about the evolution of Python can be found in these delightful interviews:

Happy Birthday Python!

Enter the Open Data Day 2021 photo and video competition / Open Knowledge Foundation

Open Data Day 2020: images from round the world

Every year, hundreds of Open Data Day events take place to celebrate open data in communities across the world and it is fantastic to see all the photos and videos shared.

So this year, we are giving away prizes for the best Open Data Day 2021 photographs and videos as well as making sure that all the winners are published under an open license for anyone to use to promote Open Data Day in the future.

Thanks to the generous support of our funding partners, we are able to offer:

  • 10 prizes of $50 USD each for the best Open Data Day 2021 photographs
  • 5 prizes of $100 USD each for the best Open Data Day 2021 videos (max length = 60 seconds)

Open Knowledge Foundation wants more openly licensed images and video of Open Data Day in order to better communicate and share more about all the activities which happen annually to celebrate open data. 

All entries must be submitted by 12pm GMT on Saturday 13th March 2021 via this form and winners will be announced shortly afterwards.

If you are organising an event, please do tell your participants about the prizes. Anyone attending any Open Data Day event or online session anywhere in the world can submit photographs and/or videos to this competition. 

GPT-3 Jam / Ed Summers

One of the joys of pandemic academic life has been a true feast of online events to attend, on a wide variety of topics, some of which are delightfully narrow and esoteric. Case in point was today’s Reflecting on Power and AI: The Case of GPT-3 which lived up to its title. I’ll try to keep an eye out for when the video posts, and update here.

The workshop was largely organized around an exploration of whether GPT-3, the largest known machine learning language model, changes anything for media studies theory, or if it amounts to just more of the same. So the discussion wasn’t focused so much on what games could be played with GPT-3, but rather if GPT-3 changes the rules of the game for media theory, at all. I’m not sure there was a conclusive answer at the end, but it sounded like the consensus was that current theorization around media is adequate for understanding GPT-3, but it matters greatly what theory or theories are deployed. The online discussion after the presentations indicated that attendees didn’t see this as merely a theoretical issue, but one that has direct social and political impacts on our lives.

James Steinhoff looked at GPT-3 using a Marxist media theory perspective where he told the story of GPT-3’s as a project of OpenAI and as a project of capital. OpenAI started with much fanfare in 2015 as a non-profit initiative where the technology, algorithms and models developed would would be kept openly licensed and freely available so that the world could understand the benefits and risks of AI technology. Steinhoff described how in 2019 the project’s needs for capital (compute power and staff) transitioned it from a non-profit into a capped-profit company, which is now owned, or at least controlled, by Microsoft.

The code for generating the model as well as the model itself are gated behind a token driven Web API run my Microsoft. You can get on a waiting list to use it, but apparently a lot of people have been waiting a while, so … Being a Microsoft employee probably helps. I grabbed a screenshot of the pricing page that Steinhoff shared during his presentation:

I’d be interested to hear more about how these tokens operate. Are they per-request, or are they measured according something else? I googled around a bit during the presentation to try to find some documentation for the Web API, and came up empty handed. I did find Shreya Shankar’s gpt3-sandbox project for interacting with the API in your browser (mostly for iteratively crafting text input in order to generate desired output). It depends on the openai Python package created by OpenAI themselves. The docs for openai then point at a page on the openai.com website which is behind a login. You can create an account, but you need to be pre-approved (made it through the waitlist) to be able to see the docs. There’s probably some sense that can be made from examining the python client though.

All of the presentations in some form or another touched on the 175 billion parameters that were used to generate the model. But the API to the model doesn’t have that many parameters. It allows you to enter text and get text back. But the API surface that the GPT-3 service provides could be interesting to examine a bit more closely, especially to track how it changes over time. In terms of how this model mediates knowledge and understanding it’ll be important watch.

Steinhoff’s message seemed to be that, despite the best of intentions, GPT-3 functions in the service of very large corporations with very particular interests. One dimension that he didn’t explore perhaps because of time, is how the GPT-3 model itself is fed massive amounts of content from the web, or the commons. Indeed 60% of the data came from the CommonCrawl project.

GPT-3 is an example of an extraction project that has been underway at large Internet companies for some time. I think the critique of these corporations has often been confined to seeing them in terms of surveillance capitalism rather than in terms of raw resource extraction, or the primitive accumulation of capital. The behavioral indicators of who clicked on what are certainly valuable, but GPT-3 and sister projects like CommonCrawl shows just the accumulation of data with modest amounts of metadata can be extremely valuable. This discussion really hit home for me since I’ve been working with Jess Ogden and Shawn Walker using CommonCrawl as a dataset for talking about the use of web archives, while also reflecting on the use of web archives as data. CommonCrawl provides a unique glimpse into some of the data operations that are at work in the accumulation of web archives. I worry that the window is closing and the CommonCrawl itself will be absorbed into Microsoft.

Following Steinhoff Olya Kudina and Bas de Boer jointly presented some compelling thoughts about how its important to understand GPT-3 in terms of sociotechnical theory, using ideas drawn from Foucault and Arendt. I actually want to watch their presentation again because it followed a very specific path that I can’t do justice to here. But their main argument seemed to be that GPT-3 is an expression of power and that where there is power there is always resistance to power. GPT-3 can and will be subverted and used to achieve particular political ends of our own choosing.

Because of my own dissertation research I’m partial to Foucault’s idea of governmentality, especially as it relates to ideas of legibility (Scott, 1998)–the who, what and why of legibility projects, aka archives. GPT-3 presents some interesting challenges in terms of legibility because the model is so complex, the results it generates defy deductive logic and auditing. In some ways GPT-3 obscures more than it makes a population legible, as Foucault moved from disciplinary analysis of the subject, to the ways in which populations are described and governed through the practices of pastoral power, of open datasets. Again the significance of CommonCrawl as an archival project, as a web legibility project, jumps to the fore. I’m not as up on Arendt as I should be, so one outcome of their presentation is that I’m going to read her The Human Condition which they had in a slide. I’m long overdue.

References

Scott, J. C. (1998). Seeing like a state: How certain schemes to improve the human condition have failed. Yale University Press.

Blast Radius / David Rosenthal

Last December Simon Sharwood reported on an "Infrastructure Keynote" by Amazon's Peter DeSantis in AWS is fed up with tech that wasn’t built for clouds because it has a big 'blast radius' when things go awry:
Among the nuggets he revealed was that AWS has designed its own uninterruptible power supplies (UPS) and that there’s now one in each of its racks. AWS decided on that approach because the UPS systems it needed were so big they required a dedicated room to handle the sheer quantity of lead-acid batteries required to keep its kit alive. The need to maintain that facility created more risk and made for a larger “blast radius” - the extent of an incident's impact - in the event of failure or disaster.

AWS is all about small blast radii, DeSantis explained, and in the past the company therefore wrote its own UPS firmware for third-party products.

“Software you don’t own in your infrastructure is a risk,” DeSantis said, outlining a scenario in which notifying a vendor of a firmware problem in a device commences a process of attempting to replicate the issue, followed by developing a fix and then deployment.

“It can take a year to fix an issue,” he said. And that’s many months too slow for AWS given a bug can mean downtime for customers.
This is a remarkable argument for infrastructure based on open source software, but that isn't what this post is about. Below the fold is a meditation on the concept of "blast radius", the architectural dilemma it poses, and its relevance to recent outages and compromises.

In 2013 Stanford discovered that an Advanced Persistent Threat (APT) actor had breached their network and compromised the Active Directory server. The assessment was that, as a result, nothing in the network could be trusted. In great secrecy Stanford built a complete replacement network, switches, routers, servers and all. The secrecy was needed because the attackers were watching; for example nothing about the new network could be mentioned in e-mail. Eventually there was a "flag day" when everyone was cut over to the new network with new passwords. Then, I believe, all the old network hardware was fed into the trash compactor. The authentication technology for a network, in Stanford's case Active Directory, has a huge "blast radius", which is why it is a favorite target for attackers.

Two recent outages demonstrate the "blast radius" of authentication services:
  • Last September, many of Microsoft's cloud services failed. Thomas Claiburn's Microsoft? More like: My software goes off reported:
    Microsoft's online authentication systems are inaccessible for at least some customers today, locking those subscribers out of tons of Redmond-hosted services if they are not already logged in.
    ...
    Beyond Microsoft's public and government cloud wobbles ... the authentication system outage has hit its other online services, including Outlook, Office, Teams, and Microsoft Authenticator. If you're not already logged in, it appears, you may be unable to get in and use the cloud-based applications as a result of the ongoing downtime.
    The next day, Tim Anderson's With so many cloud services dependent on it, Azure Active Directory has become a single point of failure for Microsoft had more details of the failed attempts to recover by backing out a recent change, and the lingering after-effects:
    The core service affected was Azure Active Directory, which controls login to everything from Outlook email to Teams to the Azure portal, used for managing other cloud services. The five-hour impact was also felt in productivity-stopping annoyances like some installations of Microsoft Office and Visual Studio, even on the desktop, declaring that they could not check their licensing and therefore would not run. ... If the problem is authentication, even resilient services with failover to other Azure regions may become inaccessible and therefore useless.

    The company has yet to provide full details, but a status report today said that "a recent configuration change impacted a backend storage layer, which caused latency to authentication requests".
    Part of the difficulty Microsoft had in responding was likely due to the need for staff responding to the authentication problem to authenticate to obtain the access required to make the necessary changes.
  • Then, in December, many of Google's cloud services failed. Again, Anderson reported the details in Not just Microsoft: Auth turns out to be a point of failure for Google's cloud, too:
    Google has posted more details about its 50 minute outage yesterday, though promising a "full incident report" to follow. It was authentication that broke, reminiscent of Microsoft's September cloud outage caused by an Azure Active Directory failure.

    In an update to its Cloud Status dashboard, Google said that: "The root cause was an issue in our automated quota management system which reduced capacity for Google's central identity management system, causing it to return errors globally. As a result, we couldn't verify that user requests were authenticated and served errors to our users."

    Not mentioned is the fact that the same dashboard showed all green during at least the first part of the outage. Perhaps it did not attempt to authenticate against the services, which were otherwise running OK. As so often, Twitter proved more reliable for status information.

    Services affected included Cloud Console, Cloud Storage, BigQuery, Google Kubernetes Engine, Gmail, Calendar, Meet, Docs and Drive.

    "Many of our internal users and tools experienced similar errors, which added delays to our outage external communication," the search and advertising giant confessed.
    The last point is important. Even services that didn't require authentication of external requests, such as the font service, failed because they in turn used internal services to which they had to authenticate.
To be fair, I should also mention that around the same time Amazon's US-EAST-1 region also suffered an outage. The cause had a big "blast radius" but wasn't an authentication failure, so you should follow the link for the fascinating explanation. Returning to authentication, Anderson wrote:
Authentication is a tricky problem for the big cloud platforms. It is critically important and cannot be fudged; security trumps resilience.
The keys to resilient services are replication, independence and (ideally) voting among the replicas so that there are no single points of failure. But the whole point of an authentication service is to centralize knowledge in a single authoritative source. Multiple independent databases of authentication information present the risk of inconsistency and delayed updates.

Another service that must be centralized to be effective is network monitoring. The whole point of network monitoring is to provide a single, holistic view of network activity. Thus it is that the recent SolarWinds and Centreon compromises targeted network monitoring systems.

This all suggests that critical systems such as release build environments that can perform their function without being on-line should be air-gapped, or at least be disconnected from the general authentication system. Of course, most systems cannot perform their functions without being on-line and authenticated. For these systems, authentication and network monitoring systems in particular, networks need contingency plans for when, not if, they are compromised.

Security Releases: Evergreen 3.4.6, 3.5.3, and 3.6.2 / Evergreen ILS

On behalf of the Evergreen contributors, we are pleased to announce the release of Evergreen 3.4.6, 3.5.3, and 3.6.2.

Theys are available from the downloads page.

THESE RELEASES CONTAIN A SECURITY UPDATE.

It is recommended that all Evergreen sites upgrade as soon as possible.

These releases fix a potential crash of a back end service when requests are made with a null authorization token.

All of these new releases contain additional bug fixes that not related to the security issue. For more information on the changes in these releases, please consult their release notes:

“The Big Ask”: Securing Recurring Campus Funding for a Research Data Service at the University of Illinois / HangingTogether

Research data management (RDM) has quickly grown in interest in higher education, with significant investment in services, resources, and infrastructure to support researchers’ data management needs. While libraries are natural stakeholders in RDM, the development—and financial support for—research data support services requires significant social interoperability, the creation and maintenance of working relationships across individuals and organizational units that promote collaboration, communication, and mutual understanding.

A recent OCLC Research Library Partnership (RLP) Works in Progress webinar presentation entitled Developing and sustaining RDM services at Arizona and Illinois through partnership with the Office of Research provides two rich case studies for how libraries are optimizing social interoperability to support local RDM services. I urge you to watch the entire webinar—it’s time I promise you won’t want back.

In this blog post I’m focusing on how the University of Illinois Library secured funding from the Office of the Vice Chancellor for Research (VCR) to provide ongoing support for a campus Research Data Service. OCLC Research has previously highlighted the efforts at Illinois in our Realities of Research Data Management report series.

Securing buy-in. . . can take (a lot of) time

There were many efforts at Illinois that laid the groundwork for the establishment of the Research Data Service. As far back as 2010, the campus wide Stewarding Excellence strategic planning effort noted that while several “stand-alone” activities on campus were supporting data management, there had been “very little campus-level planning for a coordinated data stewardship program,” which would require the addition of capacity, infrastructure, and policies. The subcommittee examining this service had been charged with exploring opportunities for campus-wide coordination, outsourcing, and external funding—in order to achieve savings in an insecure state funding environment.

However, it reported that cost savings were unlikely, instead articulating that

“. . . there is high risk associated with the status quo. Significant concerns include data quality and loss, liability related to funding agency requirements for data sharing and publication, and risk to future grants that require detailed data management plans and compliance.”[1]

As a result, two working groups were formed, including a Data Stewardship Committee, which was comprised of senior level stakeholders from the University Library, Office of the Vice Chancellor for Research (OVCR), Campus IT, National Center for Supercomputing Applications (NCSA), the Graduate College, and the iSchool.

In 2011-2012, this committee organized the Illinois Research Data Initiative, which included a number of events featuring prominent speakers and panels to raise awareness on campus of issues related to RDM practices, explore disciplinary-specific challenges, and to identify infrastructure and service needs.

As this outreach and needs assessment activity was taking place, there were also other external happenings that added urgency to the effort. Beginning in January 2011, the National Science Foundation (NSF) began requiring supplemental data DMPs in NSF grant proposals. These DMPs were expected to describe how investigators would responsibly manage and share the results and data from NSF-supported research. In February 2013, the Obama administration’s White House Office of Science and Technology Policy (OSTP) released a public access memo directing federal agencies supporting research to develop a plan to support increasing public access to publicly supported research—including research data.

Timing—use it to your advantage when you can

When a new Vice Chancellor for Research began work in 2012, the committee members of the Illinois Research Data Initiative were ready with a proposal on his desk shortly after he arrived. The proposal was to launch a campus Research Data Service that leveraged existing research infrastructures and added new services that would include:

  • data curation and technical professionals with oversight by a director
  • data storage services
  • a research data repository
  • consultation services
  • close collaboration with central service units
  • interaction with governance groups

Offering concrete solutions to institution-wide problems

The proposal wasn’t cheap, with a total  asking cost of $800,000 per annum, in order to cover eight staff members, necessary memberships in DataCite and ORCID, and additional storage, software, and licensing costs. As you can see from the image here, they didn’t receive everything requested, but the Vice Chancellor of Research committed to recurring support in the amount of $400,000 annually to launch and sustain a Research Data Service to be housed in the University Library. Reductions were largely tied to medium scale storage, which was handled with one-time start-up funds instead of being part of the reoccurring allocation.

Stewarding relationships and building trust

In the time since, the Illinois Research Data Services has made it a priority to continue to steward the strong relationships and trust established during the initial efforts with the Office of Research, campus IT, and other RDM stakeholders. The library provides quarterly updates, advice and recommendations related to opportunities of interest or service challenges to the Office of the Vice Chancellor for Research. In return, the OVCR looks to the library not only for the RDM services it provides, but also for valued expertise on campus committees, support for RFIs, and more. The relationship is reciprocal, trusted, and respectful.

An exemplar of social interoperability

The recent OCLC Research report Social Interoperability in Research Support: Cross-Campus Partnerships and the University Research Enterprise documents strategies and tactics that libraries need to leverage in order to develop relationships, secure buy-in and financial commitments, and sustain resource support services in the highly heterogeneous higher education landscape. This case study from the University of Illinois exemplifies many of the strategies and tactics discussed in that report, from securing buy-in, investing time and energy in relationships, offering concrete solutions to institutional problems, and being tactical about timing—patiently waiting for the right moment and being prepared when it comes. I encourage you to read the report after finishing this blog—and put a note in the comments below if you have your own social interoperability story to share!


[1] Report and Recommendations, Stewarding Excellence @ Illinois IT Project team, University of Illinois at Urbana-Champaign, 2010, https://web.archive.org/web/20151228161659/http:/oc.illinois.edu/budget/it_project_team_report.pdf, 41-42. The section on research data is a worthy read, and it could be a resource to buttress your own institutional ask.

The post “The Big Ask”: Securing Recurring Campus Funding for a Research Data Service at the University of Illinois appeared first on Hanging Together.

Hyku 3.0 Release Includes New Customization Features / Samvera

Hyku 3.0 is now available, with new features and improvements. These features add customization options at the institution level, and the improvements provide for easier maintenance of Hyku implementations across all adopters.

Theming Improvements

Now even more theming capability is in the hands of non-technical administrators, offering the ability to create a unique branded repository without reliance on internal technical support or a development team. Admins can customize CSS in the interface, choose additional font and color options, and have the ability to set default images and logos at the tenant level.

Bulkrax Importer and Exporter now Optional

This feature ranked as a top-priority need for the community in the recent community survey of Hyku features conducted by the Advancing Hyku Project. Bulkrax Importer and Exporter is now optional behind a feature flipper, allowing admins to turn the feature on or off for your instance. It’s also now connected to BrowseEverything, and includes a status dashboard at the tenant level for self service. Bulkrax currently supports OAH-PMH, CSV, XML, and Bagit formats, and increases in scope with each iteration. The ability to toggle on/off behind a feature flipper was supported by the Hyku National Transportation Library 3.0 project for the US Department of Transportation.

Contact Page Customization at the Tenant Level

Each institution can now configure contact information for their repository instance. This moves Hyku toward truly discrete tenants. This feature helps repository-level compliance with existing community frameworks and requirements such as Core Trust Seal, TRUST Principles and COAR Community Framework which require baseline software information and user support

Improved User Management Capabilities

This includes superuser role, tenant-level role improvements, removal of registration requirements when SSO is in place, and groundwork for future permissions updates. These improvements allow individual tenant institutions to manage their own users and repositories across a multiple institution implementation, such as with the Advancing Hyku/British Library multi-tenant implementation or the Hyku for Consortia/PALNI-PALCI multi-tenant consortial pilot. The SSO update allows for smoother adoption by institutions with their own login protocol requirements

New Embargo and Lease Options

Embargo and Lease features have been upgraded to automatic release with background jobs. This functionality allows multiple options for visibility, such as worldwide visible or institution-only, to expire at designated times. This feature helps repositories to be compliant with publisher policies while ensuring the content is made available via the repository, without requiring administrator intervention when visibility requirements expire or open. This feature is part of the recent Advancing Hyku core code contribution to Hyku 3.0.

Additional improvements in this release include upgrades to Hyrax 2.9, Rails 5.2, and Ruby 2.7, with the removal of redundant code; and helm chart Kubernetes deployment. Convergence on shared infrastructure and software stacks for Hyku 3.0 core means easier maintenance for Hyku implementations across all adopters. It brings us closer to the goal that we’ll all be on the same version, running the same code. (All Hyku 3.0 instances)

All the release details can be found in the release notes.

What can Hyku 3.0 enable for your project? Interested in learning more about Hyku?

Join us for the next Hyku Interest Group call! All are welcome to attend. You can also ask questions in the Hyku channel on the Samvera Slack workspace, and check out videos of Hyku in action on the Hyku YouTube channel.

The post Hyku 3.0 Release Includes New Customization Features appeared first on Samvera.