Planet Code4Lib

The Distant Reader and its five different types of input / Eric Lease Morgan

The Distant Reader can take five different types of input, and this blog posting describes what they are.

wall paper by Eric

Wall Paper by Eric

The Distant Reader is a tool for reading. It takes an arbitrary amount of unstructured data (text) as input, and it outputs sets of structured data for analysis — reading. Given a corpus of any size, the Distant Reader will analyze the corpus, and it will output a myriad of reports enabling you to use & understand the corpus. The Distant Reader is intended to designed the traditional reading process.

At the present time, the Reader can accept five different types of input, and they include:

  1. a file
  2. a URL
  3. a list of URLs
  4. a zip file
  5. a zip file with a companion CSV file

Each of these different types of input are elaborated upon below.

A file

The simplest form of input is a single file from your computer. This can be just about file available to you, but to make sense, the file needs to contain textual data. Thus, the file can be a Word document, a PDF file, an Excel spreadsheet, an HTML file, a plain text file, etc. A file in the form of an image will not work because it contains zero text. Also, not all PDF files are created equal. Some PDF files are only facsimiles of their originals. Such PDF files are merely sets of images concatenated together. In order for PDF files to be used as input, the PDF files need to have been “born digitally” or they need to have had optical character recognition previously applied against them. Most PDF files are born digitally nor do they suffer from being facsimiles.

A good set of use-cases for single file input is the whole of a book, a long report, or maybe a journal article. Submitting a single file to the Distant Reader is quick & easy, but the Reader is designed for analyzing larger rather than small corpora. Thus, supplying a single journal article to the Reader doesn’t make much sense; the use of the traditional reading process probably makes more sense for a single journal article.


The Distant Reader can take a single URL as input. Given a URL, the Reader will turn into a rudimentary Internet spider and build a corpus. More specifically, given a URL, the Reader will:

  • retrieve & cache the content found at the other end of the URL
  • extract any URLs it finds in the content
  • retrieve & cache the content from these additional URLs
  • stop building the corpus but continue with its analysis

In short, given a URL, the Reader will cache the URL’s content, crawl the URL one level deep, cache the result, and stop caching.

Like the single file approach, submitting a URL to the Distant Reader is quick & easy, but there are a number of caveats. First of all, the Reader does not come with very many permissions, and just because you are authorized to read the content at the other end of a URL does not mean the Reader has the same authorization. A lot of content on the Web resides behind paywalls and firewalls. The Reader can only cache 100% freely accessible content.

“Landing pages” and “splash pages” represent additional caveats. Many of the URLs passed around the ‘Net do not point to the content itself, but instead they point to ill-structured pages describing the content — metadata pages. Such pages may include things like authors, titles, and dates, but these things are not presented in a consistent nor computer-readable fashion; they are laid out with aesthetics or graphic design in mind. These pages do contain pointers to the content you want to read, but the content may be two or three more clicks away. Be wary of URLs pointing to landing pages or splash pages.

Another caveat to this approach is the existence of extraneous input due to navigation. Many Web pages include links for navigating around the site. They also include links to things like “contact us” and “about this site”. Again, the Reader is sort of stupid. If found, the Reader will crawl such links and include their content in the resulting corpus.

Despite these drawbacks there are number of excellent use-cases for single URL input. One of the best is Wikipedia articles. Feed the Reader a URL pointing to a Wikipedia article. The Reader will cache the article itself, and then extract all the URLs the article uses as citations. The Reader will then cache the content of the citations, and then stop caching.

Similarly, a URL pointing to an open access journal article will function just like the Wikipedia article, and this will be even more fruitful if the citations are in the form of freely accessible URLs. Better yet, consider pointing the Reader to the root of an open access journal issue. If the site is not overly full of navigation links, and if the URLs to the content itself are not buried, then the whole of the issue will be harvested and analyzed.

Another good use-case is the home page of some sort of institution or organization. Want to know about Apple Computer, the White House, a conference, or a particular department of a university? Feed the root URL of any of these things to the Reader, and you will learn something. At the very least, you will learn how the organization prioritizes its public face. If things are more transparent than not, then you might be able to glean the names and addresses of the people in the organization, the public policies of the organization, or the breadth & depth of the organization.

Yet another excellent use-case includes blogs. Blogs often contain content at their root. Navigations links abound, but more often than not the navigation links point to more content. If the blog is well-designed, then the Reader may be able to create a corpus from the whole thing, and you can “read” it in one go.

A list of URLs

The third type of input is a list of URLs. The list is expected to be manifested as a plain text file, and each line in the file is a URL. Use whatever application you desire to build the list, but save the result as a .txt file, and you will probably have a plain text file.‡

Caveats? Like the single URL approach, the list of URLs must point to freely available content, and pointing to landing pages or splash pages is probably to be avoided. Unlike the single URL approach, the URLs in the list will not be used as starting points for Web crawling. Thus, if the list contains ten items, then ten items will be cached for analysis.

Another caveat is the actual process of creating the list; I have learned that is actually quite difficult to create lists of URLs. Copying & pasting gets old quickly. Navigating a site and right-clicking on URLs is tedious. While search engines & indexes often provide some sort of output in list format, the lists are poorly structured and not readily amenable to URL extraction. On the other hand, there are more than a few URL extraction tools. I use a Google Chrome extension called Link Grabber. [1] Install Link Grabber. Use Chrome to visit a site. Click the Link Grabber button, and all the links in the document will be revealed. Copy the links and paste them into a document. Repeat until you get tired. Sort and peruse the list of links. Remove the ones you don’t want. Save the result as a plain text file.‡ Feed the result to the Reader.

Despite these caveats, the list of URLs approach is enormously scalable; the list of URLs approach is the most scalable input option. Given a list of five or six items, the Reader will do quite well, but the Reader will operate just as well if the list contains dozens, hundreds, or even thousands of URLs. Imagine reading the complete works of your favorite author or the complete run of an electronic journal. Such is more than possible with the Distant Reader.‡

A zip file

The Distant Reader can take a zip file as input. Create a folder/directory on your computer. Copy just about any file into the folder/directory. Compress the file into a .zip file. Submit the result to the Reader.

Like the other approaches, there are a few caveats. First of all, the Reader is not able to accept .zip files whose size is greater than 64 megabytes. While we do it all the time, the World Wide Web was not really designed to push around files of any great size, and 64 megabytes is/was considered plenty. Besides, you will be surprised how many files can fit in a 64 megabyte file.

Second, the computer gods never intended file names to contain things other than simple Romanesque letters and a few rudimentary characters. Now-a-days our file names contain spaces, quote marks, apostrophes, question marks, back slashes, forward slashes, colons, commas, etc. Moreover, file names might be 64 characters long or longer! While every effort as been made to accomodate file names with such characters, your milage may vary. Instead, consider using file names which are shorter, simpler, and have some sort of structure. An example might be first word of author’s last name, first meaningful word of title, year (optional), and extension. Herman Melville’s Moby Dick might thus be named melville-moby.txt. In the end the Reader will be less confused, and you will be more able to find things on your computer.

There are a few advantages to the zip file approach. First, you can circumvent authorization restrictions; you can put licensed content into your zip files and it will be analyzed just like any other content. Second, the zip file approach affords you the opportunity to pre-process your data. For example, suppose you have downloaded a set of PDF files, and each page includes some sort of header or footer. You could transform each of these PDF files into plain text, use some sort of find/replace function to remove the headers & footers. Save the result, zip it up, and submit it to the Reader. The resulting analysis will be more accurate.

There are many use-cases for the zip file approach. Masters and Ph.D students are expected to read large amounts of material. Save all those things into a folder, zip them up, and feed them to the Reader. You have been given a set of slide decks from a conference. Zip them up and feed them to the Reader. A student is expected to read many different things for History 101. Download them all, put them in a folder, zip them up, and submit them to the Distant Reader. You have written many things but they are not on the Web. Copy them to a folder, zip them up, and “read” them with the… Reader.

A zip file with a companion CSV file

The final form of input is a zip file with a companion comma-separated value (CSV) file — a metadata file.

As the size of your corpus increases, so does the need for context. This context can often be manifested as metadata (authors, titles, dates, subject, genre, formats, etc.). For example, you might want to compare & contrast who wrote what. You will probably want to observe themes over space & time. You might want to see how things differ between different types of documents. To do this sort of analysis you will need to know metadata regarding your corpus.

As outlined above, the Distant Reader first creates a cache of content — a corpus. This is the raw data. In order to do any analysis against the corpus, the corpus must be transformed into plain text. A program called Tika is used to do this work. [2] Not only does Tika transform just about any file into plain text, but it also does its best to extract metadata. Depending on many factors, this metadata may include names of authors, titles of documents, dates of creation, number of pages, MIME-type, language, etc. Unfortunately, more often than not, this metadata extraction process fails and the metadata is inaccurate, incomplete, or simply non-existent.

This is where the CSV file comes in; by including a CSV file named “metadata.csv” in the .zip file, the Distant Reader will be able to provide meaningful context. In turn, you will be able to make more informed observations, and thus your analysis will be more thorough. Here’s how:

  • assemble a set of files for analysis
  • use your favorite spreadsheet or database application to create a list of the file names
  • assign a header to the list (column) and call it “file”
  • create one or more columns whose headers are “author” and/or “title” and/or “date”
  • to the best of your ability, update the list with author, title, or date values for each file
  • save the result as a CSV file named “metadata.csv” and put it in the folder/directory to be zipped
  • compress the folder/directory to create the zip file
  • submit the result to the Distant Reader for analysis

The zip file with a companion CSV file has all the strengths & weakness of the plain o’ zip file, but it adds some more. On the weakness side, creating a CSV file can be both tedious and daunting. On the other hand, many search engines & index export lists with author, title, and data metadata. One can use these lists as the starting point for the CSV file.♱ On the strength side, the addition of the CSV metadata file makes the Distant Reader’s output immeasurably more useful, and it leads the way to additional compare & contrast opportunities.


To date, the Distant Reader takes five different types of input. Each type has its own set of strengths & weaknesses:

  • a file – good for a single large file; quick & easy; not scalable
  • a URL – good for getting an overview of a single Web page and its immediate children; can include a lot of noise; has authorization limitations
  • a list of URLs – can accomodate thousands of items; has authorization limitations; somewhat difficult to create list
  • a zip file – easy to create; file names may get in the way; no authorization necessary; limited to 64 megabytes in size
  • a zip file with CSV file – same as above; difficult to create metadata; results in much more meaningful reports & opportunities

Happy reading!

Notes & links

‡ Distant Reader Bounty #1: To date, I have only tested plain text files using line-feed characters as delimiters, such are the format of plain text files in the Linux and Macintosh worlds. I will pay $10 to the first person who creates a plain text file of URLs delimited by carriage-return/line-feed characters (the format of Windows-based text files) and who demonstrates that such files break the Reader. “On you mark. Get set. Go!”

‡ Distant Reader Bounty #2: I will pay $20 to the first person who creates a list of 2,000 URLs and feeds it to the Reader.

♱ Distant Reader Bounty #3: I will pay $30 to the first person who writes a cross-platform application/script which successfully transforms a Zotero bibliography into a Distant Reader CSV metadata file.

[1] Link Grabber –

[2] Tika –

Stewardship of professional FTEs in metadata work and turnover / HangingTogether

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by John Riemer of University of California, Los Angeles and Jennifer Baxmeyer of Princeton. Turnover in a professional position within a cataloging or metadata unit now comes with a significant risk that it will be impossible to convince administrators to retain the position in the unit and repost it. This is particularly true when the outgoing incumbent performed a high proportion of “traditional” work, e.g., original cataloging in MARC. The odds of retaining the position are much greater if careful thought goes into how the position could be reconfigured or re-purposed to meet emerging needs.

Most metadata managers have had to address varying amounts of turnover, either from retirements or staff leaving for other positions. Half needed to reconfigure the positions of outgoing librarians.  Looking at what other institutions are advertising helps in creating an attractive position description.  Many cataloging positions do not require an MLS degree, so recruiting professionals has focused on adaptability, aligning new positions with university priorities, and eagerness to learn and take initiative in areas such as metadata for research output, open access, digital collections, and linked data. Mapping out future strategies and designing ways of making metadata interoperate across systems have been components of recent recruitments. New staff with programming skills are sought after, as they can apply batch techniques to metadata that can compensate for the loss of staff.  Using technology in the service of library service helps catalogers “do more with less”. (Some of these needs have been discussed previously: see New skill sets for metadata management.)  Professionals trail-blaze innovations, which are then routinized for non-professionals.

The impact of losing professional librarians includes:

  • Loss of language skills and original cataloging capacity
  • Longer throughput
  • Increased minimal-level cataloging
  • Reduced time for authority work and training
  • Some cataloging work just doesn’t get done, increasing backlogs

Metadata managers have a variety of strategies for dealing with this impact. Libraries are relying more on shelf-ready books or outsourcing cataloging for which they no longer have staff.  Outsourcing as much of the lower-level work as possible frees resources for higher-level work and preserves librarian-level staff.  As most research libraries batchload a large number of records created by publishers and vendors, professional staff can focus on describing the library’s distinctive collections and working on new initiatives. Some consolidate roles and expectations. For example, catalogers with specific subject expertise can be assigned to describing materials outside their areas of expertise; those with language expertise may handle multilingual descriptions in formats other than MARC. With tools like controlled vocabularies, the need for subject expertise might be less.

Institutional processes may limit the extent libraries can assign some professional tasks to non-professionals. In addition, training non-librarians such as students to comply with library standards can be difficult; they do not always understand why librarians do things a certain way. Student assignments are also time-bounded, which makes outsourcing more attractive. Some would like to experiment with harnessing the knowledge of expert readers like academic staff and post-docs who use their special collections, but the training that would be required can be daunting.

Metadata managers’ experiences have shown that it is easier for librarians to learn programming skills than it is to hire IT specialists to learn the “technical services mind-set.”  Metadata managers also want new staff to be aware of the broader “cataloging world” that the library’s metadata must integrate with, the “shared cataloging community.” Some catalog records may be good enough for in-house use but don’t work in the larger environment. Libraries’ metadata needs to interoperate with the metadata created by others in different settings.

The library environment keeps evolving, and librarians have had to reflect on their priorities moving forward. Metadata managers need to rethink the roles of metadata specialists beyond “traditional cataloging work,” especially with so much cataloging having moved to batch processing of large chunks of data. More work is being conducted at a consortial rather than institutional level. It has become harder to justify “really specialized” catalogers. Potential candidates with more flexible skill sets have become more attractive than those with a traditional cataloging background who may not adapt well to working in new environments. Many cataloging roles and descriptions may need to be rewritten and retooled. Most ideal would be a process of “continuous modernization,” annually reviewing and adjusting existing positions to meet new needs and not waiting to address this just when vacancies occur.

This retooling has resulted in more group cross-training to update staff skills and learning exercises where staff from across a unit or division work together to resolve knotty issues or problems. Talking through workflows and problems and documenting the resolutions help everyone in the group learn new skills and develop a “we’re all in this together” mindset while breaking down change resistance. The documentation also helps with integrating new staff.  Some have established standing “study groups” to learn and how to apply SPARQL, RDF validation, and scripting skills. Staff prefer in-house training to external training as it strengthens team knowledge and demonstrates internal commitment to their professional development. Perhaps the only activities that will perennially remain professional tasks are those like management, scouting new trends, strategizing, leading and implementing changes, and thinking about the big picture.

Do any of you have success stories about successfully reconfiguring a professional metadata position you’re willing to share in the comments below?

The post Stewardship of professional FTEs in metadata work and turnover appeared first on Hanging Together.

Be Careful What You Measure / David Rosenthal

"Be careful what you measure, because that's what you'll get" is a management platitude dating back at least to V. F. Ridgway's 1956 Dysfunctional Consequences of Performance Measurements:
Quantitative measures of performance are tools, and are undoubtedly useful. But research indicates that indiscriminate use and undue confidence and reliance in them result from insufficient knowledge of the full effects and consequences. ... It seems worth while to review the current scattered knowledge of the dysfunctional consequences resulting from the imposition of a system of performance measurements.
Back in 2013 I wrote Journals Considered Harmful, based on Deep Impact: Unintended consequences of journal rank by Björn Brembs and Marcus Munaf, which documented that the use of Impact Factor to rank journals had caused publishers to game the system, with negative impacts on the integrity of scientific research. Below the fold I look at a recent study showing similar negative impacts on research integrity.

Citation gaming induced by bibliometric evaluation: A country-level comparative analysis by Alberto Baccini, Giuseppe De Nicolao and Eugenio Petrovich (summary here) shows that the introduction of a research assessment scheme based on bibliometrics caused Italian researchers to massively game the system. Their study uses:
a new inwardness indicator able to gauge the degree of scientific self-referentiality of a country. Inwardness is defined as the proportion of citations coming from the country over the total number of citations gathered by the country.
Their Figure 1 shows that for G10 countries inwardness gradually increased over the period 2000-2016. But starting in 2011 Italy (red) started increasing significantly faster, rising over the remaining 6 years from 6th (20% inwardness) to 2nd (30%).  They identify the cause as a change in the incentives for researchers:
The comparative analysis of the inwardness indicator showed that Italian research grew in insularity in the years after the adoption of the new rules of evaluation. While the level of international collaboration remained stable and comparatively low, the research produced in the country tended to be increasingly cited by papers authored by at least an Italian scholar.

The anomalous trend of the inwardness indicator detected at the macro level can be explained by a generalized change in micro-behaviours of Italian researchers induced by the introduction of bibliometric thresholds in the national regulations for recruitment and career advancement. Indeed, in 2011 research and careers evaluation were revolutionized by the introduction of quantitative criteria in which citations played a central role. In particular, citations started being rewarded in the recruiting and habilitation mechanisms, regardless of their source.
The change in the research and career evaluation:
created an incentive to inflate those citation scores by means of strategic behaviors, such as opportunistic self-citations and the creation of citation clubs
The change is even more obvious when they plot inwardness as a function of the rate of international collaboration. Researchers gamed the system so effectively that:
Before 2010, Italy is close to and moves together with a group of three European countries, namely Germany, UK, and France. Starting from 2010, Italy departs from the group along a steep trajectory, to eventually become the European country with the lowest international collaboration and the highest inwardness.
So what did the Italian authorities expect when they made hiring, promotion and research funding dependent upon citation counts? Similarly, what did academics expect when they made journal prestige, and thus tenure, depend upon citation counts via Impact Factor? Was the goal to increase citation counts? No, but that's what they got.

Jobs in Information Technology: October 16, 2019 / LITA

New This Week

Visit the LITA Jobs Site for additional job listings and information on submitting your own job posting.

When Does Burnout Begin? The Relationship Between Graduate School Employment and Burnout Amongst Librarians / In the Library, With the Lead Pipe

In Brief
Burnout issues are of increasing concern for many service professionals, including Library and Information Science (LIS) workers; however, the majority of articles addressing burnout in the LIS field describe methods of coping with burnout, but do not ascertain trends and preventable factors. The purpose of this study was to identify the percentage of LIS workers (current and former) and students who have experienced burnout. Additionally, this study focused on the correlation between those who work while obtaining their LIS degree and whether or not they later experience burnout. These objectives aim to answer the question: what percentage of future librarians are more susceptible to burnout once they enter the profession because they are currently working while enrolled in classes? The LIS field is competitive, and students are encouraged to gain experience in libraries while pursuing their LIS degree. By identifying the prevalence of burnout within the LIS profession and attempting to identify the earliest causes, we hope to spark a conversation between hiring managers and current or future library professionals about the effects of our profession’s expectations and the high risk of burnout.

By Jade Geary and Brittany Hickey


Burnout is becoming an increasingly prevalent issue in our society. According to a general population survey from Statista, 21% of females and 17% of males age 18 and older in the U.S. suffer from exhaustion related to burnout (2017). Librarianship is not immune to the increase in burnout. In fact, helping professions are particularly vulnerable to burnout (Swanson 1992), and librarianship is a helping profession. It is essential to investigate the causes of burnout and how to prevent burnout. By looking at causes and prevention techniques, Library and Information Science (LIS) educators can help students prepare for the potential of burnout in their future careers and managers can become better informed on how to aid employees. The findings of our study indicate that there is a high connection between those that work while in library school and experiencing burnout. Thus it is imperative that burnout prevention techniques are discussed with LIS graduate students. This discussion includes both how to prevent burnout for themselves as well as how to aid others in burnout prevention.The latter is essential as it important for future managers to be able to assist those they work with in preventing and coping with burnout in librarianship.

Literature Review

The idea of work-related burnout first appeared in psychological literature in the 1970s (Schaufeli, Leiter & Maslach, 2008). While burnout did not appear in the LIS literature until more recently, there is still an abundance of information available. The LIS dialogue on burnout ranges from coping resources (Bosque & Skarl, 2016; Martin, 2009), webinars (Rogers-Whitehead, 2018; Singer, & Griffith, 2011; Westwood, 2017), panels (Block, Clasper, Courtney, Hermann, Houghton, & Zulida, 2019), and scholarly literature (Adebayo, Segun-Adeniran, Fagbohun, & Osayande, 2018) on this topic. Burnout is not the sole domain of only one particular library type; it is pervasive in every type of library from special libraries to public libraries (Mangus, Salo, & Jansson, 2018; Salyers, et al., 2019; Swanson, 1992). In fact, it is common for the literature to focus on particular job functions associated with librarian burnout (Affleck, 1996; Nardine, 2019). Unfortunately, even with the growing popularity of burnout research, there is limited literature focusing on the root causes of burnout. This literature review will focus on the literature currently available on the topic while our data will help fill a gap in the literature of when burnout begins.

Defining burnout

To begin, it is important to explore how burnout is defined and what the symptoms of burnout are. There are numerous definitions of burnout, but this study focuses on the definition provided by Christina Maslach, a leading authority on occupational burnout and the creator of the Maslach Burnout Inventory. Maslach (1982) defines burnout as “a syndrome of emotional exhaustion, depersonalization, and reduced personal accomplishment that can occur among individuals who do ‘people work’ of some kind” (p. 3). The primary factors leading to burnout are an unsustainable workload, role conflict and a lack of personal control at work, insufficient recognition or compensation, lack of social support, a sense of unfairness, and personal values that are at odds with the organization’s values (Maslach & Leiter, 2008). There are three overarching components of burnout: “overwhelming exhaustion, feelings of cynicism and detachment from the job, and a sense of ineffectiveness and lack of accomplishment” (Maslach & Leiter, 2016, p. 103). Maslach & Lieter (2016) describe the physical symptoms of burnout as the following: “headaches, chronic fatigue, gastrointestinal disorders, muscle tension, hypertension, cold/flu episodes, and sleep disturbances” (p. 106).

Methods to Prevent

In the literature, many articles offer tips on how to prevent burnout. Christian (2015) argues that proactive solutions are needed to “reverse the symptoms of a passion deficit” (p. 8). One solution offered by Christian (2015) is for LIS faculty to do a better job of preparing students for the “emotional labor” aspect of librarianship, including the negative side effects. This route takes a preventive approach; unfortunately, for those currently working in the field, precautionary methods do little to alleviate existing problems. In turn, there needs to be more literature on how to reduce burnout within the working profession for everyone from top-level administrators to part-time paraprofessionals. Most articles on preventing burnout focus on steps that individuals can take. DelGuidice (2011), for instance, offers a list of ways for school librarians to avoid burnout after the appearance of symptoms: attend conferences, take your lunch break or “prep” hour, take a sick day, let your aides do more, partner with the public library, or reach out for help. Campbell (2008) has many of the same suggestions, but also adds personalizing your workspace, engaging in meditation, and finding a hobby.

Farrell, Alabi, Whaley, and Jenda (2017) suggest library mentoring— not as a method of prevention, but one of mitigation. They propose that mentors who are aware of the causes of burnout, including racial microaggressions and imposter syndrome, and symptoms of burnout are more likely to offer compassion and empathy. However, mentors who are unfamiliar with burnout may fan the flames of burnout by encouraging their mentees to work harder to prove themselves.

Scholarship on burnout in LIS

Much of the scholarship surrounding burnout in libraries focuses solely on academic library settings. Adebayo, Segun-Adeniran, Fagbhohun, and Osayande (2018) investigated “perceived causes” of burnout amongst all levels of library staff at academic libraries in Nigeria by asking participants if they felt certain factors caused them to personally experience burnout. The causes included factors such as funding, support, and work environment. Nardine (2019) focused specifically on academic liaison librarians in the Association of Research Libraries (ARL). Kaetrena Davis Kendrick (2017) studied the low-morale of academic librarians and identified burnout, along with bullying and workplace toxicity, as being a contributing factor to issues with morale.

Although the literature is focused heavily on academic libraries, public libraries and librarians are not completely left out of the scholarship on burnout. Lindén, Salo, and Jansson (2018) investigated burnout in public libraries in Sweden. They studied organizational factors that lead to burnout and found that “the most frequently occurring stressors encountered in the
library organization was the workload stressor ‘overload’, the job-control stressors ‘technostress’ and ‘patrons’, the reward stressor ‘poor feedback from management’ and the community stressor ‘isolation’” (p. 203). Salyers et al. (2019) note the lack of literature available on burnout in public libraries. Their study gathered data from 171 public librarians about the issue of burnout. Salyers et al. (2019) found,

several job and recovery-related factors to be associated with increased emotional exhaustion and cynicism and decreased professional efficacy. Important job-related variables appear to be work pressure (associated with greater emotional exhaustion) and protective factors of autonomy, role clarity, and coworker support (for emotional exhaustion and to a lesser extent cynicism). (p. 981)

Swanson (1992) focus on burnout in youth librarians both in public libraries and school libraries. Swanson (1992) emphasises how burnout is common within the helping professions and focuses on physical exhaustion, emotional exhaustion, and psychological exhaustion. Swanson (1992) found that her pool of participants did not experience a significantly high rate of burnout and exhaustion. The literature on burnout and school media specialists or special librarians is primarily limited to those articles which offer methods of prevention (DelGuidice, 2011; Anzalone, 2015) rather than in-depth research into the prevalence of or specific factors leading to burnout.

Graduate students are virtually forgotten when it comes to scholarship on burnout within Library and Information Science. Multiple database searches failed to reveal any relevant articles. Although research on this topic is limited within the LIS field, burnout among students in other professions, such as Social Work and Psychology, is being investigated. Han, Lee, and Lee (2012) found that incoming Social Work graduate students are more susceptible to the three overarching characteristics of burnout identified by Maslach if their emotions are frequently influenced by the emotions of those around them. Wardle and Mayorga (2016) found that only 14.28% of counseling graduate students were not on the verge of, or already suffering from, burnout.

Research Question

The focus of our study was to answer the following research question:
Are library school students who work through graduate school more likely to leave librarianship due to burnout than their peers who did not work while obtaining an MLS/MLIS degree?


To gather our data, we created a branching survey that adjusted each participant’s questions based on their provided answers (Appendix A). The survey is not an adaptation of the Maslach Burnout Inventory because we were not measuring the degree to which librarians experience burnout. Instead, librarians were asked to determine if they believe they have experienced burnout. Other questions were designed to attempt to identify common experiences of librarians who have or have not experienced burnout.

Prior to completing the survey, participants were provided with an informed consent statement and an explanation of the purpose of our research. Before completing the survey, our respondents were provided with a definition of burnout, which was adapted from Maslach’s aforementioned definition. Our research focus limited our pool of LIS professionals. To be best aligned with our research question, the survey was targeted to MLS/MLIS students, current librarians, and former librarians. This was noted prior to beginning the survey with the following statement: “the survey is open to MLS/MLIS students, current librarians, and former librarians. We are not seeking responses from paraprofessionals at this time”.

After approval from the Institutional Review Board (IRB) at both of our institutions, we distributed our survey on October 1, 2018. This was done via Facebook groups, Library Think Tank – #ALATT and the Library Employee Support Network, as well as our own personal accounts. The survey was also shared via Twitter, professional listservs, and via our home libraries. We posted our call for participation on social media twice, listservs twice, and our home libraries once.

The survey remained open on Google Forms for one month. Once the survey closed, all results were extracted from Google Forms via Google Sheets. An original copy of the data was kept and has remained untouched. The data was then cleaned. Anyone who did not meet the criteria was removed, codes were set for the responses, and the data was moved into StatCrunch for statistical analysis.


We received responses from 612 people who met the survey requirements (i.e. participants who completed library school or are currently LIS students). The responses encompassed a wide range of library types: public (n=333), academic (n=216), school (n=74), archives (n=39), government (n=3), special (n=40), law (n=2), law firm (n=1), medical (n=2), hospital (n=1), subscription/membership (n=1), military (n=3), state (n=2), and high-density offsite storage (n=1). Participants were able to indicate all of the types of libraries in which they worked, allowing our research to reflect a wide array of library experiences.

Current Librarians

There were 612 total respondents, of which 76.64% are current librarians. Of the current librarians (n=469), 79.10% responded that they have experienced burnout based on the definition of burnout provided at the beginning of the survey. Additionally, 47.33% of current librarians responded that they have considered leaving the profession due to burnout.

An overwhelming majority of current librarians, 94.24%, were employed while they were enrolled in LIS courses. Current librarians who took an average of three credit hours per semester worked an average of 34.13 hours per week. Those enrolled in more than twelve credit hours per semester worked an average of 18.22 hours per week.

Closer inspection reveals that 78.89% of current librarians had a job in a library while taking library school classes. Overall, 74.84% of current librarians worked while enrolled in classes and experienced burnout, 10.87% worked but did not experience burnout, 4.26% did not work but still experienced burnout, and 0.85% neither worked while enrolled in classes nor experienced burnout later in their careers (Fig. 1).

bar chart with full equivalent as a table linked below
Figure 1. The relationship between working while enrolled in school and experiencing burnout for current and former librarians. Alternate version of this bar chart as a table.

Former Librarians

Of the total respondents (n=612), 5.23% are former librarians. Of the former librarians (n=32), 71.86% responded that they have experienced burnout based on the definition of burnout provided at the beginning of the survey. Burnout was the primary reason that 18.75% of former librarians left the profession and an additional 40.63% reported that burnout was a contributing factor in their decision to leave the profession. Of the former librarians who experienced burnout, 86.96% worked while in library school. Of those, 39.13% worked 31 or more hours per week. Interestingly, 100% of former librarians who never experienced burnout worked in a library while in library school. Overall, 62.50% of former librarians worked while enrolled in classes and experienced burnout, 21.88% worked but did not experience burnout, 9.38% did not work but still experienced burnout, and 0% neither worked while enrolled in classes nor experienced burnout later in their careers (Fig. 1).

LIS Students

LIS students comprised of 18.14% of our total participant pool. Of the students who responded (n=111), 96.40% are employed. Of today’s LIS students, 61.26% work 31 or more hours per week in addition to taking classes. Further inquiry reveals that 77.47% of student respondents work in a library, including 15.31% who have multiple jobs, at least one of which is in a library. Furthermore, 53.15% of LIS students take six credit hours a semester on average. The majority of those students, 59.32%, work 31-40 hours per week in addition to their class responsibilities (Fig. 2).

bar chart with full equivalent as a table linked below
Figure 2. The percent distribution of the hours LIS students worked by the average number of credit hours they were enrolled in.
Alternate version of this bar chart as a table.


The results of our study highlight the pervasiveness of burnout in the LIS field. Out of the sample (n= 612), 81.86% of librarians reported that they have experienced burnout. With over three-fourths of respondents indicating they have experienced burnout, these results indicate that this topic demands further study within the profession. Additionally, since we are investigating the link between burnout and working while enrolled in graduate courses, the percentage of students working while pursuing their masters must be taken into account. The discussion section will take a closer look at these numbers to help provide a more comprehensive picture of factors that influence burnout.

Generally speaking, it appears more graduate students are working than before. We do not have a breakdown by decade, but we do know that 96.40% of current students are employed while taking classes, compared to 94.03% of current librarians, and 84.38% of former librarians (Fig. 3).

three pie charts with full equivalent as a table linked below
Figure 3. The percentage of LIS students, current librarians, and former librarians who are working or worked while enrolled in graduate courses.
Alternate version of these pie charts as a table.

Not only does it appear that more of today’s students are working, but they are also working more hours on average than current or former librarians did as students (Fig. 4). As expected, the average hours students spent at their jobs decreased as their average credit hours increased. The only exception was with retired librarians; however, only one retired librarian took an average of three credit hours and they worked an average of 15 hours which skewed the results.

line graph with full text equivalent as a list linked below
Figure 4. Average hours spent working while enrolled.
Full description of this line graph as a table.

Our survey only asked students if they were working for income, experience, or a combination of the two. As depicted in Figure 5, the majority of students work for income and to gain experience. Out of the 111 current LIS students that responded to this question, 69.37% work while enrolled for both income and experience. One student commented that they are working specifically so they can receive benefits, such as insurance. This begs the question: will more students work full-time in the future to ensure they have health insurance and how will this increase their susceptibility to burnout?

bar chart with alternate version as a table linked below
Figure 5. Why current LIS students work
Alternate version of this bar chart as a list.

Contradictory to our predictions, as discussed in the results section, 100% of former librarians who never experienced burnout worked in a library while in library school. This data is varied from our current librarians that shows 74.84% of current librarians worked while enrolled in classes and experienced burnout. It would be impossible to draw conclusions from this data without talking more in depth with the former librarians that we surveyed. One possible explanation is the changing landscape of both librarianship and graduate work. Even though the changing landscape possibly contributes to burnout, it does not mean that this is the reason that former librarians did not experience burnout. There are many additional factors like the number of working hours, credit hours taken, the rigor of programs, and the type of work schedule they had to maintain both as a student and a professional.


This study consisted of a variety of limitations. First, for roughly an hour when the survey was opened, there was an error with the branching in Google Forms. This caused four participants to receive the wrong screen via Google Forms that provided them with additional, irrelevant, questions. Only a small number of participants were affected by the issue, and since we were quickly alerted, we were able to fix this issue without it affecting our results. To adjust for this error, we removed the “extra” information that was provided to us via the branching mishap. The second limitation would be our pool of participants themselves. Selection bias is a possible concern. It is possible that LIS professionals who have experienced burnout were most likely to complete the survey. Additionally, it is difficult to connect with former librarians. Most are no longer on traditional listservs and or social media. Thus we had a relatively small pool of former librarians. The last limitation is the definition of “librarian”. According to the Department for Professional Employees (2019), “in 2018, 53.5 percent of librarians held a master’s degree or higher” (p. 3). So, nearly half of those with a title of librarian do not have a master’s degree. We were specifically exploring the relationship between working while in library school and its impact on susceptibility to burnout later in life. Therefore, for the purpose of our research, we limited our data only to library workers who attended and completed library school. We found that some participants took the survey even though they did not meet this requirement. Consequently, they were removed from the pool. We recognize that burnout is an issue for all library employees, regardless of education or title; however, the scope of our study was limited to those who completed library school in order to determine if there is a correlation between burnout and work levels in library school. Lastly, we intended to investigate the relationship between burnout and race and/or gender, but we did not receive enough data to dive into such a complex issue.


Burnout is a complex issue. An overwhelming percentage of librarians experience burnout. The vast majority of librarians work while taking graduate classes. However, based on the relative lack of data from former librarians and librarians who have not experienced burnout, we cannot definitively say that working while in graduate school is a source of causation. Burnout is an invasive issue within librarianship. Although this is something that we were aware of before we began the study, the data revealed how pervasive this issue truly is. We, as a profession, are suffering and it is clear that more research, training, and professional development needs to be done on this topic.

We have only skimmed the surface of burnout research. There is much more room for analysis surrounding the degree to which people experience burnout within librarianship, rather than the prevalence of burnout in the profession. Feedback from our participants, and on social media, revealed that there is definitely a need to focus on all library workers regardless of education or position. Thus it is essential that more inclusive research is performed on this matter. Further, exploring the potentially additional stressors of paraprofessional library work is a topic that needs to be investigated more in-depth. Finally, it is important to note that our study was not limited to one type of library setting. Many articles that are written about librarian burnout focus on just academic librarians or just school media specialists; however, our study shows that burnout is a risk no matter the library setting.

It is our hope that this article can serve as a call for action for librarians, managers, and LIS educators. Hopefully, this can aid in a culture change for librarians and create greater support for burnout. Librarians need to be able to openly discuss burnout and know that they are not alone in dealing with it. We hope that this article will act as a catalyst for such discussions. Perhaps our survey will persuade hiring managers to take another look at the experience requirements for “entry-level” positions. If entry-level positions were truly “entry-level” and didn’t have such lofty experience requirements, then graduate students may not feel so compelled to exhaust themselves while in graduate school, thus possibly reducing the number of people who experience burnout. Even with the overwhelming number of librarians who have experienced burnout, it is not a topic we heard mentioned in lectures or assigned readings in library school. As our findings indicate, our profession is rife with burnout. It is our hope that LIS schools and educators will put more of an emphasis on preparing students to prevent burnout in their lives and the lives of others.


We would like to thank Megan Fratta, our external reviewer, Bethany Radcliffe, our internal reviewer, and Ian Beilin, our publishing editor for all of their insight, suggestions, and guidance. We would also like to thank Jesika Brooks for proofreading early drafts. Finally, we’d like to thank everyone who took the time to complete our survey.


Adebayo, O., Segun-Adeniran, C. D., Fagbohun, M. O., & Osayande, O. (2018). Investigating occupational burnout in library personnel. Library Philosophy & Practice, 1–15. Retrieved from

Affleck, M. A. (1996). Burnout among bibliographic instruction librarians. Library & Information Science Research 18, 165–83. Retrieved from

Block, C., Clasper, E., Courtney, K., K., Hermann, J., Houghton, S., Zulida, D., F., (2019, July). Self care Is not selfish: Preventing burnout, American Library Association Annual Conference. American Library Association, Washington, D.C. Retrieved from

Bosque, D., & Skarl, S. (2016). Keeping workplace burnout at bay: Online coping and prevention resources. College & Research Libraries News, 77(7), 349-355.

Campbell, K. (2008). The Wonder Woman syndrome: Burnout in libraries. Tennessee Libraries, 58(2). Retrieved from

Christian, L. (2015). A passion deficit: Occupational burnout and the new librarian: A recommendation report. Southeastern Librarian, 62(4), 2–11. Retrieved from

DelGuidice, M. (2011). Avoiding school librarian burnout: Simple steps to ensure your personal best. Library Media Connection, 29(4), 22–23. Retrieved from

Department for Professional Employees. (2019). Library professionals: Facts & figures. Retrieved August 23, 2019, from

Farrell, B., Alabi, J., Whaley, P., & Jenda, C. (2017). Addressing psychosocial
factors with library mentoring. Libraries and the Academy, 17(1), 51-69.

Han, M., Lee, S. E., & Lee, P. A. (2012). Burnout among entering MSW students:
Exploring the role of personal attributes. Journal of Social Work Education, 48(3), 439–457.

Kendrick, K. D. (2017) The low morale experience of academic librarians: A
phenomenological study. Journal of Library Administration, 57(8), 846-878.

Lindén, M., Salo, I., & Jansson, A. (2018). Organizational stressors and burnout
in public librarians. Journal of Librarianship & Information Science, 50(2), 199-204.

Maslach, C. (1982). Burnout: The cost of caring. Englewood Cliffs, NJ: Prentice-Hall.

Maslach, C. & Leiter M. (2008). Early predictors of job burnout and engagement. Journal of Applied Psychology, 93(3), 498-512.

Maslach, C, & Leiter, M. (2016). Understanding the burnout experience: Recent research and its implications for psychiatry. World Psychiatry, 15(2), 103-111.

Mangus, L., Salo, I., & Jansson, A. (2018). Organizational stressors and burnout in public librarians. Journal of Librarianship & Information Science, 50(2), 199–204.

Martin, C. (2009). Library burnout: Causes, symptoms, solutions. Library
Worklife, 6(12). Retrieved from

Nardine J. The state of academic liaison librarian burnout in ARL libraries in the United States. College & Research Libraries. 2019;80(4):508-524. Accessed July 24, 2019.

Rogers-Whitehead, C. (2018). Self-care: Protecting yourself (and others) from
burnout [webinar]. Retrieved from

Salyers, M. P., Watkins, M. A., Painter, A., Snajdr, E. A., Gilmer, L. O., Garabrant, J. M., & Henry, N. H. (2019). Predictors of burnout in public library employees. Journal of Librarianship and Information Science.

Schaufeli, W., Leiter, M., & Maslach, C.. (2009). Burnout: 35 years of research and practice. Career Development International, 14(3), 204–220.

Singer, P., & Griffith, G. (2011). Preventing staff burnout [webinar]. Retrieved

Statista Survey. (February 16, 2017). Percentage of adults in the U.S. who
suffered at least sometimes from select health symptoms as of February 2017, by gender [Graph]. In Statista. Retrieved August 26, 2019, from

Swanson, C. P. (1992). Assessment of Stress and Burnout in Youth Librarians
(Doctoral dissertation). Retrieved from ERIC.

Wardle, E. A., & Mayorga, M. G. (2016). Burnout among the counseling
profession: A survey of future professional counselors. Journal of Educational Psychology, 10(1), 9–15. Retrieved from

Westwood, D. (2017). Burnout or bounce back? Building resilience [webinar].
Retrieved from

Appendix A

Link to Survey Questions

Survey Questions

Relationship Between Working While Enrolled in School and Experiencing Burnout

Figure 1. The relationship between working while enrolled in school and experiencing burnout for former and current librarians.
Worked while in school; Have experienced burnout Worked while in school; Have not experienced burnout Did not work while in school; Have experienced burnout Did not work while in school; Have not experienced burnout
Former Librarian 62.5% 21.88% 9.38% 0%
Current Librarian 74.84% 10.87% 4.26% 0.85%

Return to Figure 1 caption.

Percent Distribution of Hours Worked by Credits Taken, LIS Students

Figure 2. The percent distribution of the hours LIS students worked by the average number of credit hours they were enrolled in.
Hours Worked
Credit Hours 0 1-10 11-20 21-30 31-40 More than 40
3 0% 0% 0% 12.5% 62.5% 25%
6 1.69% 0% 16.95% 5.08% 59.32% 16.95%
9 6.9% 3.45% 24.14% 24.14% 27.59% 13.79%
12 8.33% 0% 25% 33.33% 16.67% 16.67%
More than 12 0% 0% 33.33% 66.67% 0% 0%

Return to Figure 2 caption.

Percentage Working or Worked While in Graduate Courses

Figure 3. The percentage of LIS students, current librarians, and former librarians who are working or worked while enrolled in graduate courses.
LIS Students Current Librarians Former Librarians
Worked 96.4% 94.03% 84.38%
Did not work 3.6% 5.97% 9.38%
No response 0% 0% 6.25%

Return to Figure 3 caption.

Average Hours Spent Working While Enrolled

Figure 4. The average number of hours a week LIS students, current librarians, and former librarians work(ed) while enrolled in graduate courses.
Credit Hours LIS Students Current Librarians Former Librarians
3 36.75% 34.13% 15.5%
6 32.69% 32.64% 37.16%
9 26.15% 24.25% 25.5%
12 25.88% 20.81% 10.5%
More than 12 22.16% 18.22% 10.5%

Return to Figure 4 caption.

Why Current LIS Students Work

LIS students’ reasons for working while enrolled in graduate school, given in actual numbers, not percentages:

  • Health care benefits: 1
  • Both income and experience: 77
  • Experience: 1
  • Income: 28

Return to Figure 5 caption.

NDSA Announces Winners of 2019 Innovation Awards / Digital Library Federation

NDSA Announces Winners of 2019 Innovation Awards

For release October 16, 2019

The NDSA established its Innovation Awards in 2011 to recognize and encourage innovation in the field of digital stewardship, Since then, it has honored 34 exemplary individuals, institutions, projects, educators, and future stewards for their efforts in ensuring the ongoing viability and accessibility of valuable digital heritage.

Today, NDSA adds 5 new awardees to that honor roll during the opening plenary ceremony of Critical Junctures, the NDSA’s 2019 Digital Preservation Conference,


Individual Innovation Award

The NDSA Individual Innovation Award honors individuals making significant, innovative contributions to the digital preservation community. Two worthy individuals are recognized this year.

Dr. Dinesh Katre, Senior Director and Head of the Department of Human-Centred Design and Computing (HCDC) Group at the Centre for Development of Advanced Computing (C-DAC) in India, has established a distinguished record leading the development of innovative technological solutions for digital preservation, trustworthy digital repository certification, data repurposing and intelligent archiving. Over the last several years, he has worked to advocate for, develop, and deploy the Indian National Digital Preservation Programme, which provides a robust and comprehensive platform for the effective long-term preservation of the digital materials. As Chief Investigator of the Programme’s flagship project to establish a Center of Excellence for Digital Preservation, Dr. Katre led the process to develop a digital preservation standard for India, as well as domain-specific archival systems and automation tools for digital preservation. He also conceptualized, designed and led the development of DIGITĀLAYA, a software framework, which comprehensively implements the OAIS reference model. DIGITĀLAYA has been customized for preservation of electronic office records, audiovisual and document archives and e-governance records. Katre’s efforts culminated in the first repository in the world to achieve ISO 16363 certification. His achievements exemplify the growing international reach of concern and practice in the areas of digital stewardship and preservation.

Tim Walsh is a digital archivist and preservation librarian with varied experience at Harvard, Tufts, the University of Wyoming, and currently, Concordia University. He is also a prolific software developer, and this capacity has created, and made freely available through his BitArchivist website and Github, an evolving suite of robust open source tools meeting many core needs of the stewardship community in appraising, processing, and reporting upon born-digital collections. His projects include the Brunnhilde characterization tool; BulkReviewer, for identifying PII and other sensitive information; the METSFlask viewer for Archivematica METS files; SCOPE, an access interface for Archivematica dissemination information packages; and CCA Tools, for creating submission packages from a variety of folder and disk image sources.  Taken together, these tools support a very wide gamut of both technical and curatorial activities. The open availability, documentation, support, and community engagement for a growing ecosystem of mature preservation tools is critical to the successful and sustainable stewardship of the digital materials so critical to contemporary and future commerce, culture, science, entertainment, and education.  This work also provides an excellent example of how a lone individual can nevertheless make a substantial positive impact on the complex domain of stewardship practice through dedication, skill, enthusiasm, and community spirit.

Organization Innovation Award

The NDSA Organization Innovation Award honors organizations taking an innovative approach to providing support and guidance to the digital preservation community.  Today we recognize two organizations.

The Asociación Iberoamericana de Preservación Digital (APREDIG) is a nonprofit Ibero-American association founded at the end of 2017 in Barcelona, Spain, with the intention of promoting the importance of digital preservation in Spainish-speaking countries. Its activity has culminated in projects and activities to disseminate a Spanish translation of the original NDSA Levels of Preservation, opening-up significant new opportunities for expanding digital stewardship best practices, and subsequent outcomes, by practitioners in Spain and Latin America. Led by Dr. Miquel Termens and Dr. David Leija (Universitat de Barcelona), this group of volunteers, researchers, and disseminators of best practices for digital preservation have created an online self-assessment tool to help Institutions of Spain and Mexico understand recommendations, key concepts, and simple diagnosis of digital preservation practices using the NDSA Levels as a guideline. The critical importance of effective and sustainable solutions for preserving digital materials transcends institutional and national boundaries. APREDIG’s efforts are a vital example of the growing international reach of stewardship and preservation concerns and applications. Furthermore, they evidence the positive contribution to local and global understanding resulting from the expansion of the community of theory and practice to all interested and engaged participants.

  The idea for a Software Preservation Network (SPN) originated 2014,  Since then, it has developed into a vibrant grassroots organization of digital preservation practitioners invested in the future of software preservation. Through multiple federal grants and start-up seed funding, SPN has solidified alliances among international stakeholders—both individuals and organizations—with diverse perspectives, including libraries, archives, and museums. Two separate, but complementary, aspects of SPN’s work are particularly noteworthy. First, its innovative efforts to develop effective techniques and programs for the long-term stewardship of the intermediating software upon which preserved digital resources are inextricably dependent, exemplified by publication of the Code of Best Practices for Fair Use in Software Preservation, and the Emulation-as-a-Service Infrastructure (EaaSI) project researching scalable emulation. Second, Jessica and Zack place critical emphasis on issues of community engagement and organizational sustainability. This work provides an extremely useful case study to the stewardship community of the importance of thoughtful and iterative self-reflection and refinement of organizational strategies, goals, processes, and initiatives to ensure the continued relevance, value, and persistence of programmatic efforts. SPN offers a model for digital stewardship that combines steadfast vision with flexibility and an emphasis on the evolving needs of the organization’s constituents.  The award was accepted on behalf of the entire SPN organization and its members by Jessica Meyerson (Educopia Institute) and Zach Vowell (California Polytechnic State University).

Project Innovation Award

The NDSA Project Innovation Award honors projects whose goals or outcomes represent an inventive, meaningful addition to the understanding or processes required for successful, sustainable digital preservation stewardship.

  Since its inception in 2016, the Great Migration Home Movie Project (GMHMP) at the Smithsonian Institution’s National Museum of African American History and Culture (NMAAHC) has digitized hundreds of hours of African American home movies and thousands of photographs for families who have visited the Museum in Washington and for those who live across the country, in Baltimore, Denver, and Chicago. In its current iteration, families visiting the Museum are invited to drop off their home movies and films, videotapes, and audiotapes when they arrive for the day, and then pick up their original and digital copies (preserved by a team of professionals) at the end of the day — with the added invitation to donate digital copies to the Museum, enriching its growing collection of vernacular home movies. As explained by Walter Forsberg, founder of the NMAAHC’s Media Conservation and Digitization department, the Great Migration Home Movie Project lowers the “technological barriers-to-entry of audiovisual digitization and directly and proactively addresses the historic obfuscation and exclusion of people of color from traditional archives.” It is thanks to the work of the Great Migration Home Movie Project that not only can these memories be gifted back to families and their future descendants, but, also, that “history is being re-written in a very real and immediate way.”  The award was accepted on behalf of the entire GMHMP project team by Candace Ming.

The NDSA Innovation Awards Working Group was led by co-chairs Stephen Abrams (Harvard University) and Krista Oldham (Clemson University), with members Samantha Abrams (Ivy Plus Libraries Confederation), Lauren Goodley (Texas State University), Grete Graf (Yale University), and Kari May (University of Pittsburgh). Aliya Reich at CLIR provided administrative support for the entire awards process.

Please join the Working Group in congratulating the 2019 Innovation Award winners!

The post NDSA Announces Winners of 2019 Innovation Awards appeared first on DLF.

Looking Back at Islandoracon 2019 / Islandora

Last week, the Islandora community came together for its third major conference, dubbed Islandoracon and held in Vancouver, BC. Our hosts at Vancouver Public Library and Simon Fraser University put us up in some fantastic locations (see below!), but the real highlights of the week came from what our attendees brought with them to the workshops, general sessions, and Use-a-Thon

Vancouver, as seen from Vancouver Public Library's 9th floor


The conference opened with four half-day workshops, with "Islandora 101" style introductions to Islandora 7 and Islandora 8, an overview of Islandora ISLE, and training in how to build your own plugins (PDF). We closed out the main conference with more workshops, running two tracks of 90-minute deep-dives into different use cases and tools, such as Working with Linked Data and Ontologies, Islandora as an IR, and Preservation in Islandora 8.


The main conference opened with updates about the Islandora Foundation, the Islandora community, and the exciting new world of Islandora 8 and its much closer connections to Drupal, before opening up the floor to a wide variety of sessions submitted by members of the Islandora community. We have been gathering up slides and linking to them in the conference schedule wherever we can.


Our first-ever Islandora Use-a-Thon was designed to explore what we can achieve by leaning into Drupal and solving common use cases with contributed modules and configuration. Our community rose to the challenge beyond our wildest expectations, and we'll be spending the next few weeks unpacking the results for you. Every one of the eight entries produced something of real and immediate value for Islandora, giving our judges (Alex Kent, Danny Lamb, Rosie Le Faive, and Bethany Seeger) a real challenge to pick these top three winners:

Third Place: Archives & Archives-Adjacent

Goal: gathering specific use-cases to support archival materials in Islandora 8


Second Place: The Islandora 8 Collectioneers

Goal: "Search within this collection".  Stretch goal:  "Search underneath this collection, including subcollections"


First Place: What’s in a Name(space)?

Goal: Defining a path forward for multi-tenancy implementations in Islandora 8



Overall, it was a pretty incredible week and a real celebration of our community and all that we can acheive together. Next year we'll go back to holding smaller Islandora Camps around the world, but stay tuned for Islandoracon to return on 2021!

Hyku Open Source Institutional Repository Development partnership awarded $1M Arcadia grant to improve open scholarship infrastructure / Samvera

The University of Virginia is pleased to announce a two-year award in the amount of $1,000,000 from Arcadia—a charitable fund of philanthropists Lisbet Rausing and Peter Baldwin—in support of the “Advancing Hyku: Open Source Institutional Repository Platform Development” project.

Through this project, the University of Virginia and its partner institutions—Ubiquity Press and the British Library—will support the growth of open access through institutional repositories. Working with the global open infrastructure community, the partners will introduce significant structural improvements and new features to the Samvera Community’s Hyku Institutional Repository platform. Read the entire announcement at this link.

The post Hyku Open Source Institutional Repository Development partnership awarded $1M Arcadia grant to improve open scholarship infrastructure appeared first on Samvera.

csv,conf returns for version 5 in May / Open Knowledge Foundation

Save the date for csv,conf,v5! The fifth version of csv,conf will be held at the University of California, Washington Center in Washington DC, USA, on May 13 and 14, 2020. 


If you are passionate about data and its application to society, this is the conference for you. Submissions for session proposals for 25-minute talk slots are open until February 7, 2020, and we encourage talks about how you are using data in an interesting way (like to uncover a crossword puzzle scandal). We will be opening ticket sales soon, and you can stay updated by following our Twitter account @CSVconference.


csv,conf is a community conference that is about more than just comma-sepatated-values – it brings together a diverse group to discuss data topics including data sharing, data ethics, and data analysis from the worlds of science, journalism, government, and open source. Over two days, attendees will have the opportunity to hear about ongoing work, share skills, exchange ideas (and stickers!) and kickstart collaborations. 



Attendees of csv,conf,v4

First launched in July 2014,  csv,conf has expanded to bring together over 700 participants from 30 countries with backgrounds from varied disciplines. If you’ve missed the earlier years’ conferences, you can watch previous talks on topics like data ethics, open source technology, data journalism, open internet, and open science on our YouTube channel. We hope you will join us in Washington D.C. in May to share your own data stories and join the csv,conf community!


Csv,conf,v5 is supported by the Sloan Foundation through OKFs Frictionless Data for Reproducible Research grant as well as by the Gordon and Betty Moore Foundation, and the Frictionless Data team is part of the conference committee. We are happy to answer all questions you may have or offer any clarifications if needed. Feel free to reach out to us on, on twitter @CSVconference or our dedicated community slack channel


We are committed to diversity and inclusion, and strive to be a supportive and welcoming environment to all attendees. To this end, we encourage you to read the Conference Code of Conduct.

Rojo the Comma Llama

While we won’t be flying Rojo the Comma Llama to DC for csv,conf,v5, we will have other mascot surprises in store.

Now Available: The LYRASIS FY2019 Annual Report / DuraSpace News

From Robert Miller, Chief Executive Officer, LYRASIS

On behalf of the entire team at LYRASIS, I am pleased to announce that the LYRASIS  FY 2019 annual report is now available. It was as fun to pull this together as it was to take on so many wonderful things for our community. It started off as a good year and finished as a great year!
Read the LYRASIS Annual Report here.
Three things that stand out:
  • We are a new LYRASIS through the joining of DuraSpace and LYRASIS.
  • We continue to invest in members with another $1.2M going into our member focused R&D and Channel Development efforts. This brings our grand total to $2.0M+ invested from 2016-2020.
  • Next year’s member summit will be in Philadelphia, PA on Oct 27/28. So mark your calendars and please join us.

The post Now Available: The LYRASIS FY2019 Annual Report appeared first on

Sprockets 4 and your Rails app / Jonathan Rochkind

Sprockets 4.0 was released on October 8th 2019, after several years of beta, congratulations and hooray.

There are a couple confusing things that may give you trouble trying to upgrade to sprockets 4 that aren’t covered very well in the CHANGELOG or upgrade notes, although now that I’ve taken some time to understand it, I may try to PR to the upgrade notes. The short version:

  1. If your Gemfile has `gem ‘sass-rails’, ‘~> 5.0’` in it (or just '~> 5'), that will prevent you from upgrading to sprockets 4. Change it to `gem ‘sass-rails’, ‘~> 6.0’` to get sass-rails that will allow sprockets 4 (and, bonus, will use the newer sassc gem instead of the deprecated end-of-lifed pure-ruby sass gem).
  2. Sprockets 4 changes the way it decides what files to compile as top-level aggregated compiled assets. And Rails (in 5.2 and 6) is generating a sprockets 4 ‘config’ file that configures something that is probably inadvisable and likely to do the wrong thing with your existing app.
      • If you are seeing an error like Undefined variable: $something this is probably affecting you, but it may be doing something non-optimal even without an error. (relevant GH Issue)
      • You probably want to go look at your ./app/assets/config/manifest.js file and turn //= link_directory ../stylesheets .css to //= link 'application.css'.
      • If you are not yet in Rails 6, you probably have a //= link_directory ../javascripts .js, change this to link application.js
      • This still might not get you all the way to compatiblity with your existing setup, especially if you had additional top-level target files. See details below.

The Gory Details

I spent some hours trying to make sure I understood everything that was going on. I explored both a newly generated Rails 5.2.3 and 6.0.0 app; I didn’t look at anything earlier. I’m known for writing long blog posts, cause I want to explain it all! This is one of them.

Default generated Rails 5.2 or 6.0 Gemfile will not allow sprockets 4

Rails 5.2.3 will generate a Gemfile that includes the line:

gem 'sass-rails', '~> 5.0'

sass-rails 5.x expresses a dependency on sprockets < 4 , so won’t allow sprockets 4.0.0.

This means that a newly generated default rails app will never use sprockets 4.0. And you can’t get to sprockets 4.0 by running any invocation of bundle update, because your Gemfile links to a dependency requirement tree that won’t allow it.

The other problem with sass-rails 5.x is it depends on the deprecated and end-of-lifed pure-ruby sass gem. So if you’re still using it (say with a default generated Rails app), you may be seeing deprecation “please don’t use this messages” too.

So some people may have already updated their Gemfile. There are a couple ways you can do that:

  • You can change the dependency to gem 'sass-rails', '~> 6.0' (or '>= 5.0'), which is what an upcoming Rails release will probably do)
  • But sass-rails 6.0 is actually a tiny little wrapper over a different gem, sassc-rails. (which itself depends on non-deprecated sassc instead of deprecated pure-ruby sass).  So you can also just change your dependency to gem 'sassc-rails', '~> 2.0',
  • which you may have already done when you wanted to get rid of ruby-sass deprecation warnings, but before sass-rails 6 was released. (Not sure why they decided to release sass-rails 6 as a very thin wrapper on a sassc-rails 2.x), and have Rails attempt to still generate a Gemfile with sass-rails
  • Either way, you will then have a dependency requirement tree which allows any sprockets `> 3.0` (which is still an odd dependency spec; 3.0.0 isn’t allowed, but 3.0.1 and higher are? It probably meant `>= 3.0`? Which is still kind of dangerous for allowing future sprockets 5 6 or 7 too…) — anyway, so allows sprockets 3 or 4.

Once you’ve done that, if you do a bundle update now that sprockets 4 is out, you may find yourself using it even if you didn’t realize you were about to do a major version upgrade. Same if you do bundle update somegem, if somegem or something in it’s dependency tree depends on sprockets-rails or sprockets, you may find it upgraded sprockets when you weren’t quite ready to.

Now, it turns out Rails 6.0.0 apps are in exactly the same spot, all the above applies to them too. Rails intended to have 6.0 generate a Gemfile  which would end up allowing sass-rails 5.x or 6.x, and thus sprockets 3 or 4.

It did this by generating a Gemfile with a dependency that looks like ~> 5, which they thought meant `>= 5` (I would have thought so too), but it turns out it doesn’t, it seems to mean the same thing as ~> 5.0, so basically Rails 6 is still in the same boat. That was fixed in a future commit, but not in time for Rails 6.0.0 release — Rails 6.1 will clearly generate a Gemfile that allows sass-rails 5/6+ and sprockets 3/4+, not sure about a future 6.0.x.

So, Rails 5.2 won’t allow you to upgrade to sprockets 4 without a manual change, and it turns out accidentally Rails 6 won’t either. That might be confusing if you are trying to update to sprockets 4, but it actually (accidentally) saves you from the confusion that comes from accidentally upgrading to sprockets 4 and finding a problem with how top-level targets are determined. (Although if even before sprockets4 came out you were allowing sass-rails 6.x to avoid deprecated ruby-sass… you will be able to get sprockets 4 with bundle update, accidentally or on purpose).

Rails-Sprockets built-in logic for determining top-level compile targets CHANGES depending on Sprockets 3 or 4

The sprockets-rails gem actually has a conditional for applying different logic depending on whether you are using Sprockets 3 or 4.  Rails 5.2 or 6 won’t matter; but in either Rails 5.2 or 6, changing from Sprockets 3 to 4 will change the default logic for determining top-level compile targets (the files that can actually be delivered to the browser, and will be generated in your public/assets directory as a result of rake assets:precompile).

This code has been in sprockets-rails since sprockets-rails 3.0, released in December 2015(!). The preparations for sprockets 4 are a long time coming.

This means that switching from Sprockets 3 to 4 can mean that some files you wanted to be delivered as top-level targets no longer are; and other files that you did not intend to be are; in some cases, when sprockets tries to compile as a top-level target when not intended as such, the file actually can’t be compiled as such without an error, and that’s when you get an error like Undefined variable: $something— it was meant as a sass “partial” to be compiled in a context where that variable was defined, but sprockets is trying to compile it as a top-level target.

rails-sprockets logic for Sprockets 3

If you are using sprockets 3, the sprockets-rails logic supplies a regexp basically saying the files `application.css` and `application.js` should be compiled as top-level targets. (That might apply to such files found in an engine gem dependency too? Not sure).

And it supplies a proc object that says any file that is in your local ./app/assets (or a subdir), and has a file extension, but that file extension is not `.js` or `.css`  => should be compiled as a top-level asset.

  • Actually not just .js and .css are excluded, but anything sprockets recognizes as compiling to .js or .css, so .scss is excluded too.

That is maybe meant to get everything in ./app/assets/images, but in fact it can get a lot of other things, if you happened to have put them there. Say ./app/assets/html/something.html or ./app/assets/stylesheets/images/something.png.

rails-sprockets logic for Sprockets 4

If you are using sprockets-4, sprockets won’t supply that proc or regexp (and in fact proc and regexp args are not supported in sprockets 4, see below), but will tell sprockets to start with one file: manifest.js.

This actually means any file in any subdir of app/assets (maybe files from rails engine gems too?), but the intention is that this refers to app/assets/config/manifest.js.

The idea is that the manifest.js will include the sprockets link, link_directory, and link_tree methods to specify files to treat as top-level targets.

And possibly surprising you, you probably already have that file there, because Rails has been generating that file for new apps for some time. (I am not sure for how long, because I haven’t managed to find what code generates it. Can anyone find it? But I know if you generate a new rails 5.2.3 or rails 6 app, you get this file even though you are using sprockets 3).

If you are using sprockets 3, this file was generated but not used, due to the code in sprockets-rails that does not set it up for use if you are using sprockets 3. (I suppose you could have added it to Rails.application.config.assets.precompile yourself in config/initializers/assets.rb or wherever). But it was there waiting to be used as soon as you switched to sprockets4.

What is in the initial Rails-generated app/assets/config/manifest.js?

In Rails 5.2.3:

//= link_tree ../images
//= link_directory ../javascripts .js
//= link_directory ../stylesheets .css

This means:

  • Anything in your ./app/assets/images, including subdirectories
  • Anything directly in your `./app/assets/javascripts` (not including subdirs) that ends in `.js`.
  • Anything directly in your `./app/assets/stylesheets` (not including subdirs) that ends in `.css`
    • So here’s the weird thing, it actually seems to mean “any file recognized as a CSS” file — file ending in `.scss` get included too. I can’t figure out how this works or is meant to work;  Can anyone find better docs for what the second arg to `link_directory` or `link_tree` does or figure it out from the code, and want to share?

Some significant difference between sprockets3 and sprockets4 logic

A initially generated Rails 5.2.3 app has a file at ./app/assets/javascripts/cable.js. It is referenced with a sprockets require from the generated application.js; it is not intended to be a top-level target compiled separately. But a default generated Rails 5.2.3 app, once using sprockets 4 — will compile the cable.js file as a top-level target, putting it in `public/assets` when you do rake assets:precompile. Which you probably don’t want.

It also means it will take any CSS file (including .scss)  directly (not in subdir) at ./app/assets/stylesheets and try to compile them as top-level targets. If you put some files here that were only intended to be `imported` by sass elsewhere (say, _mixins.scss), sprockets may try to compile them on their own, and raise an error. Which can be a bit confusing, but it isn’t really a “load order problem”, but about trying to compile a file as a top-level target that wasn’t intended as such.

Even if it doesn’t raise an error, it’s spending time compiling them, and putting them in your public/assets, when you didn’t need/want them there.

Perhaps it was always considered bad practice to put something at the top-level `./app/assets/stylesheets` (or ./app/assets/javascripts?)  that wasn’t intended as a top-level target… but clearly this stuff is confusing enough that I would forgive anyone for not knowing that.

Note that the sprockets-rails code activated for sprockets3 will never choose any file ending in .js or .css as a top-level target, they are excluded. While they are specifically included in the sprockets4 code.

(Rails 6 is identical situation to above, except it doesn’t generate a `link_directory` referencing assets/javascripts, becuase Rails 6 does not expect you will use sprockets for JS, but will use webpacker instead).

I am inclined to say the generated Rails code is a mistake and it probably should be simply

//= link_tree ../images 
//= link application.js # only in Rails 5.2
//= link application.css

You may want to change it to that. If you have any additional things that should be top-level targets compiled, you will have to configure them seperately…

Options for configuring additonal top-level targets

If you are using Sprockets 3, you are used to configuring additional top-level targets by setting the array at Rails.application.config.assets.precompile. (Rails 6 even still generates a comment suggesting you do this at./config/initializers/assets.rb).

The array at config.assets.precompile can include filenames (not including paths), a regexp, or a proc that can look at every potential file (including files in engines I think?) and return true or false.

If you are using sprockets4, you can still include filenames in this array. But you can not include regexps or procs. If you try to include a regexp or proc, you’ll get an error that looks something like this:

`NoMethodError: undefined method `start_with?' for #`
...sprockets-4.0.0/lib/sprockets/uri_utils.rb:78:in `valid_asset_uri?'

While you can still include individual filenames, for anything more complicated you need to use sprockets methods in the `./app/assets/config/manifest.js` (and sprockets really wants you to do this even instead of individual filenames).

The methods available at `link`, `link_directory`, and `link_tree`. The documentation isn’t extensive, but there’s some in the sprockets README , and a bit more in sourcecode in a somewhat unexpected spot.

I find the docs a bit light, but from experimentation it seems to me that the first argument to link_directory and link_tree is a file path relative to the manifest.js itself (does not use “asset load path”), while the first argument to link is a file path relative to some dir in “asset load path”, and will be looked up in all asset load paths (including rails engine gems) and first one found used.

  • For instance, if you have a file at ./app/assets/images/foo/bar.jpg, you’d want //= load foo/bar.jpg since all subdirs of  ./app/assets/ end up in your “asset load path”.
  • I’m not sure where what I’m calling the “asset load path” is configured/set, but if you include a //= load for some non-existent file, you’ll conveniently get the “asset load path” printed out in the error message!

The new techniques are not as flexible/powerful as the old ones that allowed arbitrary proc logic and regexps (and I think the proc logic could be used for assets in dependent engine gems too). So you may have to move some of your intended-as-top-level-targets source files to new locations, so you specify them with the link/link_tree/link_directory functions available; and/or refactor how you are dividing things between what asset files generally.

What went wrong here? What should be fixed?

Due to conditional logic in sprockets 3/4, very different logic for determining top-level targets will be used when you update to sprockets 4. This has affected a lot of people I know, but it may affect very few people generally and not be disruptive? I’m not sure.

But it does seem like kind of a failure in QA/release management, making the upgrade to sprockets 4 not as backwards compat as intended. While this roadblock was reported to sprockets in a 4.0 beta release back in January, and reported to Rails too in May, sadly  neither issue received any comments or attention from any maintainers before or after sprockets 4.0 release; the sprockets one is still open, the rails one was closed as “stale” by rails-bot in August.

This all seems unfortunate, but the answer is probably just that sprockets continues to not really have enough maintainers/supporters/contributors working on it, even after schneem’s amazing rescue attempt.

If it had gotten attention (or if it does, as it still could) and resources for a fix… what if anything should be done? I think that Rails ought to be generating the ./app/assets/config/manifest.js with eg //= link application.css instead of //= link_directory ../stylesheets .css.

  • I think that would be closer to the previous sprockets3 behavior,  and would not do the ‘wrong’ thing with the Rails 5.2.3 cable.js file. (In Rails 6 by default sprockets doesn’t handle JS, so cable.js not an issue for sprockets).
  • This would be consistent with the examples in the sprockets upgrading guide.

I think/guess it’s basically a mistake, from inconsistent visions for what/how sprockets/rails integration should or would work over many years with various cooks.

Since (by accident) no Rails has yet been released which will use Sprockets 4 (and the generated manifest.js file) without a manual change to the Gemfile, it might be a very good time to fix this before an upcoming Rails release that does. Becuase it will get even more confusing to change at a later date after that point.

The difficulties in making this so now:

  • I have been unable to find what code is generating this to even make a PR. Anyone?
  • Finding what code is generating it would also help us find commit messages from when it was added, to figure out what they were intending, why they thought this made sense.
  • But maybe this is just my opinion that the generated manifest.js should look this way. Am I wrong? Should (and will) a committer actually merge a PR if I made one for this? Or is there some other plan behind it? Is there anyone who understands the big picture? (As schneems himself wrote up in the Saving Sprockets post, losing the context brought by maintainers-as-historians is painful, and we still haven’t really recovered).
  • Would I even be able to get anyone with commit privs attention to possibly merge a PR, when the issues already filed didn’t get anyone’s attention? Maybe. My experience is when nobody is really sure what the “correct” behavior is, and nobody’s really taking responsibility for the subsystem, it’s very hard to get committers to review/merge your PR, they are (rightly!) kind of scared of it and risking “you broke it you own it” responsibility.

Help us shneems, you’re our only hope?

My other conclusion is that a lot of this complexity came from trying to make sprockets decoupled from Rails, so it can be used with non-Rails projects. The confusion and complexity here is all about the Rails/sprockets integration, with sprockets as a separate and decoupled project that doens’t assume Rails, so needs to be configured by Rails, etc. The benefits of this may have been large, it may have been worth it — but one should never underestimate the complexity and added maintenance burden of trying to make an independent decoupled tool, over something that can assume a lot more about context, and significantly added difficulty to making sprockets predictable, comprehensible, and polished. We’re definitely paying the cost here, I think a new user to Rails is going to be really confused and overwhelemed trying to figure out what’s going on if they run into trouble.


Nanopore Technology For DNA Storage / David Rosenthal

DNA assembly for nanopore data storage readout by Randolph Lopez et al from the UW/Microsoft team continues their steady progress in developing technologies for data storage in DNA.

Below the fold, some details and a little discussion.

Up to now the UW/Microsoft team have used Illumina's "sequencing by synthesis" (SBS) as the technology for reading the sequence of bases from the DNA strand. But this time they used:
Nanopore sequencing, as commercialized by Oxford Nanopore Technologies (ONT), offers a sequencing alternative that is portable, inexpensive and automation-friendly, resulting in a better option for a real-time read-head of a molecular storage system. Specifically, ONT MinION is a four-inch long USB-powered device containing an array of 512 sensors, each connected to four biological nanopores, ... Each nanopore is built into an electrically resistant artificial membrane. During sequencing, a single strand of DNA passes through the pore resulting in a change in the current across the membrane. This electrical signal is processed in real time to determine the sequence identity of the DNA strand.
Nanopore has two key advantages over SBS for reading data from DNA:
In the context of DNA storage, real-time sequencing enables the ability to sequence until sufficient coverage has been acquired for successful decoding without having to wait for an entire sequencing run to be completed. Moreover, nanopore sequencing offers a clear single-device throughput scalability roadmap via increased pore count, which is very important for viability of DNA data storage.
But there are significant problems too:
Nanopore sequencing presents unique challenges to decoding information stored in synthetic DNA. In addition to a significantly higher error rate compared to SBS, nanopore sequencing of short DNA fragments results in lower sequencing throughput due to inefficient pore utilization. ... existing scalable approaches for writing synthetic DNA rely on parallel synthesis of millions of short oligonucleotides (i.e., 100–200 bases in length) where each oligonucleotide contains a fraction of an encoded digital file. We find that sequencing of such short fragments results in significantly lower sequencing throughput in the ONT MinION. This limitation hinders the scalability of nanopore sequencing for DNA storage applications.
So, for nanopore to be effective, the data needs to be encoded in much longer strands. The detailed reason is that:
The overall yield and quality of a nanopore sequencing run is dependent on the molecular size of the DNA to be sequenced. DNA molecules translocate through the pore at a rate of 450 bases/sec while it can take between 2–4 s for a pore to capture and be occupied by the next DNA molecule. Therefore, short DNA molecules result in a higher number of unoccupied pores over time, which increases the rate of electrolyte utilization above the membrane. This results in a faster loss in polarity and lower sequencing capacity and overall throughput. ONT estimates that the optimal DNA size to maximize sequencing yield is around 8 kilobases while the minimum size is 200 bases. Below 200 bases, event detection and basecalling is not possible.
Their approach is to concatenate many of their short strands:
To achieve this, we implement a strategy that enables both random access and molecular assembly of a given DNA file stored in short oligonucleotides (150 bp) into large DNA fragments containing up to 24 oligonucleotides (~5000 bp). We evaluate Gibson Assembly and Overlap-Extension Polymerase Chain Reaction (OE-PCR) as suitable alternatives to iteratively concatenate and amplify multiple oligonucleotides in order to generate large sequencing reads.
They preferred Gibson Assembly for this task. The remaining problem is the high error rate:
To decode the files, we implement a consensus algorithm capable of handling high error rates associated with nanopore sequencing.
Figure 3a
The data is encoded into "files", strands with a ID tag at each end and the data payload, consisting of a chunk of data and its address, in the middle:
Each file in the oligo pool consists of a set of 150-bases oligonucleotides with unique 20-nucleotide sequences at their 5′ and 3′ ends for PCR-based random-access retrieval (i.e., file ID) and a 110-nucleotide payload encoding the digital information.
The system repeatedly reads many copies of these strands (at least 36 in their previous work) and applies a decoding algorithm to the resulting base sequences to identify and correct errors:
In our previous decoding algorithm, the consensus sequence is recovered by a process where pointers for payload sequences are maintained and moved from left to right, and at every stage of the process the next symbol of the sequence is estimated via a plurality vote. For payload sequences that agree with plurality, the pointer is moved to the right by 1. But for the sequences that do not agree with plurality, the algorithm classifies whether the reason for the disagreement is a single deletion, an insertion, or a substitution. This is done by looking at the context around the symbol under consideration. Once this is estimated, the pointers are then moved to the right accordingly.
Nanopore has a higher error rate than SBS, so this algorithm would need even more than 36 copies. By improving the algorithm they instead reduced the minimum number of copies to 22:
The key difference between our new algorithm and our previous implementation is that in cases when disagreements cannot be classified, we do not drop respective payload sequences from further consideration. Instead, we label such payload sequences as being out of sync and attempt to bring them back at later stages. Specifically, every sequence that is out of sync is ignored for several next steps of the algorithm. However, after those steps, we perform search for a match between the last few bases of the partially constructed consensus sequence and appropriately located short substrings of the payload sequence.

If we discover a match, we move the payload sequence pointer to the corresponding location and drop the out of sync label. This allows the payload sequence pointer to circumvent small groups of adjacent incorrect bases, a feature that was not present in the earlier algorithm. This modification to the consensus algorithm allows us to successfully decode from notably lower coverages because more information from the sequencing reads is used in the process.
One pore can decode a 5000bp strand in 11s, plus 2-4s recovery time. Lets say 14s. That's 308s for 22 reads, which will generate consensus on 2640 bases of data. At the theoretical maximum of 2bit/base this is 5280 bits of data. So the read bandwidth per pore is about 17bit/s.

For example, Seagate's Exos 7E8 drives have a sustained transfer rate of 215Mb/s. To match the read performance of a single hard drive would need about 13M pores, or about $6.2M worth of MiniONs. (ONT does have higher-throughput products, using the PromethION48 would need 88 units at about $52M). This doesn't account for Reed-Solomon overhead, or the difficulty of ensuring that each strand was read only 22 times among the 13M pores. Although this statement is accurate:
nanopore sequencing offers a clear single-device throughput scalability roadmap via increased pore count, which is very important for viability of DNA data storage.
it rather understates the engineering difficulties involved in scaling up to match competing media.

MarcEdit 7 Updates / Terry Reese

This version of MarcEdit 7 includes a wide range of updates.  I’ve documented them below.

You can pick up the download at:

Recover Settings from Backup

If you use MarcEdit’s automated configuration back up tool (which creates backups of all local user files) – the way you restored from back up was a manual process.  That has been updated.  You can now find an option to restore from back up on the Main Screen, under Help/Troubleshooting Options/Configuration Settings/Restore Configuration Data from Backup


When you use this option, the tool will provide a list of all available backups.  Select the date and restore to that point.


MarcEditor: Delete Record(s) Option

This is a tool that has been in MarcEdit for a while, but not public.  The tool enables users to select individual or ranges of records and delete them from the current file.  Deleted Records correspond to their position (record #) inside the file.


Task Action: Sort By

Tasks now can utilize sorts within them.


Field Insertion/Sort Rules

When using the global editing tools, when new fields are added via the Add Field/Swap Field/Copy Field/Build Field tools – they are added within the numeric group of the record.   If records include fields that are wildly out of order (like US Library of Congress records, which include a 9xx field in the first 2-3 fields) – this can cause insertion problems.  This setting allows users to tell MarcEdit to ignore specific fields or groups of fields for purposes of inserting new field data.


Z39.50/SRU Query Expansion

To support my granular queries – z39.50/sru has been expanded to allow search of multiple indexes, as well as mixed searches of both names and raw queries in both Z39.50 and SRU.


Z39.50/SRU – Automatically Convert to MARC

A new option has been added to the Z39.50/SRU options that allows MarcEdit to automatically convert record data to MARC if it can identify a conversion path.  Presently, the tool is specifically evaluating data for information in MODS or MARCXML.  Will look to expand as necessary.


Moving Data into a Task from a global Edit Screen

I was giving a workshop, and someone asked why it wasn’t easier to move data from a global edit screen into a task.  It’s a good question.  I’ve updated the tool so global edit windows now have the following option: Add to Task


Clicking this button will copy all data to a special clipboard that can then be pasted into a task.

Task Manager: Copy Items between tasks Clipboard expansion

Updated the clipboard that allows data to be moved between tasks.

Replace Function: Task UI Changes

Updated the Replace Function to disable the Find button (as it’s not applicable in the Task Processing)

Saxon.NET, NewtonSoft.Json component updates

Updating core components related to XML, XSLT, XQuery, and JSON processing.

Bug Fix: Batch Processing Tool

Corrected an error that was occurring when processing XML data. 

Bug Fix: Updates to the Harvester Command-Line processing

Updated the command line tool – it appears that the harvester was failing in a way that would make it difficult to understand why data wasn’t being processed.

Bug Fix: Installer

The installer may select the wrong installer version via automatic update if the user is running an Admin version of the software.  This should correct the issue for all installations after this one.

The Library of the Living and the Library of the Dead / Mita Williams

I am in the process of re-organizing my Google Drive and in doing so, I stumbled upon a bit of writing from 2013 that would have been a perfect addition to the post Haunted libraries, invisible labour, and the librarian as an instrument of surveillance which I wrote earlier this year:

When I was a child, the walls of books in the adult section of our modest public library always filed me with unease and even dread. So many books that I would never read. So many books I suspected – even then – that were never read. I was under the impression that all the books were so old that the authors must all be dead. Unlike my refuge – the children’s section of the library, partitioned by a glass door set in a glass wall – this section of the library was dark and largely silent. The books were ghosts.

I am imagining a library that is made up of two distinct sections. These sections may be on separate floors. They may be in separate buildings. But these sections must be separated and distinct.

One of these sections would be ‘The Library of the Living’. It would be comprised of works by authors who still walked on the earth, somewhere, among us. The other section would be ‘The Library of the Dead’.

When an authors passes from the earthly realm, a librarian take their work from the Library of the Living and bring it, silently, to the Library of the Dead.

And at the end of this text was this:

“We don’t have much time, you know. We need to find the others. We need to find mentors. We need to be mentors. We don’t have much time.”

Reflections on the evolving research library: a report from the OCLC Library Futures Conference / HangingTogether

Last week the Research Library Partnership led a pre-conference workshop at the OCLC Library Futures: Community Catalysts conference in Phoenix, Arizona, in conjunction with the meeting of the OCLC Americans Regional Council (ARC).  

Convening “Two Loops”

As the theme of the conference was about catalyzing change, we also focused our pre-meeting workshop on how systems change, using a model called the “Two Loops Model” developed by Margaret Wheatley and the Berkana Institute. This model is one of many models and exercises the RLP uses in our in-person gathers to provide structure for library leaders to take a break from day-to-day concerns and reflect deeply upon the macro scale changes taking places in libraries today.

In Two Loops, Wheatley has created a metaphor that is built not on a mechanistic view of how systems work, but one grounded in how processes, systems, and organizations arise, gain and lose momentum, and finally die. Wheatley’s model suggests a natural arc for all lives, systems, and processes, and also honors and respects the variety of leadership capabilities needed at all stages of that arc.

Debbie Frieze offers an excellent video overview of the Two Loops model in a video on the Berkana Institute’s web page, and I highly recommend it as seven minutes well spent.

During the course of our event, we explained the Two Loops model and invited participants to identify and physically represent where their work or organization fit into the model.

Event participants described themselves as leading in a variety of roles

Attendees then gathered into a variety of small group configurations to discuss change, leadership roles, and to share experiences. Participants found this exercise helped them to think differently about how things work (or don’t work), sharing insights such as:

Stewarding is one of the leadership capabilities discussed in the Two Loops model

  • Many different types of leadership are needed to optimally manage change. One participant said, “I need to be spending more time supporting [leadership for processes that are in decline] and less on some other parts of my organization.”
  • Some participants initially responded negatively to one or more of the roles, particularly those that involved caring for processes in their final decline. However, several shared that through this exercise they have a new respect for the significant leadership required in all roles.
  • Participants frequently identified themselves in multiple roles
  • Emotion was a recurring theme in our group discussions, as change can ignite fear, anxiety, and conflict.

This workshop was a great way to kick off the Library Futures conference, tying into the larger themes of library change emphasized throughout that event program. We look forward to reprising this activity in Europe at the gathering of the EMEA Regional Council (EMEARC) at the OCLC Library Futures: Community Catalysts in Vienna in March 2020.

The post Reflections on the evolving research library: a report from the OCLC Library Futures Conference appeared first on Hanging Together.

September 2019 ITAL Issue Now Available / LITA

The September 2019 issue of Information Technology and Libraries (ITAL) is available now. In this issue, ITAL Editor Ken Varnum announces six new members of the ITAL Editorial Board. Our content includes a recap of Emily Morton-Owens’ President’s Inaugural Message, “Sustaining LITA“, discussing the many ways LITA strives to provide a sustainable member organization. In this edition of our “Public Libraries Leading the Way” series, Thomas Lamanna discusses ways libraries can utilize their current resources and provide ideas on how to maximize effectiveness and roll new technologies into operations in “On Educating Patrons on Privacy and Maximizing Library Resources.

Featured Articles:

Library-Authored Web Content and the Need for Content Strategy,” Courtney McDonald and Heidi Burkhardt

Increasingly sophisticated content management systems (CMS) allow librarians to publish content via the web and within the private domain of institutional learning management systems. “Libraries as publishers”may bring to mind roles in scholarly communication and open scholarship, but the authors argue that libraries’ self-publishing dates to the first “pathfinder”handout and continues today via commonly used, feature-rich applications such as WordPress, Drupal, LibGuides,and Canvas. Although this technology can reduce costly development overhead, it also poses significant challenges. Read more.

Use of Language-Learning Apps as a Tool for Foreign Language Acquisition by Academic Libraries Employees,” by Kathia Ibacache

Language-learning apps are becoming prominent tools for self-learners. This article investigates whether librarians and employees of academic libraries have used them and whether the content of these language-learning apps supports foreign language knowledge needed to fulfill library-related tasks. The research is based on a survey sent to librarians and employees of the University Libraries of the University of Colorado Boulder (UCB), two professional library organizations, and randomly selected employees of 74 university libraries around the United States. Read more.

Is Creative Commons A Panacea for Managing Digital Humanities Intellectual Property Rights?,” by Yi Ding

Digital humanities is an academic field applying computational methods to explore topics and questions in the humanities field. Digital humanities projects, as a result, consist of a variety of creative works different from those in traditional humanities disciplines. Born to provide free, simple ways to grant permissions to creative works, Creative Commons(CC)licenses have become top options for many digital humanities scholars to handle intellectual property rights in the US. Read more.

Am I on the library website?,” A LibGuides Usability Study by Suzanna Conrad and Christy Stevens

In spring 2015, the Cal Poly Pomona University Library conducted usability testing with ten student testers to establish recommendations and guide the migration process from LibGuides version 1 to version 2. This case study describes the results of the testing as well as raises additional questions regarding the general effectiveness of LibGuides, especially when students rely heavily on search to find library resources. Read more.

Assessing the Effectiveness of Open Access Finding Tools,” by Teresa Auch Schultz, Elena Azadbakht, Jonathan Bull, Rosalind Bucy, and Jeremy Floyd

The open access (OA) movement seeks to ensure that scholarly knowledge is available to anyone with internet access, but being available for free online is of little use if people cannot find open versions. A handful of tools have become available in recent years to help address this problem by searching for an open version of a document whenever a user hits a paywall. This project set out to study how effective four of these tools are when compared to each other and to Google Scholar, which has long been a source of finding OA versions. Read more.

Creating and Developing USB Port Covers at Husdon County Community College,” by Lotta Sanchez and John DeLooper

In 2016, Hudson County(NJ) Community College (HCCC) deployed several wireless keyboards and mice with its iMac computers. Shortly after deployment, library staff found that each device’s required USB receiver(a.k.a. dongle)would disappear frequently. As a result, HCCC library staff developed and deployed 3D printed port covers to enclose these dongles. This, for a time, proved very successful in preventing the issue. This article will discuss the development of these port covers, their deployment, and what worked and did not work about the project. Read more.

Submit Your Ideas

Contact ITAL Editor Ken Varnum at with your proposal. Current formats are generally:

  • Articles – original research or comprehensive and in-depth analyses, in the 3000-5000 word range.
  • Communications – brief research reports, technical findings, and case studies, in the 1000-3000 word range.

Questions or Comments?

For all other questions or comments related to LITA publications, contact us at (312) 280-4268 or

Library Workers and Resilience: More Than Self-Care / Shelley Gullikson

An article in the Globe and Mail this spring about resilience was a breath of fresh air—no talk about “grit” or bootstraps or changing your own response to a situation. It was written by Michael Ungar, the Canada Research Chair in Child, Family, and Community Resilience at Dalhousie University and leader of the Resilience Research Centre there. The research shows that what’s around us is much more important than what’s inside us when it comes to dealing with stress.

The article was adapted from Ungar’s book, the now-published Change Your World: The Science of Resilience and the True Path to Success. I know, the title is a little cringey. And honestly, some of the book veers into self-help-style prose even as it decries the self-help industry. But on the whole, there is quite a lot that it interesting here. I was looking at it for an upcoming project on help-seeking, but it keeps coming to mind during discussions about self-care and burnout among library workers.

Ungar writes of the myth of the “rugged individual” who can persevere through their own determination and strength of character. We get fed a lot of stories about rugged individuals, but Ungar has found that when you look closely at them, what you find instead are “resourced individuals”—people who have support from the people and environment around them.

“Resilience is not a do-it-yourself endeavor. Striving for personal transformation will not make us better when our families, workplaces, communities, health care providers, and governments provide us with insufficient care and support.” (p.14)

Ungar is mostly focused on youth but also writes about workplaces, even though this is not his direct area of research. Two passages in particular caught my eye: “Every serious look at workplace stress has found that when we try and influence workers’ problems in isolation, little change happens. … Most telling, when individual solutions are promoted in workplaces where supervisors do not support their workers… resilience training may actually make matters worse, not better.” (p.109)

A now-removed article in School Library Journal explained how one library worker changed herself to deal with her burnout. The reaction to this article was swift and strong. Many of us know that individual stories of triumph over adversity are bullshit, particularly when we have seen those same efforts fail in our own contexts. I have found it validating to find research backs that up.

Ungar does allow that there are times when changing oneself can work—either a) when stress is manageable and we already have the resources (if you can afford to take two weeks off to go to a meditation retreat, why not), or b) when there is absolutely nothing else you can do to change your environment or circumstances (your job is terrible but you can’t leave it and you’ve tried to do what you can to improve things, so sure take some time to meditate at your desk to get you through your day). But most of us live somewhere between perfectly-resourced and completely hopeless. So what needs to be fixed is our environment, not ourselves.

I have noticed resilience has been coming up as a theme in my own university over the last year or so—workshops on becoming more resilient or fostering resilient employees. Ungar says “To be resilient is to find a place where we can be ourselves and be appreciated for the contributions that we make.” That’s not something individuals can do by themselves. People in leadership positions would do well to better understand the research behind resilience rather than the self-help inspired, grit-obsessed, bootstraps version. Workshops and other initiatives that focus on individuals will not fix anything. At best, they are resources for people who are already doing pretty well. At worst, they add to the burden of people already struggling by making them feel like their struggles are caused by their own insufficiency.

Anyway, these are just some thoughts based on a single book; I’m nowhere in the realm of knowledgeable on this subject. But I thought it might be helpful to share that there is research that backs up the lived experience of the many library workers who struggle in their organizations, despite their own best efforts.


DLF Featured Sponsor – ZONTAL / Digital Library Federation

Featured Blog Post by 2019 DLF Forum Sponsor ZONTAL

ZONTAL LogoHi, I am Dennis, a physics professor and data management expert representing ZONTAL. I am excited to become part of your digital library ecosystem. 

For the last 4 years I have focused on data preservation strategies for global pharmaceutical enterprises. 

After running a successful proof of concept with Brigham Young University library, I am convinced that our pharma solution also yields great benefits for other libraries.

I am attending the DLF to learn more about the current challenges your IT department faces and the best solutions that are available. I would be thrilled to meet you at the ZONTAL Space vendor table. Shoot me an email at: to schedule a slot ahead, or just drop by the table. See you in Tampa!

Best regards,


The post DLF Featured Sponsor – ZONTAL appeared first on DLF.

“Future Proofing” of Cataloging / HangingTogether

Jessie Eastland, Moon in Sunrise Sky, Wikimedia Commons CC-BY-SA-3.0

That was the topic discussed recently by OCLC Research Library Partners metadata managers, initiated by Melanie Wacker of Columbia, Daniel Lovins of Yale and Roxanne Missingham of Australian National University. Metadata departments not only need to focus on current requirements for their metadata in the library catalog or repositories, but also need to ensure that they look ahead to future uses of their metadata in emerging services. The work of the PCC Task Group on URIs in MARC and the PCC ISNI Pilot are network-level efforts; involving metadata staff in academic projects, research data, or identity management tasks are examples taking place on the local level. As technologies change there will be new opportunities to unleash the power of our metadata in legacy records for future, different interactions and uses. Our cataloging heritage equips us to use metadata for revealing collections in new ways beyond our current systems.

Our discussions focused on identifiers, viewed as a transition bridge from legacy and current metadata to future applications. Although few identifiers are now leveraged as they could be, many institutions are adding ISNIs and FAST headings to their catalog records, and for records describing materials in Institutional Repositories, ORCIDs and DOIs.   Incentives for researchers to use ORCIDs include facilitating the population of faculty research profiles, peer-to-peer networking, and automatically compiling their lists of publications regardless of their institutional affiliations at the time they were published. Australian National University views ORCIDs as the “glue” that holds together four arms of scholarly work—publishing, repository, library catalog, and researchers.  Identifiers from the Library of Congress (e.g..,lccn and are commonly used in library catalog records; a number of institutions are also including identifiers for digital collections, photo archives, and archives.

Some institutions have contracted with third parties to add URIs to their MARC records but have discovered that if this is done by string-matching, there can be many false matches. Even if library systems do not yet make use of identifiers, third-parties such as Google Books and HathiTrust rely on identifiers for service integration. Embedding geo-coordinates in metadata or URIs supporting API calls to GeoNames can support map visualizations. FAST includes geo-graphic names that link to their geo-coordinates. The University of Minnesota is contributing to the Big Ten Academic Alliance’s Geospatial Data Project, which supports creating and aggregating metadata describing geospatial data resources and making them discoverable through an open source Geoportal.

Some metadata managers have been experimenting with creating entities in Wikidata, thereby minting Wikidata identifiers. Wikidata can be viewed as an identifier hub, which aggregates different identifier schemes pointing to the same object. In a recent OCLC Research Works in Progress webinar, Case Studies from Project Passage Focusing on Wikidata’s Multilingual Support, Xiaoli Li compared creating identifiers in Wikidata and the LC/NACO authority file and concluded that Wikidata offers more opportunities for providing richer information with more references and links to related entities.

One appealing use case for identifiers is to bridge the different systems (referred to as an “archipelago of systems”) used across an institution. Although we will likely continue to live in a world of multiple identifiers with varying degrees of overlap, hubs that can show “same as” relationships among different identifiers could support an infrastructure that brings together resources described by different data content standards. We see disciplinary differences in the way that scholars want to interact with and navigate data and outputs, and no one “identifier hub” will ever be comprehensive. A few are considering setting up local “data stores” to aggregate the identifiers and other metadata used across their different systems, such as Harvard’s Library Cloud. The British Library is working on a metadata model that would support the flow of metadata across all its systems.

In the meantime, libraries struggle to weigh the future potential benefits of identifiers with current workloads.  Publishers serve as a key player in the metadata workstream, but publisher data does not currently include identifiers. The British Library is working with five UK publishers to add ISNIs to their metadata as a promising proof-of-concept. (The BL’s Cataloging-in-Publication records, created before a work’s publication, are disseminated with ISNIs.)  The ability to batch-load or algorithmically add identifiers in the future is on metadata managers’ wish-list.  

Everyone hopes that wider use of identifiers will provide the means for future systems to provide a richer user experience and increased discoverability on the semantic web. Let’s not just “get the web to come to us”: as libraries become more a part of the semantic web, metadata specialists will be freed from having to re-describe things already described elsewhere.

The post “Future Proofing” of Cataloging appeared first on Hanging Together.

Invitation to participate in a new project: Help open journals’ deep backfiles / John Mark Ockerbloom

As I’ve noted here previously, there’s a wealth of serial content published in the 20th century that’s in the public domain, but not yet freely available online, often due to uncertainty about its copyright (and the resulting hesitation to digitize it).  Thanks to IMLS-supported work we did at Penn, we’ve produced a complete inventory of serials from the first half of the 20th century that still have active copyright renewals associated with them. And I’ve noted that there was far more serial material without active copyright, as late as the 1960s or even later.  We’ve also produced a guide to determining whether particular serial content you may be interested in is in the public domain.

Now that we’ve spent a lot of time surveying what is still in copyright though, it’s worth turning more focused attention to serial content that isn’t in copyright, but still of interest to researchers.  One way we can identify journals whose older issues (sometimes known as their “deep backfiles”) are still of interest to researchers and libraries is to see which ones are included in packages that are sold or licensed to libraries.   Major vendors of online journals publish spreadsheets of their backfile offerings, keyed by ISSN.  And now, thanks to an increasing amount of serial information in Wikidata (including links to our serials knowledge base) it’s possible to systematically construct inventories of serials in these packages that include, or might include, public domain and other openly accessible content.

We’ve now done that, for packages from some of the big online journal publishers (as of today, including Elsevier, SAGE, Springer, Taylor & Francis, and Wiley).  We’ve also included inventories for the JSTOR journal platform, which has deep backfiles of sought-after journals from many publishers.  The inventories can be found here, and they include information we’ve gathered about first copyright renewals, when known, and also about free online volumes and issues that we know about.

Many of the journals in our inventory tables have “Unknown” under the “first renewal” column.  That’s where you can help.  If you find journals of interest to you in any of these tables, you can research their copyrights and send us what you find out.  Thanks to our prior inventory work, the process can be boiled down to a few basic steps:

  1. Find a journal of interest to you in one of the tables.
  2. If its first renewal is reported in the table as “Unknown”, look to see if it’s mentioned in our “first copyright renewals for periodicals” inventory.  (While most journals listed there are now linked with Wikidata, those that are not may still report “Unknown” in our tables.  We’re trying to fill those in as soon as we can manage.)
  3. If you don’t find the journal listed there, then search for it in the Copyright Office’s registered works database.  (We describe how to search that database in Appendix A of our “Determining copyright status of serial issues” guide.)
  4. Use the “contact us” link in the journal’s row in our inventory table to go to a form where you can tell us the earliest renewed item you found for the journal in steps 2 and/or 3.  Also tell us, if you know, whether the journal was published in the United States, or only elsewhere.   (Works not published in the US might be exempt from copyright renewal requirements.)

After you send this information, we’ll update our knowledge base and tables accordingly (possibly after doing our own verification).  You can also use the “contact us” form to inform us of free online issues we don’t yet mention, or contribute more detailed copyright information about a particular journal if you’re so inclined.  All copyright and listings data we publish will be put in the public domain via a CC0 dedication.

If you’re willing and able to contribute information for a large number of serials, there may be more efficient ways for us to exchange information, and I invite you to get in touch with me.  Also, if time permits, we can do our own research on serials that you’re interested in, so if you find a favorite in the tables and aren’t confident about researching it yourself, just follow its “contact us” link, and click on the “submit” button near the bottom of the form that comes up.

All of the journal providers mentioned above (none of whom have had any involvement with this project to date) also offer extensive archives of content that’s still under copyright.   Their newer journals from the era of automatic copyright renewals (1964 and later) aren’t generally represented in the tables we currently provide, which focus on the mostly-older content that’s openly or potentially openly available.  You can follow the links on the provider names above to get information on the complete offerings of those providers available for subscription or purchase.

Our Deep Backfile project is very much a work in progress, but I think it’s far enough along now to be useful for folks interested in finding and sharing information about serials in the public domain of research interest. I’m looking forward to hearing others’ thoughts, questions, and suggestions, and to making our knowledge base more populated and useful.



Real-Time Gross Settlement / David Rosenthal

Cryptocurrency advocates appear to believe that the magic of cryptography makes the value of trust zero, but they’re wrong. Follow me below the fold for an example that shows this.

There are two fundamentally different types of systems financial institutions use to implement transactions. In a net system bank A batches up outgoing and incoming transactions with bank B, typically for a day, then the bank with more outgoing than incoming funds sends the other bank a single transaction for the difference. In a gross system bank A executes each outgoing transaction to bank B separately.

If, at the end of the day in a net system, bank B owes bank A money bank A has in effect made bank B an interest-free day-long loan. So the participants in a net system need to trust each other to repay these loans. In a gross system each transaction stands alone; there are no such loans and thus no need for trust. Instead there is a need for cash reserves. Capital must be available to cover the worst-case sequence of outgoing transactions. In practice this is much greater than the end-of-day difference between outgoing and incoming transactions. The advantage of lack of trust comes at the cost of the foregone interest on the greater cash reserves. The trust in a net system creates a pool of zero-cost liquidity.

In the days when transactions involved paper checks daily net systems were natural. But when electronic systems became common, it became obvious that a real-time, and thus necessarily gross, system was possible. Wikipedia explains:
Real-time gross settlement (RTGS) systems are specialist funds transfer systems where the transfer of money or securities takes place from one bank to any other bank on a "real time" and on a "gross" basis. Settlement in "real time" means a payment transaction is not subjected to any waiting period, with transactions being settled as soon as they are processed. "Gross settlement" means the transaction is settled on one-to-one basis without bundling or netting with any other transaction. "Settlement" means that once processed, payments are final and irrevocable.

RTGS systems are typically used for high-value transactions that require and receive immediate clearing. In some countries the RTGS systems may be the only way to get same day cleared funds and so may be used when payments need to be settled urgently. However, most regular payments would not use a RTGS system, but instead would use a national payment system or automated clearing house that allows participants to batch and net payments. RTGS payments typically incur higher transaction costs and usually operated by a country's central bank.
Wikipedia is wrong about the “high-value” part. Using my UK bank’s website, I can make small real-time payments at zero cost.

When central banks introduced RTGS, it had unanticipated results. David Gerard writes:
Izabella Kaminska takes you through how real-time gross settlement works in the real-life banking system — and how instant settlement turned out to be deadly to liquidity, and what this means for central bank digital currencies. “Banks needed funding not credit because without such funds in situ, real-time settlement could not be contractually achieved. That pre-funding need, however, would heighten the system’s sensitivity to logjams imposed by single institutions, and with it threaten total system gridlock.” See also Izzy’s Twitter thread of research — “What i find most interesting about all this is how it ultimately relates to the cost of pre-funding tx in any real-time system, and the degree such a set up will always be more expensive than a netting-based alternative that operates on trust.”
Kaminska's history of RTGS is a fascinating read.

The whole point of cryptocurrency systems is to eliminate the need for trust. Thus they have to be gross rather than net systems. Thus Bitcoin is gross, but not real-time because of the need to wait (typically six block times or one hour) to be certain that the transaction is irreversible. This delay, and the uncertain and potentially large transaction fees, have made Bitcoin useless for retail transactions. The Lightning Network was intended to fix this problem, by providing cheap, near-real-time gross transactions. But it is another example of the cost of lack of trust, because of the need to provide the channels with liquidity. David Gerard asks:
"How good a business is running a Lightning Network node? LNBig provides 49.6% ($3.7 million in bitcoins) of the Lightning Network’s total channel liquidity funding — that just sits there, locked in the channels until they’re closed. They see 300 transactions a day, for total earnings on that $3.7 million of … $20 a month. They also spent $1000 in channel-opening fees."

So $20/month earned by a $3.7M investment makes it worth the risk of being indicted for violating the Bank Secrecy Act? Because decentralization!
$20/month for 300 transactions/day is 0.2c per transaction. To cover 1% interest on $3.7M would need 34c per transaction. But raising fees by a factor of 170 would certainly reduce the number of transactions significantly, so increasing fees and driving an economic death spiral.

If running a Lightning Network node were to be even a break-even business, the transaction fees would have to more than cover the interest on the funds providing the channel liquidity. But this would make the network un-affordable compared with conventional bank-based electronic systems, which can operate on a net basis because banks trust each other.

Intra-campus collaboration around research support: Formal or informal? / HangingTogether

Photo by Aubrey Rose Odom on Unsplash

Think about intra-campus collaboration around research support at your institution – how does it take place? How are campus stakeholders brought together around these services?

  • Does it occur through “top down” standing committees, task forces, working groups, policies, or other types of formalized inter-unit arrangements?
  • Or is it catalyzed through “informal” channels, driven by personal relationships, influential champions, and serendipitous discovery of mutual interests?

Research support services are a growing part of the academic library mission, but the provision of these services is often the joint responsibility of multiple campus units, working in partnership with the library. Consider, for example, a research data repository that is maintained through campus IT services; data documentation, description, and deposit support offered through the library; and repository metadata funneled into the institutional research information management system operated by the Research Office. In our recent Realities of RDM report series, we noted the prevalence of multi-unit responsibility for research data management (RDM) services, with RDM capacity often branded at the institutional level rather than under the aegis of any single campus unit.

Recent discussions by the OCLC Research Library Partnership (RLP) Research Support Interest Group gave us an opportunity to hear about the ways intra-campus collaborations around research support services played out. Some key observations from the discussions include:

Collaborative relationships on campus are both formal and informal

In the discussion, participants supplied examples of both formal and informal collaborations enabling research support services.

One participant offered an interesting example of a hybrid approach: she built relationships with colleagues on campus in order to be included on formal committees that afforded opportunities for collaboration. In this case, cultivating relationships through informal channels led to more formal collaborative arrangements. Another participant pointed out that while collaborations can begin informally, strengthening them through more formal arrangements, such as an MOU, might become important over time.

Who you know is important … build your networks

Although participants acknowledged the importance of both formal and informal channels for catalyzing collaboration, the discussion seemed to return again and again to the more informal approach of proactively building cross-unit personal relationships, and having those relationships drive collaboration.

This can be especially important for librarians, whom campus stakeholders may not readily imagine as a prospective partner for research support services. Building relationships across campus can raise the profile of the library and its resources in communities that historically have had little contact with librarians. For example, one participant noted the value of an invitation to speak at the National Organization of Research Development Professionals (NORDP) conference, which led to many positive interactions within that stakeholder community.

Sustainability can be an issue for informal collaborations

Good collaborations can only release their full value if they are sustainable. In this case, formal collaborative arrangements, with clearly delineated responsibilities, accountability, and explicit budget lines may have the advantage.

Several participants noted concerns about the persistence of initiatives undertaken through informal collaborations, where there is a collective (cross-unit) stake in a project, but no one necessarily treats it as a priority.

Sustainability can also be an issue when the provision of a service is too tightly linked with a particular individual – if that person leaves, the service may disappear. For example, one participant noted that an RDM “office hours” service at their institution ended abruptly when the person responsible for it changed jobs.

Appeal to partners’ missions

Several participants emphasized that successful collaborations on campus – formal or informal – were usually the result of tying the effort into the various partners’ parochial missions. In other words, in advocating for the benefits of a given collaboration, it is not enough to say, “we can collaborate on a service”, but rather, we can collaborate on a service that, if successful, will advance your organizational mission to the campus”.

It is important to keep in mind that each campus partner – including the library – brings a set of organizational interests, special expertise and perspectives, and established ways of getting things done. An important aspect of successful collaboration around research support services is finding complementarities while at the same time managing differences.

While these observations were raised in the context of discussions of campus collaboration around research support services, it is worth noting that they are sound advice for any kind of collaboration, both on campus and beyond.

Provision of research support services is often something that cannot be addressed by a single campus unit; rather, it is a task that must be parceled out to different campus units according to their special expertise and resources. Because of this, effective cross-unit collaboration, whether formal or informal, is an essential ingredient for making research support services robust and sustainable. The importance of this topic, and its increasing relevance in the library service environment, is the motivation behind a new OCLC Research project, in which we are taking a close look at intra-campus collaborations to build, deploy, and sustain research support services. Stay tuned for updates on this project! And if you are affiliated with a Research Library Partnership member institution, please consider joining the Research Support Interest Group.

Thanks to my colleague Rebecca Bryant for helpful advice on improving this post!

The post Intra-campus collaboration around research support: Formal or informal? appeared first on Hanging Together.

DLF Forum Featured Sponsor – OCLC / Digital Library Federation

Featured Blog Post by 2019 DLF Forum Sponsor, OCLC

Open New Research Opportunities with Your Digital Collections

by Jullianne Ballou
Project Librarian, Harry Ransom Center, The University of Texas at Austin
Unidentified photographer. Gabriel García Márquez with Fidel Castro, undatedUnidentified photographer. Gabriel García Márquez with Fidel Castro, undated. Courtesy Harry Ransom Center.

The Harry Ransom Center at The University of Texas (UT) at Austin holds more than 28,000 pages of writing, research, and photographs from the collection of Colombian-born writer Gabriel García Márquez, and it has made many of these available online in a digital archive. As Project Librarian Jullianne Ballou explained, “A lot of our researchers are textual editors curious about differences across versions of a manuscript or about the writer’s process.”

The digital collection on CONTENTdm® makes use of integrated IIIF APIs and a Mirador viewer to give researchers the chance to make these comparisons. She described IIIF as a “standard way to seamlessly transport images across browsers—regardless of software or computer system—without the interruption of having to upload and download images.”

“As a part of our grant application to CLIR (the Council on Library and Information Resources),” Jullianne said, “we proposed integrating IIIF and the Mirador image viewer with our digital collection pages, primarily to make the García Márquez online archive even more usable.” The IIIF Image and Presentation APIs make it easy for researchers to zoom in and conduct side-by-side comparisons of images, even images held by different libraries in different repositories. “People would love to see a comprehensive collection of García Márquez’s work,” she said, and the IIIF integration with CONTENTdm and the Mirador viewer makes this closer to a reality. “Scholars imagine being able to join dispersed manuscripts, letters, and books from in a single viewer. They’re excited by the potential of IIIF.”

“Our researchers envision IIIF as the key to unlocking the question of what a universal collection could look like.”

In addition to the Mirador viewer, many researchers use the OpenSeadragon viewer that comes standard with CONTENTdm. “We like that we have the opportunity to offer our audiences both viewers,” Jullianne said. “OpenSeadragon allows our patrons to zoom in to a manuscript page or a photograph without leaving the original CONTENTdm record. I’ve gotten great feedback on that.” In the reading room, García Márquez scholars are limited to viewing one folder of documents at a time and enlarging details with a magnifying glass. But with the Center’s digital library, they can pan, zoom, and compare pages from different folders within their browsers, without even downloading images. “The images just generally tend to be easier to access and use,” she added.

To help researchers and other special collections staff understand the potential of this technology, Jullianne and her team have developed training opportunities. “The Ransom Center is the first library at UT to implement IIIF APIs, with OCLC’s help,” she explained. “We’re holding tutorials for students and staff on how to make the most of it.” As she reaches out to other campus libraries, she has found exciting opportunities for collaboration. “We have amazing map collections at UT,” she said. “We’ve had some discussions about combining our knowledge resources to do something really cool with IIIF and maps.” As more digital collections implement the IIIF standard, she added, “the conversation about what collections of the future—including digital collections—will look like becomes richer.”

The post DLF Forum Featured Sponsor – OCLC appeared first on DLF.

GeoCities and the spacer.gif / Nick Ruest

Originally posted here.*fu5LWxDSghx3j8XlpiKUug.gif

Trevor Owens and Grace Thomas recently had their article, “The invention and dissemination of the spacer gif: implications for the future of access and use of web archives” published in the International Journal of Digital Humanities. It’s a great look at the history of the spacer.gif, how it proliferated in the early web, and a case study of digging into web archives and doing a whole lot of analysis. After reading it this past spring, it really got me motivated to round out the DataFrame implementation in the Archives Unleashed Toolkit so that more people could do this kind of inspirational work.

In late August 2019, the Archives Unleashed team released version 0.18.0 of the Archives Unleashed Toolkit. If you check out the release notes, you’ll see a lot of new functionality was added, along with some bug fixes. In those notes and user documentation you’ll see the new and expanded functionality with our DataFrame implementation. We now have functions for a variety of binary types (images, audio, video, pdf, spreadsheets, presentation program files, word processor files, and text files) that allow a user to extract binaries of a given type — all the images from a web collection for instance — or extract information about those binaries into a DataFrame that is output as a CSV file. For images you can extract:

  • the url of an image;

  • the filename;

  • the extension;

  • the MimeType provided by the web server and MimeType as identified by Apache Tika;

  • the width and height of the image;

  • the md5 hash of the image;

  • and the raw bytes of the image.

Similarly, for the other binary extraction and analysis functions (audio, video, pdf, etc.), you are able to have columns as described above minus the height and width.

So, how do you use this new functionality? We do have it all documented here, but for posterity this is how it works if you had a collection of web archives handy:

import io.archivesunleashed._
import io.archivesunleashed.df._

val images = RecordLoader
  .loadArchives("/path/to/web/archive/collection", sc)
  .extractImageDetailsDF();$"url", $"filename", $"extension", $"mime_type_web_server",
  $"mime_type_tika", $"width", $"height", $"md5")

Once the script is run across the web archive collection, it will produce a bunch of part-234432.csv files in your named output directory. You can cat (i.e. by typing cat part* > all-files.csv) those files together into a single one. In the end you should end up with something like this:,camera_blu_001.jpg,jpg,image/jpeg,image/jpeg,112,150,fffffef31a159782b97876b7a17eab92,06_small.jpg,jpg,image/jpeg,image/jpeg,100,143,fffffd5fe6d986c04f028854bbd4a20a,DSC01219.jpg,jpg,image/jpeg,image/jpeg,510,768,fffffc7244d39657dd286547fda3fd0d,favor.gif,gif,image/gif,image/gif,71,20,fffff8a7566c250585fb4453594b9c3e,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07,logo.gif,gif,image/gif,image/gif,168,49,fffff72ef7571cf00d0717ac96bfad07,merlin0.jpg,jpg,image/jpeg,image/jpeg,129,140,fffff077e30e213fa08cecc389a60bdb,MENU7_r11_c21.jpg,jpg,image/jpeg,image/jpeg,68,10,ffffe91beaf231ea8b5fc46a1c6b7f32,qudes.jpg,jpg,image/jpeg,image/jpeg,55,24,ffffd381a8c0ae2e6a7d63d8af6b893c,dad_brendon_lighthouse.jpg,jpg,image/jpeg,image/jpeg,300,226,ffffc83f77a1558222f40d7a44b1d464

So, how is this relevant to the work of Owens and Thomas? Well, a few years ago, our team secured a copy of the GeoCities dataset from the Internet Archive. We’ve done a fair bit of analysis on the collection for a variety of research projects. If you’re curious about some of that research, check out Ian Milligan’s new book, “History in the Age of Abundance? How the Web is Transforming Historical Research.” It features a great case study on GeoCities. Also check out Sarah McTavish’s work, such as this presentation at the Michigan State Global Digital Humanities conference.

We ran an image analysis job across the entire 4 terabyte GeoCities dataset, and ended up with a 16 gigabyte CSV which held data for 121,371,844 images that the toolkit was able to identify!


URL MD5 COUNT MD5 COUNT FILENAME c4746081d66bc2abc269f22ca27ebb46 2,705 373,198 b4682377ddfbe4e7dabfddb2e543e842 3,336 18,685 fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747 55fade2068e7503eae8d7ddf5eb6bd09 2,551 13,852 b4682377ddfbe4e7dabfddb2e543e842 3,336 1,780 fc94fb0c3ed8a8f909dbc7630a0987ff 69,625 747 4f59788bde58d15d541a9c116d0e850d 2,729,121 2,731,243 325472601571f31e1bf00674c368d335 18,537,796 39 accba0b69f352b4c9440f05891b015c5 1,341 26,292;-)/n.gif 325472601571f31e1bf00674c368d335 18,537,796 1,888,058

It looks like 325472601571f31e1bf00674c368d335 (n.gif or spaceball.gif) was the most prolific of the spacer.gif files in the GeoCities dataset.

15.27% of the 121,371,844 images that we identified with the toolkit are 325472601571f31e1bf00674c368d335 (n.gif or spaceball.gif)!!

325472601571f31e1bf00674c368d335 is represented with 3,130 different filenames. The full list is available here.

The top 10 occurrences are:

  1. 18507222 serv

  2. 7461 serv.gif

  3. 3981 spacer.gif

  4. 1660 mm_spacer.gif

  5. 953 blank.gif

  6. 629 hbpix

  7. 603 clear.gif

  8. 541 pixel.gif

  9. 513 px1.gif

  10. 448 trans.gif

Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.
Pie chart of top 500 filenames for 325472601571f31e1bf00674c368d335 with serv removed.

In total, 17.58% of the images the toolkit identified in the web archive come from the list Owens and Thomas identified. So, nearly 20% of the images in the 4 terabyte GeoCites dataset are 1x1 images 🤯🤯🤯.

If this type of analysis interests or inspires you, please checkout of the current call for participation for our fourth Archives Unleashed Datathon in NYC next March.

For the latest news and project updates subscribe to our quarterly newsletter.

Keeping in touch with the Archives Unleashed Project is so easy!

Belated update / Coral Sheldon-Hess

Right now I should be grading or preparing for classes, but honestly I’m three blog posts behind where I wanted to be by now (I haven’t forgotten my WisCon promise to make a post about tabletop roleplaying games) and fighting a pretty nasty headache. So what if I take a short break to make a blog post about the very many things going on in my life?

New job!

First things first: I work full-time now! There was an opening for a full-time Computer Information Technology and Data Analytics professor at the community college where I was teaching as an adjunct (in Data Analytics) in the spring. I knew I liked the students here, my departmental colleagues, and my bosses, so I applied. It’s academia, so the process took all semester and well into the summer, but eventually I was offered and accepted the position! I’m an Assistant Professor! (Hah, yes, again. But as teaching faculty, rather than library faculty, this time.) I get to help build our fledgling Data Analytics program, which is a rare thing for a community college and an exciting development for Southwestern Pennsylvania.

This first semester has been a lot, I won’t lie. I’m teaching five 3-credit classes and co-teaching a sixth with a colleague. Of those, only Python 1 and Data Analytics 1 (the one I’m co-teaching) are courses I’ve (we’ve) taught before, and I had the brilliant idea to go and change the textbook for Python. So for now, my whole life is given over to preparing lectures to go with new-to-me textbooks, wrangling Blackboard, planning assignments/projects/tests, and grading.

Next semester looks a little easier, in some ways: provided things don’t change much between now and then (heh), I’ll solo teach Data Analytics 1, plus a brand new course (Data Analytics 2: Forecasting and Regressions and Stuff … uh, that isn’t the official title, but if that sounds like something you’ve taught or taken, I’d love to see a syllabus!), Python again (without changing the book this time), intro to computers (which I taught this semester), and our HTML/CSS course (which a colleague is generously willing to share materials for). There’s an experimental course in the works, which, if I end up involved, I’ll split with something like three other faculty members for a very small overage; if that doesn’t run, I’ll split an independent study with a colleague, so we can graduate the first student from the Data Analytics program(!!!).

I’m excited to be here! I’m very tired, right now, because it’s so much work to put this many courses together and because I didn’t optimize well for sleep when I was planning my schedule for this semester. But I really enjoy teaching, and I have an excellent officemate (we share offices, here, and I had a lot of anxiety about where I’d be placed, so this is huge). Most of my department has been really welcoming and helpful, and they seem genuinely pleased that I’m here, which feels good. I’ve made some friends outside of the department, too, so coming in to work is always pleasant, even if I happen to be tired and/or headachy now and then. And it isn’t too hard to imagine a future when I’ve taught most of my courses a couple of times and they aren’t quite so overwhelming to put together.

Non-work things

Dale and I have reason to believe we’re going to have to move in the near future. (We haven’t been told anything official by anyone official, but we’ve received some news unofficially that, combined with our current month-to-month lease, is pretty clear in its implications.) The timing isn’t great, given everything I just said about being super busy, but then again, my commute is half an hour without traffic, right now. So we’ve started the process of putting an offer on a house, much closer to my work and still not far from Dale’s work. (It’s so perfect! I don’t want to lovingly describe it right now, in case it falls through, but wow, it would be so fantastic if it all works out!)

We also adopted a bird. Well, two birds since the last time I posted. One was found flying around wild in town. Her name is Pepper, and she is a beautiful girl cockatiel. (The animal rescue named her “Matthew McCockatiel,” which is a great name, but she’s definitely a girl.) She hates hands, so we don’t get to pick her up or anything–we hold out hope that she’ll come to trust us with time–but she likes having us around, and she and Phoebe get along pretty well. The other, someone on the internet posted about needing to find a home for, and I really needed a people-loving bird, since neither Phoebe nor Pepper are especially social with humans. His name is Oliver Scribbner, but we just call him “Mr. Scribbner” or “Scribbner” or “Scribbs” or “buddy” … you know how it is with pets. He wants to hang out with us pretty much the whole time we’re home and he’s awake. He makes microwave sounds when he goes into the kitchen and tries to steal our food and is just generally a great little bird. 💛🧡

Here are three of our four birds (we also have a parakeet who is a cranky old man, but still beautiful and deeply invested in bothering the cockatiels), from left to right, Phoebe, Pepper, and Scribbner:

In my minimal spare time (read: when I am too tired to do something productive but too wound up to sleep) I’ve been watching Star Trek: Deep Space 9 and Star Trek: Voyager with Dale, in preparation for the new Picard series. And also because I’ve never watched them all the way through. We watched Star Trek: The Next Generation this summer. Without bringing this post too far down, I feel like I should acknowledge aloud that I wish I’d done this a couple of years ago, so I could talk about all of my Star Trek feelings with my dad, who was a Trekkie and who would have enjoyed arguing about whether Neelix or Tom Paris is worse and other vital issues. Secretly, I kind of like DS9 better than Voyager, but Janeway is forever my captain, given her science nerd heart and deep need for coffee.

I also finally broke down and bought a Nintendo Switch so that I could play Untitled Goose Game (and ABZÛ and Pokémon and Katamari Damacy and probably Let’s Dance and …). I’ve been enjoying that, albeit somewhat sporadically. I’m looking forward to (fingers crossed! we hope!) playing Switch games for hours at a time in a comfortable reading chair in our new house, or maybe even on a porch(!), this May.

WisCon will hopefully get its own post(s), but (speaking of May?) I do want to say that I plan to attend again next year, and if anyone wants to make any plans around that (panels? fun Madison things? roadtripping or taking the train together?), I’m open to it!

Jobs in Information Technology: October 9, 2019 / LITA

New This Week

Visit the LITA Jobs Site for additional job listings and information on submitting your own job posting. New vacancy listings are posted on Wednesday afternoons.

DLF Forum Featured Sponsor – LIBNOVA / Digital Library Federation

LibNova LogoFeatured Blog Post by 2019 DLF Forum Sponsor LIBNOVA

LIBNOVA, the DLF FORUM and researching for the future.

LIBNOVA is honoured to be taking part of the DLF 2019! As a digital preservation company rooted on research we understand the importance of gearing towards the future by developing technologies and sharing methodologies; and, that is what the DLF Forum represents to us, a great arena to spread the word, find common ground to disseminate new ideas and project our thoughts.

Let me explain a little. Right from the start, LIBNOVA focused on research as its core. It has slowly but confidently expanded the LIBSAFE digital preservation platform into one of the most advanced platforms. It understands that progress is made by many at the same time, although there is always a first step. Many a time that first step has been taken at the DLF Forum.

We all know that protecting assets from the worst of today and preserving them for the future has become an obligation for institutions of all sizes across the globe. We all know that ensuring digital heritage is kept readable and accessible for the long-term is critical. The need to protect content while keeping it alive and usable has become a priority. We all share the same essence and participate in the same experience. We take part in the same journey.

LIBNOVA Research Labs is a fundamental piece of this forward-thinking technology that is LIBSAFE. One example of this is our machine learning and neural networks project, – it achieved impressive results and was awarded a European Research Grant in 2018. That research has made its way into LIBSAFE.

Over the next week, during the DLF Forum, we will be running through some of our research programs, so please keep an eye out and join us. Otherwise you can always come around to our booth and exchange ideas and views.


The post DLF Forum Featured Sponsor – LIBNOVA appeared first on DLF.

Tune in to OA Week “Equity in Open Knowledge” Highlight–Open Data Activism in Search of Algorithmic Transparency / DuraSpace News

From Verletta Kern, Digital Scholarship Librarian, University of Washington Libraries

Take a break during Open Access Week and join the ACRL Open Research and Digital Collections Discussion Groups on Tuesday, October 22nd from 11:00am-12:00pm Central for “Open Data Activism in Search of Algorithmic Transparency: Algorithmic Awareness in Practice”.  Tying in with this year’s OA Week theme of “Open for Whom? Equity in Open Knowledge,” this online learning opportunity will begin with an introduction to algorithms, touching on definitions, implications, and the link to digital transparency initiatives followed by a chance for a simple, hands-on experience with pseudocode (no previous coding knowledge required).   Presenters Jason Clark and Julian Kaptanian (Montana State University Libraries)  will facilitate a discussion on programmer bias, the complexity behind algorithmic decisions, and integrating the concept of algorithmic awareness into teaching and advocacy.

This learning opportunity has grown in part out of the robust research conducted by Clark and Kaptanian as a part of an IMLS grant and the positive reception of a previous collaboration between the ACRL Open Research and Digital Collections Discussion Groups, presented earlier in 2019. Recording available here. We hope to continue the learning here in this hands-on webinar that integrates algorithmic awareness into teaching and advocacy. 

Please register in advance for this event. After registering, you will receive a confirmation email containing information about joining the meeting. A recording will be available after the event for those who register. We hope you will join us for this Open Access Week event!

The post Tune in to OA Week “Equity in Open Knowledge” Highlight–Open Data Activism in Search of Algorithmic Transparency appeared first on

Ethiopia adopts a national open access policy / Open Knowledge Foundation

In September, Ethiopia adopted a national open access policy for higher education institutions. In a guest blog on the EIFL website, Dr Solomon Mekonnen Tekle, librarian at Addis Ababa University Library, organiser of the Open Knowledge Ethiopia group and EIFL Open Access Coordinator in Ethiopia, celebrates the adoption of the policy. This is a repost of the original blog at EIFL (Electronic Information for Libraries) – a not-for-profit organization that works with libraries to enable access to knowledge in developing and transition economy countries in Africa, Asia Pacific, Europe and Latin America.

The new national open access policy adopted by the Ministry of Science and Higher Education of Ethiopia (MOSHE) will transform research and education in our country. The policy comes into effect immediately. It mandates open access to all published articles, theses, dissertations and data resulting from publicly-funded research conducted by staff and students at universities that are run by the Ministry – that is over 47 universities located across Ethiopia.

In addition to mandating open access to publications and data, the new policy encourages open science practices by including ‘openness’ as one of the criteria for assessment and evaluation of research proposals. All researchers who receive public funding must submit their Data Management Plans to research offices and to university libraries for approval, to confirm that data will be handled according to international FAIR data principles. (FAIR data are data that meet standards of Findability, Accessibility, Interoperability and Reusabililty.)

EIFL guest blogger, Dr Solomon Mekonnen Tekle: “And now the work begins!”

We will have to adapt quickly!

Our universities and libraries will have to adapt quickly to comply with the new policy. Each university will have to develop an open access policy to suit its own institutional context, and which is also aligned with the national policy. We have a long way to go – at present, only three of the 47 universities that fall under the Ministry (Hawassa, Jimma and Arba Minch universities) have adopted open access policies.

The policy requires universities to ensure that all publications based on publicly-funded research are deposited in the National Academic Digital Repository of Ethiopia (NADRE) as well as in an institutional repository, if the university has one. NADRE is supported by MOSHE, and also harvests and aggregates deposits from institutional repositories.

Right now, about 13 universities under the Ministry have institutional repositories, but only four are openly available because of policy and technical issues, so there is work to be done.

Ministry support for universities

To speed up and support compliance with the new policy, MOSHE has launched a project in partnership with Addis Ababa University.

I am managing the project, and I am very happy to have the support of the Consortium of Ethiopian Academic and Research Libraries (CEARL) through its chairperson, Dr Melkamu Beyene, who has been named chairperson of the project board. We can also draw on the expertise and experience of Iryna Kuchma, Manager of the EIFL Open Access Programme, who serves as an international advisor for the project.

The project has just started. It will ensure that all public universities that do not have institutional repositories establish them as soon as possible. The project will also strengthen the Ethiopian Journals Online (EJOL) open access journals platform that currently includes 24 journals and is hosted by Addis Ababa University. And the project will support researchers in making their research available through the EthERNet platform for research data, and through NADRE.

There is a strong capacity building component to the project to train repository managers and administrators to manage their new institutional repositories and open access journals.

Positive impact of the new policy

When we first began making research outputs openly available in Ethiopia there was fierce resistance from the academic community who were worried that their work would be plagiarized. But now, researchers and students come to my office in the library and ask for their research to be published in open access so that others, like potential employers, for example, can find and read it. They see the benefits.

The main impact of the new policy will be to increase the visibility of Ethiopian research, within the national and international research communities. The quality of our research will improve, because researchers will be able to see and verify each others’ work, and to comment on the integrity of the methodology and results. Practitioners in organizations will have access to our research and will be able to base their work on it and so our research will have real impact. Sharing of research and data through open access will minimize duplication, thereby saving costs, time and effort.

The journey to achieving the new policy was long. We began reaching out to MOSHE over three years ago, and they formed a Working Group to draft a national open access policy based on a model that had been developed by CEARL and Addis Ababa University, with support from EIFL. Success resulted from a collective effort by many colleagues and partners. I am proud to have been part of the process and I am now looking forward to working with our partners to achieve full implementation of the policy through the Ministry’s project.

CALL for Archiving 2020 Proposals / DuraSpace News

Join an international community of technical experts, managers, practitioners, and academics from cultural heritage institutions, universities, and commercial enterprises at Archiving 2000 from May 18-21, 2020 to explore and discuss the digitization, preservation, and access of 
2-dimensional, 3-dimensional, and audio-visual materials, including documents, photographs, books, paintings, videos, and born digital works.

Authors are invited to submit abstracts describing original work in technical areas related to 2D, 3D, and AV materials by November 1, 2019. Call for proposals here.


  • New developments in digitization technologies and workflows
  • Advanced imaging techniques and image processing, e.g., multispectral imaging, 3D imaging
  • Large scale/mass digitization and workflow management systems
  • Quality assurance and control 
of digitization workflow, e.g., targets, software, automation, integration

Preservation / Archiving

  • Formats, specifications, 
and systems
  • Management of metadata
  • Standards and guidelines
  • Archival models and workflows

Access of 2D, 3D, and AV materials

  • Dissemination and use of digitized 
materials, e.g., rights management, crowdsourcing, data mining, data visualizations
  • Formats for preservation and access
  • Deep learning algorithms to improve search results; AI, machine learning, etc.
  • Open access and open data strategies
  • Integration of linked open [usable] data (LOD/LOUD)/Open source solutions/APIs (automated programming interface, e.g., IIIFs)

Management and Partnerships

  • Policies, strategies, plans, and risk management; repository assessment
  • Business and cost models
  • Collaborations and partnership best practices/lessons learned/case studies

The post CALL for Archiving 2020 Proposals appeared first on

The Archives Unleashed Toolkit as a Finding Aid Utility / Nick Ruest

Originally posted here.*sj4i3HWZY9CRAWsHtRoL4g.png

I’ve been thinking a lot lately about how the Archives Unleashed Toolkit is a great finding aid utility for web archive collections, and should be in the toolbox for any web archivist.

So, a finding aid. How could we create a finding aid for a web archive collection with relatively minimal labour?

The Society of American Archivists define a finding aid as:

  1. A tool that facilitates discovery of information within a collection of records.

  2. A description of records that gives the repository physical and intellectual control over the materials and that assists users to gain access to and understand the materials.

What are current practices? Generally speaking, most web archive collections from institutions such as national libraries, government organizations, or universities will have a set of descriptive metadata about a collection. This is often framed around Dublin Core, and is arguably a “minimal finding aid.” It is useful, and usually what can be done given time, budget, and labour constraints.

With the Archives Unleashed Toolkit, we can create a more robust finding aid. But, what should it be? Currently, we have four sets of what we’ve called scholarly derivatives of web archives that are produced with the Archives Unleashed Cloud. With each of the derivatives, we also provide extensive tutorials that walk through use cases for scholarly analysis.

Full Text

We provide a txt file, that’s comma delimited (basically a CSV file 😄). Each website within the web archive collection will have its full text presented on one line, along with information around when it was crawled, the name of the domain, and the full URL of the content

Sample full-text aut output
Sample full-text aut output

Domains CSV

Sample full-urls aut output
Sample full-urls aut output

Text by Domain

We provide a ZIP file containing ten text files (see Full Text derivative above) corresponding to the full-text of the top ten domains.

Sample extracted output
Sample extracted output

Gephi & Raw Network Files:

Two different network files (described below) are also provided. These files are useful to discovering how websites link to each other.

  1. Gephi file, which can be loaded into Gephi. It will have basic characteristics already computed and a basic layout.

  2. Raw Network file, which can also be loaded into Gephi. You will have to use the network program to lay it out yourself.

Hyperlink Diagram of [University of Victoria's Anarchist Archives]( collection
Hyperlink Diagram of University of Victoria’s Anarchist Archives collection

In addition to the set of scholarly derivatives the toolkit can produce, we have also improved and expanded upon our DataFrame implementation in the most recent release of the Archives Unleashed Toolkit. If you’re familiar with CSV files, then you’ll be right at home with DataFrames!

You can now produce a variety of DataFrames of a web archive collection with the toolkit, that can be exported directly to CSV files.

You can produce a list of domains or a hyperlink network.

List of domains example output in a DataFrame
List of domains example output in a DataFrame
Hyperlink network example output in a DataFrame
Hyperlink network example output in a DataFrame

You can also gather information on a variety of binary types in a web archive collection. We current support audio, images, pdf, presentation program files, spreadsheets, text files, videos, and word processor files. Each of these methods allows you to extract:

  • the url of the binary;

  • the filename of the binary;

  • the extension of the binary;

  • the MimeType identified by the web server, and by Apache Tika;

  • height and width of a image binary;

  • md5 hash of the binary;

  • and the raw bytes of the binary.

Image analysis example output in a DataFrame
Image analysis example output in a DataFrame

What does all this mean?

If a given institution has an Archive-It subscription, they could generate a set of derivatives for their web archive collections with the Archives Unleashed Cloud. If they don’t have an Archive-It subscription, or want to do the analysis on their own with the Toolkit, and have the storage and compute resources available to run analysis on their web archive collections, we have a ready set of documented scripts available in the “cookbook” section of our documentation now.

I’m curious from an archivist’s perspective, what should make up a finding aid for a web archive collection? Is descriptive metadata for a collection sufficient? Does a combination of descriptive metadata for a collection, plus some combination of analysis with the Archives Unleashed Toolkit cover it? If so, what output from the Toolkit is useful in creating a finding aid for a web archive collection? Is there more that can be done? If so what?

I’d love to hear from you in our Slack (I’ve setup a channel #aut-finding-aid), or on GitHub if you’d like to create an issue.

An additional option outside the scope of the Archives Unleashed Toolkit, but closely related to it, is to index a web archive collection (all the ARC/WARC files) with UK Web Archive’s webarchive-discovery, and built a discovery interface around the Solr index with Warclight.

A Thought on Digitization / Harvard Library Innovation Lab

Although it is excellent, and I recommend it very highly, I had not expected Roy Scranton's Learning to Die in the Anthropocene to shed light on the Caselaw Access Project. Near the end of the book, he writes,

The study of the humanities is nothing less than the patient nurturing of the roots and heirloom varietals of human symbolic life. This nurturing is a practice not strictly of curation, as many seem to think today, but of active attention, cultivation, making and remaking. It is not enough for the archive to be stored, mapped, or digitized. It must be worked.

The value of the Caselaw Access Project is not primarily in preservation, saving space, or the abstraction of being machine-readable; it is in its new uses, the making of things from the raw material. We at LIL and others have begun to make new things, but we're only at the beginning. We hope you will join us, and surprise us.

DLF Forum Featured Sponsor – Data Curation Experts / Digital Library Federation

Featured Blog Post by 2019 DLF Forum Sponsor Data Curation Experts (DCE)

This year, as part of our conference sponsorship, three of my DCE colleagues and I will be attending DLF in Tampa. I’m personally excited for the opening keynote “Beautiful Data: Justice, Code, and Architectures of the Sublime”by Dr. Marisa Duarte. At DCE, we have strongly held opinions about the aesthetics of well constructed data and transparent interfaces, so any talk that promises an exploration of how to make navigation seamless and unencumbered gets us hooked. I expect that Dr. Durate’s talk will set the stage for many more discussions over the week about how we all can help support the role of digital libraries as institutions for change. As people who care about social justice, we try to be intentional about DCE’s duty to build repositories for the future that don’t just unthinkingly perpetuate the patterns (and potential problems) of the past.

We’re excited to be supporting DLF this year, because this is the conference where we can have conversations about these kinds of ideas -not just in abstract visionary terms, but in how we do our day-to-day work and how we collaborate with our colleagues across the community. DLF is where researchers, librarians, and coders talk to each other about how to build digital libraries and digital humanities projects.

In particular, my DCE colleagues and I want to thank the staff and leadership at DLF for the way they have worked to broaden the conversation to include more participants every year: DLF’s fellowship programs provide travel funding for underrepresented groups, students and new professionals; DLF’s commitment to providing childcare makes the conference more accessible for people with childcare responsibilities; and the DLF Organizers’ Toolkit provides a space and a framework for taking action. DCE is proud to support DLF and we are excited for the 2019 Forum! Please stop by our table sometime during the conference -Rachel, Jamie, Mark, and I would love to say hello.

–Bess Sadler on behalf of the DCE Team

The post DLF Forum Featured Sponsor – Data Curation Experts appeared first on DLF.

The Data Isn't Yours / David Rosenthal

Most discussions of Internet privacy, for example Jaron Lanier Fixes the Internet, systematically elide the distinction between "my data" and "data about me". In doing so they systematically exaggerate the value of "my data".

The typical interaction that generates data about an Internet user involves two parties, a client and a server. Both parties know what happened (a link was clicked, a purchase was made, ...). This isn't "my data", it is data shared between the client ("me") and the server. The difference is that the server can aggregate the data from many interactions and, by doing so, create something sufficiently valuable that others will pay for it. The client ("my data") cannot.

DLF Forum Featured Sponsor – Quartex powered by Adam Matthew Digital / Digital Library Federation

Featured Blog Post by 2019 DLF Forum Sponsor – Quartex by Adam Matthew Digital

At this year’s DLF Forum, learn how your organisation can make its archival content fully searchable with Quartex, powered by Adam Matthew Digital

Our team at Adam Matthew Digital is looking forward to speaking with attendees about how Quartex enables organisations to easily publish their digital collections.

Adam Matthew Digital is an award-winning publisher of primary source content with 30 years’ experience curating and showcasing archival collections. It is through this experience that we have developed Quartex, a platform designed to help libraries, archives, and other heritage institutions showcase, share, and celebrate their digital collections.

Quartex is a fully hosted solution developed by a dedicated and skilled team of engineers. It is designed as a simple but powerful resource with functionality that requires no technical expertise, and which allows customers the flexibility to establish customized workflows based on their unique needs and collections. The Quartex platform benefits from a continual program of investment and development, as new functionality is added to support the requirements of customers.

At this year’s DLF Forum, we’re going to be asking:

  • What search functionality is key for your published collections? Handwritten Text Recognition (HTR) technology, OCR, A/V transcription, federated search across collections? What issues have you come across when using other platforms?

Our existing customers include the University of Toronto Mississauga Library, Shakespeare’s Globe, Loyola Marymount University, the Royal Horticultural & Agricultural Society of South Australia, and Baylor University.

For a demonstration of Quartex, visit our team in the Grand Ballroom Foyer at the DLF Forum – they will be happy to answer your questions.

The post DLF Forum Featured Sponsor – Quartex powered by Adam Matthew Digital appeared first on DLF.

Introducing D-CRAFT / Digital Library Federation

Assessing the impact, value, and usability of digital collections is a major and persistent challenge for Cultural Heritage and Knowledge Organizations (CHKO) of all forms and sizes. Some of the complications stem from a limited number of methods and approaches for assessing digital libraries. Traditional digital library assessment analytics focus almost entirely on usage statistics, which do not provide the information needed to develop a nuanced picture of how users engage with, repurpose, or transform unique materials and data from digital libraries, archives, and repositories. This lack of distinction, combined with a lack of standardized assessment approaches for digital libraries, makes it difficult for CHKOs to develop user-responsive collections and to highlight the value of their materials. This in turn presents significant challenges for developing the appropriate staffing, system infrastructure, and long-term funding models needed to support digital collections.

The Content Reuse Working Group, one of several that make up DLF’s Assessment Interest Group (AIG), is here to help. Thanks to the generosity of the Institute of Museum and Library Services (IMLS), which awarded the working group a National Leadership Grants for Libraries (LG-36-19-0036-19), CLIR/DLF, and the members’ home institutions, the working group will spend two-and-a-half years developing the Digital Content Reuse Assessment Framework Toolkit (D-CRAFT). The D-CRAFT project, which started in July 2019, will complete two phases of work to develop, solicit feedback, and launch the toolkit. 


About the Team

The D-CRAFT Project Team pulls together the established experience of the AIG Content Reuse Working Group. Members Elizabeth Joan Kelly (University of Loyola New Orleans), Ayla Stein Kenfield (University of Illinois Urbana Champaign), Caroline Muglia (University of Southern California), Santi Thompson (University of Houston), and Liz Woolcott (Utah State University) have spent several years focused on understanding the scope and challenges for assessing digital collection reuse. In 2017, this group, along with Genya O’Gara (VIVA), conducted the IMLS-funded Developing a Framework for Measuring Reuse of Digital Objects (Measuring Reuse), which provided the D-CRAFT team with prioritized use cases and functional requirements for the toolkit. 

The D-CRAFT Project Team is excited to add Kinza Massod (Mountain West Digital Library) and Ali Shiri (University of Alberta) to the team. Their expertise in digital library management and research will add another dimension to the toolkit. 

Beyond the core team members, the project will benefit from the shared experience of five project consultants. Focused in targeted areas that will enhance the development of the toolkit, consultants will add additional expertise in key themes including library assessment, privacy, diversity and equity, instructional design, and accessibility. Finally, the project team will draw upon the collective wisdom of an eight-member advisory group. The advisors represent diverse institutions and will offer constructive critiques on every toolkit facet and deliverable as they are created. 

About the Project

Phase One (July 2019 – May 2021)

Activities will focus on building the major components of the toolkit, including:

Code of Ethics for Assessing Reuse

The Code of Ethics for Assessing Reuse will address key principles for responsibly assessing reuse, including indigenous and underrepresented communities’ concerns and ideas, as well as user privacy. The need for a Code of Ethics for Assessing Reuse was identified in Measuring Reuse focus group discussions centered around privacy, cultural sensitivity, and controversial reuse of digital collection materials. 

Reuse Assessment Recommended Practices

Recommended Practices will aggregate tools and resources and compile existing strategies for assessing various facets of digital object reuse. The recommended practices will address prominent reuse assessment use cases derived during the Measuring Ruse project. Additionally, gaps in current assessment capabilities will be detailed.  

Education and Engagement Tools

Based on community input, the Recommended Practices for the prioritized use cases will be accompanied by the development of Education and Engagement Tools. Tutorials and quick start guides will provide examples and templates to demonstrate practical implementation of the recommended assessment practices. 


The final activities of Phase One include building and populating a living space for D-CRAFT. The Project Team envisions D-CRAFT to be one of a suite of digital library planning and assessment services hosted on the DLF Dashboard (which currently hosts the Digitization Cost Calculator, a tool created by the AIG’s Cost Assessment working group).

Phase Two (April – December 2021)

Phase Two will translate the main deliverables of Phase One (Code of Ethics, Recommended Practices, and Education and Engagement tools for each of the use cases) into training opportunities for the digital library community. This phase will take place over the last nine months of the project and will feature the development of in-person and online trainings.

Check the D-CRAFT website for announcements on training opportunities in 2021.

Getting Involved

The D-CRAFT project will seek the input of communities throughout the span of the project. In particular, the project team will institute open comment and feedback periods to ensure that the broadest perspectives are incorporated into the Code of Ethics for Assessing Reuse, Recommended Practices, and Education and Engagement Tools. That valuable exchange will ensure that the deliverables from the project can be utilized by the CHKO community to address topics related to reuse of digital assets. The D-CRAFT project team will release calls for public comment on prominent CHKO listservs as well as via its webpage. 

Beyond public comment, individuals can take advantage of either in-person or virtual trainings on D-CRAFT in the closing months of the grant. More details will be made available via the project website

Learning More

The Project Team will maintain the D-CRAFT website as a way to efficiently communicate information about the project, as well as provide access to project updates, blog posts, conference presentations, and publication announcements. The permanent repository for all final project deliverables will be a dedicated Open Science Framework project repository.

In the coming weeks, D-CRAFT team members will be presenting on the project at several conferences, including the DLF Forum and the Charleston Conference. Details on additional conferences can be found on the project website.

The D-CRAFT Project Team would like to thank the IMLS for its generous financial support, the staff of CLIR/DLF for their continued support and guidance in making this grant application (and funded award) possible, colleagues in other DLF AIG Working Groups for their critical input and feedback on the design of our project, and their home institutions for giving the group the opportunity to engage in this exciting work. They look forward to collaborating with you all in the coming years to build, refine, and release D-CRAFT.



Project Team

  • Elizabeth Joan Kelly, Loyola University New Orleans
  • Ayla Stein Kenfield, University of Illinois
  • Kinza Masood, Mountain West Digital Library
  • Caroline Muglia, University of Southern California
  • Ali Shiri, University of Alberta
  • Santi Thompson, University of Houston
  • Liz Woolcott, Utah State University


  • Joyce Chapman, Duke University
  • Derrick Jefferson, American University
  • Myrna E. Morales, Massachusetts Coalition of Domestic Workers and PhD candidate, University of Illinois

The post Introducing D-CRAFT appeared first on DLF.

Guest post: Ilya Kreymer's Client-Side Replay Technology / David Rosenthal

Ilya Kreymer gave a brief description of his recent development of client-side replay for WARC-based Web archives in this comment on my post Michael Nelson's CNI Keynote: Part 3. It uses Service Workers, which Matt Gaunt describes in Google's Web Fundamentals thus:
A service worker is a script that your browser runs in the background, separate from a web page, opening the door to features that don't need a web page or user interaction. Today, they already include features like push notificationsand background sync. In the future, service workers might support other things like periodic sync or geofencing. The core feature discussed in this tutorial is the ability to intercept and handle network requests, including programmatically managing a cache of responses.
Client-side replay was clearly an important advance, so I asked him for a guest post with the details. Below the fold, here it is.

Introducing wabac.js: Viewing Web Archives Directly in the Browser

As the web has evolved, many types of digital content can be opened and viewed directly in modern web browsers. A user can view images, watch video, listen to audio, read a book or publication, and even interact with 3D models, directly in the web browser. All of this content can be delivered as discrete (or streaming) files and rendered neatly by the browser. This lends well to web-based digital repositories, which can store these objects (and associated metadata) and serve them via the web.

But what about archiving web content itself and presenting it in a browser? The web is oriented around a network protocol (HTTP), not particular media formats. Thus, web archiving involves the capture of web (HTTP) traffic from multiple web servers, and later replay of that HTTP traffic in a different context. Web archiving has always been a form of ‘self-emulation’ of the web in the same medium. To replay accurately web content captured previously requires not just a web browser, but also a web server, which can ‘emulate’ any number of web servers, and a system that can ‘emulate’ any number of sites when rendered in the browser. For this reason, a browser’s ‘Save Page As...’ functionality never really quite worked for anything but the simplest of sites: the browser does not store the network traffic and does not have the ability to emulate itself!

But what if it were possible to ‘render’ an arbitrary web archive, directly in the browser, just as easily as it is to view a video or a PDF?

Thanks to a technology originally developed for offline browsing, called Service Workers, this is now possible! Service workers allow a browser to also act as a web server for a given domain. This feature was designed for offline caching to speed up complex web applications, but turns out to have a significant impact for web archives. Using service workers, it is now possible to “emulate” a web server in the browser, thereby loading a web archive using the browser itself.

Initial research on using service workers for web archive replay has been published by Sawood Alam et al of ODU's WS-DL group (Client-Side Reconstruction of Composite Mementos Using ServiceWorker). I have take the idea a step further, porting aspects of Webrecorder's existing python-based software to run directly in the browser as a service worker.

The result is the following prototype, Web Archive Browsing Advanced Client (WABAC), available on our github as wabac.js and as a static site, Much like opening a PDF in the browser, using allows opening and rendering WARC or HAR files directly in the browser (the files are not uploaded anywhere and do not leave the users' machine).

The implications of this technology could be far reaching for web archives.

First, the infrastructure necessary to support web archives could be reduced to primarily the cost of storage. Any system that can store and serve static files over HTTP, including github, S3, or any institutional repository, can thus function as a web archive. The costs associated with running a web archive can thus be reduced to the cost of existing storage, making storing web archives no different than storing other types of digital content.

For example, the following links loads a web archive (via a WARC file) on-demand from github, and renders a blog post or Twitter feed.

Since github provides free hosting for small files, the archived page is available entirely for free:
The trade-off currently is slightly longer load time as the full WARC needs to be downloaded, and more CPU processing in the users’ browser. (The download time can likely be reduced by using a pre-computed lookup index, which has not yet been implemented).

Second, this approach could increase trust in web archives by increasing transparency. Current web archive replay infrastructures involve complex server-side operations, such as URL rewriting, banner insertion, which could cause questions about the reliability of web archives.

Due to how web archives currently work, web archive replay requires modifying the original context (to ‘emulate’ the original context) and serving a modified version in order for it to appear as expected, or with the archives branding, in the browser. These modifications happen on the web archive server and are thus opaque to the user: one could always claim the web archive has been improperly modified.

The service worker approach does not eliminate the complexity but it increases transparency, by moving all of the modifications necessary for rendering to the users’ browser. Thus, the browser must receive the raw content and render it in the browser. With the web archive replay happening in the browser and not on a remote server, it is possible to verify the rendering process, especially if the replay software is fully open source.

Of course, trusting the raw content becomes even more imperative, and more work is needed there. The WARC format, the standard format for archived web content, does not itself contain a way to verify that the data has not been tampered with. There have been various suggestions for how to verify that raw web archive data received from another server has not been tampered with, including by also storing certificates, signing the WARC files themselves, or possibly exploring the new signed exchange standard that is being proposed. Similar to other mediums, some form of verification of WARC data will be necessary to avoid ‘deep-fake’ WARCs that misrepresent, and much more work in this area is still needed.

However, the client-side rendering approach clearly separates the rendering of a web archive from the raw data itself.

The clear distinction between the raw web archive data (eg. WARC files) and the software needed to replay the web archive data has a number of useful properties for digital preservation, aligned with other types of content.

For example, it certainly makes sense that a PDF file or video is separate from the PDF viewer or video player that is used to access or view these files. (There can be various ways of associating a relationship). When a video or PDF does not work in a particular player or PDF viewer, we can try a different viewer, to determine if the issue is with the raw content or the software.

But no such distinction currently exists for web archives, obfuscating issues and reducing trust in web archives. Web archives are usually referenced by a url, such as:

but this simultaneously implies a reference to a particular archive of from 2018-04-28 available from Internet Archive AND Internet Archives’ rendering of the archived page using a particular version of their (non-open source) web archive replay software. In this case, the rendering does not work. Does it mean the capture is invalid, or that the rendering software does not work? We do not know based on the results, because the raw content is intermingled with the replay software, and so trust in web archives is lessened. If IA has issues with this page, what other issues do they have? Maybe they aren’t able to capture the data at all?

It turns out, the issue in this case is with the rendering, not the capture. IA does indeed have a capture of this page, but their current replay software is unable to render it properly. Fortunately, using the service worker replay system, it is possible to apply an ‘alternative’ rendering system to the same content!

The service-worker based replay system itself consists simply of HTML and Javascript files, and thus can be added to an existing web archive, and then used to provide an alternative replay. It turns out that it is possible, as a proof-of-concept, to add an experimental, alternative client-side replay system directly to Internet Archive’s Wayback Machine. An example of this is as follows, which should yield a better replay of the same complex page:|

This approach works because the client-side service-worker based replay system is itself just a web page, and thus can itself be archived and versioned as needed. The improved replay reconstructs the page in the client browser, without costing IA anything, although browser cpu and memory usage may increase and it may not always work (again, this is a prototype!)

Just like PDFs or videos, a web archive could be associated with a particular replay software and if a better version of the web archive software is developed, it could be associated with an improved version. This approach should make web archives more comprehensible and future-proof, as particular versions of replay software can be associated with raw content, while new web archives could be required to use a new version.

In the above example, the first timestamp (20191001020034) is the timestamp of the replay software, while the second timestamp is the actual timestamp of the content to be replayed (20180428140459). In fact, all of the versions of the client-side replay can be viewed by loading:*/

This is sort of an elegant hack onto an existing system, and not necessarily the ideal way to do this, but one way of illustrating the distinction! For better reliability, the replay software should be versioned and stored separately from the archived content.

Applying an alternative replay system to IA’s existing replay works because IA’s Wayback Machine provides an ‘unofficial’ way of retrieving the raw content. While IA does not make all the WARCs publicly accessible for download, it is possible to load the raw content by using a special modifier in the url. The ability to load raw unmodified content was also discussed as a possible extension to the Memento protocol along with a more general preference system.

However, the final Memento extension proposal, with multiple 'dimensions of rawness', was perhaps overly complex: what is needed is a simply way to transmit the raw contents of a WARC, either as a full WARC or per WARC record (memento) without any modifications. Then, all necessary transformations and rewriting can be applied in the browser.

Why is per-memento access to raw content still useful if the WARC files can be downloaded directly? Ideally, the system would always work with raw WARC files, available as static files. In practice, large archives rarely make the raw WARC files available, for security and access control reasons. For example, a typical WARC from a crawl might be 1GB in size and may contain multiple resources, but an archive may choose to exclude or embargo a particular resource, say a 1MB file 200MB into the WARC. This would not be possible if the entire WARC is made available for download. For archives that require fine-grained access control, providing per-memento access to raw data is the best option for browser-based replay.

Thus far, the service worker browser-based replay proof-of-concept has been tested in two different ways: Small-scale object bound archives, such as a single blog post, a social media page, where the entire WARC is available and small enough to downloaded at once, and also embedded into very large existing web archives, such as IA’s Wayback Machine, where the client side-replay system sits on top of an existing replay mechanism capable of serving raw archived content or mementos.

While more research and testing is needed, the approach presents a compelling path to browser-based web archive rendering for archives of all sizes and could significantly improve web archive transparency and trust, reduce operational costs of running web archives, and improve digital preservation workflows.

A collective perspective on print books in the US and Canada / HangingTogether

According to a new study just released by the Pew Research Center, print books are still the most popular format for American readers. That’s good news, because according to a new OCLC Research position paper, US – and Canadian – libraries have a lot of them. More than 59 million distinct publications, based on nearly a billion print book holdings, in fact. That ought to keep everyone busy!

In The US and Canadian Collective Print Book Collection: A 2019 Snapshot, OCLC Research provides an overview of the collective print book holdings – as represented in WorldCat – of US and Canadian libraries. The paper also includes an updated version of our familiar illustration of US and Canadian “mega-regional” print book collections. Overall, the paper traces the contours of US and Canadian collective print book holdings at two scales: a “grand scale” combining the holdings of all US and Canadian libraries into a single collective collection, and a smaller regional scale that focuses on geographically-clustered print book holdings as collective collections.

A key finding from the paper is that the US and Canadian collective print book collection is growing more “dilute”: that is, growth in the number of distinct print book publications exceeds growth in total print book holdings, which means that the average number of holdings per print book publication is falling. This suggests that duplication, or overlap, across local print book collections may be lessening.

Collective collections are an important frame for collection analysis, and increasingly, the operational scale for services like shared print management, group-scale discovery, and resource sharing. OCLC Research has produced an extensive (and ongoing) program of work on collective collections, of which this position paper is the latest contribution. Be sure to also check out the recently published OCLC Research report Operationalizing the BIG Collective Collection: A Case Study of Consolidation vs Autonomy. Prepared in collaboration with the Big Ten Academic Alliance Library Initiatives, the report offers recommendations on how this consortium of academic libraries can expand the coordination of their collective collection.   

The post A collective perspective on print books in the US and Canada appeared first on Hanging Together.

Join #Hacktoberfest 2019 with Frictionless Data / Open Knowledge Foundation

The Frictionless Data team is excited to participate in #Hacktoberfest 2019! Hacktoberfest is a month-long event where people from around the world contribute to open source software (and – you can win a t-shirt!). How does it work? All October, the Frictionless Data repositories will have issues ready for contributions from the open source community. These issues will be labeled with ‘Hacktoberfest’ so they can be easily found. Issues will range from beginner level to more advanced, so anyone who is interested can participate. Even if you’ve never contributed to Frictionless Data before, now is the time! 

To begin, sign up on the official website ( and then read the OKF project participation guidelines + code of conduct and coding standards. Then find an issue that interests you by searching through the issues on the main Frictionless libraries (found here) and also on our participating Tool Fund repositories here. Next, write some code to help fix the issue, and open a pull request for the Frictionless Team to review. Finally, celebrate your contribution to an open source project!

We value and rely on our community, and are really excited to participate in this year’s #Hacktoberfest. If you get stuck or have questions, reach out to the team via our Gitter channel, or comment on an issue. Let’s get hacking!

Creating a Data Analysis Workspace with Voyant and the CAP API / Harvard Library Innovation Lab

This tutorial is an introduction to creating a data analysis workspace with Voyant and the Caselaw Access Project API. Voyant is a computational analysis tool for text corpora.

Import a Corpus

Let’s start by retrieving all full cases from New Mexico:

Copy and paste that API call into the Add Texts box and select Reveal. Here’s more on how to create your own CAP API call.

Create Stopwords

You’ve just created a corpus in Voyant! Nice 😎. Next we’re going to create stopwords to minimize noise in our data.

In Voyant, hover over a section header and select the sliding bar icon to define options for this tool.

Blue sliding bar icon shown displaying text "define options for this tool".

From the Stopwords field shown here, select Edit List. Scroll to the end of default stopwords, and copy and paste this list of common metadata fields, OCR errors, and other fragments:


Once you’re ready, Save and Confirm.

Your stopwords list is done! Here’s more about creating and editing your list of stopwords.

Data Sandbox

Let’s get started. Voyant has out of the box tools for analysis and visualization to try in your browser. Here are some examples!

Summary: "The Summary provides a simple, textual overview of the current corpus, including (as applicable for multiple documents) number of words, number of unique words, longest and shortest documents, highest and lowest vocabulary density, average number of words per sentence, most frequent words, notable peaks in frequency, and distinctive words."

Here’s our summary for New Mexico case law.

Termsberry: "The TermsBerry tool is intended to mix the power of visualizing high frequency terms with the utility of exploring how those same terms co-occur (that is, to what extend they appear in proximity with one another)."

Here's our Termsberry.

Collocates Graph: "Collocates Graph represents keywords and terms that occur in close proximity as a force directed network graph."

Here's our Collocates Graph.

Today we created a data analysis workspace with Voyant and the Caselaw Access Project API.

To see how words are used in U.S. case law over time, try Historical Trends. Share what you find with us at

Evergreen 3.4.0 and OpenSRF 3.2.0 released / Evergreen ILS

The Evergreen community is proud to announce the release of Evergreen 3.4.0. Evergreen is highly-scalable software for libraries that helps library patrons find library materials and helps libraries manage, catalog, and circulate those materials, no matter how large or complex the libraries.

Evergreen 3.4.0 is a major release that includes the following new features of particular note:

  • Integrated carousels for the public catalog
  • Billings and payments can now be aged, allowing the link between a patron and their billings to be severed when a circulation is aged.
  • A new feature for generating print content via a server-side web service is now available that allows for centralized management of print templates.
  • The Booking module has been redesigned with most of its interfaces rewritten in Angular. This redesign adds
    • The ability to edit reservations in many screens
    • A new notes field for the reservations record
    • A calendar view in the Create Reservations Interface
  • A number of improvements in cataloging, including
    • A new ‘Cancel Edit’ button in the record merge interface
    • The ability to export records from a staff catalog record basket
    • Additional options for controlling the overlay of items during MARC Batch Import/Export
    • The Physical Characteristics Wizard now displays both code and label for coded values
  • A number of circulation improvements, including
    • Enhancements to the Mark Item functions
    • A new Action Trigger notification is available for when a patron exceeds their fines and fees limit
    • A new permission to control whether a staff user can create pre-cats
    • Improvements to the Billing Details screen
  • Various improvements to the experimental Angular staff catalog, including
    • A new record holds tab
    • Support for call number browsing
    • Support for storing a list of recent searches
    • Support for named catalog search templates, allowing staff to create predefined searches
    • A new flat text MARC editor
  • A variety of staff client interfaces have been rewritten in Angular, including
    • The Organizational Units and Org Unit Types administration pages
    • The Permission Group administration page
    • The Standing Penalties administration page
    • A number of Local Administration pages
  • A number of improvements have been made to Angular grids in the staff interface, including
    • support for filtering grid contents per-column
    • making the grid header for long/tall grids sticky
  • It is now possible to configure LDAP authentication to allow a user’s institutional/single-sign-on username to be different from their Evergreen username.
  • A new mechanism for defining configurable APIs for patron authentication, allowing external services to more flexibly and securely use Evergreen to authenticate users.

The release is available on the Evergreen downloads page. For more information on what’s included in Evergreen 3.4.0, please consult the release notes or the tabular release notes summary.

Evergreen 3.4.0 requires PostgreSQL 9.6 or later and OpenSRF 3.2.0 or later.

The Evergreen community is also pleased to announce the release of OpenSRF 3.2.0, which adds support for Debian 10 Buster and removes support for the deprecated Apache WebSockets backend. For more information on the OpenSRF release, please consult its release notes.

Evergreen 3.4.0 and OpenSRF 3.2.0 include contributions from at least 32 individuals and 16 institutions.

Fedora 6 Code Sprint 1 Summary / DuraSpace News

The first code sprint to develop Fedora 6 ran from September 16-27 with thirteen participants:

  • Danny Bernstein, LYRASIS
  • Andrew Woods, LYRASIS
  • Ben Pennell, University of North Carolina, Chapel Hill
  • Jared Whiklo, University of Manitoba
  • Mohamed Mohideen Abdul Rasheed, University of Maryland
  • Peter Eichman, University of Maryland
  • Aaron Birkland, Johns Hopkins University
  • Youn Noh, Yale University
  • Dan Field, National Library of Wales
  • Jenny A’Brook, National Library of Wales
  • Richard Williams, National Library of Wales
  • Michal Dulinski, National Library of Wales
  • Remigiusz Malessa, National Library of Wales

The sprint had two primary goals:

  1. Implement basic resource management for Containers and Binaries in accordance with Oxford Common File Layout persistence
  2. Demonstrate a basic Fedora 3 to OCFL conversion using migration-utils and the OCFL client developed by Peter Winckles at the University of Wisconsin-Madison

Two teams were established to pursue each of these goals in parallel. The resource management team spent much of the first week engaged in detailed discussions on how to map Fedora’s resource management functionality to the OCFL, which led to a number of open questions that were answered over the course of the sprint. This team now has a clear path forward that they can follow in the next sprint in November.

The migration team members were relatively new to Fedora development, so this sprint served as an opportunity to get up to speed with the technology, development practices, and tools. The Fedora technical team facilitated this onboarding by assigning discrete tasks and guiding the new participants through the code review process. The updated migration-utils can now migrate Fedora 3 resources to an OCFL-compliant structure; the next step will be to align this structure with the expectations of Fedora 6. 

By the conclusion of the sprint, the teams had finalized the design work and gotten a good start on implementation. This will set up the next sprint in November, which will focus more on implementation with a goal of producing a functional Fedora 6 prototype by the end of 2019.

The post Fedora 6 Code Sprint 1 Summary appeared first on