Planet Code4Lib

tei2json: Summarizing the structure of Early English poetry and prose / Eric Lease Morgan

This posting describes a hack of mine, – a Perl program to summarize the structure of Early English poetry and prose. [0]

In collaboration with Northwestern University and Washington University, the University of Notre Dame is working on a project whose primary purpose is to correct (“annotate”) the Early English corpus created by the Text Creation Partnership (TCP). My role in the project is to do interesting things with the corpus once it has been corrected. One of those things is the creation of metdata files denoting the structure of each item in the corpus.

Some of my work is really an effort to reverse engineer good work done by the late Sebastian Rahtz. For example, Mr. Rahtz cached a version of the TCP corpus, transformed each item into a number of different formats, and put the whole thing on GitHub. [1] As a part of this project, he created metadata files enumerating what TEI elements were in each file and what attributes were associated with each element. The result was an HTML display allowing the reader to quickly see how many bibliographies an item may have, what languages may be present, how long the document was measured in page breaks, etc. One of my goals is/was to do something very similar.

The workings of the script are really very simple: 1) configure and denote what elements to count & tabulate, 2) loop through each configuration, 3) keep a running total of the result, 4) convert the result to JSON (a specific data format), and 5) save the result to a file. Here are (temporary) links to a few examples:

JSON files are not really very useful in & of themselves; JSON files are designed to be transport mechanisms allowing other applications to read and process them. This is exactly what I did. In fact, I created two different applications: 1) and 2) [2, 3] The former script takes a JSON file and creates a HTML file whose appearance is very similar to Rahtz’s. Using the JSON files (above) the following HTML files have been created through the use of

The second script ( allows the reader to compare & contrast structural elements between items. reads many JSON files and outputs a matrix of values. This matrix is a delimited file suitable for analysis in spreadsheets, database applications, statistical analysis tools (such as R or SPSS), or programming languages libraries (such as Python’s numpy or Perl’s PDL). In its present configuration, the outputs a matrix looking like this:

id      bibl  figure  l     lg   note  p    q
A00002  3     4       4118  490  8     18   3
A00011  3     0       2     0    47    68   6
A00089  0     0       0     0    0     65   0
A00214  0     0       0     0    151   131  0
A00289  0     0       0     0    41    286  0
A00293  0     1       189   38   0     2    0
A00395  2     0       0     0    0     160  2
A00749  0     4       120   18   0     0    2
A00926  0     0       124   12   0     31   7
A00959  0     0       2633  9    0     4    0
A00966  0     0       2656  0    0     17   0
A00967  0     0       2450  0    0     3    0

Given such a file, the reader could then ask & answer questions such as:

  • Which item has the greatest number of figures?
  • What is average number of lines per line group?
  • Is there a statistical correlation between paragraphs and quotes?

Additional examples of input & output files are temporarily available online. [4]

My next steps include at least a couple of things. One, I need/want to evaluate whether or not save my counts & tabulations in a database before (or after) creating the JSON files. The data may be prove to be useful there. Two, as a librarian, I want to go beyond qualitative description of narrative texts, and the counting & tabulating of structural elements moves in that direction, but it does not really address the “aboutness”, “meaning”, nor “allusions” found in a corpus. Sure, librarians have applied controlled vocabularies and bits of genre to metadata descriptions, but such things are not quantitive and consequently allude statistical analysis. For example, using sentiment analysis one could measure and calculate the “lovingness”, “war mongering”, “artisticness”, or “philosophic nature” of the texts. One could count & tabulate the number of times family-related terms are used, assign the result a score, and record the score. One could then amass all documents and sort them by how much they discussed family, love, philosophy, etc. Such is on my mind, and more than half-way baked. Wish me luck.


Synonymizer: Using Wordnet to create a synonym file for Solr / Eric Lease Morgan

This posting describes a little hack of mine, Synonymizer — a Python-based CGI script to create a synonym files suitable for use with Solr and other applications. [0]

Human language is ambiguous, and computers are rather stupid. Consequently computers often need to be explicitly told what to do (and how to do it). Solr is a good example. I might tell Solr to find all documents about dogs, and it will dutifully go off and look for things containing d-o-g-s. Solr might think it is smart by looking for d-o-g as well, but such is a heuristic, not necessarily a real understanding of the problem at hand. I might say, “Find all documents about dogs”, but I might really mean, “What is a dog, and can you give me some examples?” In which case, it might be better for Solr to search for documents containing d-o-g, h-o-u-n-d, w-o-l-f, c-a-n-i-n-e, etc.

This is where Solr synonym files come in handy. There are one or two flavors of Solr synonym files, and the one created by my Synonymizer is a simple line-delimited list of concepts, and each line is a comma-separated list of words or phrases. For example, the following is a simple Solr synonym file denoting four concepts (beauty, honor, love, and truth):

  beauty, appearance, attractiveness, beaut
  honor, abide by, accept, celebrate, celebrity
  love, adoration, adore, agape, agape love, amorousness
  truth, accuracy, actuality, exactitude

Creating a Solr synonym file is not really difficult, but it can be tedious, and the human brain is not always very good at multiplying ideas. This is where computers come in. Computers do tedium very well. And with the help of a thesaurus (like WordNet), multiplying ideas is easier.

Here is how Synonymizer works. First it reads a configured database of previously generated synonyms.† In the beginning, this file is empty but must be readable and writable by the HTTP server. Second, Synonymizer reads the database and offers the reader to: 1) create a new set of synonyms, 2) edit an existing synonym, or 3) generate a synonym file. If Option #1 is chosen, then input is garnered, and looked up in WordNet. The script will then enable the reader to disambiguate the input through the selection of apropos definitions. Upon selection, both WordNet hyponyms and hypernyms will be returned. The reader then has the opportunity to select desired words/phrase as well as enter any of their own design. The result is saved to the database. The process is similar if the reader chooses Option #2. If Option #3 is chosen, then the database is read, reformatted, and output to the screen as a stream of text to be used on Solr or something else that may require similar functionality. Because Option #3 is generated with a single URL, it is possible to programmatically incorporate the synonyms into your Solr indexing process pipeline.

The Synonymizer is not perfect.‡ For example, it only creates one of the two different types of Solr synonym files. Second, while Solr can use the generated synonym file, search results implement phrase searches poorly, and this is well-know issue. [1] Third, editing existing synonyms does not really take advantage of previously selected items; data-entry is tedious but not as tedious as writing the synonym file by hand. Forth, the script is not fast, and I blame this on Python and WordNet.

Below are a couple of screenshots from the application. Use and enjoy.

Synonymizer home

Synonymizer output

[0] –

[1] “Why is Multi-term synonym mapping so hard in Solr?” –

† The “database” is really simple delimited text file. No database management system required.

‡ Software is never done. If it were, then it would be called “hardware”.

An increasing role for libraries in research information management / HangingTogether

It’s no secret that the research ecosystem has been experiencing rapid change in recent years, driven by complex political, technological, and network influences. One component of this complicated environment is the adoption of research information management (RIM) practices by research institutions, and particularly the increasing involvement of libraries in this development.

Research information management is the aggregation, curation, and utilization of information about research. Research universities, research funders, as well as individual researchers are increasingly looking for aggregated, interconnected research information to better understand the relationships, outputs, and impact of research efforts as well as to increase research visibility.

Efforts to collect and manage research information are not new but have traditionally emphasized the oversight and administration of federal grants. Professional research administrative oversight within universities emerged in the 20th century, rapidly accelerating in the United States following Sputnik and exemplified through the establishment of professional organizations concerned primarily with grants administration & compliance, such as the Society for Research Administrators (SRA) in 1957 and the National Council of University Research Administrators (NCURA) in 1959.

Today research information management efforts seek to aggregate and connect a growing diversity of research outputs that encompass more than grants administration, and significantly for libraries, includes the collection of publications information. In addition, both universities and funding agencies have an interest in reliably connecting grants with resulting publications–as well as to researchers and their institutional affiliations.

Not long ago the process for collecting the scholarly publications produced by a campus’s researchers would have been a manual process, possible only through the collection of each scholar’s curriculum vitae. The resources required to collect this information at institutional scale would have been prohibitively expensive, and few institutions made such an effort. Institutions have instead relied upon proxies of research productivity–such as numbers of PhDs awarded or total dollars received in federal research grants–to demonstrate their research strengths. However, recent advances in scholarly communications technology and networked information offer new opportunities for institutions to collect the scholarly outputs of its researchers. Indexes of journal publications like Scopus, PubMed, and Web of Science provide new sources for the discovery and collection of research outputs, particularly for scientific disciplines, and a variety of open source, commercial, and locally-developed platforms now support institutional aggregation of publications metadata. The adoption of globally accepted persistent identifiers (PIDs) like DOIs for digital publications and datasets and ORCID and ISNI identifiers for researchers provide essential resources for reliably disambiguating unique objects and people, and the incorporation of these identifiers into scholarly communications workflows provide growing opportunities for improved metadata quality and interoperability.

Institutions may now aggregate research information from numerous internal and external sources, including information such as:

• Individual researchers and their institutional affiliations
• Publications metadata
• Grants
• Patents
• Awards & honors received by a researcher
• Citation counts and other measures of research impact

Depending upon institutional needs, the RIM system may also capture additional internal information about faculty, such as:
• Courses taught
• Students advised
• Committee service

National programs to collect and measure the impact of sponsored research has accelerated the adoption of research information management in some parts of the world, such as through the Research Excellence Framework (REF) in the UK and the Excellence for Research in Australia (ERA) in Australia. The effort to collect, quantify, and report on a broad diversity of research outputs has been happening for some time in Europe, where RIM systems are more commonly known as Current Research Information Systems (CRIS), and where efforts like CERIF (the Common European Research Information Format) provide a standard data model for describing and exchanging research entities across institutions.

Here in the US, research information management is emerging as a part of scholarly communications practice in many university libraries, in close collaboration with other campus stakeholders. In the absence of national assessment exercises like REF or ERA, RIM practices are following a different evolution, one with greater emphasis on reputation management for the institution, frequently through the establishment public research expertise profile portals such as those in place at Texas A&M University and the University of Illinois. Libraries such as Duke University are using RIM systems to support open access efforts, and others are implementing systems that convert a decentralized and antiquated paper-based system of faculty activity reporting and annual review into a centralized process with a single cloud-based platform, as we are seeing at University of Arizona and Virginia Tech.

I believe that support for research information management will continue to grow as a new service category for libraries, as Lorcan Dempsey articulated in 2014. Through the OCLC Research Library Partnership and in collaboration with partners from EuroCRIS, I am working with a team of enthusiastic librarians and practitioners from three continents to explore, research, and report on the rapidly evolving RIM landscape, building on previous RLP outputs exploring the library’s contribution to university ranking and researcher reputation. 

One working group is dedicated to conducting a survey of research institutions to gauge RIM activity:

• Pablo de Castro, EuroCRIS
• Anna Clements, University of St. Andrews
• Constance Malpas, OCLC Research
• Michele Mennielli, EuroCRIS
• Rachael Samberg, University of California-Berkeley
• Julie Speer, Virginia Tech University

And a second working group is engaged with qualitative inquiry into institutional requirements and activities for RIM adoption:
• Anna Clements, University of St. Andrews
• Carol Feltes, Rockefeller University
• David Groenewegen, Monash University
• Simon Huggard, La Trobe University
• Holly Mercer, University of Tennessee-Knoxville
• Roxanne Missingham, Australian National University
• Malaica Oxnam, University of Arizona
• Annie Rauh, Syracuse University
• John Wright, University of Calgary

Our research efforts are just beginning, and I look forward to sharing more about our findings in the future.

About Rebecca Bryant

Rebecca Bryant is Senior Program Officer at OCLC where she leads research and initiatives related to research information management in research universities.

VuFind - 3.1.2 / FOSS4Lib Recent Releases

Release Date: 
Monday, January 16, 2017

Last updated January 15, 2017. Created by Demian Katz on January 15, 2017.
Log in to edit this page.

Minor bug fix / expanded translation release.

Towards Reproducibility of Microscopy Experiments / D-Lib

Article by Sheeba Samuel, Frank Taubert and Daniel Walther, Institute for Computer Science, Friedrich Schiller University Jena; Birgitta Koenig-Ries and H. Martin Buecker, Institute for Computer Science, Friedrich Schiller University Jena, Michael Stifel Center Jena for Data-driven and Simulation Science

RepScience2016 / D-Lib

Guest Editorial by Amir Aryani, Australian National Data Service; Oscar Corcho, Departamento de Inteligencia Artificial, Universidad Politecnica de Madrid; Paolo Manghi, Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche; Jochen Schirrwagen, Bielefeld University Library

The Scholix Framework for Interoperability in Data-Literature Information Exchange / D-Lib

Article by Adrian Burton and Amir Aryani, Australian National Data Service; Hylke Koers, Elsevier; Paolo Manghi and Sandro La Bruzzo, Istituto di Scienza e Tecnologie dell'Informazione, Consiglio Nazionale delle Ricerche; Markus Stocker, Michael Diepenbroek and Uwe Schindler, PANGAEA, MARUM Center for Marine Environmental Sciences, University of Bremen; Martin Fenner, DataCite

Google's "Crypto-Cookies" are tracking Chrome users / Eric Hellman

Ordinary HTTP cookies are used in many ways to make the internet work. Cookies help websites remember their users. A common use of cookies is for authentication: when you log into a website, the reason you stay logged is because of a cookie that contains your authentication info. Every request you make to the website includes this cookie; the website then knows to grant you access.

But there's a problem: someone might steal your cookies and hijack your login. This is particularly easy for thieves if your communication with the website isn't encrypted with HTTPS. To address the risk of cookie theft, the security engineers of the internet have been working on ways to protect these cookies with strong encryption. In this article, I'll call these "crypto-cookies", a term not used by the folks developing them. The Chrome user interface calls them Channel IDs.

Development of secure "crypto-cookies" has not been a straight path. A first approach, called "Origin Bound Certificates" has been abandoned. A second approach "TLS Channel IDs" has been implemented, then superseded by a third approach, "TLS Token Binding" (nicknamed "TokBind"). If you use the Chrome web browser, your connections to Google web services take advantage of TokBind for most, if not all, Google services.

This is excellent for security, but might not be so good for privacy; 3rd party content is the culprit. It turns out that Google has not limited crypto-cookie deployment to services like GMail and Youtube that have log-ins. Google hosts many popular utilities that don't get tracked by conventional cookies. Font libraries such as Google Fonts, javascript libraries such as jQuery, and app frameworks such as Angular, are all hosted on Google servers. Many websites load these resources from Google for convenience and fast load times.  In addition, Google utility scripts such as Analytics and Tag Manager are delivered from separate domains so that users are only tracked across websites if so configured.  But with Google Chrome (and Microsoft's Edge Browser), every user that visits any website using Google Analytics, Google Tag Manager, Google Fonts, JQuery, Angular, etc. are subject to tracking across websites by Google. According to Princeton's OpenWMP project, more than half of all websites embed content hosted on Google servers.
Top 3rd-party content hosts. From Princeton's OpenWMP.
Note that most of the hosts labeled "Non-Tracking Content"
are at this time subject to "crypto-cookie" tracking.

While using 3rd party content hosted by Google was always problematic for privacy-sensitive sites, the impact on privacy was blunted by two factors – cacheing and statelessness. If a website loads fonts from, or style files from, the files are cached by the browser and only loaded once per day. Before the rollout of crypto-cookies, Google had no way to connect one request for a font file with the next – the request was stateless; the domains never set cookies. In fact, Google says:
Use of Google Fonts is unauthenticated. No cookies are sent by website visitors to the Google Fonts API. Requests to the Google Fonts API are made to resource-specific domains, such as or, so that your requests for fonts are separate from and do not contain any credentials you send to while using other Google services that are authenticated, such as Gmail. 
But if you use Chrome, your requests for these font files are no longer stateless. Google can follow you from one website to the next, without using conventional tracking cookies.

There's worse. Crypto-cookies aren't yet recognized by privacy plugins like Privacy Badger, so you can be tracked even though you're trying not to be. The TokBind RFC also includes a feature called "Referred Token Binding" which is meant to allow federated authentication (so you can sign into one site and be recognized by another). In the hands of the advertising industry, this will get used for sharing of the crypto-cookie across domains.

To be fair, there's nothing in the crypto-cookie technology itself that makes the privacy situation any different from the status quo. But as the tracking mechanism moves into the web security layer, control of tracking is moved away from application layers. It's entirely possible that the parts of Google running services like and have not realized that their infrastructure has started tracking users. If so, we'll eventually see the tracking turned off.  It's also possible that this is all part of Google's evil master plan for better advertising, but I'm guessing it's just a deployment mistake.

So far, not many companies have deployed crypto-cookie technology on the server-side. In addition to Google and Microsoft, I find a few advertising companies that are using it.  Chrome and Edge are the only client side implementations I know of.

For now, web developers who are concerned about user privacy can no longer ignore the risks of embedding third party content. Web users concerned about being tracked might want to use Firefox for a while.


  1. This blog is hosted on a Google service, so assume you're being watched. Hi Google!
  2. OS X Chrome saves the crypto-cookies in an SQLite file at "~/Library/Application Support/Google/Chrome/Default/Origin Bound Certs". 
  3. I've filed bug reports/issues for Google Fonts, Google Chrome, and Privacy Badger. 
  4. Dirk Balfanz, one of the engineers behind TokBind has a really good website that explains the ins and outs of what I call crypto-cookies.

DPLA to Expand Access to Ebooks with Support from the Alfred P. Sloan Foundation / DPLA

The Digital Public Library of America is thrilled to announce that the Alfred P. Sloan Foundation has awarded DPLA $1.5 million to greatly expand its efforts to provide broad access to widely read ebooks. The grant will support improved channels for public libraries to bolster their ebook collections, and for millions of readers nationwide to access those works easily.

DPLA will leverage its extensive connections to America’s libraries through its national network to pilot new ways of acquiring ebook collections. In the same way that DPLA has worked with its hubs in states from coast to coast to improve access to digitized materials from America’s archives, museums, and libraries, DPLA will collaborate with other institutions to improve access to ebooks through market-based methods.

As part of the grant, DPLA will also develop an expansive, open collection of popular ebooks, formatted in the EPUB format for smartphones and tablets, and curated so that readers can find works of interest. Together, these programs will increase substantially the number of ebooks that are readable by all Americans, on the devices that are now broadly held throughout society.

“From its inception, DPLA has sought to maximize access to our shared culture,” Dan Cohen, DPLA’s Executive Director, said at the announcement of the new Sloan grant. “Books are central to that culture, and the means through which everyone can find knowledge and understanding, multiple viewpoints, history, literature,  science, and enthralling entertainment. We deeply appreciate the Sloan Foundation’s support to help us connect the most people with the most books, which are now largely in digital formats.”

“The Sloan Foundation is delighted to support the Digital Public Library of America’s efforts to create new channels for better ebook access,” said Doron Weber, Vice President and Program Director at the Alfred P. Sloan Foundation. “Sloan was the founding funder of DPLA and its mission, enabling a nationwide, grassroots and non-profit collaboration that to date has provided access to over 15 million digitized items from over 2,000 cultural heritage institutions across the U.S. With its timely new focus on ebooks, DPLA will leverage its national network to expand reading opportunities for thousands of schools and libraries and millions of students, scholars, and members of the public.”

The Sloan grant will help DPLA build upon its existing successful ebook work, such as in the Open eBooks Initiative, which has provided thousands of popular and award-winning books to children in need. Recently, DPLA announced with its Open eBooks partners the New York Public Library, First Book, Baker & Taylor, and Clever that well over one million books were read through the Sloan-supported program in 2016.

Truth-seeking institutions and strange bedfellows / Galen Charlton

I was struck just now by the confluence of two pieces that are going around this morning. One is Barbara Fister’s Institutional Values and the Value of Truth-Seeking Institutions:

Even if the press fails often, massively, disastrously, we need it. We need people employed full-time to seek the truth and report it on behalf of the public. We need to defend the press while also demanding that they do their best to live up to these ethical standards. We need to call out mistakes, but still stand up for the value of independent public-interest reporting.

Librarians . . . well, we’re not generally seen as powerful enough to be a threat. Maybe that’s our ace in the hole. It’s time for us to think deeply about our ethical commitments and act on them with integrity, courage, and solidarity. We need to stand up for institutions that, like ours, support seeking the truth for the public good, setting aside how often they have botched it in the past. We need to apply our values to a world where traditions developed over years for seeking truth – the means by which we arrive at scientific consensus, for example – are cast aside in favor of nitpicking, rumor-mongering, and self-segregation.

The other is Eric Garland’s Twitter thread on how the U.S. intelligence community gathers and analyzes information:

In particular,

Of course, if it is easy nowadays to be cynical about the commitment of the U.S. press to truth-seeking, such cynicism is an even easier pose to adopt towards the intelligence community. At the very least, spreading lies and misinformation is also in the spy’s job description.

But for the purpose of this post, let’s take the latter tweet at face value, as an expression of an institutional value held by the intelligence community (or at least by its analysts).

I’m left with a couple inchoate observations. First, a hallmark of social justice discourse at its best is a radical commitment to centering the voices of those who hitherto have been ignored. Human nature being what it is, at least a few folks who understood this during during their college days will end up working for the likes of the CIA. On the one hand, that sort of transition feels like a betrayal. On the other hand, I’m not Henry L. Stimson: not only is it inevitable that governments will read each other’s mail, my imagination is not strong enough to imagine a world where they should not. More “Social Justice Intelligence Analysts” might be a good thing to have — as a way of mitigating certain kind of intellectual weakness.

However, one of the predicaments we’re in is that the truth alone will not save us; it certainly won’t do so quickly, not for libraries, and not for the people we serve. I wonder if the analyst side of the intelligence community, for all their access to ways of influencing events that are not available to librarians, is nonetheless in the same boat.

Tracking Changes With diffengine / Ed Summers

Our most respected newspapers want their stories to be accurate, because once the words are on paper, and the paper is in someone's hands, there's no changing them. The words are literally fixed in ink to the page, and mass produced into many copies that are near impossible to recall. Reputations can rise and fall based on how well newspapers are able to report significant events. But of course physical paper isn't the whole story anymore. News on the web can be edited quickly as new facts arrive, and more is learned. Typos can be quickly corrected--but content can also be modified for a multitude of purposes. Often these changes instantly render the previous version invisible. Many newspapers use their website as a place for their first drafts, which allows them to craft a story in near real time, while being the first to publish breaking news. News travels *fast* in social media as it shared and reshared across all kinds of networks of relationships. What if that initial, perhaps flawed version goes viral, and it is the only version you ever read? It's not necessarily fake news, because there's no explicit intent to mislead or deceive, but it may not be the best, [most accurate] news either. Wouldn't it be useful to be able to watch how news stories shift in time to better understand how the news is produced? Or as Jeanine Finn memorably put it: how do we understand the news [before truth gets its pants on]? --- As part of [MITH]'s participation in the [Documenting the Now] project we've been working on an experimental utility called [diffengine] to help track how news is changing. It relies on an old and quietly ubiquitous standard called [RSS]. RSS is a data format for syndicating content on the Web. In other words it's an automated way of sharing what's changing on your website. News organizations use it heavily, and if you've every subscribed to a podcast you're using RSS. If you have a blog or write on [Medium] an RSS feed is quietly be generated for you whenever you write a new post. So what diffengine does is really quite simple. First it subscribes to one or more RSS feeds, for example the Washington Post, and then it watches to see if any articles change their content over time. If a change is noticed a representation of the change, or a "[diff]" is generated, archived at the [Internet Archive] and (optionally) tweeted. We've been experimenting with an initial version of diffengine by having it track the Washington Post, the Guardian and Breitbart News which you can see on the following Twitter accounts: [wapo_diff], [guardian_diff] and [breitbart_diff]. Here's an example of what a change looks like when it is tweeted: The text highlighted in red has been deleted and the text highlighted in green has been added. But you can't necessarily take diffengine's word for it that the text has been changed, right? Bots are [sending] all kinds of fraudulent and intentionally misleading information out on the web, and in particular in social media. So when diffengine notices new or changed content it uses Internet Archive's [save page now] functionality to take a snapshot of the page, which it then references in the tweet so you can see the original and changed content there. You can see those links in the tweet above. --- diffengine draws heavily on the inspiration of two previous projects, [NYTDiff] and [NewsDiffs], which did very similar things. [NYTdiff] is able to create presentable diff images and [tweet them] for the New York Times. But it was designed to work specifically with the NYTimes API. NewsDiffs provides a comprehensive framework for watching changes on multiple sites (Washington Post, New York Times, CNN, BBC, etc). But you need to be a programmer to add a [parser module]( for a website that you want to monitor. It is also fully functional web application which requires some commitment to install and run. With the help of [feedparser] diffengine takes a different approach of working with any site that publishes an RSS feed of changes. This covers many news organizations, but also personal blogs and organizational websites that put out regular updates. And with the [readability] module diffengine is able to automatically extract the primary content of pages, without requiring special parsing to remove boilerplate material. To do its work diffengine keeps a small database of feeds, feed entries and version histories that it uses to notice when content has changed. If you know your way around a sqlite database you can query it to see how content has changed over time. The database could be a valuable source of research data if you are studying the production of the news, or the way organizations or people communicate online. One possible direction we are considering is creating a simple web frontend for this database that allows you to navigate the changed content without requiring SQL chops. If this sounds useful please get in touch with the DocNow project, by joining our [Slack] channel or emailing us at [Installation] of diffengine is currently a bit challenging if you aren't already familiar with installing Python packages from the command line. If you are willing to give it a try let us know how it goes over on [GitHub]. Ideas for sites for us to monitor as we develop diffengine are also welcome! --- *Special thanks to [Matthew Kirschenbaum] and [Gregory Jansen] at the University of Maryland for the intial inspiration behind this idea of showing rather than telling what news is. The [Human-Computer Interaction Lab] at UMD hosted an informal workshop after the recent election to see what possible responses could be, and diffengine is one outcome from that brainstorming.* [tweet them]: [NYTDiff]: [NewsDiffs]: [feedparser]: [readability]: [Medium]: [wapo_diff]: [guardian_diff]: [breitbart_diff]: [diff]: [Internet Archive]: [Documenting the Now]: [save page now]: [most accurate]: [before truth gets its pants on]: [MITH]: [diffengine]: [RSS]: [sending]: [Installation]: [Slack]: [GitHub]: [Matthew Kirschenbaum]: [Gregory Jansen]: [Human-Computer Interaction Lab]:

Equinox Transitions to NonProfit to Benefit Libraries / Equinox Software

Equinox Transitions to Nonprofit to Benefit Libraries


Duluth, Georgia, January 12, 2017On January 1, 2017, Equinox Software, Inc., the premiere support and service provider for the Evergreen Integrated Library System, became Equinox Open Library Initiative Inc., a nonprofit corporation serving libraries, archives, museums, and other cultural institutions. This change comes after several years of consideration, evaluation of community needs, planning, and preparation.  The change allows Equinox to better serve its customers and communities by broadening its mission of bringing more open source technology to a wide array of institutions dedicated to serving the public good.

About the conversion from for-profit to nonprofit, Mike Rylander, president of the new Equinox Open Library Initiative said, “Everyone at Equinox is dedicated to the mission of helping libraries of all types adopt and use open source software.  We have been involved in this work for ten years now, and our move to become a nonprofit helps us further that mission.  Importantly, this change also matches more closely the cooperative, community-focused ethos of the open source technologies with which we work.  We could not be more excited to move forward in this new direction.”

Jason Etheridge, an Equinox founder, added, “In 2009, we wrote an open letter to the community called the Equinox Promise, where we pledged to adhere to ideas such as transparency, code sharing, maintaining a single code set, and, in general, working with and within the Evergreen and Koha communities.  This built on the original vision of Evergreen as software that should be open source for both philosophical and pragmatic reasons.  Equinox becoming a nonprofit is another promise, one with legal teeth, where our charitable purpose is put front and center.  I see no better way to participate in the gift culture known as open source, and in our Evergreen and Koha communities.”

While daily operations at Equinox will not change, company leaders highlight that going forward there will be new opportunities for service expansion and enhancement, as well as creative funding options for projects that enhance library services.  Grace Dunbar, Equinox Vice President, pointed out, “By becoming a nonprofit organization, Equinox will actually be able to do more and grow our service offerings to the library community.  I think it’s important to note we’re not changing our services—we still offer a complete suite of services for seamless migration, support, and development for open source software library software. However, by making the change to nonprofit we will be able to grow in a way that does not require a merger or acquisition with a proprietary software company and will allow us to integrate more resources into our mission.”

For more information, please visit our FAQ.

About Equinox Open Library Initiative Inc.
Equinox Open Library Initiative Inc. is a nonprofit company engaging in literary, charitable, and educational endeavors serving cultural and knowledge institutions.  As the successor to Equinox Software, Inc., the Initiative carries forward a decade of service and experience with Evergreen and other open source library software.  At Equinox OLI we help you empower your library with open source technologies.

CSV,Conf is back in 2017! Submit talk proposals on the art of data collaboration. / Open Knowledge Foundation


CSV,Conf,v3 is happening! This time the community-run conference will be in Portland, Oregon, USA on 2nd and 3rd of May 2017. It will feature stories about data sharing and data analysis from science, journalism, government, and open source. We want to bring together data makers/doers/hackers from backgrounds like science, journalism, open government and the wider software industry to share knowledge and stories.

csv,conf is a non-profit community conference run by people who love data and sharing knowledge. This isn’t just a conference about spreadsheets. CSV Conference is a conference about data sharing and data tools. We are curating content about advancing the art of data collaboration, from putting your data on GitHub to producing meaningful insight by running large scale distributed processing on a cluster.

Talk proposals for CSV,Conf close Feb 15, so don’t delay, submit today! The deadline is fast approaching and we want to hear from a diverse range of voices from the data community.

Talks are 20 minutes long and can be about any data-related concept that you think is interesting. There are no rules for our talks, we just want you to propose a topic you are passionate about and think a room full of data nerds will also find interesting. You can check out some of the past talks from csv,conf,v1 and csv,conf,v2 to get an idea of what has been pitched before.

If you are passionate about data and the many applications it has in society, then join us in Portland!


Speaker perks:

  • Free pass to the conference
  • Limited number of travel awards available for those unable to pay
  • Did we mention it’s in Portland in the Spring????

Submit a talk proposal today at 

Early bird tickets are now on sale here.

If you have colleagues or friends who you think would be a great addition to the conference, please forward this invitation along to them! CSV,Conf,v3 is committed to bringing a diverse group together to discuss data topics.

For questions, please email, DM @csvconference or join the public slack channel.

– the csv,conf,v3 team

Preserving Email Report Summary / Archival Connections

Earlier today, I provided a summary of Preserving Email, a Technology Watch Report I wrote back in 2011. I'll leave it to others to judge how well that report holds up, but I had the following takeaways when re-reading it:

Update to WMS CirculationAPI / OCLC Dev Network

The latest release of the Update to WMS Circulation API includes new operations for forwarding holds, pulling items, inventory and in-house check-ins of items.

DNA experts claim that Cherokees are in the Middle East / Equinox Software

We were holding some superior hints for ending newcomers, which you are capable to use in nearly any article or speech. It must be inviting to your very own audience, also it would do you fantastic to begin your article that’s a great anecdote. Now there is no need to visit fantastic lengths to purchase composition. Decide what compartmentalization of position you’ll be using for your own essay. The judgment is only to refresh your composition within the audience’s head. Topic phrase must certanly be created in the best saying the primary topic area of an essay. It’s possible to be equally as innovative as you choose to be, s O long as your composition carries the appropriate info to the reader. There are many ways about how to compose an essay. Ourpany provides to purchase documents on line. Only the ideal writers, simply the perfect high quality Essay on Love expert essay tok documents 2008 solutions for cheap.

She was interested and also re-read part of the section (i examine her it while she drove) later.

Ergo, you should choose the beginning of your own reflective composition seriously. This list relates to a number of the simple to compose article subjects. Consequently, follow this advice to compose an excellent article in easy method. Your article needs to be up to-date with all the details, particularly the performance data of the gamers. Your satirical article may make extra brownie points with a suitable title. To be able to compose a high-quality dissertation composition you might have to be convincing and can show your claim regardless of what. Once, you have your name on you, you are able to truly start seeking pertinent information all on your own article. Allow your first-hand experience be placed into phrases, whenever you’re writing a reflective essay. Writing this type of essay is not a easy job.

This may cause sales that is more shut with each group.

Writing a suitable cover for an article you’ve created isn’t an extremely tough job whatsoever, nevertheless it truly is the most discounted. You may also attempt to locate specialist essay writing solutions which is able enough to finish your writing needs. Certainly, custom paper writing services aren’t free. Your thesis statement should educate your readers exactly what the paper is about, as well as assist guide your writing. Writing a paper is only a speciality that needs writing gift. Web is really an professional article writing service available on the net to anybody who requires an article document. Merely be sure that your essay WOn’t seem just informative. That is all you will need to understand so as to compose a great dissertation composition.” Thanks so much, it’s really a decent article! To start, make an outline or prewriting of your own article when preparing the initial write.

Length mba in india could be the right option for these aspirants.

One ought to comprehend the 3 conventional sections of the article. Purchase essays that absolutely trust your demands. GW graduates utilize company to generate favorable, where to purchase essays alter. The finest component about creating an enlightening essay may be the big assortment of topics you are able to select from. Should you be confident with the manner you’ve written your relative article and you also really believe you haven’t left actually one level found then you’ve all the chances of developing a fantastic impact on the readers. The kind of theme you determine on is going to count on the purpose why you’re composing the article in the very first affordable papers spot. The prime thought that you simply have to concentrate up on initially, is the objective of creating this essay. The very first step to creating a flourishing school essay is selecting the best theme.

ALA urges Senators to probe Sessions on privacy / District Dispatch

ALA, together with a baker’s dozen of allied organizations, has written to the members of the Senate Judiciary Committee on the eve of its hearings on the confirmation of Sen. Jeff Sessions

Senator Jeff Session at his confirmation hearing

Source: 13newsnow

(R-AL) to serve as the nation’s next Attorney General. Detailing concerns about Sen. Sessions’ record on a host of issues – including expressly his opposition to the special protection of library patron records – the letter calls on Committee members to use the hearings to “carefully investigate Senator Sessions’ record on privacy and seek assurances that he will not pursue policies that undermine Americans’ privacy and civil liberties.”

Orchestrated by the Center for Democracy & Technology, the American Association of Law Libraries and Association of Research Libraries also signed the letter, as did other prominent national groups, including: Access Now, Amnesty International USA, the Constitutional Alliance and Electronic Frontier Foundation.

The post ALA urges Senators to probe Sessions on privacy appeared first on District Dispatch.

HTTPS Everywhere -- Promise Fulfilled / Library Tech Talk (U of Michigan)

green lock icon and text that says https at the start of a URL bar

Over Fall 2016, the University of Michigan Library updated most of its web sites to operate exclusively on a secure, HTTPS, protocol. Along the way, we learned a few lessons.

Equipping librarians to code: part 2 / District Dispatch

I know, I know, you just put down Increasing CS Opportunities for Young People, the Libraries Ready to Code final report, and are already saying to yourself, “What’s next?” and “How can I get involved?” Here’s the answer:

Today we officially launched Ready to Code 2 (RtC2), Embedding RtC Concepts in LIS Curricula. Building on findings from last year’s work, RtC2 focuses on ensuring pre-service and in-service librarians are prepared to facilitate and deliver youth coding activities that foster the development of computational thinking skills—skills critical for success in education and career. Like Phase 1, this will be a yearlong project and is also supported by Google, Inc.

library professionals demonstrating coding in a public library computer lab

Photo Credit: Los Angeles Public Library

Several of the findings from Phase 1 led us to consider the potential impact of focusing on librarian preparation and professional development could have on increasing the pool of librarians and library staff who have the skills necessary to design and implement coding programs that spark the curiosity and creativity of our young patrons and help them connect coding to their own interests and passions which can be outside of computer science specific domains.

RtC2 will include a carefully selected LIS Faculty cohort of seven that will redesign and then pilot their new tech/media courses at their institutions. Results of the pilot courses will then be synthesized, and course models will be disseminated nationally. Faculty and their students will provide input throughout the project to the project team through faculty documentation, regular virtual meetings, a survey, student products and other outreach mechanisms. An outside evaluator will also work with the project team to identify the impacts of project activities and outcomes. This input will provide content for the final synthesis and recommendations for scaling in other LIS institutions.

Working along with me, the RtC2 project team includes Dr. Mega Subramaniam, Associate Professor and Associate Director of the Information Policy and Access Center at the University of Maryland’s College of Information Studies; Linda Braun, Learning Consultant, LEO: Librarians and Educators Online; and Dr. Alan S. Inouye, Director, OITP. OITP Youth and Technology Fellow Christopher Harris will provide overall guidance throughout the project. Can you tell how excited this makes me?!

Curious? Read the RtC2 Summary.

Are you LIS faculty? You can

  • Read the RtC2 Call for Applications.
  • Attend an in-person information session at the 2017 ALISE conference on Wednesday, January 18 at 6:00pm (meet Dr. Subramaniam in the Sheraton Atlanta hotel lobby).
  • Attend a virtual information session on January 27, 2017 at noon EST via Adobe Connect. Please complete this form if you are interested in attending the information session or would like to receive a recording of the session.

Yes, there will be a Libraries Ready to Code Website, where all this and more will live. In the meantime, if you have questions, please contact me directly at

The post Equipping librarians to code: part 2 appeared first on District Dispatch.

The Keepers Registry: Ensuring the Future of the Digital Scholarly Record / Library of Congress: The Signal

Photo of a paper-strewn desk.

Humanités Numériques, on Wikimedia by Calvinius:

This is a guest post by Ted Westervelt, section head in the Library of Congress’s US Arts, Sciences & Humanities Division.

Strange as it now seems, it was not that long ago that scholarship was not digital. Writing a dissertation in the 1990s was done on a computer and took full advantage of the latest suite of word-processing tools available (that a graduate student could afford). And it certainly was a world away from the typewritten dissertations of the 1950s, requested at the university library and pored over in  reading rooms.

Yet these were tools to create physical items not much different than those dissertations of the previous forty, fifty or a hundred years. That sense of completion and accomplishment came with the bound copy of the dissertation taken from the bookbinders or with the offprints of the article sent in the post by the journal publisher.

Now, instead of using digital tools to create a physical item, we create a digital item. From this digital item we might make a physical copy but that is no longer a necessary endpoint. Once we have created the digital item encompassing our scholarly work, it can be complete. Now when we talk of the scholarly record, we talk of the digital scholarly record, for they are almost entirely one and the same.

The advantages of this near-complete overlap are evident to anyone who works with scholarly works or with any creative works for that matter. The challenges, on the other hand, can be less immediately apparent, though they are not hidden too deeply.

A library gutted by fire.

“The library at Holland House in Kensington, London, extensively damaged by a Molotov ‘Breadbasket’ fire bomb.” On Flickr by Musgo Dumio_Momio.

The most immediate of these challenges relate to managing and preserving the digital scholarly record as we have done for the scholarly record for centuries and millennia (if we draw a curtain over the destruction of the Library of Alexandria). We have had those centuries to learn how to manage, keep and preserve the textual part of the scholarly record, to use it while also keeping it safe and usable for future generations (e.g. keep it away from fire).

With digital content, we do not have those centuries of knowledge; the sharp shift to digital creation and to a digital scholarly record has not come with a history of experiences in keeping that record safe and secure.

Which is not to say that preserving, protecting and ensuring the ongoing use and value of that digital scholarly record are hopeless dreams or that there is not a lot of work being dedicated to accomplishing these ends. This is a concern of anyone with an interest in scholarly works and in the scholarly record as a whole and, as any who delve into this at all know, productive work is being undertaken by a variety of groups using a variety of means in order to ensure its survival.

The Keepers Registry, based at the University of Edinburgh, is an important effort to preserve the digital scholarly record. The Keepers Registry brings together institutions and organizations that are committed to the preservation of electronic serials and enables those institutions to share titles, volumes and issues they have preserved.

In doing so, The Keepers Registry allows us to identify which parts of the digital scholarly record are being preserved, which institutions have taken on this responsibility and, just as important, which parts of the digital scholarly record are not being preserved and are therefore at higher risk of being lost.

There is a clear general benefit in sharing the names, missions and holdings of the institutions and organizations (those Keepers), thus committing themselves and their resources to the preservation of these parts of the digital scholarly record. But there is also a very clear benefit to those individual Keepers to better know their fellows who have similarly committed themselves to serve as Keepers of the digital scholarly record.

Since all are committed to the same end, the staff behind The Keepers Registry organized meetings of The Keepers and other similar organizations in Edinburgh in September 2015 and in Paris in June 2016. The organizations and institutions – and the individuals who represented them at these meetings – can and in some cases do meet and interact with each other in other forums. But the meetings arranged by The Keepers Registry allowed for a focus on the preservation of the digital scholarly record and how that can be accomplished collectively.

The preservation of the digital scholarly record cannot happen except through collaboration and cooperation. And none know this better than the individual Keepers and their colleagues at these meetings who have committed resources to the issue. The very existence of The Keepers Registry is an admission that the preservation of the scholarly record is larger than any one institution and in fact cannot be entrusted in any one institution alone.

But as much as this is known, the answer to how best to work, collaboratively and cooperatively, is less readily apparent.  We benefit from the opportunities to discuss this in person.

Images of interconnected human silhouettes.

CC0 Public Domain.

These meetings and discussions, while valuable, were not intended as an end unto themselves.  The meetings helped crystallize in the minds of the participants that, because of what they do and because of their participation in The Keepers Registry, they form a Keepers network.

As such, they have a shared commitment and a shared idea of how they can do for the digital scholarly record what they have managed for the scholarly record in centuries past: ensure its preservation and ongoing use.  This vision has been encapsulated in the joint statement that representatives from the Keepers issued this past August, “Ensuring the Future of the Digital Scholarly Record.”  It sets out a plan of engagement with other stakeholders –- especially publishers, research libraries and national libraries -– who also have a role in this mission.

To this end, the Keepers network welcomes any institution, organization or consortium that wishes to endorse the statement, as some have already.  And it encourages any stakeholders which wish to begin working with the Keepers network to let them know.

The Keepers network is committed to making all parties aware of their roles in the preservation of the digital scholarly record and have already begun reaching out to those stakeholders, such as at the Fall Meeting of the Coalition for Networked Information.  This is a shared need and a shared responsibility.  No one institution has to do it alone. We cannot succeed in preserving the digital scholarly record unless we do it together.

Registration Now Open for DPLAfest 2017 / DPLA

We’re pleased to announce that registration for DPLAfest 2017 — taking place on April 20-21 in Chicago, Illinois — has officially opened. We invite all those interested from public and research libraries, cultural organizations, the educational community, the creative community, publishers, the technology sector, and the general public to join us for conversation and community building as we celebrate our fourth annual DPLAfest.

The two-day event is open to all and advance registration is required. Registration for DPLAfest 2017 is $150 and includes access to all DPLAfest events including a reception on April 20. Coffee, refreshments, and a boxed lunch will be provided on April 20 and 21. Register today.


Participants collaborate at DPLAfest 2016. Photo by Jason Dixson

Participants collaborate at DPLAfest 2016. Photo by Jason Dixson.

DPLAfest 2017 will take place on April 20-21, 2017 in Chicago at Chicago Public Library’s Harold Washington Library Center. The hosts for DPLAfest 2017 include Chicago Public Library, the Black Metropolis Research Consortium, Chicago Collections, and the Reaching Across Illinois Library System (RAILS).


We are currently seeking session proposals for DPLAfest 2017. The deadline to submit a session proposal is January 17, 2017. Click here to review submission terms and submit a session proposal.

We will be posting a full set of activities and programming for DPLAfest 2017 in February. Until then, to review topics and themes from previous fests, check out the agendas from DPLAfest 2016 and 2015.

Travel and Logistics

Click here for logistical and travel information about DPLAfest and our host city, Chicago.


Should you have any questions, please do not hesitate to reach out to us at We look forward to seeing you in Chicago!

Register for DPLAfest 2017

Gresham's Law / David Rosenthal

Jeffrey Beall, who has done invaluable work identifying predatory publishers and garnered legal threats for his pains, reports that:
Hyderabad, India-based open-access publisher OMICS International is on a buying spree, snatching up legitimate scholarly journals and publishers, incorporating them into its mega-fleet of bogus, exploitative, and low-quality publications. ... OMICS International is on a mission to take over all of scholarly publishing. It is purchasing journals and publishers and incorporating them into its evil empire. Its strategy is to saturate scholarly publishing with its low-quality and poorly-managed journals, aiming to squeeze out and acquire legitimate publishers.
Below the fold, a look at how OMICS demonstrates the application of Gresham's Law to academic publishing.

Following John Bohannon's 2013 sting against predatory publishers with papers that were superficially credible, in 2014 Tom Pears wrote a paper that "that absolutely shouldn’t be published by anyone, anywhere", submitted it to 18 journals, and got accepted by 8 of them. None of them could even have understood the title:
“Acidity and aridity: Soil inorganic carbon storage exhibits complex relationship with low-pH soils and myeloablation followed by autologous PBSC infusion.”

Look more closely. The first half is about soil science. Then halfway through it switches to medical terms, myeloablation and PBSC infusion, which relate to treatment of cancer using stem cells.

The reason: I copied and pasted one phrase from a geology paper online, and the rest from a medical one, on hematology.

I wrote the whole paper that way, copying and pasting from soil, then blood, then soil again, and so on. There are a couple of graphs from a paper about Mars. They had squiggly lines and looked cool, so I threw them in.

Footnotes came largely from a paper on wine chemistry. The finished product is completely meaningless.

The university where I claim to work doesn’t exist. Nor do the Nepean Desert or my co-author. Software that catches plagiarism identified 67% of my paper as stolen (and that’s missing some). And geology and blood work don’t mix, even with my invention of seismic platelets.
Among the publishers recently acquired by OMICS were two previously legitimate Canadian companies. Tom Spears tried and scored again:
OMICS has publicly insisted it will maintain high standards. But now the company has published an unintelligible and heavily plagiarized piece of writing submitted by the Citizen to test its quality control. The paper is online today in the Journal of Clinical Research and Bioethics — not one of the original Canadian journals, but now jointly owned with them. And it’s awful. OMICS claims this paper passed peer review, and presents useful insights in philosophy, when clearly it is entirely fake.
Bryson Masse's Fake Science News Is Just As Bad As Fake News explains how Spears came to submit the paper:
This summer, OMICS reached out to Spears, who has previously demonstrated how to game the scientific publishing system, and now gets a lot of spam from journal publishers. This time, he decided he might have some fun with them.

“I'd sent test submissions to a couple of predators in the past and had kind of moved on, but then I got this request to write for what looked like a fake journal—of ethics,” Spears wrote me in an email. “Something about that attracted me so I just thought: Why not? And one morning in late August when I woke up early I made extra coffee and banged out some drivel and sent it to them.”
And voila, his minutes of toil paid off.

It got published without him paying. He did get invoiced, though, and the publisher was not afraid to haggle, said Spears.
At the New York Times, Kevin Carey's A Peek Inside the Strange World of Fake Academia reveals that OMICS is following the lead of less deplorable academic publishers who have observed that in the Internet era, conferences are less vulnerable to disintermediation than journals:
The caller ID on my office telephone said the number was from Las Vegas, but when I picked up the receiver I heard what sounded like a busy overseas call center in the background. The operator, “John,” asked if I would be interested in attending the 15th World Cardiology and Angiology Conference in Philadelphia next month.

“Do I have to be a doctor?” I said, because I’m not one. I got the call because 20 minutes earlier I had entered my phone number into a website run by a Hyderabad, India, company called OMICS International.

“You can have the student rate,” the man replied. With a 20 percent discount, it would be $599. The conference was in just a few weeks, I pointed out — would that be enough time for the academic paper I would be submitting to be properly reviewed? ... It would be approved on an “expedited basis” within 24 hours, he replied, and he asked which credit card I would like to use.

it seems that I was about to be taken, that’s because I was. OMICS International is a leader in the growing business of academic publication fraud. It has created scores of “journals” that mimic the look and feel of traditional scholarly publications, but without the integrity. This year the Federal Trade Commission formally charged OMICS with “deceiving academics and researchers about the nature of its publications and hiding publication fees ranging from hundreds to thousands of dollars.”

OMICS is also in the less well-known business of what might be called conference fraud, which is what led to the call from John. Both schemes exploit a fundamental weakness of modern higher education: Academics need to publish in order to advance professionally, get better jobs or secure tenure. Even within the halls of respectable academia, the difference between legitimate and fake publications and conferences is far blurrier than scholars would like to admit.
Carey goes in to considerable detail about OMICS and its competitors in the fraudulent conference business and concludes:
There are real, prestigious journals and conferences in higher education that enforce and defend the highest standards of scholarship. But there are also many more Ph.D.-holders than there is space in those publications, and those people are all in different ways subject to the “publish or perish” system of professional advancement. The academic journal-and-conference system is subject to no real outside oversight. Standards are whatever the scholars involved say they are.

So it’s not surprising that some academics have chosen to give one another permission to accumulate publication credits on their C.V.’s and spend some of the departmental travel budget on short holidays. Nor is it surprising that some canny operators have now realized that when standards are loose to begin with, there are healthy profits to be made in the gray areas of academe.
Carey's right, but that isn't the fundamental problem. Two years ago I wrote Stretching the "peer reviewed" brand until it snaps responding to a then-current outbreak of concern about this issue:
These recent examples, while egregious, are merely a continuation of a trend publishers themselves started many years ago of stretching the "peer reviewed" brand by proliferating journals. If your role is to act as a gatekeeper for the literature database, you better be good at being a gatekeeper. Opening the gate so wide that anything can get published somewhere is not being a good gatekeeper.
Gresham's Law states "Bad money drives out good". The major publishers can hardly complain if others more enthusiastically follow their example by proliferating journals (and conferences) and lowering reviewing standards. Their value-added was supposed to be "peer review", but the trend they started has devalued it to the point where peer-reviewed science no longer influences public policy. It is true that industries such as tobacco and fossil fuels have funded decades-long campaigns pushing invented reasons to doubt published research. But at the same time academic publishers were providing real reasons for doing so.

LC Name Authority File Analysis: Where are the Commas? / Mark E. Phillips

This is the second in a series of blog posts on some analysis of the Name Authority File dataset from the Library of Congress. If you are interested in the setup of this work and bit more background take a look at the previous post.

The goal of this work is to better understand how personal and corporate names are formatted so that I can hopefully train a classifier to automatically identify a new name into either category.

In the last post we saw that commas seem to be important in differentiating between corporate and personal names.  Here is a graphic from the previous post.

Distribution of Commas in Name Strings

You can see that  the majority of personal names have commas 99% with a much smaller set of corporate names 14% having a comma present.

The next thing that I was curious about is does that placement of the comma in the name string reveal anything about the kind of name that it is?

How Many?

The first thing to look at is just counting the number of commas per name string.  My initial thought is that there are going to be more commas in the Corporate Names than in the Personal Names.  Let’s take a look.

Name Type Total Name Strings Names With Comma min 25% 50% 75% max mean std
Personal 6,362,262 6,280,219 1 1 1 2 8 1.309 0.471
Corporate 1,499,459 213,580 1 1 1 1 11 1.123 0.389

In looking at the overall statistics for the number of commas in the name strings indicate that there are more commas for the Personal Names than for the Corporate Names.  The Corporate Name with the most commas, in this case eleven is International Monetary Fund. Office of the Executive Director for Antigua and Barbuda, the Bahamas, Barbados, Belize, Canada, Dominica, Granada, Ireland, Jamaica, St. Kitts and Nevis, St. Lucia, and St. Vincent and the Grenadines you can view the name record here.

The Personal Name with the most commas had eight of them and is this name string Seu constante leitor, hum homem nem alto, nem baixo, nem gordo, nem magro, nem corcunda, nem ultra-liberal, que assistio no Beco do Proposito, e mora hoje no Cosme-Velho and you can view the name record here.

I can figure out the Corporate Name but needed a little help with the Personal Name so Google Translate to the rescue. From what I can tell that translate to His constant reader, a man neither tall, nor short, nor fat, nor thin, nor hunchback nor ultra-liberal, who attended in the Alley of the Purpose, and lives today in Cosme-Velho which I think is a pretty cool sounding Personal Name.

I was surprised when I made a histogram of the values and saw that it was actually pretty common for Personal Names to have more than one comma.   Very common actually.

Number of Commas in Personal Names

And while there are instances of more overall commas in Corporate Names, you generally are only going to see one comma per string.

Number of Commas in Corporate Names

Which Half?

The next thing that I wanted to look at is the placement of the first comma in the name string.

The numbers below represent the stats for just the name strings that contain a comma. The values of the number is the position of the first comma as a percentage of the overall number of characters in the name string.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1.9% 26.7% 36.4% 46.7% 95.7% 37.3% 13.8%
Corporate 213,580 2.2% 60.5% 76.9% 83.3% 99.0% 69.6% 19.3%

If we look at these as graphics we can see some trends a bit better.  Here is a histogram of the placement of the first comma in the Personal Name strings.

Comma Percentage Placement for Personal Name

It shows the bulk of the names with a comma have that comma occurring in the first half (50%) of the string.

This looks a bit different with the Corporate Names as you can see below.

Comma Percentage Placement for Corporate Name

You will see that the placement of that first comma trends very strongly to the right side of the graph, definitely over 50%.

Let’s be Absolute

Next up I wanted to take a look at the absolute distance from the first comma to the first space character in the name string.

My thought is that a Personal Name is going to have an overall lower absolute distance than the Corporate Names.  Two examples will hopefully help you see why.

For a Personal Name string like “Phillips, Mark Edward” the absolute distance from the first comma to the first space is going to be one.

For a Corporate Name string like “Worldwide Documentaries, Inc.” the absolute distances from the first comma to the first space is fourteen.

I’ll jump right to the graphs here.  First is the histogram of the Personal Name strings.

Personal Name: Absolute Distance Between First Space and First Comma

You can see that the vast majority of the name strings have an absolute distance from the first comma to the first space of 1 (that’s the value for the really tall bar).

If you compare this to the Corporate Name strings in graph below you will see some differences.

Corporate Name: Absolute Distance Between First Space and First Comma

Compared to the Personal Names, the Corporate Name graph has quite a bit more variety in the values.  Most of the values are higher than one.

If you are interested in the data tables they can provide some additional information.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 131 1.4 1.8
Corporate 213,580 1 18 27 37 270 28.9 17.4

Absolute Tokens

This next section is very similar to the previous but this time I am interested in the placement of the first comma in relation to the first token in the string.  I have a feeling that it will be similar to what we saw for the absolute first space distance that we saw above but should normalize the data a bit because we are dealing with tokens instead of characters.

Name Type Names With Comma min 25% 50% 75% max mean std
Personal 6,280,219 1 1 1 1 17 1.1 0.3
Corporate 213,580 1 3 4 6 35 4.8 2.4

And now to round things out with graphs of both of the datasets for the absolute distance from first comma to first token.

Personal Name: Absolute Distance Between First Token and First Comma

Just as we saw in the section above the Personal Name strings will have commas that are placed right next to the first token in the string.

Corporate Name: Absolute Distance Between First Token and First Comma

The Corporate Names are a bit more distributed away from the first token.


Some observations that I have now that I’ve spent a little more time with the LC Name Authority File while working on this post and the previous one.

First, it appears that the presence of a comma in a name string is a very good indicator that it is going to be a Personal Name.  Another thing is that if the first comma occurs in the first half of the name string it is most likely going to be a Personal Name and if it occurs in the second half of the string it is most likely to be a Corporate Name. Finally the absolute distance from the first comma to either the first space or from the first token is a good indicator of it the string is a Personal Name or a Corporate Name.

If you have questions or comments about this post,  please let me know via Twitter.

One Week Left! Apply to Present at DPLAfest 2017 / DPLA

DPLA is seeking session proposals for DPLAfest 2017, an annual conference that brings together librarians, archivists, and museum professionals, developers and technologists, publishers and authors, educators, and many others to celebrate DPLA and its community of creative professionals.

Proposals should be related to digital libraries, broadly defined, including topics at the intersection of digital libraries and social justice, copyright and rights management, public engagement, metadata, collaboration, and more. Learn more.

The deadline to submit a session proposal is Tuesday, January 17, 2017.

See you in Chicago!

View full Call for Proposals and Submission Form

#NoFilter: Social Media Content Ideas for Libraries / LITA

In my previous blog entry, I introduced the #NoFilter series which will explore some of the challenges and concerns pertaining to social media and its use in the library. For this post, let’s consider a topic that can be simultaneously fun and perplexing: generating quality content for social media! Thoughtful, consistent, and varied content is one of the keys to cultivating a meaningful social media presence for a library i.e., opening up channels of communication with patrons and encouraging enthusiasm for the library’s materials, services, and staff.  Where does one look for social media content ideas? Keeping in mind that the intricacies of each platform necessitate different presentations in content, below are three suggestions for where those in charge of a library’s social media may find some inspiration.

damaged chemistry textbook with stain

Image accompanying a Tumblr post about the behind-the-scenes process of evaluating donations at the Othmer Library.

  • Behind-the-scenes – The day-to-day operations in a library may not seem like the most riveting subject matter for a social media post. However, in my experience, posts that feature behind-the-scenes work at the library often do very well. Think of it this way: isn’t it exciting when you get a sneak peek of what is to come or a look into processes with which you are not familiar? In terms of social media content, this could mean providing patrons with a photo of the library preparing to open, new acquisitions being processed, a book being repaired, a recent donation to the library still in boxes, a new addition being built, a new technology being installed, or a new fish tank being set up. For this type of content, consider consulting staff throughout the library such as those in technical services, collection development, or interlibrary loan. Not sure how a post about an ILL would look? Check out this great Instagram post from The Frick Collection.
  • Reference Questions – What questions have the library staff recently answered for patrons or for one another? What information was unearthed? What resources were consulted? What steps were taken to track down an answer? You may want to consider working with reference staff to compose social media posts that not only share the findings of research, but also the research process. Chances are that such information will be of interest to others. Additionally, this type of post highlights the expertise and talents of library staff. Individuals who may never have thought to consult your library before on such topics may find themselves reconsidering after seeing your post. One example is this “From the Othmer Library Reference Desk” post on my library’s Tumblr.
  • Events – Event-driven content is one of the most commonly employed on institutional social media outlets. There is an event coming up (e.g., an open house, a movie night, a special guest lecturer, edible book festival) and the library wants to get the word out about it. It’s not a guarantee of higher attendance at the actual event, but such a post, when written in a personable tone, does alert patrons to the fact that the library is a dynamic place, not just a repository of materials in varying formats. Taking this type of post one step further, a library’s social media manager may want to consider sharing stories that come about from the event. Did the library debut a new gadget at the event? Did a quote from the lecturer stand out? Did the cake you ordered for your National Library Week celebration arrive with the library’s name misspelled – e.g., the “Othmer Library of Chemical History” became the “Other Library of Chemical History”? The fun moments, the serious moments, the quirky moments – all can have a place on social media, all are demonstrations of what patrons can take away from participating in a library event.
    topsell theater of insects illustration

    The History of Four Footed Beasts and Serpents (1658) on display at the 2015 Othmer Library Open House. An iPad next to the book displays a GIF made from one of the book’s illustrations.

Whether you are new to social media or an established presence on a platform(s), I hope the above suggestions have provided some creative inspiration for your library’s future content.

Where do you look for social media content ideas? What types of content seem to do the best on your library’s social media? Share your thoughts in the comments below!

Fight for Email Privacy Act passage begins now . . . again / District Dispatch

It’s a pretty sure bet that, when James Madison penned the Fourth Amendment to assure the right of all Americans to be “secure in their persons, houses, papers, and effects against unreasonable searches and seizures,” he didn’t have protecting emails, texts, tweets and cloud-stored photo and other files in mind. Fortunately, Congress attempted to remedy that understandable omission 197 years later by passing the Electronic Communications Privacy Act (ECPA) to require that authorities obtain a search warrant, based on probable cause, to access the full content of such material. But given the difficulty, expense and thus unlikelihood of storing digital information for extended periods in 1986, ECPA’s protections were written to sunset 180 days after a communication had been created.

Portrait of James Madison wearing Google Glass

Credit: Mike McQuade

Thirty years later, in an age of essentially limitless, cheap storage – and the routine “warehousing” of our digital lives and materials – this anachronism has become a real, clear and present danger to Americans’ privacy. ALA, in concert with the many other members of the Digital Due Process coalition, has been pushing hard in every Congress since early 2010 to update ECPA for the digital age by requiring authorities to obtain a “warrant for content” for access to any electronic communications from the moment that they’re created. Last year, in the 114th Congress, we got tantalizingly close as the Email Privacy Act (H.R. 699) passed the House of Representatives unanimously (yup, you read that right) by a vote of 419 – 0: a margin unheard of for any bill not naming a post office or creating “National Remember a Day” day. Sadly, action on the bill then stalled in the Senate.

Undaunted, the bill’s diehard sponsors – Reps. Kevin Yoder (R-KS3) and Jared Polis (D–CO2) – have come roaring back in the first full week of the new, 115th Congress to reintroduce exactly the same version of the Email Privacy Act that passed the House last year without opposition. Look for similar action to get the ball rolling again in the Senate in the very near future and, not long after that, for a call to action to help convince Congress to heed ALA President Julie Todaro’s call for immediate action on this critical bill. As she put it in a January 9 statement:

ALA calls on both chambers of Congress to immediately enact H.R. 387 and send this uniquely bipartisan and long-overdue update of our laws to the President in time for him to mark Data Privacy Day on January 28, 2017, by signing it into law.”

The post Fight for Email Privacy Act passage begins now . . . again appeared first on District Dispatch.

A scholar’s pool of tears, Part 2: The pre in preprint means not done yet / Karen G. Schneider

Half-bakedNote, for two more days, January 10 and 11, you (as in all of you) have free access to my article, To be real: Antecedents and consequences of sexual identity disclosure by academic library directors. Then it drops behind a paywall and sits there for a year.

When I wrote Part 1 of this blog post in late September, I had keen ambitions of concluding this two-part series by discussing “the intricacies of navigating the liminal world of OA that is not born OA; the OA advocacy happening in my world; and the implications of the publishing environment scholars now work in.”

Since then, the world, and my priorities have changed. My goals are to prevent nuclear winter and lead our library to its first significant building upgrades since it opened close to 20 years ago. But at some point I said on Twitter, in response to a conversation about posting preprints, that I would explain why I won’t post a preprint of To be real. And the answer is very simple: because what qualifies as a preprint for Elsevier is a draft of the final product that presents my writing before I incorporated significant stylistic guidance from the second reviewer, and that’s not a version of the article I want people to read.

In the pre-Elsevier draft, as noted before, my research is present, but it is overshadowed by clumsy style decisions that Reviewer 2 presented far more politely than the following summary suggests: quotations that were too brief; rushing into the next thought without adequately closing out the previous thought; failure to loop back to link the literature review to the discussion; overlooking a chance to address the underlying meaning of this research; and a boggy conclusion. A crucial piece of advice from Reviewer 2 was to use pseudonyms or labels to make the participants more real.

All of this advice led to a final product, the one I have chosen to show the world. That’s really all there is to it. It would be better for the world if my article were in an open access publication, but regardless of where it is published, I as the author choose to share what I know is my best work, not my work in progress.

The OA world–all sides of it, including those arguing against OA–has some loud, confident voices with plenty of “shoulds,” such as the guy (and so many loud OA voices are male) who on a discussion list excoriated an author who was selling self-published books on Amazon by saying “people who value open access should praise those scholars who do and scorn those scholars who don’t.” There’s an encouraging appproach! Then there are the loud voices announcing the death of OA when a journal’s submissions drop, followed by the people who declare all repositories are Potemkin villages, and let’s not forget the fellow who curates a directory of predatory OA journals that is routinely cited as an example of what’s wrong with scholarly publishing.

I keep saying, the scholarly-industrial complex is broken. I’m beyond proud that the Council of Library Deans for the California State University–my 22 peers–voted to encourage and advocate for open access publishing in the CSU system. I’m also excited that my library has its first scholarly communications librarian who is going to bat on open access and open educational resources and all other things open–a position that in consultation with the library faculty I prioritized as our first hire in a series of retirement/moving-on faculty hires. But none of that translates to sharing work I consider unfinished.

We need to fix things in scholarly publishing and there is no easy, or single, path. And there are many other things happening in the world right now. I respect every author’s decision about what they will share with the world and when and how they will share it. As for my decision–you have it here.

Critical Librarianship in the Design of Libraries / LibUX

Sarah Houghton and Andy Woodworth announced Operation 451, a movement intentionally invoking Fahrenheit 451 as a

symbolic affirmation of our librarian values of knowledge, service of others, and free expression of ideas. [Operation 451] stands in direct opposition to the forces of intolerance and ignorance that seek to divide neighbors, communities, and the country.

Articles 4 and 5 of the library bill of rights, as well as the 1st amendment, make up the 4-5-1.

“Call it luck, fate, or serendipity, but we noticed that that individual numbers matched up with the fourth and fifth articles of the Library Bill of Rights and the First Amendment to the Constitution. These were the values that we want to promote.”

I want to be a part of this. This resonates with me.

I said as much in “The Election as a Design Problem” that at a moment defined by fake news and the echo chambers that blossom as a result of the user-experience zeitgeist, these — and the racist, sexist, other-ist fires they start — might be assuaged by that same ethos of deliberate design core to the UX boom.

Bear with me.

Fake news proliferates because of the more-useful-than-not algorithms that tailor our time online to our tastes, our friends, and family. We don’t complain about our connectivity — for example — to Facebook, nor do we complain about its uptime, because we massage the obvious kinks out of the web that interfere with access to and engagement with our feed.

These work for us – except when they don’t.

In Facebook’s case, the mechanisms to report bullshit already exist, sort of, but they’re not obvious. The interaction cost is high.

A hard-to-find reporting option in Facebook

It takes four steps just to get to this point.

And even though you’re one of the good guys, your experience is substantially better — because you’re either walking between meetings, or sitting in traffic, being a parent, or whatever — by scrolling and letting it slip by like a piece of trash in the stream.

Because the features to train the Facebook algorithm are designed poorly, scrolling is the straightest path back to the content you’re interested in. Even the hardiest fist-shakers don’t want to be on the job all the time.

But, now, fake news et al. actively impede your access to and engagement with Facebook, and the pain of reporting this bullshit is now for many becoming too great to even deal with the feed whatsoever. Fake news sucks for Facebook.

Approached as a design challenge, the answer seems to be to make these reporting tools painless to use.

The user experience is a net value, so negative features pull all ships down with the low tide.

What’s more, there is an opportunity for institutions that are positioned — either actively or by reputation — as intellectual and moral community cores (libraries) to exert greater if not just more obvious influence on the filters through which patrons access content.

We take for granted that these filters are wide open.

Librarianship, more than other disciplines, is wrapped-up in deep worldviews about information freedom that lends itself in practice to objectivity in the journalistic sense. But whereas I believe the commitment to objectivity in journalism was rooted — at one time, but no longer — in good business sense, our commitment is moral.

Even so, I am not sure objectivity is good for the success of libraries, either, although I didn’t really have the vocabulary to communicate this before reading Meredith Farkas’s column in American Libraries, “Never Neutral,” where she defines “critical librarianship”:

Critical librarianship supports the belief that, in our work as librarians, we should examine and fight attempts at social oppression. … Many librarians are thinking about how they can fight for social justice in their work, which raises the question of whether that work reflects the neutrality that has long been a value in our profession. Meredith Farkas

For me it’s been just out of earshot, although as Meredith mentions #critlib’s been edging into conversation around algorithmic bias and — in my own way — when I make the point about not blindly trying to serve all users but deliberately identifying and eschewing non-adopters.

I imply in the library interface that the best design decisions for libraries are those that get them out of the way, not as an end unto itself but to optimize the user experience so that libraries are primed to strategically exert control over that interface.

A flow chart showing how publishers, performers, and other parties want access to the audience libraries have, and the audience once access to the services libraries aggregate.

The library is the interface

When I usually talk about this, I tend to refer to controlling negotiations with vendors who want access to the audience libraries attract. We want to force vendors to commit to a user experience that suits the library mission. Users don’t discern between the services libraries control and those that libraries don’t, so it behooves libraries to wrest control and aggressively negotiate. We have the leverage, after all.

That said, these design decisions also position libraries to more deliberately influence the user experience in other ways – such as communicating moral or social values.

For gun-shy administrations, values do not have to be in policy or in the mission statement – although I suspect that’s better marketing than not. Service design communicates these values just as effectively.

Meredith again:

Librarians may not be able to change Google or Facebook, but we can educate our patrons and support the development of the critical-thinking skills they need to navigate an often-biased online world. We can empower our patrons when we help them critically evaluate information and teach them about bias in search engines, social media, and publishing.

This is the opportunity I mean for librarians to embrace.

Reframing the #Operation451 pledge as a design challenge to integrate critical librarianship

Let’s approach support for #Operation451 as a design challenge. We’ll start with participants’ pledge to

  • work towards increasing information access, especially for vulnerable populations;
  • establish your library as a place for everyone in the community, no matter who they are;
  • ensure and expand the right of free speech, particularly for minorities’ voices and opinions.

We can rally behind these philosophically but as problems to be solved they’re too big to approach practically.

First, to prime even stubborn obstacles for brainstorming ideas that can be tested, you can reword these as how might we notes. It’s a little workshop trick shifting focus from how daunting the challenge is to how it can solved.

  • How might we increase information access in general?
  • How might we increase information access for vulnerable populations?
  • How might we establish the library as a place for everyone?
  • How might we ensure the right of free speech?
  • How might we encourage patrons to share their voices and opinions?
  • How might we amplify our patrons voices and opinions?
  • How might we encourage and amplify the voices and opinions of minorities?

What we then want to be able to do is identify as many ways as these challenges manifest.

These could just be bullet points, but you can step in the user’s shoes and phrase ideas as job stories. These are madlib-style statements meant to approach motivations as actionable problems. “When _____, I want to _____, so that _____.”

And what may seem counterintuitive is that unlike a user story — “As a _____, I want to _____, so that _____” — the job story shifts the focus away from the persona to tasks that are independent of demographic.

A list of job stories in the spirit of #Operation451

I think this could be a living list that can serve to inspire design solutions that are intended to bake-in the spirit of critical librarianship into our practice. I would love your help.

You can contribute by either leaving a comment below, or triple hashtagging a job story #libux #op451 #critlib on twitter.

  • When I don’t have a stable place to live, I want to be able to get a library card, so that I can still take advantage of library services.
  • When I am concerned about privacy and my personal information, I want to get a library card without feeling like I’m giving too much information, so that I can take advantage of library services.
  • When I can’t pay library fines, I want to continue to be able to use the library services, so that I can get what I need done.
  • When I owe library fines, I want to use the library services without feeling like I’m in trouble, so that I still feel welcome. Okay, this could use some wordsmithing
  • When I see fake news or fake information in the library collection, I want to report it, so that the item gets reevaluated for its place in the collection.
  • When I am looking for non-fiction in the library collection, I want to see that content is untrustworthy or questionable, so that I can make an informed choice.
  • When I am looking for non-fiction in the library collection, I want to be able to filter results by most factual, so that I can make an informed choice.

This is just a start.