Planet Code4Lib

Going on a bike tour in Quebec with my kids. What will follow for Lila. / John Miedema

canada-rideI’m going on a bike tour in Quebec with my kids. That and other priorities mean I need to shift
my focus away from Lila for a bit. But here’s what coming up when I get back.

  1. Convert unread content into notes. Lila will take unread articles and books and convert them into slips for embedded reading.  I may have been over-thinking the technology that is required here. After all, most unread content is already organized into slip-sized shapes, short units of thought, i.e., paragraphs.
  2. Compute association between notes. I have done a rough cut at calculating association between slips, based on keyword queries. Keyword queries are not enough, as a cognitive technology should operate more on the level of questions, i.e., something that digs into meaning. I have been thinking about other methods of computing association, based on word properties. Most recently I have been thinking about topic analysis and statistical clustering. I need to dive into this latter approach. I must pick the best approach.
  3. Demonstrate the use of concreteness to order notes. I believe the use of word concreteness could become the most interesting feature of Lila. I did some tests back in December before I started blogging. Fascinating stuff. I will post some demonstrations.
  4. Draft the user interface.  I have been cutting a few drafts of Lila’s user interface, but I am stuck a bit as I wrestle with item 2.
  5. Post an updated and unified version of the solution architecture. I have been blogging my way through a solution architecture since January, but this process has been as much discovery for me as articulation for interested readers. Blogs are helpful that way. I have been reconsidering several things along the ways, trimming here, extending there. Once I complete items 1-4, I will likely blow away everything I have written so far and post a updated and unified version of the solution architecture.

When this is complete, I expect I will choose one piece of Lila to code, likely item 2. I am also considering a deep dive into Digital Humanities research. Stick around.

Fuck Documents / Alf Eaton, Alf

This post from just over a year ago is the clearest statement I’ve seen of Google’s intentions for Blink, the browser rendering engine which Google forked from WebKit and which powers Chrome, the “web browser”/“application platform”. The intense focus is on performance of Blink as a platform for mobile applications, and not at all on document rendering features.

But really, who needs documents anyway?

The web is no longer a desktop publishing platform, it’s most often a networked medium for machine-machine communication. Hardly anyone writes English (though a lot of people, and some machines, can read it to some extent). Data is best communicated as objects, tables, graphs, games, worlds. Processes are best communicated as scripts.

All the old “features” that came part and parcel with printed documents are relics of an age where information was fixed in stone. In a hyperlinked network with history of where code/data/information was forked/included from, there’s no need to explicitly cite previous work. Data analysis is cool, and infographics are viral (particularly when they’re slightly inaccurate), but the raw data is enough, and machines/people can (literally) draw their own conclusions. Authorship is immaterial (jk, partly), and when is anything ever authored by a single person, anyway?

So: no more making points without data, no more explanation without exploration, no more statements without reproducibility: no more documents.

Bookmarks for April 18, 2015 / Nicole Engard

Today I found the following resources and bookmarked them on Delicious.

    WASSAIL is a database-driven, web-based application employing PHP, MySQL, and Javascript/AJAX technologies. It was created to manage question and response data from the Augustana Library’s library instruction sessions, pre- and post-tests from credit-bearing information literacy (IL) courses, and user surveys. It has now expanded beyond its original function and is being used to manage question and response data from a variety of settings. Its most powerful feature is the ability to generate sophisticated customized reports.

Digest powered by RSS Digest

The post Bookmarks for April 18, 2015 appeared first on What I Learned Today....

What's New with the CrossRef Staff / CrossRef

Lisa Hart CrossRef's Director of Finance and Operations celebrated her 15 year CrossRef anniversary on April 1! She was also appointed to American Society of Association Executives (ASAE's) Finance & Business Council for a 1 year term, and is also part of ORCID's audit committee.

Other anniversaries celebrated in April include: Chuck Koscher, CrossRef's Director of Technology celebrating his 13th year, Patricia Feeney, Product Support Manager celebrating her 8th year and Amy Bosworth, Accounts Receivable Administrator celebrating her 3rd year at CrossRef.

Additionally Lindsay Russell, CrossRef's Payroll and Benefits Coordinator is going to be recognized at the New England Society of Association Executives (NESAE's) annual meeting as a Rising Professional, and has completed their Rising Professional program.

Lastly, Susan Collins, CrossRef's Member Services Coordinator is running in the Boston Marathon this Monday, April 20th. We wish her the best of luck.

Congratulations to all and go Susan!

CCSC South Central: Teaching Open Source / Nicole Engard

Today I had the pleasure of attending a talk by Gina Likins from Red Hat at the 2015 Consortium for Computing Sciences in Colleges (CCSC): South Central conference about teaching open source.

Gina started by asking the audience how many people in the room taught open source already – and no one raised their hands!! That means Gina had to start with the background of what open source is. Gina says open source is a cookie (yum). When you bake a cookie you can share the cookies and the recipe with your friends and family. If one of the people you share with is allergic to nuts and the recipe calls for nuts then that person can alter the recipe to make it so that it doesn’t kill them. There is also the potential for people to take the recipe and improve upon it. Now – of course you can go to the store and buy some cookies – but you don’t really know what’s in them and you can’t replicate them. You can try … but then you get sued for sharing those proprietary cookies.

TerminologyAnother open source example – you wouldn’t buy a car with the hood welded shut … so why do we buy proprietary software? If you can’t see what’s going on and see what’s happening under the hood then you’re stuck with the car exactly the way it is and that might not be so great. While some people are fine with that, computer geeks shouldn’t be. We should want to get in there and tinker with it.

Next some legal terminology. It’s important to understand copyright. Gina shared with her a pretty flower drawing and now that it’s up there it’s copyrighted (so of course she added a creative commons license to the picture).

Gina and her flowerSo what’s the difference between open source and free and open source software? The difference is that the free licenses always require that you share what you do under the free license. So if you’re under an open source license and you make a modification you can change the license. If you’re under a free license you don’t have that option – the license must stay the same.

Open Source TimelineSo now – a bit of history because it’s important to know where the magic comes from. In the 50s software and hardware came bundled together. In the 60s that was changed because the DOJ thought that bundling hardware and software was monopolistic. In 1983 Richard Stallman launched the GNU project which were the beginnings of the open source movement as a thing. In 1989 the first GPL was released. This history and intro is the minimum that every computer science graduate should know! Especially the licensing part because students need to know what rights people have to use their software.

90% of fortune 500 companies use open source software!! The governments of all 50 states use open source software. 76% of today’s developers use open source. Students need to know about this so that they’re ready when they’re looking for a job. By learning open source you learn to code from others by working in a virtual team and collaborating. By working on an open source project you learn how to learn – because no one is there to sit down and teach you you have to learn a lot yourself – this is how students learn how to problem solve, ask smart questions and read documentation.

HFOSSBy teaching open source and using open source students get to work on real code, fork that code, and talk about why that was a good/bad idea. As a side note – I personally don’t remember any of the programs I wrote in my computer science classes – none of them had any benefit to me or were saved for me to go back and look at. Students working on open source get to know that they’re working on real code being used by real people. If you’re looking for a project – take a look at Humanitarian Free and Open Source (HFOSS) projects because these attract a more diverse audience – this is a great way to get more women in your classes.

Working on a project is an important skill to teach students because you’re never ever going to work alone in the real world. Furthermore the likelihood that you’ll be writing your own code is very tiny!! Usually you’ll have to add to a project that exists already – and learning how to communicate with other developers is key for this. Working on open source will also allow students to make connections with actual industry connections so that they can use those when it’s time to find a job. It’s a way for student to prove themselves!

Given all that – how do we differentiate open source from proprietary. We already talked about licenses but there are other things to know about. First is the open source principles and second is the community!

The principles includes:

  • Open exchange: Communication is transparent
  • Participation: When we are free to collaborate, we create
  • Rapid prototyping: Can lead to rapid failures, but that leads to better solutions
  • Meritocracy: The best ideas win
  • Community: Together, we can do more

All that sounds awesome right? Well, there are some ‘gotchas’.

First off, as academics you’re used to knowing everything about the thing you’re teaching. Open source projects are scary because you’re not going to know them inside and out. There’s an opportunity here though by putting yourself in this role to teach your kids that it’s okay to not know everything and show them how to ask the right questions and learn how to learn. This is how we grow – even if our code isn’t accepted – you grow. Learning that will make it so that students can learn any system.

Next you’ll be a stranger in a strange land. There is no manager or single person in charge – it’s a bit of the wild wild west. This is not an environment you can control – you will be a guest. It won’t be like stepping in to a classroom and saying this is what we’re doing today.

Open source can occasionally be aggressive. With freedom and transparency comes opinions – and sometimes those opinions are not expressed politely. If there were an HR department in open source then some of these things wouldn’t happen – but that’s not how open source works. It’s the Internet – it’s all open and anyone can say anything. The good thing is that instructors are helping students in these situations – hopefully to tell them what is proper etiquette and what isn’t. Hopefully teaching open source in schools will prevent some of this. Learn more about etiquette in open source projects from Gina’s ApacheCon Keynote.

Even with all that it’s extremely important!! Students need to learn what open source is and how to contribute.

Quote from Gina: “It’s amazing how wonderful scary things can be”

How do we move forward? Check out POSSE. Teaches professors what they need to know so they can teach open source in their classes. You can also look at and sign up for the mailing list. Finally be sure to look in to OpenHatch which provides tools for building your curriculum and/or learning what open source is like.

The post CCSC South Central: Teaching Open Source appeared first on What I Learned Today....

The Digital Public Library of America Announces New Partnerships, Initiatives, and Milestones at DPLAfest 2015 / DPLA

Indianapolis, IN — On the second anniversary of the Digital Public Library of America’s launch, DPLA announced a number of new partnerships, initiatives, and milestones that highlight its rapid growth, and prepare it to have an even larger impact in the years ahead. At DPLAfest 2015 in Indianapolis, hundreds of people from DPLA’s expanding community gathered to discuss DPLA’s present and future. Announcements included:

Content milestones

Over 10 Million Items from 1,600 Contributing Institutions

On the second anniversary of its launch, the Digital Public Library of America surpassed a remarkable 10,000,000 items in its aggregated collection of openly available books, photographs, maps, artworks, manuscripts, audio, video, and material culture. This represents a quadrupling of the original collection at launch, which stood at 2.4 million items.

DPLA now has 1,600 contributing institutions from across the country, including libraries, archives, museums, and cultural heritage sites. Included within this wide-ranging collaboration are small rural public libraries and historical societies, large universities and community colleges, federal, state, and local government agencies, corporations, independent collections, and many more organizations of all stripes. In April 2013 there were just 500 contributing institutions.

This tremendous growth can be attributed, in part, to existing partners whose collections are newly available this week, including the Empire State Digital Network and the California Digital Library. Minnesota Digital Library, a partner since DPLA’s inception, is making available nearly half a million new records, an incredible 900% increase in just the past few months.

New Hub Partnerships

With Indiana’s bicentennial coming up in 2016, DPLA is delighted to announce that close to 50,000 items from Indiana Memory were added to DPLA’s collection in the last week, including postcards, photographs, and other unique and compelling documents from Indiana’s rich history.

Joining Indiana as newly covered states in early 2015 are Tennessee, Maine, and Maryland, which are forming Service Hubs for collections in their states. DPLA expects to have new content from those states, as well as ongoing contributions from our many other states, in the coming months. In addition, DPLA added the Digital Library of the Caribbean as a hub partner, which will be contributing a vast array of materials from that region.

DPLA now has 15 Service Hubs, covering 19 states. Recent grants from the National Endowment for the Humanities and the Institute of Museum and Library Services are targeted toward coverage of additional states in a succession of application phases that has already begun.


PBS-DPLA Partnership

The Digital Public Library of America and PBS LearningMedia are excited to announce today a major collaboration, bringing together the complementary strengths, networks, and content of our two nationwide organizations to better serve teachers, students, and the general public. By interweaving PBS’s unparalleled media resources and connections to the world of education and lifelong learning with DPLA’s vast and growing storehouse of openly available materials and community of librarians, archivists, and curators, the partners hope to make rich new resources accessible and discoverable for all.

In support of our respective organizations’ mutual interests in education, DPLA and PBS plan to work together to bring the high-quality DPLA digital content to as many teachers and students as possible. In the future, PBS and DPLA will explore additional, related ideas, such as professional development resources for teachers, the possible inclusion of PBS media within DPLA, and fostering local relationships between PBS’s affiliates and DPLA’s state-based service hubs.

Learning Registry Collaboration

Beginning today, the Digital Public Library of America’s exhibitions will be discoverable through the Learning Registry, which distributes top educational resources to states and schools around the country. The U.S. Departments of Education and Defense launched the Learning Registry in 2011 as an open source community and technology designed to improve the quality and availability of learning resources in education. By connecting DPLA’s metadata with the Learning Registry’s digital platform, schools, teachers, and students will more easily find the rich and open resources within DPLA’s collections. Since they cover major themes in American history and culture, DPLA’s exhibitions are already widely used in education, and this partnership will ensure an even broader audience for them, and set the stage for other DPLA resources to be more widely discoverable in the future.

Ebooks-related announcements and remarks

Sloan Foundation-funded Work on Ebooks

DPLAfest marks a key moment in the Digital Public Library of America’s work on improving access to ebooks. Generously funded by the Alfred P. Sloan Foundation, dozens of librarians, authors, publishers, and readers are gathered in Indianapolis to discuss the current landscape of this complex challenge. Our goals were to identify community leaders, scaleable infrastructure, and avenues of participation that have the potential to transform libraries’ and librarians’ contributions and roles. The expectation is that we will come away with the framework for a possible demonstration effort, as well as a means to closer unite strong contributors in the space towards a common goal.

Collaboration with HathiTrust for Open Ebooks

In a related development, the Digital Public Library of America and HathiTrust plan to highlight how they will work together to help exciting new initiatives that open up access to books. The Humanities Open Book grant program, a joint initiative of the National Endowment for the Humanities and the Andrew W. Mellon Foundation, for instance, will award grants to publishers to identify select previously published books and acquire the appropriate rights to produce an open access ebook edition available under a Creative Commons license. Participants in the program must deposit an EPUB version of the book in a trusted preservation service to ensure future access. DPLA and HathiTrust are well-prepared to accept these books and provide a wider distribution point for them.


New Board Chair and New Board Member Announced

In advance of DPLAfest 2015, the Board of Directors of the Digital Public Library of America announced the appointment of Amy Ryan, President of the Boston Public Library, as its next chair, effective for two years. Ryan will succeed the current chair, John Palfrey, Head of School at Phillips Academy in Andover, Massachusetts. Palfrey has been a central figure in DPLA’s history, from his co-leadership of the Secretariat during DPLA’s planning phase, and subsequently as founding chair of the DPLA Board.

Ryan has over 35 years of public library experience. Before being named to lead the Boston Public Library, she was the director of the nationally recognized Hennepin County Library in Minnesota, and prior to that Ryan served in leadership positions for over 28 years with Minneapolis Public Library.

In addition, at DPLAfest the Board announced the appointment of Jennifer 8. Lee as a new member, effective July 2015. A former New York Times reporter, Jennifer 8. Lee is an author, journalist and digital media entrepreneur. She is the co-founder and CEO of Plympton, a publisher of serialized fiction on digital platforms. Lee is the author of the New York Times-bestselling book, The Fortune Cookie Chronicles, and serves on the boards of the Nieman Foundation, the Center for Public Integrity, the Asian American Writers’ Workshop, Hacks/Hackers, Awesome Foundation and the Robert F. Kennedy journalism awards. She is a member of the New York Public Library Young Lions Committee. Jenny graduated with a degree in Applied Math and Economics from Harvard, where she was vice president of The Harvard Crimson.

Lee will be replacing Cathy Casserly on the board. Like Palfrey, Casserly has been enormously helpful to DPLA in its inception and growth as an organization. Her unparalleled experience with Creative Commons and Open Educational Resources, and her keen sense of nonprofit management, has been a boon to the young organization.



The Digital Public Library of America (DPLA), Stanford University, and the DuraSpace organization announced this week that their collaboration has been awarded a $2 million National Leadership Grant from the Institute of Museum and Library Services (IMLS). Nicknamed Hydra-in-a-Box, the project aims foster a new, national, library network through a community-based repository system, enabling discovery, interoperability and reuse of digital resources by people from this country and around the world.

The partners will engage with libraries, archives, and museums nationwide, especially current and prospective DPLA hubs and the Hydra community, to systematically capture the needs for a next-generation, open source, digital repository. They will collaboratively extend the existing Hydra project codebase to build, bundle, and promote a feature-complete, robust digital repository that is easy to install, configure, and maintain—in short, a next-generation digital repository that will work for institutions large and small, and is capable of running as a hosted service. Finally, starting with DPLA’s own metadata aggregation services, the partners will work to ensure that these repositories have the necessary affordances to support networked aggregation, discovery, management and access to these resources, producing a shared, sustainable, nationwide platform.

For more information, please see the full press release.

DPLA Becomes an Official Hydra Project Partner

In concert with the Hydra-in-a-box project, the Digital Public Library of America became an official Hydra Project partner. Hydra is a repository solution that is being used by institutions worldwide to provide access to their digital content. A large, multi-institutional collaboration,  the project gives like-minded institutions a mechanism to combine their individual repository development efforts into a collective solution with breadth and depth that exceeds the capacity of any individual institution to create, maintain or enhance on its own. The motto of the project partners is “if you want to go fast, go alone. If you want to go far, go together.” Hydra is open source, and enables advanced, modern, flexible user and administrative interfaces. Mark A. Matienzo, DPLA’s Director of Technology, notes that “by becoming a Hydra partner, DPLA is expressing its commitment to contributing and furthering a vibrant open source community.” The Hydra project has over 25 partners, including academic libraries, public libraries, and non-profit organizations.


The Digital Public Library of America wishes to thank its generous DPLAfest Sponsors:

  • The Alfred P. Sloan Foundation
  • Anonymous Donor
  • Bibliolabs
  • OCLC
  • Digital Library Federation
  • Digital Library Systems Group at Image Access

DPLA also wishes to thank its gracious hosts for DPLAfest 2015:

  • Indianapolis Public Library
  • Indiana State Library
  • Indiana Historical Society
  • IUPUI University Library

unicode normalization in ruby 2.2 / Jonathan Rochkind

Ruby 2.2 finally introduces a #unicode_normalize method on strings. Defaults to :nfc, but you can also normalize to other unicode normalization forms such as :nfd, :nfkc, and :nfkd.


Unicode normalization is something you often have to do when dealing with unicode, whether you knew it or not. Prior to ruby 2.2, you had to install a third-party gem to do this, adding another gem dependency. Of the gems available, some money-patched string in ways I wouldn’t have preferred, some worked only on MRI and not jruby, some had unpleasant performance characteristics, etc.  Here’s some benchmarks I ran a while ago on available gems giving unicode normalization and performance, although since I did those benchmarks new options appeared and performance characteristics changed , but now we don’t need to deal with it, just use the stdlib.

One thing I can’t explain is that the only ruby stdlib documentation I can find on this, suggests the method should be called just `normalize`.  But nope, it’s actually `unicode_normalize`.  Okay. Can anyone explain what’s going on here?

`unicode_normalized?` (not just `normalized?`) is also available, also taking a normalization form argument.

The next major release of Rails, Rails 5, is planned to require ruby 2.2.   I think a lot of other open source will follow that lead.  I’m considering switching some of my projects over to require ruby 2.2 as well, to take advantage of some of the new stdlib like this. Although I’d probably wait until JRuby 9k comes out, planned to support 2.2 stdlib and other changes.  Hopefully soon. In the meantime, I might write some code that uses #unicode_normalize when it’s present, otherwise monkey-patches in a #unicode_normalize method implemented with some other gem — although that still requires making the other gem a dependency.  Which I’ll admit there are some projects I have that really should be unicode normalizing in some places, but I could barely get away without it, and skipped it because I didn’t want to deal with the dependency. Or I could require MRI 2.2 or jruby latest, and just monkey-patch a simple pure-java #unicode_normalize if JRuby and not String.instance_methods.include? :unicode_normalize.

Filed under: General

Client-side XML validation in JavaScript / Alf Eaton, Alf


Emscripten comes with its own SDK, which bundles the specific versions of clang and node that it needs.

Install the Emscripten SDK and follow the instructions for setting it up.

Run ./ to set the PATH variable (you need to do this each time you want to use Emscripten).


xml.js is an Emscripten port of libxml2’s xmllint command, making it usable in a web browser.

Clone xml.js (and set up the submodules, if not done automatically).

Run npm install to install gulp.

Compile xmllint.js:

gulp clean
gulp libxml2 # compile libxml2
gulp compile # compile xmllint.js

Start a web server in the xml.js directory and open test/test.html to test it.

Importing multiple schema files

I’ve made a fork of xml.js which a) allows all the command-line arguments to be specified, so can be used for validating against a DTD rather than an XML schema, and b) allows a list of files to be specified, which are imported into the pseudo-filespace so that xmllint can access them. This makes running xmllint in the browser much more like running xmllint on the command line.

There is one caveat, which is that this version of xmllint still seems to try to fetch the DTD from the URL in the XML’s doctype declaration rather than that specified with the --dtdvalid argument, so the doctype needs to be edited to match the local file path to the DTD.

New CrossRef Members / CrossRef

Updated April 13, 2015

Voting Members
Agricultural Faculty
Aletheia - Associacao Cientifica e Cultural
Bowen Publishing Company
EDICE Programme
Indonesian Economist Association
International Neuroscience Institute
KIMS Foundation and Research Center
NPP Polis (Political Studies)
SciELO Paraguay
W.E. Upjohn Institute for Employment Research
Wydawnictwo Uniwersytetu Marii Curi-Sklodowskiej w Lublinie

Represented Members
Association Culturelle Franco-Coreenne
Journal of Istanbul Faculty of Medicine
Journal of Natural Sciences
Kesit Akademi
Korean Association for Political Economy
Kyung Hee University Management Research Institute
Russian Ilizarov Scientific Centre Restorative Traumatology and Orthopaedics
The Institute for Legal Studies
The Institute for Northeast Asia Research
The Institute of the History of Christianity in Korea
The Korean Association for Saramdaum Education
The Korean Society for Early Childhood Education

Last Update April 6, 2015

Voting Members
Academic and Educational Forum on International Relations
Alexander Graham Bell Association for the Deaf and Hard of Hearing
American Academy of Insurance Medicine
Australasian Association for Information and Communication Technology
Austrian Statistical Society
Cancer Research Frontiers
Evolve Publishing
Faculdades Catolicas
Friends Science Publishers
Ginekologia Polska
Indonesian Society Fisheries Product Processing
Journal of Experimental and Agricultural Sciences
Orthopaedic Section, APTA, Inc.
Penerbit Universiti Kebangsaan Malaysia (UKM Press)
Prompt Scientific Publishing
STEM Fellowship
Tobacco Regulatory Science Group
Universidade da Coruna
University of Dubrovnik

Represented Members
Adiyaman Universitesi Egitim Bilimleri Dergisi
Aufklarung Journal of Philosophy
Bulletin of Legal Medicine
CBCD Colegio Brasileiro de Cirugia Digestiva
International Cardiovascular Forum Journal
International Journal of Academic Research in Education
Journal of Computer and Education Research
Korea Research Academy of Distribution and Management
Korean Association for Psychodrama and Sociodrama
Korean Society for Environmental Education
Korean Society for Medical Mycology
Korean Society of Mechanical Technology
PE Polunina Elizareta Gennadievna
Scientia Primar
Society for Korea Classical Chinese Education
The Center for Social Welfare Research Yonsei University
The English Language Linguistics Society of Korea
The Korean Liver Cancer Study Group
The Korean Society for German History
The Korean Society of Christian Religious Education
The Phonology-Morphology Circle of Korea
The Study of History Education
Universidade Estadual Paulista - Campus de Tupa

CrossRef Indicators / CrossRef

Updated April 13, 2015

Total no. participating publishers & societies 6087
Total no. voting members 3308
% of non-profit publishers 57%
Total no. participating libraries 1943
No. journals covered 38,609
No. DOIs registered to date 73,195,928
No. DOIs deposited in previous month 483,190
No. DOIs retrieved (matched references) in previous month 59,784,568
DOI resolutions (end-user clicks) in previous month 124,765,975

What's New with the CrossRef Staff / CrossRef

Congratulations to Ed Pentz who celebrated 15 years in January as Executive Director of CrossRef.

In addition to Ed, Applications Developer Jon Stark, celebrated 11 years and Susan Collins, our Member Services Coordinator her 7th anniversary.

Paula Dwyer, our Controller and Vaishali Patel, our Technical Support Analyst both celebrate 4 years at CrossRef.

Chris Cocci, our Staff Accountant, Amy Kelley, our Operations Administrator and Penny Martin, our Part-Time UK Office Manager, have both been with us for 1 year.

Congratulations to all!

CrossRef staff at upcoming conferences / CrossRef

CrossRef International Workshop, April 29, Shanghai, China - Ed Pentz and Pippa Smart presenting.

CSE 2015 Annual Meeting, May 15-18, Philadelphia, PA - CrossCheck User Group Meeting and Breakfast, Rachael Lammey and Chuck Koscher presenting.

MLA '15 "Librarians Without Limits", May 15-20, Austin, TX. Exhibiting at booth number 234.

2015 SSP 37th Annual Meeting, May 27-29, Arlington, VA. Exhibiting at table 6.

CrossRef European Workshop, June 11, Vilnius, Lithuania - Ed Pentz and Pippa Smart presenting.

PKP Scholarly Publishing Conference 2015
, August 11-14, Vancouver, BC - Karl Ward attending.

ISMTE 8th Annual North American Conference, August 20-21, Baltimore, MD - Rachael Lammey presenting.

ALPSP Conference
, 9-11 September, London, UK - CrossRef staff attending.

Link roundup April 17, 2015 / Harvard Library Innovation Lab

It’s spring! Sit at the picnic table and read some rounded up links.

The best icon is a text label

Making Furniture by Molding Growing Trees Into Chairs, Tables, and More | Mental Floss

Fantasy Frontbench – giving the public a way to compare politicians / Open Knowledge Foundation

This is a guest blog post by Matt Smith, who is a learning technologist at UCL. He is interested in how technology can be used to empower communities.


Fantasy Frontbench is a not-for-profit and openly licensed project aimed at providing the public with an engaging and accessible platform for directly comparing politicians.

A twist on the popular fantasy football concept, the site uses open voting history data from Public Whip and They Work For You. This allows users to create their own fantasy ‘cabinet’ by selecting and sorting politicians on how they have voted in Parliament on key policy issues such as EU integration, Updating Trident, Same-sex marriage and NHS reform.

Once created, users can see how their fantasy frontbench statistically breaks down by gender, educational background, age, experience and voting history. They can then share and debate their selection on social media.

The site is open licensed and we hope to make datasets of user selections available via figshare for academic inquiry.

A wholly state educated frontbench, from our gallery.

A wholly state educated frontbench, from our gallery.

Aim of the project

Our aim is to present political data in a way that is engaging and accessible to those who may traditionally feel intimidated by political media. We wish to empower voters through information and provide them with the opportunity to compare politicians on the issues that most matter to them. We hope the tool will encourage political discourse and increase voter engagement.

Skærmbillede 2015-04-17 kl. 16.41.54

Uses in education

The site features explanations of the electoral system and will hopefully help learners to easily understand how the cabinet is formed, the roles and responsibilities of cabinet ministers and the primary processes of government. Moreover, we hope as learners use the site, it will raise questions surrounding the way in which MPs vote in Parliament and the way in which bills are debated and amended. Finally, we host a gallery page which features a number of frontbenches curated by our team. This allows learners to see how different groups and demographics of politicians would work together. Such frontbenches include an All Female Frontbench, Youngest Frontbench, Most Experienced Frontbench, State Educated Frontbench, and a Pro Same-sex Marriage Frontbench, to name but a few.

Users can see how their frontbench in Parliament has voted on 75 different policy issues.

Users can see how their frontbench in Parliament has voted on 75 different policy issues.


Over the coming weeks, we will continue to develop the site, introducing descriptions of the main political parties, adding graphs which will allow users to track or ‘follow’ how politicians are voting, as well as adding historical frontbenches to the gallery e.g. Tony Blair’s 1997 Frontbench, Margaret Thatcher’s 1979 Frontbench and Winston Churchill’s Wartime Frontbench.

For further information or if you would like to work with us, please contact or tweet us at [@FantasyFbench](


Fantasy Frontbench is a not-for-profit organisation and is endorsed and funded by the Joseph Rowntree Reform Trust Ltd.

Javiera Atenas provided advice on open licensing and open data for the project.

Progress for School Libraries! / District Dispatch

Yello school bus

By H. Michael Miley

This week, the Senate Committee on Health, Education, Labor and Pensions (aka “HELP Committee”) met to mark-up (debate, amend and vote on) the Every Child Achieves Act of 2015, a bill that would reauthorize the Elementary and Secondary Education Act (ESEA), formerly known as No Child Left Behind.

The American Library Association (ALA) sought amendments to require that every student have access to an “effective school library program,” defined in statute to require that: every school library be staffed by a certified librarian; equipped with up-to-date materials and technology; and enriched by a curriculum jointly developed by a grantee school’s librarians and classroom teachers and codifying the currently funded Innovative Approaches to Literacy (IAL) program under ESEA.

While we did not get all we had hoped for, the Committee did adopt Sen. Sheldon Whitehouse’s (with co-sponsors: Sens. Bob Casey, Susan Collins, and Elizabeth Warren) amendment to amend Title V of ESEA establishing “effective school library programs” as an eligible use of funds under a program for literacy and arts education. Passed by unanimous consent as part of Chairman Sen. Lamar Alexander’s “manager’s amendment” package, this provision would allow grants to be awarded to low-income communities for “developing and enhancing effective school library programs, which may include providing professional development for school librarians, books, and up-to-date materials to low-income schools.”

The bill that the Committee marked up and passed will next be taken up by the full Senate, although we don’t yet know when. Our champion, Senator Jack Reed, intends to propose a stronger amendment on the Senate floor than the one adopted by the HELP Committee to broadly provide dedicated funding for school libraries and librarians in ESEA.

We would like to thank all of the library advocates who reached out to their senators and representatives to demand that Congress support effective school library programs. As we move forward in the advocacy process, there is more work to do. Stay tuned as we await further word!

The post Progress for School Libraries! appeared first on District Dispatch.

Format Migrations at Harvard Library: An NDSR Project Update / Library of Congress: The Signal

The following is a guest  post by Joey Heinen, National Digital Stewardship Resident at Harvard University Library.

Joey Heinen

Joey Heinen

As has been famously outlined by the Library of Congress on their website on sustainability factors for digital formats, digital material is just as susceptible to obsolescence as analog formats. Within digital preservation there are a number of strategies that can be employed in order to protect your data including refreshing, emulation or migration, to name a few. As the National Digital Stewardship Resident at Harvard Library, I am responsible for developing a format migration framework which can be continuously adapted for migration projects at Harvard.

In order to test the viability of this framework, I am also planning for migration of three obsolete formats within the Digital Repository Service (DRS) – Kodak PhotoCD, SMIL playlists and RealAudio. While each format will have its own challenges for a standard workflow, there are certain processes which will always be incorporated into the overall migration framework. In a sense I am helping to create a series of incantations that must be uttered in order to raise these much-cherished digital materials back from the dead. No sage-burning necessary.

Migration is the chosen digital preservation strategy for this project since the aim of migration is to move content from its previously tenuous origins to a format with much greater promise in terms of support and usage. Our overall goal is to continue to provide remote access on modern platforms in a way that best matches the original format.

A Framework Emerges – First Steps

I began my residency by performing a broad literature review on the status of migration projects across the library field. This was a great way to acquaint myself with the terrain, but greater depth would be needed by using some real examples and understanding the institutional context of Harvard – its staff structure, its resources, its policies and its digital repository. Bouncing back and forth between the broader framework and the individual format plans, some patterns began to emerge. After further processing, we have arrived upon some core attributes that will inform the overall framework. The specifics of this framework are still in development and are much too large to narrate here, but I’ll discuss some of the most distinct themes.

Stakeholder involvement

The mention of “stakeholder involvement” first is deliberate – without gaining a sense for the “who,” the project cannot commence. Depending on the type of content, the exact cast of characters may vary but the types of roles will stay somewhat consistent. For the framework, we identified the following key areas of responsibility and corresponding responsible parties:

  • Project Management (that’s me!).
  • Technical Guidance/Format Experts (those who understand the format best).
  • Documentation (that’s me too! Though gathering provenance and creation of documentation throughout the migration may originate from other departments, depending).
  • Quality Assurance/Plan Approval (that’s pretty much everyone but at different points in the process).
  • Systems Conformance/Technical Infrastructure (this is almost always our friends in Library IT staff and Metadata who inform us of how the plan does or does not comply with current technological procedures and infrastructure).
  • Content Ownership (curators or collection managers, involvement is generally just to be informed of major decisions).

Defined Project Phases

In general, our migration plans can be broken down into these essential phases:

  • Planning for the Test.
  • Testing.
  • Refining the Plan.
  • Executing the Plan.
  • Verifying Results and Project Wrap-Up.

From these project phases, we then defined the following within each phase:

  • Workflow Activities – essential steps in the migration workflow.
  • Workflow Components – ways of grouping the more granular activities.
  • Project Deliverables – this could take on the form of: the migrated content itself; documentation or metadata generated along the way; diagrams of the workflow and the migration path (e.g. how the content in relation to the Harvard repository will change from pre- to post-migration); or new revelations in digital preservation policies e.g. storage and retention plans.

Last but not least, we want to consider how other projects within the library might impact the migration plan, whether in terms of timing and staff availability, as well as projects that might impact the infrastructure upon which migration is supported. For example, the metadata from Harvard’s DRS is being migrated to a new version of the DRS which includes changes to how relationships between files and objects are described. The relationship structure of still image objects will be completely different before and after this metadata migration so a plan to migrate the Kodak PhotoCD files will need to take this into consideration.

Format Specifics – Examples

In terms of how this framework has been used on the actual formats, we have made the most progress on Kodak PhotoCD, mostly because it’s less complex and less staff intensive than the SMIL/RealAudio formats. So far we have completed the analysis, creation of the test, the testing itself and are beginning to define how the old image objects will be changed relative to the inclusion of migrated content, additional artifacts (e.g. metadata) and the new content model structuring. The details of our decisions around successfully migrating PhotoCD content is too verbose for this post (though more information can be found on the NDSR blog). However, the Migration Workflow and Migration Pathway diagrams shown here help to show “how the sausage is made.”

Migration Workflow

Migration Workflow

The Migration Workflow demonstrates every step of the process from gathering documentation for initial analysis to ingest of the migrated content into the repository. In the example at left, we see the first two components of Phase 1 of the Migration Workflow – Format/Tools Research and Confirming Migration Criteria. As is shown in the corresponding legend, stakeholder involvement is determined based on a colored box which names the stakeholder group within each component. These roles were designed based off the RACI Responsibility Assignment Matrix which defines 4 levels of responsibility.

Migration Pathway

Migration Pathway

The Migration Pathway diagram (at right) shows how content will be transformed by a migration. A diagram is produced for each “bucket” of content for which the same tools, settings and outputs can be used unanimously based on shared technical characteristics. This example, from the Horblit Collection, a collection of daguerreotypes initially digitized in PhotoCD form, shows the ways in which the original PhotoCD content as found within the DRS will be converted and newly packaged and ingested into the repository. It considers how the image objects look now (DRS1), how they will look after the metadata migration (DRS2) and how the object will look after the content is migrated.

In the two months remaining for my residency I will be completing the overall framework, and working on the Kodak PhotoCD and SMIL/RealAudio plans (though execution of these plans will certainly fall outside of this timeline). After planning for the format-specific migration and going through several passes at the overall framework, we are getting closer to an actionable model for ongoing migration projects.

It has been fascinating to oscillate between deep analysis of the technical and infrastructural challenges faced with each format and finding ways to abstract these processes into a template that can be continuously adapted. The result will certainly be of use to Harvard, and our hope is that in sharing it with the larger digital preservation field that it will be useful to others as well. For the finalized spells and incantations, check the NDSR blog or Harvard website at the end of May. Presto Change-o!

DPLAFest Attendees: Support LGBTQ Youth in Indiana! / DPLA

After the passage of SEA 101 (the Indiana Religious Freedom Restoration Act), many scheduled attendees of DPLAFest were conflicted about its location in Indianapolis. Emily Gore, DPLA Director for Content, captured both this conflict and the opportunity the location provides when she wrote:

We should want to support our hosts and the businesses in Indianapolis who are standing up against this law… At DPLAfest, we will also have visible ways to show that we are against this kind of discrimination, including enshrining our values in our Code of Conduct.  We encourage you to use this as an opportunity to let your voice and your dollars speak.

As DPLAFest attendees, patronizing businesses identifying themselves with Open for Service is an important start, but some of us wanted to do more. During our visit to Indianapolis, we are donating money to local charities supporting the communities and values that SEA 101 threatens.

One such local charity is the Indiana Youth Group (IYG). The IYG “provides safe places and confidential environments where self-identified lesbian, gay, bisexual, transgender, and questioning youth are empowered through programs, support services, social and leadership opportunities and community service. IYG advocates on their behalf in schools, in the community and through family support services.” IYG was written up as a direct-action donation option in the New Civil Rights Movement, and they provide services and support in parts of the state with a more hostile legal environment than Indianapolis.

This kind of local, direct action effort needs our support in Indiana right now.  If you can, please consider donating to the Indiana Youth Group while in Indiana for DPLAFest. There is an existing GoFundMe campaign that IYG recommended linked below. If you choose to donate via GoFundMe, please consider tagging your donation with #DPLAFest so that we can communicate the goodwill of DPLAFest attendees as a group to the charity. The GoFundMe campaign sends money directly to IYG regardless of fundraising goals.

GoFundMe for Indiana Youth Group:

You can also donate via PayPal through IYG’s website. If you choose to donate through PayPal, please consider mentioning DPLAFest in the related forms on IYG site. IYG has offered to collate those responses with donations to again communicate the positive support DPLAFest attendees give to the charity and to LGBTQ youth in the state of Indiana.

Thank you for considering joining us and other DPLAFest attendees in supporting LGBTQ communities in Indiana. We look forward to seeing you in Indianapolis.

Honouring the memory of leading Open Knowledge community member Subhajit Ganguly / Open Knowledge Foundation

It is with great sadness that we have learned that Mr. Subhajit Ganguly, an Open Knowledge Ambassador in India and a leading community member in the entire region, has suddenly and tragically passed away.

Following a short period of illness Subhajit Ganguly, who was only 30 years old, passed away on the morning of April 7, local time, in the hospital in his hometown of Kolkata, India. His demise came as a shock to his family and loved ones, as well as to his colleagues and peers in the global open data and open knowledge community.

Subhajit was known as a relentless advocate for justice and equality, and a strong proponent and community builder around issues such as open data, open science and open education, which were all areas to which he devoted a large part of both his professional and personal time. Most recently he was the main catalyst and organiser of India Open Data Summit and he successfully contributed as project lead for the Indian Local City Census as well as being a submitter and reviewer of datasets in the Global Open Data Index, a global community-driven project that compares the openness of datasets worldwide to ensure another most pressing issue for him: Political transparency and accountability.

Subhajit was also instrumental in building the Open Knowledge India Local Group over the past two years, alongside also volunteering his time to coordinate other groups and initiatives within the open data landscape. Just last summer he attended the Open Knowledge Festival in Berlin to join his fellow community leaders to plan the future of open knowledge and open data in India, regionally in AsiaPAC, and globally.

Ever since the news passed across the globe during the last few days, messages and praise of Subhajit’s being and work have been pouring in from community leaders and members from near and far. He will be tremendously missed, and we join the many voices across the world mourning his loss.

Our thoughts and condolences go out to his family and loved ones. We hope that his work and vision will continue to stand as a significant example to follow for people around the world. May Subhajit rest in peace.

Subhajit (holding the sign) among his Open Knowledge community peers at OK Festival in Berlin, 2014 (Photo: Burt Lum, CC BY-NC-ND)

Subhajit (holding the sign) among his Open Knowledge community peers at OK Festival in Berlin, 2014 (Photo: Burt Lum, CC BY-NC-ND)

A Second Collaborative Technology / LITA

In September, I wrote a post about new collaborative technology from Crestron. We installed AirMedia in our library, and we are now looking at AirTame as a possible next generation version of collaborative technology.


Licensed under a Creative Commons Attribution Share-Alike 3.0 License by

It works on all mobile devices. AirMedia does this too, but the tablet features have been less than ideal.  Airtame was able to raise more money than expected and is currently working to scale its production.

My university is also considering how collaborative technologies can be used in the classroom. This type of technology will allow for enhanced group work, enhanced presentations, and the instructor being able to move around the classroom to work with different students instead of being tied to the front of the classroom.

As technology continues to move toward mobile and wearable, the ability to show a group what is on a small screen will become more important in both education and the business world.

How is your library using collaborative technology?

How can libraries support new communication methods using collaborative technology?

Recordings Available: “Integrating ORCID Persistent Identifiers with DSpace, Fedora and VIVO.” / DuraSpace News

DuraSpace launched its 11th Hot Topics Community Webinar Series, “Integrating ORCID Persistent Identifiers with DSpace, Fedora and VIVO” last month.  Curated by ORCID’s European Regional Director, Josh Brown, this series provided detailed insights into how ORCID persistent digital identifiers can be integrated with DSpace and Fedora repositories and with the VIVO open source semantic

April 17 Web Services Maintenance Canceled / OCLC Dev Network

The Web service maintenance scheduled for Friday, April 17 has been canceled and will be rescheduled. Stay tuned to Developer Network for updates on future maintenance windows. 

We apologize for any inconvenience. 

Technology and Youth Services Programs / LITA

Technology and Youth Services Programs: Early Literacy Apps and More

tweentabTuesday May 20, 2015
1:00 pm – 2:00 pm Central Time
Register now for this webinar

A brand new LITA Webinar on youth and technology.

In this digital age it has become increasingly important for libraries to infuse technology into their programs and services. Youth services librarians are faced with many technology routes to consider and app options to evaluate and explore. Join Claire Moore from the Darien Public Library to discuss innovative and effective ways the library can create opportunities for children, parents and caregivers to explore new technologies.

clairemooreClaire Moore

Is the Head of Children’s Services at Darien Library in Connecticut. She is a member of ALSC’s School Age Programs and Services Committee and the Digital Content Task Force. Claire earned her Masters in Library and Information Science at Pratt Institute in New York. Claire currently lives in Brooklyn, NY.

Then register for the webinar

Full details
Can’t make the date but still want to join in? Registered participants will have access to the recorded webinar.

LITA Member: $45
Non-Member: $105
Group: $196
Registration Information

Register Online page arranged by session date (login required)
Mail or fax form to ALA Registration
Call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4269 or Mark Beatty,

Open Access Attitudes of Computer Science Professors / Peter Murray

My Communications of the ACM came in the main recently, and in an article about the future of scholarly publishing in computer science (in general — and what the ACM Publications Board is thinking about doing), there was this paragraph about the attitudes of a subset of ACM members towards open access publishing.

Open access models are an area of broad interest, and we could fill a dozen columns on different issues related to open access publishing. Based on actions taken by certain research funders (primarily governmental, but also foundations), we have been looking at whether and how to incorporate author-pays open access into ACM’s journals. We asked ACM Fellows about author-pays Gold OA journals, and specifically whether they preferred a Gold “umbrella” journal across computer science vs. Gold-only specialty journals vs. Gold editions of current journals. Gold-only specialty journals were preferred by 15% of Fellows; a Gold umbrella journal by 29%; and Gold editions of existing journals by 45%. Ten percent were against author-pays open access generally, preferring the current model or advocating for non-author-pays open access models.

Note that the ACM Fellows are “the top 1% of ACM members [recognized] for their outstanding accomplishments in computing and information technology and/or outstanding service to ACM and the larger computing community.” So it is hardly a representative sample of computer science professors. They do have a survey form for more broad, if yet unscientific, input on the topic.

I’ve tangled with the ACM editor-in-chief before about the cost of the ACM digital library subscriptions cross-subsidizing other ACM activities. There have been others that have taken the ACM to task for their open access policies. It is good to see the publications committee learning from past missteps, educating then listening to its members, and be willing to consider change in this area.

Research across the Curriculum / Dan Scott

The following post dates back to January 15, 2007, when I had been employed at Laurentian for less than a year and was getting an institutional repository up and running.... I think old me had some interesting thoughts!


The author advocates an approach to university curriculum that re-emphasizes the student's role in the search for truth and knowledge by providing essential critical thinking skills and treating undergraduate students as full participants in the academic discussion.


The academy is a place to develop critical thinking skills, and a place to develop those skills by participating in discussions seeking truth and knowledge. These conversations may occur between students in informal spaces; they may be facilitated by a professor and take place during a single class session or over multiple sessions during a course; or they may take place over centuries (most commonly through the medium of the written word).

As a university, we recognize the value of all of these conversations in developing citizens with well-honed critical thinking skills. However, I would argue that our focus (at least at the undergraduate level) has been on the level of single and multiple class discussions. Students are often assigned course work for which the only intended audience is the professor or marking T.A.; the audience for presentations is normally just the rest of the class. A typical unit of work is the “essay” (from the French: essayer, meaning “to try”).

(Rhetorical question alert!) But what are the students trying for? Typically, they are trying for grades; some for an A, some simply to pass. But are they trying to contribute to the greater academic discussions? Where do those essays go in a month, or a year? Do students see their papers as parts of a greater continuum of the academic discussion, or do they see them as a means to an end? Are students exhorted to aspire to publishing their papers on any scale? What effect does the treatment of course work as an ephemeral entity, rather than a permanent contribution to the field of knowledge, have on the motivation of students to excel in the application of their critical thinking skills, to be creative, to write high quality papers? Does the knowledge that their days and nights of hard work going to quickly be consigned to the trash bin cause students to treat the work of the intellectual giants that preceded them with a similar disregard?


I initially started worrying about this because of a third-year assignment that simply cited “Google” as its sole source. The sad confusion of search tool with source immediately raised my concern about the student's ability to evaluate alternative sources of information and opinion for authority. I doubted that this student had completed the Library's introductory tutorial on searching and citing sources, and that reinforced my desire to encourage programs to make this course a mandatory requirement. During a casual conversation with Dr. David Robinson, he disclosed that he assigned basic literature research tasks to every one of his courses because he could not guarantee that his students had learned those skills outside of his courses. I continued to reflect on this problem in the attempt to develop an approach to motivating the student to want to participate in the overarching discussions – and that is where the idea of “research across the curriculum” came to mind.

I will credit Dr. Laurence Steven with the idea of motivating higher quality undergraduate work through the expectation of publication. In his fourth-year Literary Criticism course in 1996, he told students at the outset of the class that he planned to compile and publish the complete set of our final assignments. Even though the press run was undoubtedly under 100, the commitment to taking our work seriously positively influenced our efforts to produce high-quality assignments.

Emphasizing the academic discussion

The overarching message we can send to students is: “We take your effort seriously, and will help you contribute to your chosen discipline.”

Publishing offers the carrot of fame and the stick of exposure. I cannot help but think that the expectation of publishing your work will improve the quality of that work.

We obviously cannot expect a first year student to publish their work in a traditional academic journal. However, the Web has given us an alternative publishing method that can be controlled to meet the student's comfort level: publishing visibility could be limited to the author herself, to the professor, to the class, to the program, to the university, and to the world. If we created a simple Web-based repository, we could allow a student to first work on drafts of their assignment, then open it up to their professor or a TA for initial review, then open it up to the class to exchange their work with their classmates and participate in peer review. Outstanding work could be surfaced at wider levels of availability. Of course, given that the student retains copyright over their work, they would be free to republish their work as they see fit (on a personal Web log, on a discipline-related mailing list, to an academic journal, etc). This opens up an opportunity to discuss intellectual property issues and the characteristics of various publishing mechanisms.

Through the course of a student's career, this Web-based publishing mechanism would serve as an electronic portfolio of their work. If a student chose to make their work visible outside of the class, they would be able to track citations to that work over time - particularly if professors chose to surface the work of previous students in a given class as optional or required references in addition to traditional sources. We know that one of the primary uses of the Laurentian University Archives today is by students seeking the fourth-year papers of previous students in their disciplines so that they can find work to build upon.

At the fourth-year level, we could strongly encourage (to the point of making it an unstated assumption) that fourth-year work should be published in some fashion. The publishing schedule of traditional journals makes it unlikely that a student could achieve publication within the normal class schedule, however we could commit some resources to assisting those alumni who want to polish their fourth-year papers for journal publication (without necessarily requiring a complete graduate program). Assuming that the J.N. Desmarais Library goes forward with the Laurentian University Institutional Repository, we could offer that as a venue for publishing fourth year work (or exceptional work from previous years).

If there are doubts that fourth-year work is of publishable quality, I would like to refer back to an evaluation (???) of the fourth-year papers that are held by the Laurentian University archives. Many of these papers were found to be of a quality comparable to Master's theses (the hypothesis was that that the lack of graduate programs resulted in higher-quality undergraduate work).

Jobs in Information Technology: April 15 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Associate University Librarian for Digital Strategies, Northwestern University, Evanston, IL

Data Curator, DST Systems, Kansas City, MO

Head Librarian, Systems & Applications #12530, Boston College, Chestnut Hill, MA

Learning & Assessment Designer, Harvard Library, Cambridge, MA

Systems Librarian, Hobart and William Smith Colleges, Geneva, NY

Visit the LITA Job Site for more available jobs and for information on submitting a  job posting.

The Life of a Lowly MARC Subfield / HangingTogether

4483215956_c2ce0fdbed_zScience uses the art of observation to unearth truth. Sometimes the observation is minutely focused on a small constituent of a much larger ecosystem. By doing this, it can be possible to detect larger truths from such minutely focused observation. This brings me to my latest metadata investigation, which is about as minutely focused within the library metadata world as it is possible to be.

I decided to look at the life of a single MARC subfield, in this case the lowly 034 $2. The 034 field is “Coded Cartographic Mathematical Data”. The 034 field was proposed and adopted in 2006. The $2 subfield is where one can record the source of the data in the 034. Values were to come from a specified list of potential values.

From my “MARC Usage in WorldCat” work, I already knew that as of last January there were about 2.4 million records with an 034 field. I also knew that the $2 subfield of the 034 only appeared 1,976 times. Of course a year had passed so that figure was likely low.

So the first thing I did was to grab all of the 034 $2 subfields and count how many times each source code had been used. Since the point of my exercise was not to show errors, I combined entries with typos with what they should have been and only counted as “errors” entries that were clearly in the wrong place in the field:

3868 bound
2539 gooearth
1069 geoapn
215 geonet
157 geonames
129 pnosa2011
46 other
26 gnis
17 cga
5 local
3 gnrnsw
3 aadcg
1 wikiped
1 gettytgn
1 geoapn geonames

I then wanted to find out who was using this subfield, so I ran a job to extract the 040 $a, the “original cataloging agency” and totaled the occurrences. It turns out the vast majority come from five institutions:

2471 National Library of Israel (J9U)
1632 Libraries Australia (AU@)
1076 British Library (UKMGB)
885 Pennsylvania State University (UPM)
799 Cambridge University (UkCU)

Then it drops off rather precipitously from there:

213 Agency for the Legal Deposit Libraries (Scotland) (StEdALDL)
206 New York Public (NYP)
117 Commonwealth Libraries, Bureau of State Library, Pennsylvania (PHA)
101 Yale University, Beinecke Rare Book and Manuscript Library (CtY-BR)

Curious about how the main user of this element was using it, I contacted the National Library of Israel. They were kind enough to reply to my odd query:

We have added geographic coordinates to records that describe ketubot, Jewish marriage contracts. The contracts almost always include the geographic location where the wedding takes place.

Using, google earth ($2 gooearth) , we added the coordinates with the intention of enabling the display of a google map in this website.

I don’t believe that the site is fully functional as to their intended goal, but you can at least start to get an idea as to how this data is going to be used. So even a lowly subfield can have higher aspirations for impact than may seem warranted at first.

About Roy Tennant

Roy Tennant works on projects related to improving the technological infrastructure of libraries, museums, and archives.

The role of Role in / Richard Wallis

schema-org1 is basically a simple vocabulary for describing stuff, on the web.  Embed it in your html and the search engines will pick it up as they crawl, and add it to their structured data knowledge graphs.  They even give you three formats to chose from — Microdata, RDFa, and JSON-LD — when doing the embedding.  I’m assuming, for this post, that the benefits of being part of the Knowledge Graphs that underpin so called Semantic Search, and hopefully triggering some Rich Snippet enhanced results display as a side benefit, are self evident.

The vocabulary itself is comparatively easy to apply once you get your head around it — find the appropriate Type (Person, CreativeWork, Place, Organization, etc.) for the thing you are describing, check out the properties in the documentation and code up the ones you have values for.  Ideally provide a URI (URL in for a property that references another thing, but if you don’t have one a simple string will do.

There are a few strangenesses, that hit you when you first delve into using the vocabulary.  For example, there is no problem in describing something that is of multiple types — a LocalBussiness is both an Organisation and a Place.  This post is about another unusual, but very useful, aspect of the vocabulary — the Role type.

At first look at the documentation, Role looks like a very simple type with a handful of properties.  On closer inspection, however, it doesn’t seem to fit in with the rest of the vocabulary.  That is because it is capable of fitting almost anywhere.  Anywhere there is a relationship between one type and another, that is.  It is a special case type that allows a relationship, say between a Person and an Organization, to be given extra attributes.  Some might term this as a form of annotation.

So what need is this satisfying you may ask.  It must be a significant need to cause the creation of a special case in the vocabulary.  Let me walk through a case, that is used in a Blog post, to explain a need scenario and how Role satisfies that need.

Starting With American Football

Say you are describing members of an American Football Team.  Firstly you would describe the team using the SportsOrganization type, giving it a name, sport, etc. Using RDFa:

<div vocab="" typeof="SportsOrganization" resource="">
    <span property="name">Touchline Gods</span>
    <span property="sport">American Football</span>

Then describe a player using a Person type, providing name, gender, etc.:

<div vocab="" typeof="Person" resource="">
    <span property="name">Chucker Roberts</span>
    <span property="birthDate">1989</span>

Now lets relate them together by adding an athlete relationship to the Person description:

<div vocab="" typeof="SportsOrganization">
    <span property="name">Touchline Gods</span>
    <span property="sport">American Football</span>
    <span property="athlete" typeof="Person" src="">
       <span property="name">Chucker Roberts</span>
       <span property="birthDate">1989</span>


Let’s take a look of the data structure we have created using Turtle – not a html markup syntax but an excellent way to visualise the data structures isolated from the html:

@prefix schema: <> .
    a schema:SportsOrganization;
    schema:name "Touchline Gods";
    schema:sport "American Football";
    schema:athlete <> .
    a schema:Person;
    schema:name "Chucker Roberts";
    schema:birthDate "1969".

So we now have Chucker Roberts described as an athlete on the Touchline Gods team.  The obvious question then is how do we describe the position he plays in the team.  We could have extended the SportsOrganization type with a property for every position, but scaling that across every position for every team sport type would have soon ended up with far more properties than would have been sensible, and beyond the maintenance scope of a generic vocabulary such as

This is where Role comes in handy.  Regardless of the range defined for any property in, it is acceptable to provide a Role as a value.  The convention then is to use a property with the same property name, that the Role is a value for, to then remake the connection to the referenced thing (in this case the Person).  In simple terms we have have just inserted a Role type between the original two descriptions.


This indirection has not added much you might initially think, but Role has some properties of its own (startDate, endDate, roleName) that can help us qualify the relationship between the SportsOrganization and the athlete (Person).  For the field of organizations there is a subtype of Role (OrganizationRole) which allows the relationship to be qualified slightly more.



<div vocab="" typeof="SportsOrganization" resource="">
      <span property="name">Touchline Gods</span>
      <span property="sport">American Football</span>
      <span property="athlete" typeof="OrganizationRole">
          <span propery="startDate">01072014</span>
          <span property="roleName">Quarterback</span>
          <span property="number">11;</span>
          <span property="athlete" typeof="Person" src="">
              <span property="name">Chucker Roberts</span>
              <span property="birthDate">1989</span>

and in Turtle:

@prefix schema: <>
    a schema:SportsOrganization;
    schema:name "Touchline Gods";
    schema:sport "American Football";
    schema:athlete [
        a schema:OrganizationRole
        schema:roleName "Quarterback";
        schema:startDate "01072014";
        schema:number "11"
        schema:athlete <>
    a schema:Person;
    schema:name "Chucker Roberts";
    schema:birthDate "1969"

Beyond American Football

So far I have just been stepping through the example provided in the blog post on this.  Let’s take a look at an example from another domain – the one I spend my life immersed in – libraries.

There are many relationships between creative works that libraries curate and describe (books, articles, theses, manuscripts, etc.) and people & organisations that are not covered adequately by the properties available (author, illustrator,  contributor, publisher, character, etc.) in CreativeWork and its subtypes.  By using Role, in the same way as in the sports example above,  we have the flexibility to describe what is needed.

Take a book (How to be Orange: an alternative Dutch assimilation course) authored by Gregory Scott Shapiro, that has a preface written by Floor de Goede. As there is no writerOfPreface property we can use, the best we could do is to is to put Floor de Goede in as a contributor.  However by using Role can qualify the contribution role that he played to be that of the writer of preface.


In Turtle:

@prefix schema: <> .
@prefix relators: <> .
@prefix viaf: <> .
    a schema:Book;
    schema:name "How to be orange : an alternative Dutch assimilation course";
    schema:author viaf:305830120; # Gregory Scott Shapiro
    schema:exampleOfWork ;
    schema:contributor [
        a schema:Role;
        schema:roleName relators:wpr; # Writer of preface
        schema:contributor viaf:283191359; # Floor de Goede

and RDF:

<div vocab="" typeof="Book" resource="">
      <span property="name">How to be orange : an alternative Dutch assimilation course</span>
      <span property="author" src="">Gregory Scott Shapiro</span>
      <span property="exampleOfWork" src=""></span>
      <span property="contributor" typeOf="Role" >
          <span property="roleName" src="">Writer of preface</span>
          <span property="contributor" src="http://">Floor de Goede</span>

You will note in this example I have made use of URLs, to external resources – VIAF for defining the Persons and the Library of Congress relator codes – instead of defining them myself as strings.  I have also linked the book to it’s Work definition so that someone exploring the data can discover other editions of the same work.

Do I always use Role?
In the above example I relate a book to two people, the author and the writer of preface.  I could have linked to the author via another role with the roleName being ‘Author’ or <>.  Although possible, it is not a recommended approach.  Wherever possible use the properties defined for a type.  This is what data consumers such as search engines are going to be initially looking for.

One last example

To demonstrate the flexibility of using the Role type here is the markup that shows a small diversion in my early career:

@prefix schema: <> .
    a schema:PerformingGroup;
    schema:name "Gentle Giant";
    schema:employee [
        a schema:Role;
        schema:roleName "Keyboards Roadie";
        schema:startDate "1975";
        schema:endDate "1976";
        schema:employee [
            a schema:Person;
            schema:name "Richard Wallis";

This demonstrates the ability of Role to be used to provide added information about most relationships between entities, in this case the employee relationship. Often Role itself is sufficient, with the ability for the vocabulary to be extended with subtypes of Role to provide further use-case specific properties added.

Whenever possible use URLs for roleName
In the above example, it is exceedingly unlikely that there is a citeable definition on the web, I could link to for the roleName. So it is perfectly acceptable to just use the string “Keyboards Roadie”.  However to help the search engines understand unambiguously what role you are describing, it is always better to use a URL.  If you can’t find one, for example in the Library of Congress Relater Codes, or in Wikidata, consider creating one yourself in Wikipedia or Wikidata for others to share. Another spin-off benefit for using URIs (URLs) is that they are language independent, regardless of the language of the labels in the data  the URI always means the same thing.  Sources like Wikidata often have names and descriptions for things defined in multiple languages, which can be useful in itself.

Final advice
This very flexible mechanism has many potential uses when describing your resources in There is always a danger in over using useful techniques such as this. Be sure that there is not already a way within Schema, or worth proposing to those that look after the vocabulary, before using it.

Good luck in your role in describing your resources and the relationships between them using

Thai librarians visit ALA Washington Office / District Dispatch



Last week, the American Library Association (ALA) Washington Office hosted librarians from Thailand who are visiting the United States to learn about library practices and futures. Our visitors, Supawan Ardkhla and Nusila Yumaso, are participants in the U.S. State Department’s International Visitor Leadership Program. Through short-term visits to the U.S., foreign leaders in a variety of fields experience our country firsthand and cultivate professional relationships. They were accompanied by interpreter Montanee Anusas-amornkul.

The visitors’ agenda was wide-ranging. Topics included ebooks, digital literacy, libraries as place, employment and entrepreneurship, and many more. After Washington, the Thai librarians visited libraries in several other U.S. cities.

ALA Washington Office Executive Director Emily Sheketoff and I represented ALA. Hosting visitors from abroad is a regular responsibility of the Office, and we’ve met with librarians from many other countries around the world, from Lebanon to Columbia.

The post Thai librarians visit ALA Washington Office appeared first on District Dispatch.

Ambitious “Hydra-in-a-Box” Effort Funded by IMLS / Roy Tennant

hydraThose who have been paying attention to the cutting edge of digital libraries no doubt know about the Hydra project headed up by Stanford. Hydra is a digital repository system that is built using Ruby and is designed to accept the full range of digital object types that a large research library must manage. Built on top of Fedora and Solr, with Blacklight as the default front-end, one doesn’t normally associate ease of installation with a stack like that. Heck, you could spend a week just getting all of the dependencies installed, configured, and up and running.

So color me surprised when it was announced that the Digital Public Library of America, Stanford University, and the Duraspace organization announced that IMLS had awarded them a $2 million National Leadership Grant to develop “Hydra-in-a-Box”. Just as it sounds, the goal is to “build, bundle, and promote a feature-complete, robust digital repository that is easy to install, configure, and maintain—in short, a next-generation digital repository that will work for institutions large and small, and is capable of running as a hosted service.”

That is no small goal, and a laudable one at that. But…gosh. What a distance there is to travel to get there. The project has it pegged at 30 months, so nearly three years. That sounds about right, and so far Tom Cramer has built one of the most broad-based coalitions I’ve seen in academic libraries around Hydra, so you won’t find me betting against him. Especially since he just landed $2 million to help him build out his pet project. So as much as it pains this Cal Bear to say it, Go Stanford!

Bend Your Mind…and the Laws of the Universe: Adult Summer Reading 2015 / LITA

Summer is right around the corner and a long held tradition in the public library community is summer reading programs. Synonymous with youth and young adult services, summer reading is worth the revisit by adults.

Texas State Library and Archives Commission (2009). Flickr

Texas State Library and Archives Commission (2009). Flickr


Science fiction is a gateway

I believe there is a positive correlation between reading science fiction novels and genuine interest in emerging technology. When I was younger, I loved science fiction and fantasy. My interests range from A Princess of Mars to The Hitchhikers Guide to the Galaxy. The Twilight Zone was a mark of my childhood. What I read and watched informed my psyche and furthered my interests in futuristic technology that modern humans could only dream of. The bottom line is that these books sparked an interest. Almost all tech heads I know love science fiction and fantasy. Not everyone is into books, but most science fiction films are based on alternate worlds created by authors like Isaac Asimov and Philip K. Dick. Authors of science fiction and fantasy push the envelope on physics, technology, psychology and history. These novels take place in the “future”, a fictional past or serve as social commentary. They can are cautionary tales or impetus for the reader to become proactive in current affairs. I’m sure no one wants to live in a world similar to Pat Frank’s Alas, Babylon.
A few suggestions for your reading list

In 2011 NPR published a fan-selected list of the top 100 science-fiction and fantasy books for summer reading. While selecting the best science fiction/fantasy book of all time may be a point of contention amongst staunch fans, the point in doing so is impractical.

I went ahead and selected my favorites from NPR’s list as suggestions for summer reading. There are a few that are on my personal reading wish list and many are on my re-read wish list. Which eager reader doesn’t have a wish list?


The classics:

If you went to high school in the United States, you were probably forced to read these. You probably had to analyze the themes, tone, characters, etc. As a result the mere mention of them is trite, but they more than deserve their place on this list.

1984 by George Orwell

Fahrenheit 451 by Ray Bradbury

Brave New World by Aldous Huxley

Slaughterhouse-Five by Kurt Vonnegut

Frankenstein by Mary Shelley


The epics:

Some of the best science-fiction/fantasy books are based in an infinite universe so that they require reader commitment and the ability to lift a ten pound book. Though your eyes may be weary, you won’t be at a loss for the possibilities that are illuminated through the text.

The Lord of the Rings by J.R.R. Tolkien

Dune by Frank Herbert

Foundation by Isaac Asimov

A Game of Thrones by George R.R. Martin

The Giver by Lois Lowry (not on NPR’s list)

A Princess of Mars by Edgar Rice Burroughs (not on NPR’s list)


Notable mention:

Do Androids Dream of Electric Sheep? by Philip K. Dick

The Andromeda Strain by Michael Chrichton

The Gunslinger (The Dark Tower Series) by Stephen King

Outlander by Diana Gabaldon

1632 by Eric Flint

The Body Snatchers by Jack Finney


Now that I’ve performed my reader’s advisory, what’s on your summer reading list? If you have any recommendations, reply to this post to share with others.

VIAF RDF Changes / Thom Hickey

Here's a contribution from Jeff Young, who manages the RDF aspects of VIAF:

Since Wikidata’s introduction to the Linked Data Web in 2014 and subsequent integration of Freebase, it has become a premier example of how to publish and manage Linked Data. Like VIAF, Wikidata uses as its core RDF vocabulary and both datasets publish using Linked Data best practices. This consistency should allow applications to treat both datasets as complementary. The main difference will be in the coverage of entities/information, based on their respective sources.

The VIAF RDF changes outlined on the Developer Network blog are intended to further enrich and align the common purpose. Some of the VIAF changes provide additional information to help disambiguate entities, such as schema:location and schema:description. Where possible, schema:names are now language tagged, which should make it easier for applications to select a language-appropriate label for display.

The biggest change, though, is in the “shape of the data” that gets returned via Linked Data requests. Previously, this was a record-oriented view rather than a concise description of the entity. Like Wikidata, the new response will focus on the entity itself and depend on the related entities to describe themselves.

Alignment with Wikidata is a major step in the evolution of VIAF, which started with RDF/XML representations of name authority clusters in 2009 and transitioned to “primary entities” in 2011.  The introduction of VIAF as in 2014 extends the audience and integration with Wikidata further strengthens industry standard practices. These steps should help ensure that VIAF remains an authoritative source of entity identifiers and information in the linked web of data.


Note: We expect these RDF changes to be visible on April 16, 2015.  The bulk distribution will follow shortly after that.


Far-reaching “Hydra-in-a-Box” Joint Initiative Funded by IMLS / DPLA

Boston, MA –  The Digital Public Library of America (DPLA), Stanford University, and the DuraSpace organization are pleased to announce that their joint initiative has been awarded a $2M National Leadership Grant from the Institute of Museum and Library Services (IMLS). Nicknamed Hydra-in-a-Box, the project aims foster a new, national, library network through a community-based repository system, enabling discovery, interoperability and reuse of digital resources by people from this country and around the world.

This transformative network is based on advanced repositories that not only empower local institutions with new asset management capabilities, but also interconnect their data and collections through a shared platform.

“At the core of the Digital Public Library of America is our national network of hubs, and they need the systems envisioned by this project,” said Dan Cohen, DPLA’s executive director. “By combining contemporary technologies for aggregating, storing, enhancing, and serving cultural heritage content, we expect this new stack will be a huge boon to DPLA and to the broader digital library community. In addition, I’m thrilled that the project brings together the expertise of DuraSpace, Stanford, and DPLA.”

Each of the partners will fulfill specific roles in the joint initiative. Stanford will use its existing leadership in the Hydra Project to develop core components, in concert with the broader Hydra community. DPLA will focus on the connective tissue between hubs, mapping, and crosswalks to DPLA’s metadata application profile, and infrastructure to support metadata enhancement and remediation. DuraSpace will use its expertise in building and serving repositories, and doing so at scale, to construct the back-end systems for Hydra hosting.

“DuraSpace is excited to provide the infrastructure for this project,” said Debra Hanken Kurtz, DuraSpace CEO. “It aligns perfectly with our mission to steward the scholarly and cultural heritage records and make them accessible for current and future generations. We look forward to working with DPLA and Stanford to support their work and that of the community to ensure a robust and sustainable future for Hydra-in-a-Box.’”

Over the project’s 30-month time frame, the partners will engage with libraries, archives, and museums nationwide, especially current and prospective DPLA hubs and the Hydra community, to systematically capture the needs for a next-generation, open source, digital repository. They will collaboratively extend the existing Hydra project codebase to build, bundle, and promote a feature-complete, robust digital repository that is easy to install, configure, and maintain—in short, a next-generation digital repository that will work for institutions large and small, and is capable of running as a hosted service. Finally, starting with DPLA’s own metadata aggregation services, the partners will work to ensure that these repositories have the necessary affordances to support networked aggregation, discovery, management and access to these resources, producing a shared, sustainable, nationwide platform.

“The Hydra Project has already demonstrated enormous traction and value as a best-in-class digital repository for institutions like Stanford,” said Tom Cramer, Chief Technology Strategist at the Stanford University Libraries. “And yet there is so much more to do. This grant will provide the means to rapidly accelerate Hydra’s rate of development and adoption–expanding its community, features and value all at once.”

To find out more about the Hydra-in-a-Box initiative contact Dan Cohen (, Tom Cramer ( or Debra Hanken Kurtz ( An information page is available here:

About DPLA

The Digital Public Library of America ( strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated over 8.5 million items from over 1,700 institutions. The DPLA is a registered 501(c)(3) non-profit.

About DuraSpace

DuraSpace (, an independent 501(c)(3) not-for-profit organization providing leadership and innovation for open technologies that promote durable, persistent access to digital data. We collaborate with academic, scientific, cultural, and technology communities by supporting projects (DSpace, Fedora, VIVO) and creating services (DuraCloud, DSpaceDirect, ArchivesDirect) to help ensure that current and future generations have access to our collective digital heritage. Our values are expressed in our organizational byline, “Committed to our digital future.”

About Stanford University Libraries

The Stanford University Libraries ( is internationally recognized as a leader among research libraries, and in leveraging digital technology to support scholarship in the age of information. It is a founder of both the Hydra Project and the Fedora 4 repository effort, and a leading institution in the International Image Interoperability Framework (IIIF) (

About the Hydra Project

The Hydra Project ( is both an open source community and a suite of software that provides a flexible and robustframework for managing, preserving, and providing access to digital assets. The project motto, “One body, many heads,” speaks to the flexibility provided by Hydra’s modern, modular architecture, and the power of combining a robust repository backend (the “body”) with flexible, tailored, user interfaces (“heads”). Co-designed and developed in concert with Fedora 4, the extensible, durable, and widely used repository software, the Hydra/Fedora stack is centerpiece of a thriving and rapidly expanding open source community poised to easy-to-implement solution.

Extended Date Time Format (EDTF) use in the DPLA: Part 3, Date Patterns / Mark E. Phillips


Date Values

I wanted to take a look at the date values that had made their way into the DPLA dataset from the various Hubs.  The first thing that I was curious about was how many unique date strings are present in the dataset, it turns out that there are 280,592 unique date strings.

Here are the top ten date strings, their instance and then if the string is a valid EDTF string.

Date Value Instances Valid EDTF
[Date Unavailable] 183,825 FALSE
1939-1939 125,792 FALSE
1960-1990 73,696 FALSE
1900 28,645 TRUE
1935 – 1945 27,143 FALSE
1909 26,172 TRUE
1910 26,106 TRUE
1907 25,321 TRUE
1901 25,084 TRUE
1913 24,966 TRUE

It looks like “[Date Unavailable]” is a value used by the New York Public Library in denoting that an item does not have an available date.  It should be noted that NYPL also has 377,664 items in the DPLA that have no date value present at all,  so this isn’t a default behavior for items without a date.  Most likely it is practice within a single division that denotes unknown or missing dates this way.  The value “1939-1939” is used heavily by the University of Southern California. Libraries and seems to come from a single set of WPA Census Cards in their collection.  The value “1960-1990” is used primarily for the items in the J. Paul Getty Trust.

Date Length

I was also curious as to the length of the dates in the dataset.  I was sure that I would find large numbers of date strings that were four digits in length (1923), ten digits in length (1923-03-04) and other lengths for common highly used date formats.  I also figured that there would be instances of dates that were either less than four digits and also longer than one would expect for a date string.  Here are some example date strings for both.

Top ten date strings shorter than four characters

Date Value Instances
* 968
昭和3 521
昭和2 447
昭和4 439
昭和5 391
昭和9 388
昭和6 382
昭和7 366
大正4 323
昭和8 322

I’m not sure what “*” means for a date value,  but the other values seem to be Japanese versions of four digit dates (this is what google translate tells me).  There are 14,402 records that have date strings shorter than three characters and a total of 522 unique date strings present.

Top ten date strings longer than fifty characters.

Date Value Instances
Miniature repainted: 12th century AH/AD 18th (Safavid) 35
Some repainting: 13th century AH/AD 19th century (Safavid 25
11th century AH/AD 17th century-13th century AH/AD 19th century (Safavid (?)) 15
1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939 13
10th century AH/AD 16th century-12th century AH/AD 18th century (Ottoman) 10
late 11th century AH/AD 17th century-early 12th century AH/AD 18th century (Ottoman) 8
5th century AH/AD 11th century-6th century AH/AD 12th century (Abbasid) 7
4th quarter 8th century AH/AD 14th century (Mamluk) 5
L’an III de la République française … [1794-1795] 5
Began with 1st rept. (112th Congress, 1st session, published June 24, 2011) 3

There are 1,033 items with 894 unique values that are over fifty characters in length.  The longest is a “date string” 193 characters,  with a value of “chez W. Innys, J. Brotherton, R. Ware, W. Meadows, T. Meighan, J. & P. Knapton, J. Brindley, J. Clarke, S. Birt, D. Browne, T. Dongman, J. Shuckburgh, C. Hitch, J. Hodges, S. Austen, A. Millar,” which appears to be a mis-placement of another field’s data.

Here is the distribution of these items with date strings with fifty characters in length or more.

Hub Name Items with Date Strings 50 Characters or Longer
United States Government Printing Office (GPO) 683
HathiTrust 172
ARTstor 112
Mountain West Digital Library 31
Smithsonian Institution 25
University of Illinois at Urbana-Champaign 3
J. Paul Getty Trust 2
Missouri Hub 2
North Carolina Digital Heritage Center 2
Internet Archive 1

It seems that a large portion of these 50+ character date strings are present in the Government Printing Office records.

Date Patterns

Another way of looking at dates that I experimented with for this project was to convert a date string into what I’m calling a “date pattern”.  For this I take an input string, say “1940-03-22″ and that would get mapped to 0000-00-00.  I convert all digits to zero,  all letters to the letter a and leave all characters that are not alpha-numeric.

Below is the function that I use for this.

def get_date_pattern(date_string):
    pattern = []
    if date_string is None:
        return None
    for c in date_string:
        if c.isalpha():
        elif c.isdigit():
    return "".join(pattern)

By applying this function to all of the date strings in the dataset I’m able to take a look at what overall date patterns (and also features) are being used throughout the dataset, and ignore the specific values.

There are a total of 74 different date patterns for date strings that are valid EDTF.   For those date strings that are not valid date strings,  there are a total of 13,643 date strings.  I’ve pulled the top ten date patterns for both valid EDTF and not valid EDTF date strings and presented them below.

Valid EDTF Date Patterns

Valid EDTF Date Pattern Instances Example
0000 2,114,166 2004
0000-00-00 1,062,935 2004-10-23
0000-00 107,560 2004-10
0000/0000 55,965 2004/2010
0000? 13,727 2004?
[0000-00-00..0000-00-00] 4,434 [2000-02-03..2001-03-04]
0000-00/0000-00 4,181 2004-10/2004-12
0000~ 3,794 2003~
0000-00-00/0000-00-00 3,666 2003-04-03/2003-04-05
[0000..0000] 3,009 [1922..2000]

You can see that the basic date formats yyyy, yyyy-mm-dd, and yyyy-mm very popular in the dataset.  Following that intervals are used in the format of yyyy/yyyy and uncertain dates with yyyy?.

 Non-Valid EDTF Date Patterns

Non-Valid EDTF Date Pattern Instances Example
0000-0000 1,117,718 2005-2006
00/00/0000 486,485 03/04/2006
[0000] 196,968 [2006]
[aaaa aaaaaaaaaaa] 183,825 [Date Unavailable]
00 aaa 0000 143,423 22 Jan 2006
0000 – 0000 134,408 2000 – 2005
0000-aaa-00 116,026 2003-Dec-23
0 aaa 0000 62,950 3 Jan 2000
0000] 58,459 1933]
aaa 0000 43,676 Jan 2000

Many of the date strings that are represented by these dates have the possibility of being “cleaned up” by simple transforms if that was of interest.  I would imagine that converting the 0000-0000 to 0000/0000 would be a fairly lossless transform that would suddenly change over a million items so that they are valid EDTF. Converting the format 00/00/0000 to 0000-00-00 is also a straight-forward transform if you know if 00-00 is mm-dd (US) or dd-mm (non-US). Removing the brackets around four digit years [0000] seems to be another easy fix to convert a large number of dates.  Of the top ten non-valid EDTF Date Patterns,  it might be possible to convert nine of them with simple transformations to become valid EDTF date strings.  This would give the DPLA 2,360,113 additional dates that are valid EDTF date strings.  The values for the date pattern [aaaa aaaaaaaaaaa] with a date string value of [Date Unavailable] might benefit from being removed from the dataset altogether in order to reduce some of the noise in the field.

Common Patterns Per Hub

One last thing that I wanted to do was to see i there are any commonalities between the hubs when you look at their most frequently used date patterns.  Below I’ve created tables for both valid EDTF date patterns and non-valid EDTF date patterns.

Valid EDTF Patterns

Hub Name Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5
ARTstor 0000 0000-00 0000? 0000/0000 0000-00-00
Biodiversity Heritage Library 0000 -0000 0000/0000 0000-00 0000?
David Rumsey 0000
Digital Commonwealth 0000-00-00 0000-00 0000 0000-00-00a00:00:00a
Digital Library of Georgia 0000-00-00 0000-00 0000/0000 0000 0000-00-00/0000-00-00
Harvard Library 0000 00aa 000a aaaa
HathiTrust 0000 0000-00 0000? -0000 00aa
Internet Archive 0000 0000-00-00 0000-00 0000? 0000/0000
J. Paul Getty Trust 0000 0000?
Kentucky Digital Library 0000
Minnesota Digital Library 0000 0000-00-00 0000? 0000-00 0000-00-00?
Missouri Hub 0000-00-00 0000 0000-00 0000/0000 0000?
Mountain West Digital Library 0000-00-00 0000 0000-00 0000? 0000-00-00a00:00:00a
National Archives and Records Administration 0000 0000?
North Carolina Digital Heritage Center 0000-00-00 0000 0000-00 0000/0000 0000?
Smithsonian Institution 0000 0000? 0000-00-00 0000-00 00aa
South Carolina Digital Library 0000-00-00 0000 0000-00 0000?
The New York Public Library 0000-00-00 0000-00 0000 -0000 0000-00-00/0000-00-00
The Portal to Texas History 0000-00-00 0000 0000-00 [0000-00-00..0000-00-00] 0000~
United States Government Printing Office \(GPO\) 0000 0000? aaaa -0000 [0000, 0000]
University of Illinois at Urbana-Champaign 0000 0000-00-00 0000? 0000-00
University of Southern California. Libraries 0000-00-00 0000/0000 0000 0000-00 0000-00/0000-00
University of Virginia Library 0000-00-00 0000 0000-00 0000? 0000?-00

I tried to color code the five most common EDTF date patterns from above in the following image.

Color-coded date patterns per Hub.

Color-coded date patterns per Hub.

I’m not sure if that makes it clear or not where the common date patterns fall or not.

Non Valid EDTF Patterns

Hub Name Pattern 1 Pattern 2 Pattern 3 Pattern 4 Pattern 5
ARTstor 0000-0000 aa. 0000 aaaaaaa 0000a aa. 0000-0000
Biodiversity Heritage Library 0000-0000 0000 – 0000 0000- 0000-00 [0000-0000]
David Rumsey
Digital Commonwealth 0000-0000 aaaaaaa 0000-00-00-0000-00-00 0000-00-0000-00 0000-0-00
Digital Library of Georgia 0000-0000 0000-00-00 0000-00- 00 aaaaa 0000 0000a
Harvard Library 0000a-0000a a. 0000 0000a 0000-0000 0000 – a. 0000
HathiTrust [0000] 0000-0000 0000] [a0000] a0000
Internet Archive 0000-0000 0000-00 0000- [0—] [0000]
J. Paul Getty Trust 0000-0000 a. 0000-0000 a. 0000 [000-] [aa. 0000]
Kentucky Digital Library
Minnesota Digital Library 0000 – 0000 0000-00 – 0000-00 0000-0000 0000-00-00 – 0000-00-00 0000 – 0000?
Missouri Hub a0000 0000-00-00 aaaaaaaa 00, 0000 aaaaaaa 00, 0000 aaaaaaaa 0, 0000
Mountain West Digital Library 0000-0000 aa. 0000-0000 aa. 0000 0000? – 0000? 0000 aa
National Archives and Records Administration 00/00/0000 00/0000 a’aa. 0000′-a’aa. 0000′ a’00/0000′-a’00/0000′ a’00/00/0000′-a’00/00/0000′
North Carolina Digital Heritage Center 0000-0000 00000000 00000000-00000000 aa. 0000-0000 aa. 0000
Smithsonian Institution 0000-0000 00 aaa 0000 0000-aaa-00 0 aaa 0000 aaa 0000
South Carolina Digital Library 0000-0000 0000 – 0000 0000- 0000-00-00 0000-0-00
The New York Public Library 0000-0000 [aaaa aaaaaaaaaaa] 0000 – 0000 0000-00-00 – 0000-00-00 0000-
The Portal to Texas History a. 0000 [0000] 0000 – 0000 [aaaaaaa 0000 aaa 0000] a.0000 – 0000
United States Government Printing Office \(GPO\) [0000] 0000-0000 [0000?] aaaaa aaaa 0000 00aa-0000
University of Illinois at Urbana-Champaign 0-00-00 a. 0000 00/00/00 0-0-00 00-00-00
University of Southern California. Libraries 0000-0000 aaaaa 0000/0000 aaaaa 0000-00-00/0000-00-00 0000a aaaaa 0000-0000
University of Virginia Library aaaaaaa aaaa a0000 aaaaaaa 0000 aaa 0000? aaaaaaa 0000 aaa 0000 00–?

With the non-valid EDTF Date Patterns you can see where some of the date patterns are much more common across the various Hubs than others.

I hope you have found these posts interesting.  If you’ve worked with metadata, especially aggregated metadata you will no doubt recognize much of this from your datasets,  if you are new to this area or haven’t really worked with the wide range of date values that you can come in contact with in large metadata collections, have no fear,  it is getting better.  The EDTF is a very good specification for cultural heritage institutions to adopt for their digital collections.  It helps to provide both a machine and human readable format for encoding and notating the complex dates we have to work with in our field.

If there is another field that you would like me to take a look at in the DPLA dataset,  please let me know.

As always feel free to contact me via Twitter if you have questions or comments.


Special Issue on Diversity in Library Technology Guest Editorial Committee / Code4Lib Journal

The guest editorial committee for Code4Lib Journal’s Special Issue on Diversity in Library Technology (issue 28) was developed in order to include new voices and perspectives on the journal’s practices and how they support inclusivity. The committee is comprised of eight guest editors and two regular editorial committee members. More information on the development of […]

Feminism and the Future of Library Discovery / Code4Lib Journal

This paper discusses the various ways in which the practices of libraries and librarians influence the diversity (or lack thereof) of scholarship and information access. We examine some of the cultural biases inherent in both library classification systems and newer forms of information access like Google search algorithms, and propose ways of recognizing bias and applying feminist principles in the design of information services for scholars, particularly as libraries re-invent themselves to grapple with digital collections.

How to Hack it as a Working Parent / Code4Lib Journal

The problems faced by working parents in technical fields in libraries are not unique or particularly unusual. However, the cross-section of work-life balance and gender disparity problems found in academia and technology can be particularly troublesome, especially for mothers and single parents. Attracting and retaining diverse talent in work environments that are highly structured or with high expectations of unstated off-the-clock work may be impossible long term. (Indeed, it is not only parents that experience these work-life balance problems but anyone with caregiver responsibilities such as elder or disabled care.) Those who have the energy and time to devote to technical projects for work and fun in their off-work hours tend to get ahead. Those tied up with other responsibilities or who enjoy non-technical hobbies do not get the same respect or opportunities for advancement. Such problems mirror the experiences of women on the tenure track in academia, particularly women working in libraries, and they provide a useful corollary for this discussion. We present some practical solutions for those in technical positions in libraries. Such solutions involve strategic use of technical tools, and lightweight project management applications. Technical workarounds are not the only answer; real and lasting change will involve a change in individual priorities and departmental culture such as sophisticated and ruthless time management, reviewing workloads, cross-training personnel, hiring contract replacements, and creative divisions of labor. Ultimately, a flexible environment that reflects the needs of parents will help create a better workplace culture for everyone, kids or no kids.

But Then You Have to Make It Happen / Code4Lib Journal

Librarianship as a profession has a strong commitment to diversity and tends to attract professionals ethically inclined to champion inclusion. The authors, both from historically underrepresented populations in library information technology, have a half-century of combined experience in the field and have held positions ranging from technician, systems librarian, instructional technologist, head of circulation, and digital scholarship and services librarian to associate dean in an academic library. The authors share their experiences and discuss how diversity and inclusion must be embraced at the individual level in order to develop a culture of diversity within an organization and to attract and retain diverse technology teams. Internal commitments to supporting a diverse environment are ultimately critical to recognizing, assessing, and fulfilling the needs of patrons. The authors identify and detail individual and grassroots efforts that have led to library technology programming for underserved populations, including programs involving outreach to diverse student and prospective student communities over the course of their careers. They reflect on strategies to create and retain a diverse technology group within the library and to advance and support diversity within the day-to-day work environment. They posit that a mix of experiences is necessary to advocate for access to underrepresented patron populations and to negotiate and implement a truly diverse environment with regard to ethnicity, gender, age, and socioeconomic background.

Code as Code: Speculations on Diversity, Inequity, and Digital Women / Code4Lib Journal

All technologies are social. Taking this socio-technological position becomes less a political stance as a necessity when considering the lived experience of digital inequity, divides, and –isms as they are encountered in every-day library work spheres. Personal experience as women and women of color in our respective technological and leadership communities provides both fore- and background to explore the private-public lines delineating definitions of “diversity”, “inequity”, and digital literacies in library practice. We suggest that by not probing these definitions at the most personal level of lived experience, we in the LIS and technology professions will remain well-intentioned, but ineffective, in genuine inclusion.

User Experience is a Social Justice Issue / Code4Lib Journal

When we're building services for people, we often have a lot more practice seeing from the computer's point of view than seeing from another person's point of view. The author asks the library technology community to consider several case studies in this problem, including their root causes, and the negative impact of this problem on achieving our mission as library technologists. The author then recommends specific actions that we, as individual contributors and organizations, can take to increase our empathy and improve the user experience we provide to patrons.

Recognizing Cultural Diversity in Library Interface Development / Code4Lib Journal

The rapid increase in complex library digital infrastructures has enabled a more full-featured set of resources to become accessible by autonomous users, whether onsite or remote. However, this trend also necessitates careful consideration of the usability of new interfaces for populations with increasing cultural, geographic, and socioeconomic diversity. Researcher Aron Marcus has become an authority on how cultural principles affect interface perceptions and inform their development. This article will explore Marcus’ work to contextualize diversity issues within usability before exploring the redevelopment strategy for the New York University Libraries’ web presence, which serves a broad and global set of users.