Planet Code4Lib

BC open textbook Accessibility Toolkit: generosity as a process / Tara Robertson

cover of Accessibility Toolkit

Last week we published The BC Open Textbook Accessibility Toolkit. I’m really excited and proud of the work that we did and am moved by how generous people have been with us.

Since last fall I’ve been working with Amanda Coolidge (BCcampus) and Sue Doner (Camosun College) to figure out how to make the open textbooks produced in BC accessible from the start.  This toolkit was published using Pressbooks, a publishing plugin for WordPress. It is licensed with the same Creative Commons license as the rest of the open textbooks (CC-BY). This whole project has been a fantastic learning experience and it’s been a complete joy to experience so much generosity from other colleagues.

We worked with students with print disabilities to user test some existing open textbooks for accessibility. I rarely get to work face-to-face with students. It was such a pleasure to work with this group of well-prepared, generous and hardworking students.

Initially we were stumped about how to  get faculty, who would be writing open textbooks, to care about print disabled students who may be using their books. Serendipitously  I came across this awesome excerpt from Sarah Horton and Whitney Queensbury’s book A Web For Everyone. User personas seemed like the way to explain some of the different types of user groups. A blind student is likely using different software, and possibly different hardware than a student with a learning disability. Personas seemed like a useful tool to create empathy and explain why faculty should write alt text descriptions for their images.

Instead of rethinking these from the beginning Amanda suggested contacting them to see if their work was licensed under a Creative Commons license that would allow us to reuse and remix their work. They emailed me back in 5 minutes and gave their permission for us to reuse and repurpose their work. They also gave us permission to use the illustrations that Tom Biby did for their book. These illustrations are up on Flickr and clearly licensed with a CC-BY license.

While I’ve worked on open source software projects this is the first time I worked on an open content project. It is deeply satisfying for me when people share their work and encourage others to build upon it. Not only did this save us time but their generosity and enthusiasm gave us a boost. We were complete novices: none of us had done any user testing before. Sarah and Whitney’s quick responses were really encouraging.

This is the first version and we intend to improve it. We already know that we’d like to add some screenshots of ZoomText and we need to provide better information on how to make formulas and equations accessible. It’s difficult for me to put work out that’s not 100% perfect and complete but other people’s generosity have helped me to relax.

I let our alternate format partners across Canada know about this toolkit. Within 24 hours of publishing this our partner organization in Ontario offered to translate it into French. They had also started working on a similar project and loved our approach. So instead of writing their own toolkit they will use use or adapt ours.  As it’s licensed under a CC-BY license they didn’t even need to ask us to use it or translate it.

Thank you to Mary Burgess at BCcampus who identified accessibility as a priority for the BC open textbook project.

Thank you to Bob Minnery at AERO for the offer of a French translation.

Thank you to Sarah Horton and Whitney Queensbury for your generosity and enthusiasm. I really feel like we got to stand on the shoulders of giants.

Thank you to the students who we worked with. This was an awesome collaboration.

Thank you to Amanda Coolidge and Sue Doner for being such amazing collaborators. I love how we get stuff done together.

February Library Tech Roundup / LITA

Image courtesy of Flickr user paloetic (CC-BY)

Image courtesy of Flickr user paloetic (CC-BY)

We’re debuting a new series this month: a roundup inspired by our friends at Hack Library School! Each month, the LITA bloggers will share selected library tech links, resources, and ideas that resonated with us. Enjoy – and don’t hesitate to tell us what piqued your interest recently in the comments section!

Brianna M.

Get excited: This month I discovered some excellent writing related to research data management.

Bryan B.

The lion’s share of my work revolves around our digital library system, and lately I’ve been waxing philosophical about what role these systems play in our culture. I don’t have a concrete answer yet, but I’m getting there.

John K.

I’m just unburying myself from a major public computer revamp (new PCs, new printers, new reservation/printing system, mobile printing, etc.) so here are a few things I’ve found interesting:

Lauren H.

This month my life is starting to revolve around online learning.  Here’s what I’ve been reading:

Leanne O.

I’ve been immersed in metadata and cataloguing, so here’s a grab bag of what’s intrigued me lately:

Lindsay C.

Hey, LITA Blog readers. Are you managing multiple projects? Have you run out of Post-it (R) notes? Are the to-do lists not cutting it anymore? Me too. The struggle is real. Here are a set of totally unrelated links to distract all of us from the very pressing tasks at hand. I mean inspire us to finish the work.

DPLAfest 2015 agenda now available / DPLA

The agenda for DPLAfest 2015 is now available! Featuring dozens of sessions over two days, DPLAfest 2015 will bring together hundreds from across the cultural heritage sector to discuss everything from technology and metadata, to (e)books, law, genealogy, and education. The events will take place on April 17-18, 2015 in Indianapolis, Indiana.

The second iteration of the fest–set to coincide with DPLA’s 2nd birthday–will appeal to teachers and students, librarians, archivists, museum professionals, developers and technologists, publishers and authors, genealogists, and members of the public alike who are interested in an engaging mix of interactive workshops, hands-on activities, hackathons, discussions with community leaders and practitioners, and more.

For DLF member organizations that are interested in attending DPLAfest 2015 but are in need of travel support, please note that today (March 5) is the final day to apply for a DPLA + DLF Cross-Pollinator Travel Grant.

See you in Indy!

join button View the full DPLAfest agenda: get updates about the fest

Archiving Storage Tiers / David Rosenthal

Tom Coughlin uses Hetzler's touch-rate metric to argue for tiered storage for archives in a two-part series. Although there's good stuff there, I have two problems with Tom's argument. Below the fold, I discuss them.

First, Tom's just wrong about Facebook's optical storage when he writes:
Finally let’s look at why a company like Facebook is interested in optical archives. The figure below shows the touch rate vs. response time for an optical storage system with a goal of <60 seconds response time, which can be met at a range of block sizes with 12 optical drives per 1 PB rack in an optical disc robotic library.
The reason Facebook gets very low cost by using optical technology is, as I wrote here, that they carefully schedule the activities of the storage system to place a hard cap on the maximum power draw, and to provide maximum write bandwidth. They don't have a goal of <60s random read latency. Their goals are minimum cost and maximum write bandwidth. The design of their system assumes that reads almost never happen, because they disrupt the write bandwidth. As I understand it, reads have to wait while a set of 12 disks is completely written. Then all 12 disks of the relevant group are loaded, read and the data staged back to the hard disk layers above the optical storage. Then a fresh set of 12 disks is loaded and writing resumes.

Facebook's optical read latency is vastly longer than 60s. The system Tom is analysing is a hypothetical system that wouldn't work nearly as well as Facebook's given their design goals. And the economics of such a system would be much worse than Facebook's.

Second, it is true that Facebook gains massive advantages from their multi-tiered long-term storage architecture, which has a hot layer, a warm layer, a hard-disk cold layer and a really cold optical layer. But you have to look at why they get these advantages before arguing that archives in general can benefit from tiering. Coughlin writes:
Archiving can have tiers. ... In tiering content stays on storage technologies that trade off the needs (and opportunities) for higher performance with the lower costs for higher latency and lower data rate storage. The highest value and most frequently accessed content is kept on higher performance and more expensive storage and the least valuable or less frequently accessed content is kept on lower performance and less expensive storage.
Facebook stores vast amounts of data, but a very limited set of different types of data, and their users (who are not archival users) read those limited types of data in highly predictable ways. Facebook can therefore move specific types of data rapidly to lower-performing tiers without imposing significant user-visible access latency.

More normal archives, and especially those with real archival users, do not have such highly predictable access patterns and will therefore gain much less benefit from tiering. More typical access patterns to archival data can be found in the paper at the recent FAST conference describing the two-tier (disk+tape) archive at the European Center for Medium-Range Weather Forecasting. Note that these patterns come from before the enthusiasm for "big data" drove a need to data-mine from archived information, which will reduce the benefit from tiering even more significantly.

Fundamentally, tiering like most storage architectures suffers from the idea that in order to do anything with data you need to move it from the storage medium to some compute engine. Thus an obsession with I/O bandwidth rather than what the application really wants, which is query processing rate. By moving computation to the data on the storage medium, rather than moving data to the computation, architectures like DAWN and Seagate's and WD's Ethernet-connected hard disks show how to avoid the need to tier and thus the need to be right in your predictions about how users will access the data.

Factors to prioritize (IT?) projects in an academic library / Jonathan Rochkind

  • Most important: Impact vs. Cost
    • Impact is how many (what portion) of your patrons will be effected; and how profound the benefit may be to their research, teaching, learning.
    • Cost may include hardware or software costs, but for most projects we do the primary cost is staff time.
    • You are looking for the projects with the greatest impact at the lowest cost.
    • If you want to try and quantify, it may be useful to simply estimate three qualities:
      • Portion of userbase impacted (1-10 for 10% to 100% of userbase impacted)
      • Profundity of impact (estimate on a simple scale, say 1 to 3 with 3 being the highest)
      • “Cost” in terms of time. Estimate with only rough granularity knowing estimates are not accurate. 2 weeks, 2 months, 6 months, 1 year. Maybe assign those on a scale from 1-4.
      • You could then simply compute (portion * profundity) / cost, and look for the largest values. Or you could plot on a graph with (benefit = portion * profundity) on the x-axis, and cost on the y-axis. You are looking for projects near the lower right of the graph — high benefit, low cost.
  • Demographics impacted. Will the impact be evenly distributed, or will it be greater for certain demographics? Discipline/school/department? Researcher vs grad student vs undergrad?
    • Are there particular demographics which should be prioritized, because they are currently under-served or because focusing on them aligns with strategic priorities?
  • Types of services or materials addressed.  Print items vs digital items? Books vs journal articles? Other categories?  Again, are there service areas that have been neglected and need to be brought to par? Or service areas that are strategic priorities, and others that will be untentionally neglected?
  • Strategic plans. Are there existing Library or university.strategic plans? Will some projects address specific identified strategic focuses? Can also be used to determine prioritized demographics or service areas from above.
    • Ideally all of this is informed by strategic vision, where the library organization wants to be in X years, and what steps will get you there. And ideally that vision is already captured in a strategic plan. Few libraries may have this luxury of a clear strategic vision, however.

Filed under: General

Hydra Europe Digital Repository Events, Spring 2015 / DuraSpace News

From Chris Awre, on behalf of the Hydra Europe Planning Team

London, UK  The Hydra Project is pleased to announce two Hydra Europe events for 2015, taking place this coming April, at LSE Library, London.

Intellectual Freedom Beyond Books / Tara Robertson

I was invited to speak on a panel with three other speakers:  Christopher Kevlahan, Branch Head, Joe Fortes – Vancouver Public Library,  Miriam Moses, Acquisitions Manager, Burnaby Public Library, and Greg Mackie, Assistant Professor, UBC Department of English.

I think that libraries do a great job of promoting Freedom to Read Week with events and book displays, but could be doing a better job in advocating for intellectual freedom in the digital realm.

Public library examples

I spoke about how Fraser Valley Regional Library filters all their internet, how Vancouver Public Library changed their internet use policy to single out “sexually explicit images”, and how most public library internet policies don’t appear to have been updated since the 90s.

Bibliocommons is a product that has beautiful and well designed interface that used by a lot of public libraries to sit over their public facing catalogues. It is a huge improvement over the traditional OPAC interface, I like that there’s a small social component, with user tagging and comments, as well. However, Bibliocommons allows patrons to flag content for: Coarse Language, Violence, Sexual Content, Frightening or Intense Scenes, or Other. This functionality that allows users to flag titles for sexual content or course language is not in line with our core value of intellectual freedom.

Devon Greyson, a local health librarian-researcher and PhD candidate said on BCLA’s Intellectual Freedom Committee’s email list:

Perhaps the issue is a difference in the understanding of what is “viewpoint neutral.” From an IF standpoint, suggesting categories of concern is non-neutral. Deciding that sex, violence, scary and rude are the primary reasons one should/would be setting a notice to warn other users is non-neutral. Why not racism, sexism, homophobia & classism as the categories with sex, violence & swearing considered “other”?

Academic library example

I also talked about the Feminist Porn Archive, a SSHRC funded research project at York University. Before the panel I chatted with Lisa Sloniowski  who was really generous sharing some of the hypothetical issues that she imagines the project might encounter. She wondered if campus IT, the university’s legal department or university administration might be more conservative than the library. What would happen if they digitized porn and hosted it on university servers? Would they need to have a login screen in front of their project website?

This session was recorded and I’d love to hear your thoughts. How can libraries support or defend intellectual freedom online?

Bookmarks for March 4, 2015 / Nicole Engard

Today I found the following resources and bookmarked them on <a href=

    An open source platform for working with collections of texts. It enables students, researchers and teachers to share and collaborate around texts using a simple and intuitive interface.

Digest powered by RSS Digest

The post Bookmarks for March 4, 2015 appeared first on What I Learned Today....

Free webinar: IRS officials to discuss library tax form program / District Dispatch

Photo by AgriLifeToday via Flickr

Photo by AgriLifeToday via Flickr

Want to comment on the Internal Revenue Service’s (IRS) tax form delivery service? Discuss your experiences obtaining tax forms for your library during “Talk with the IRS about Tax Forms,” a no-cost webinar that will be hosted by the American Library Association (ALA).

The session will be held from 2:30–3:30 p.m. Eastern on Tuesday, March 10, 2015. To register, send an email to Emily Sheketoff, executive director of the ALA Washington Office, at esheketoff[at]alawash[dot]org. Register now as space is limited.

Leaders from the IRS’ Tax Forms Outlet Program (TFOP) will lead the webinar. The TFOP offers tax forms and products to the American public primarily through participating libraries and post offices. Carol Quiller, the newly appointed TFOP relationship manager, will join Dietra Grant, director of the agency’s Stakeholder, Partnership, Education and Communication office, in answering questions during the webinar from the library community about tax forms and instructional pamphlet distribution.

The webinar will not be archived.

Webinar Details
Date: Tuesday, March 10, 2015
Time: 2:30–3:30 p.m. Eastern
Register now: Email: esheketoff[at]alawash[dot]org

The post Free webinar: IRS officials to discuss library tax form program appeared first on District Dispatch.

Evergreen 2.8-beta released / Evergreen ILS

The beta release of Evergreen 2.8 is available to download and test!

New features and enhancements of note in Evergreen 2.8 include:

  • Acquisitions improvements to help prevent the creation of duplicate orders and duplicate purchase order names.
  • In the select list and PO view interfaces, beside the line item ID, the number of catalog copies already owned is now displayed.
  • A new Apache access handler that allows resources on an Evergreen webs server, or which are proxied via an Evergreen web server, to be authenticated using user’s Evergreen credentials.
  • Copy locations can now be marked as deleted. This allows information about disused copy locations to be retained for reporting purposes without cluttering up location selection drop-downs.
  • Support for matching authority records during MARC import. Matches can be made against MARC tag/subfield entries and against a record’s normalized heading and thesaurus.
  • Patron message center: a new mechanism via which messages can be sent to patrons for them to read while logged into the public catalog.
  • A new option to stop billing activity on zero-balance billed transaction, which will help reduce the incidence of patron accounts with negative balances.
  • New options to void lost item and long overdue billings if a loan is marked as claims returned.
  • The staff interface for placing holds now offers the ability to place additional holds on the same title.
  • The active date of a copy record is now displayed more clearly.
  • A number of enhancements have been made to the public catalog to better support discoverability by web search engines.
  • There is now a direct link to “My Lists” from the “My Account” area in the top upper-right part of the public catalog.
  • There is a new option for TPAC to show more details by default.

For more information about what’s in the release, check out the draft release notes.

Note that the release was built yesterday before 2.7.4 existed, so the DB upgrade script applies to a 2.7.3 database. To apply to a 2.7.4 test database, remove updates 0908, 0913, and 0914 from the upgrade file, retaining the final commit. The final 2.8.0 DB upgrade script will be built from 2.7.4 instead.

LITA Webinar: Beyond Web Page Analytics / LITA

Or how to use Google tools to assess user behavior across web properties.

analyticssmallTuesday March 31, 2015
11:00 am – 12:30 pm Central Time
Register now for this webinar

This brand new LITA Webinar shows how Marquette University Libraries have installed custom tracking code and meta tags on most of their web interfaces including:

  • Digital Commons
  • Ebsco EDS
  • ILLiad
  • LibCal
  • LibGuides
  • WebPac, and the
  • General Library Website

The data retrieved from these interfaces is gathered into Google’s

  • Universal Analytics
  • Tag Manager, and
  • Webmaster Tools

When used in combination these tools create an in-depth view of user behavior across all these web properties.

webpageanalyticsFor example Google Tag Manager can grab search terms which can be related to a specific collection within Universal Analytics and related to a particular demographic. The current versions of these tools make systems setup an easy process with little or no programming experience required. Making sense of the volume of data retrieved, however, is more difficult.

  • How does Google data compare to vendor stats?
  • How can the data be normalized using Tag Manager?
  • Can this data help your organization make better decisions?


  • Ed Sanchez, Head, Library Information Technology, Marquette University Libraries
  • Rob Nunez, Emerging Technologies Librarian, Marquette University Libraries and
  • Keven Riggle, Systems Librarian & Webmaster, Marquette University Libraries

In this webinar as they explain their new processes and explore these questions. Check out their program outline:

Then register for the webinar

forumheatmap2Full details
Can’t make the date but still want to join in? Registered participants will have access to the recorded webinar.

  • LITA Member: $39
  • Non-Member: $99
  • Group: $190

Registration Information

Register Online page arranged by session date (login required)
Mail or fax form to ALA Registration
Call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4269 or Mark Beatty,

digilib / FOSS4Lib Updated Packages

Last updated March 4, 2015. Created by Peter Murray on March 4, 2015.
Log in to edit this page.

  • digilib is a web based client/server technology for images. The image content is processed on-the-fly by a Java Servlet on the server side so that only the visible portion of the image is sent to the web browser on the client side.
  • digilib supports a wide range of image formats and viewing options on the server side while only requiring an internet browser with Javascript and a low bandwidth internet connection on the client side.
  • digilib enables very detailed work on an image as required by scholars with elaborate viewing features like an option to show images on the screen in their original size.
  • digilib facilitates cooperation of scholars over the internet and novel uses of source material by image annotations and stable references that can be embedded in URLs.
  • digilib facilitates federation of image servers through a standards compliant IIIF image API.
  • digilib is Open Source Software under the Lesser General Public License, jointly developed by the Max Planck Institute for the History of Science, the Bibliotheca Hertziana, the University of Bern and others.
Development Status: 
Operating System: 

Releases for digilib

Programming Language: 
Open Hub Stats Widget: 

Mirador / FOSS4Lib Updated Packages

Last updated March 4, 2015. Created by Peter Murray on March 4, 2015.
Log in to edit this page.

An open-source, web-based 'multi-up' viewer that supports zoom-pan-rotate functionality, ability to display/compare simple images, and images with annotations.

Operating System: 

Releases for Mirador

Programming Language: 
Open Hub Stats Widget: 
works well with: 

IIPMooViewer / FOSS4Lib Updated Packages

Last updated March 5, 2015. Created by Peter Murray on March 4, 2015.
Log in to edit this page.

IIPMooViewer is a high performance light-weight HTML5 Ajax-based javascript image streaming and zooming client designed for the IIPImage high resolution imaging system. It is compatible with Firefox, Chrome, Internet Explorer (Versions 6-10), Safari and Opera as well as mobile touch-based browsers for iOS and Android. Although designed for use with the IIP protocol and IIPImage, it has multi-protocol support and is additionally compatible with the Zoomify, Deepzoom, Djatoka (OpenURL) and IIIF protocols.

Version 2.0 of IIPMooViewer is HTML5/CSS3 based and uses the Mootools javascript framework (version 1.5+).

Development Status: 
Operating System: 
Open Hub Stats Widget: 
works well with: 

Sharing Images of Global Cultural Heritage / FOSS4Lib Upcoming Events

Tuesday, May 5, 2015 - 08:30 to 17:00

Last updated March 4, 2015. Created by Peter Murray on March 4, 2015.
Log in to edit this page.

The International Image Interoperability Framework community ( is hosting a one day information sharing event about the use of images in and across Cultural Heritage institutions. The day will focus on how museums, galleries, libraries and archives, or any online image service, can take advantage of a powerful technical framework for interoperability between image repositories.

Hydra Camp London / FOSS4Lib Upcoming Events

Monday, April 20, 2015 - 08:00 to Thursday, April 23, 2015 - 13:00

Last updated March 4, 2015. Created by Peter Murray on March 4, 2015.
Log in to edit this page.

Hydra Camp London - a training event enabling technical staff to learn about the Hydra technology stack so they can establish their own implementation

Monday 20th April - lunchtime Thursday 23rd April 2015

Hydra Europe Symposium / FOSS4Lib Upcoming Events

Thursday, April 23, 2015 - 10:30 to Friday, April 24, 2015 - 15:30

Last updated March 4, 2015. Created by Peter Murray on March 4, 2015.
Log in to edit this page.

Hydra Europe Symposium - an event for digital collection managers, collection owners and their software developers that will provide insights into how Hydra can serve your needs

Thursday 23rd April - Friday 24th April 2015

This event is free of charge. Lunch and refreshments will be provided on both days

Agile Development: Estimation and Scheduling / LITA

Image courtesy of Wikipedia

Image courtesy of Wikipedia

In my last post, I discussed the creation of Agile user stories. This time I’m going to talk about what to do with them once you have them. There are two big steps that need to be completed in order to move from user story creation to development: effort estimation and prioritization. Each poses its own problems.

Estimating Effort

Because Agile development relies on flexibility and adaptation, creating a bottom-up effort estimation analysis is both difficult and impractical. You don’t want to spend valuable time analyzing a piece of functionality up front only to have the implementation details change because of something that happens earlier in the development process, be it a change in another story, customer feedback, etc. Instead, it’s better to rely on your development team’s expertise and come up with top-down estimates that are accurate enough to get the development process started. This may at times make you feel uncomfortable, as if you’re looking for groundwater with a stick (it’s called dowsing, by the way), but in reality it’s about doing the minimum work necessary to come up with a reasonably accurate projection.

Estimation methods vary, but the key is to discuss story size in relative terms rather than assigning a number of hours of development time. Some teams find a story that is easy to estimate and calibrate all other stories relative to it, using some sort of relative “story points” scale (powers of 2, the Fibonacci sequence, etc.). Others create a relative scale and tag each story with a value from it: this can be anything from vehicles (this story is a car, this one is an aircraft carrier, etc.), to t-shirt sizes, to anything that is intuitive to the team. Another method is planning poker: the team picks a set of sizing values, and each member of the team assigns one of those values to each story by holding up a card with the value on it; if there’s significant variation, the team discusses the estimates and comes up with a compromise.  What matters is not the method, but that the entire team participate in the estimation discussion for each story.

Learn more about Agile estimation here and here.

Prioritizing User Stories

The other piece of information we need in order to begin scheduling is the importance of each story, and for that we must turn to the business side of the organization. Prioritization in Agile is an ongoing process (as opposed to a one-time ranking) that allows the team to understand which user stories carry the biggest payoff at any point in the process. Once they are created, all user stories go into a the product backlog, and each time the team plans a new sprint it picks stories off the top of the list until their capacity is exhausted, so it is very important that the Product Owner maintain a properly ordered backlog.

As with estimation, methods vary, but the key is to follow a process that evaluates each story on the value it adds to the product at any point. If I just rank the stories numerically, that does not provide any clarity as to why that is, which will be confusing to the team (and to me as well as the backlog grows). Most teams adopt a ranking system that scores each story individually; here’s a good example. This method uses two separate criteria: urgency and business value. Business value measures the positive impact of a given story on users. Urgency provides information about how important it is to complete a story earlier rather than later in the development process, taking into account dependencies between user stories, contractual obligations, complexity, etc. Basically, Business Value represents the importance of including a story in the finished product, and Urgency tells us how much it matters when that story is developed (understanding that a story’s likelihood of being completed decreases the later in the process it is slotted). Once the stories have been evaluated along the two axes (a simple 1-5 scale can be used for each) an overall priority number is obtained by multiplying the two values, which gives us the final priority score. The backlog is then ordered using this value.

As the example in the link shows, a Product Owner can also create priority bands that describe stories at a high level: must-have, nice to have, won’t develop, etc. This provides context for the priority score and gives the team information about the PO’s expectations for each story.

I’ll be back next month to talk about building an Agile culture. In the meantime, what methods does your team use to estimate and prioritize user stories?

New research project to map the impact of open budget data / Open Knowledge Foundation

I’m pleased to announce a new research project to examine the impact of open budget data, undertaken as a collaboration between Open Knowledge and the Digital Methods Initiative at the University of Amsterdam, supported by the Global Initiative for Financial Transparency (GIFT).

The project will include an empirical mapping of who is active around open budget data around the world, and what the main issues, opportunities and challenges are according to different actors. On the basis of this mapping it will provide a review of the various definitions and conceptions of open budget data, arguments for why it matters, best practises for publication and engagement, as well as applications and outcomes in different countries around the world.

As well as drawing on Open Knowledge’s extensive experience and expertise around open budget data (through projects such as Open Spending), it will utilise innovative tools and methods developed at the University of Amsterdam to harness evidence from the web, social media and collections of documents to inform and enrich our analysis.

As part of this project we’re launching a collaborative bibliography of existing research and literature on open budget data and associated topics which we hope will become a useful resource for other organisations, advocates, policy-makers, and researchers working in this area. If you have suggestions for items to add, please do get in touch.

This project follows on from other research projects we’ve conducted around this area – including on data standards for fiscal transparency, on technology for transparent and accountable public finance, and on mapping the open spending community.

Financial transparency field network with the Issuecrawler tool based on hyperlink analysis starting from members of Financial Transparency Coalition, 12th January 2015. Open Knowledge and Digital Methods Initiative.

Financial transparency field network with the Issuecrawler tool based on hyperlink analysis starting from members of Financial Transparency Coalition, 12th January 2015. Open Knowledge and Digital Methods Initiative.

The Inter[mediate]face / LibUX

This postThe Battle Is For The Customer Interface by Tom Goodwin — captured my imagination. The fastest-growing companies in the world occupy the space between the product and the person. Uber doesn’t own any vehicles; Facebook doesn’t create any media; Aribnb doesn’t own any real estate. What they control is the interface.

They facilitate access — just like us.

The Library Interface

The trumped-up value of the library isn’t the dog-eared six-dollar paperbacks in its collection, nor can we squander the credit for the research behind the vendor paywall. Instead, its value continues to be what it has always been – as gatekeeper, the access point. A library is the intermediary touchpoint between the user and the content the user seeks.

We have talked before about how one of the most important features for a library website is that it stays out of the way; the most successful are — as Tom wrote — “thin layers that sit on top of vast supply systems.” In this way, libraries curate access points which are desirable to patrons because

  • they eliminate paywalls,
  • curate the signals from the noise,
  • and are delightful.

These are the core features of the library interface. Libraries absorb the community-wide cost to access information curated by knowledge-experts that help sift through the Googleable cruft. They provide access to a repository of physical items users want and don’t want to buy (books, tools, looms, 3d printers, machines). A library is, too, where community is accessed. In the provision of this access anywhere on the open web and through human proxies, the library creates delight.

The post The Inter[mediate]face appeared first on LibUX.

How to Create (and Keep Creating) a Digitization Workflow / Library Tech Talk (U of Michigan)

Workflow for Proposing and Producing Digital Projects: Overview

It’s possible we should have written this blog post years ago, when we first created our workflow for how we shepherd digitization projects through our Digital Library. Well, we were busy creating it, that’s our excuse. Three years later, we’re on our third iteration.

VIVO Strategic Plan Lays Foundation for 2015-2016 / DuraSpace News

Winchester, MA  During the past two and a half months, the VIVO Strategic Planning Group has developed a prioritized written strategy document for the VIVO project. The plan highlights key goals and recommendations that specifically focus on increasing the engagement of the VIVO community, hiring a full-time VIVO Technical Lead to make the open source development process more inclusive and transparent, and implementing a framework to increase productivity.

SECURITY RELEASES: Evergreen 2.7.4, 2.6.7, and 2.5.9 / Evergreen ILS

On behalf of the Evergreen contributors, the 2.7.x release maintainer (Ben Shum) and the 2.6.x and 2.5.x release maintainer (Dan Wells), we are pleased to announce the release of Evergreen 2.7.4, 2.6.7, and 2.5.9.

The new releases can be downloaded from:

THESE RELEASES CONTAIN SECURITY UPDATES, so you will want to upgrade as soon as possible.

In particular, the following security issues are fixed:

  • Bug 1424755: This bug allows unauthorized remote access to the value of certain library settings that are meant to be confidential.
  • Bug 1206589: This bug allows unauthorized remote access to the log of changes to library settings, including ones meant to be confidential.

All prior supported releases are vulnerable to these bugs.

All three of these new releases also contain bugfixes that not related to the security issues. For more information on the changes in these releases, please consult their change logs:

Please note that 2.5.9 is the last release expected in the 2.5.x series.

It is recommended that all Evergreen sites upgrade to one of the new releases as soon as possible.

If you cannot do a full upgrade at this time, it is extremely important that that you patch your Evergreen system to protect against these exploits. To that end, two patches are available, one for bug 1424755 and one for bug 1206589, that you can download and apply to a running system.

In order to secure your system, you must download the two patches and copy them to each of your Evergreen servers — in particular, any that run the and/or open-ils.pcrud services. You will need to perform the following steps on each server to completely patch your system.

First, you must find where the module is located. This is usually under /usr/local somewhere. The following command will find it for you:

find /usr/local -name

On an Ubuntu 12.04 system, the above prints out /usr/local/share/perl/5.14.2/OpenILS/Application/ so we will use that as our example, just be sure that when you do this for real, you use the actual path printed by the above command. If it prints nothing, you will need to check other locations.

Once you have the path, you can run the patch command. Assuming that you are in the directory where you put the patch file, the following command should apply the patch:

sudo patch -b /usr/local/share/perl/5.14.2/OpenILS/Application/ lp1424755.patch

Unless you have made local edits to the affected file, the patch should apply cleanly.

Next, you will need to apply the patch for bug 1206589. This can be done as the opensrf user:

patch -b /openils/conf/fm_IDL.xml lp1206589.patch

After you have applied the patches, you will need to restart the and open-ils.pcrud services. You do this by running osrf_control with the appropriate options:

osrf_control [--localhost] --restart --service
osrf_control [--localhost] --restart --service open-ils.pcrud

The --localhost is in brackets because you may or may not need it. Your system administrator should know if you do or not. If you do need it, remove the brackets. If you don’t need it, then omit the option entirely.

Now Accepting WHCLIST Nominations! / District Dispatch

The White House Conference on Library and Information Services Taskforce (WHCLIST) and the ALA Washington Office are calling for nominations for the WHCLIST Award. Each year, the award is granted to a non-librarian participant in National Library Legislative Day (NLLD). The winner receives a stipend of $300 and two free nights at the Liaison hotel.

The Washington Monument with flowers in the foreground.

Photo by Poco a poco

WHCLIST has been an effective force in library advocacy nationally, statewide and locally since the White House Conferences on Library and Information Services in 1979 and 1991. WHCLIST has provided its assets to the ALA Washington Office to transmit its spirit of dedicated, passionate library support to a new generation of advocates. Both ALA and WHCLIST are committed to ensuring the American people get the best library services possible.
The criteria for the WHCLIST Award are:

  • The recipient should be a library supporter (trustee, friend, general supporter) and not a professional librarian.
  • Recipient should be a first-time attendee of NLLD.

Representatives of WHCLIST and the ALA Washington office will choose the recipient. The ALA Washington Office will contact the recipient’s senators and representatives to announce the award. The winner of the WHCLIST Award will be announced at NLLD. The deadline for applications is April 1, 2015.

To apply for the WHCLIST award, please submit a completed NLLD registration form; a letter explaining why you should receive the award; and a letter of reference from a library director, school librarian, library board chair, Friend’s group chair, or other library representative to:

Lisa Lindle
Grassroots Communications Coordinator
American Library Association
1615 New Hampshire Ave., NW
First Floor
Washington, DC 20009
202-628-8419 (fax)

Note: Applicants must register for NLLD and pay all associated costs. Applicants must make their own travel arrangements. The winner will be reimbursed for two free nights in the NLLD hotel in D.C and receive the $300 stipend to defray the costs of attending the event.

The post Now Accepting WHCLIST Nominations! appeared first on District Dispatch.

IDCC15 / David Rosenthal

I wasn't able to attend IDCC2015 two weeks ago in London, but I've been catching up with the presentations on the Web. Below the fold, my thoughts on a few of them.

Tony Hey's opening keynote is an 84-slide tour through the last decade of e-science, which punctuates the normal optimistic gee-whizzery of e-science talks with some cautionary observations. Many of them are in the form of well-chosen quotes. Three that are particularly relevant are:
  • Michael Kurtz (ADS): "The problem with curation is that the funding is almost entirely local but in the digital world the use is mainly global. Leads to tragedy of the commons where no one will assume long-term obligation to curate and manage data which is mainly not from local sources."
  • James Frew (UCSB): "Frew’s first law: scientists don’t write metadata. Frew’s second law: any scientist can be forced to write bad metadata."
  • Michael Lesk: "Most of the cost of archiving is spent at the start, before we know whether the articles will be read or the data used. With data, with no emotional investment in peer review, it might be easier to do a simpler form of deposit, where as much as possible is postponed till the data are called for. There is of course some risk that a just-in-time system will leave us, some years down the road, with a data set which we wish we had curated while the creator was still alive. However, the longer the data has gone unused, the more likely it is to never be used."
My favorites presentations were from the British Library's Web archiving team. Helen Hockx-Yu's closing keynote was an overview of the first ten years of the program, including the start of non-print digital legal deposit. I've always liked the way the BL's repository strategy leveraged the distributed nature of the UK's print legal deposit system to implement Lots Of Copies (well, four, but that's way more than most).

Some of the BL's Web archive, the part for which they have website owner's permission, is freely available. The major part, including the 2013 and 2014 UK domain crawls, is available only on-site. Both feature faceted full-text search.
Andy Jackson's brief talk explained that although the BL is restricted by copyright from making most of its Web collections freely available, they can and have (as they have always done) make their metadata freely available as Open Data. He showed this example of the link data from 1996, and Helen's slides are full of many other interesting examples of the way archived web data can be analysed and used by scholars.

Ł. Bolikowski, A. Nowiński, and W. Sylwestrzak from the University of Warsaw presented another potential use for blockchain technology, to mint persistent identifiers. Although their proposal is technically feasible, their presentation does not address any of the many reasons I'm skeptical of the idea that blockchains are the Solution to Everything.

Matthew Addis gave a great marketing pitch for the use of the Arkivum service for research data management. Arkivum is the supplier until 2023 for the UK's Janet Data Archiving Framework. The service interesting, and unusual, in that they accept liability for the data they preserve. I hope to find time to blog about this aspect of their service soon.

Board Governance Committee Open Call: March 11, 2015, 1:00 PM Eastern / DPLA

The DPLA Board of Directors’ Governance Committee will hold a conference call on Wednesday, March 11, 2015 at 1:00 PM Eastern. The call is open to the public.


Public session

  • Rethinking DPLA open committee calls
  • Questions/comments from the public

Executive session

  • Update and next steps for Board Nominating Committee


3D printing technologies in libraries: intellectual property right issues / District Dispatch

3D Printer

Photo by Subhashish Panigrahi

Join us for our next installment of CopyTalk, March 5th at 2pm Eastern Time. In the past the use of photocopy, printing, scanning and related technologies in libraries raised copyright issues alone. A new technology is making its way into libraries; 3D printing technology now allows a patron to create (print) three-dimensional objects as well. Patrons can now “print” entire mechanical devices or components of other devices from something as simple as a corkscrew to parts of a prosthetic body part. Objects of all sorts can be created in library maker spaces. These technologies raise not only copyright issues but now patent including design patents, trademark including trade dress as well as copyright issues. Learn about the legal issues involved and how the library can protect itself from liability when patrons use these technologies in library spaces and raise awareness of such issues among patrons.


Professor Tomas Lipinski completed his Juris Doctor (J.D.) from Marquette University Law School, Milwaukee, Wisconsin, received the Master of Laws (LL.M.) from The John Marshall Law School, Chicago, Illinois, and the Ph.D. from the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. Mr. Lipinski has worked in a variety of legal settings including the private, public and non-profit sectors. He is the author of numerous articles and book chapters and has been a visiting professor in summers at the University of Pretoria-School of Information Technology (Pretoria, South Africa) and at the Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign. Professor Lipinski was the first named member of the Global Law Faculty, Faculty of Law, University of Leuven (Katholieke Universiteit Leuven), Belgium, in Fall of 2006 where he continues to lecture annually at its Centers for Intellectual Property Rights and Interdisciplinary Center for Law and ICT. In October he returned to the University of Wisconsin—Milwaukee to serve as Professor and Dean of its i-School, the School of Information Studies. He serves as a member of the IFLA Copyright and other Legal Matters Committee and an IFLA delegate to the WIPO Standing Committee on Copyright and Other Rights. His current project is a book on legal issues in maker spaces in libraries with Mary Minow and Gretchen McCord that should be available this summer or fall.

As OITP’s Information Policy Analyst, Charlie Wapner provides analytical, organizational, and logistical support to the ALA Washington Office as part of a team developing and implementing a national information policy agenda for America’s public libraries. He also lead’s OITP’s work on the policy implications of 3D printing. Prior to working at ALA, Charlie spent two-and-a-half years providing policy and communications support to members of the U.S. House of Representatives. He worked first for Congressman Mark Critz of Pennsylvania and then for Congressman Ron Barber of Arizona. Charlie holds a B.A. in diplomatic history from the University of Pennsylvania and an M.S. in public policy and management from Carnegie Mellon University.

There is no need to pre-register! Just show up on March 5, 2015, at 2:00 p.m. Eastern by clicking here.

The post 3D printing technologies in libraries: intellectual property right issues appeared first on District Dispatch.

Join LITA’s Imagineering IG at ALA Annual / LITA

Editor’s note: This is a guest post by Breanne Kirsch.

During the upcoming 2015 ALA Annual Conference, LITA’s Imagineering Interest Group will host the program “Unknown Knowns and Known Unknowns: How Speculative Fiction Gets Technological Innovation Right and Wrong.” A panel of science fiction and fantasy authors will discuss their work and how it connects with technological developments that were never invented and those that came about in unimagined ways. Tor is sponsoring the program and bringing authors John Scalzi, Vernor Vinge, Greg Bear, and Marie Brennan. Baen Books is also sponsoring the program by bringing Larry Correia to the author panel.


John Scalzi wrote the Old Man’s War series and more recently, Redshirts, which won the 2013 Hugo Award for Best Novel. Vernor Vinge is known for his Realtime/Bobble and Zones of Thought Series and a number of short fiction stories. Greg Bear has written a number of series, including Darwin, The Forge of God, Songs of Earth and Power, Quantum Logic, and The Way. He has also written books for the Halo series, short fiction, and standalone books, most recently, War Dogs as well as the upcoming novels Eternity and Eon. Marie Brennan has written the Onyx Court series, a number of short stories, and more recently the Lady Trent series, including the upcoming Voyage of the Basilisk. Larry Correia has written the Monster Hunter series, Grimnoir Chronicles, Dead Six series, and Iron Kingdoms series. These authors will consider the role speculative fiction plays in fostering innovation and bringing about new ideas.

Please plan to attend the upcoming ALA Annual 2015 Conference and add the Imagineering Interest Group program to your schedule! We look forward to seeing you in San Francisco.

Breanne A. Kirsch is the current Chair of the Imagineering Interest Group as well as the Game Making Interest Group within LITA. She works as a Public Services Librarian at the University of South Carolina Upstate and is the Coordinator of Emerging Technologies. She can be contacted at or @breezyalli.

New Open Knowledge Local Groups in Macedonia, Pakistan, Portugal and Ukraine / Open Knowledge Foundation


It’s once again time for us to proudly announce the establishment of a new batch of Open Knowledge Local Groups, founded by community leaders in Macedonia, Pakistan, Portugal and Ukraine, which we hereby welcome warmly into the ever-growing family of Local Groups. This brings the total number of Local Groups and Chapters up to a whopping 58!

In this blog post we would like to introduce the founders of these new groups and invite everyone to join the community in these countries.


In Macedonia, the Local Group has been founded by Bardhyl Jashari, who is the director of Metamorphosis Foundation. His professional interests are mainly in the sphere of new technologies, media, civic activism, e-­government and participation. Previously he worked as Information Program Coordinator of the Foundation Open Society – Macedonia. In both capacities, he has run national and international­scope projects, involving tight cooperation with other international organizations, governmental bodies, the business and the civic sector. He is a member of the National Council for Information Society of Macedonia and National Expert for Macedonia of the UN World Summit Award. In the past he was a member of the Task Force for National Strategy for Information Society Development and served as a commissioner at the Agency for Electronic Communication (2005­-2011). Bardhyl holds a master degree at Paris 12 University­Faculty of Public Administration (France) and an Information System Designer Degree from University of Zagreb (Croatia).

To get in touch with Bardhyl and connect with the community in Macedonia, head here.


The new Local Group in Pakistan is founded by Nouman Nazim. Nouman has worked for 7+ years with leading Public Sector as well as Non Government Organizations in Pakistan and performed variety of roles related to Administration, Management, Monitoring etc. He has worn many other hats too in his career including programmer, writer, researcher, manager, marketer and strategist. As a result, he have developed unique abilities to manage multi-disciplinary tasks and projects as well as to navigate complex challenges. He has a Bachelor degree in Information Sciences and is currently persuing a Master’s degree in Computer Science besides working on his own startup outside of class. He believes open data lets us achieve what we could normally never be able to and that it has the potential to positively change millions of lives.

In the Open Knowledge Pakistan Local Group Nouman is supported by Sher Afgun Usmani and Sahigan Rana. Sher has studied Computer sciences and is an entrepreneur, co-founder of Yum Solutions and Urducation (an initiative to promote technical education in Urdu). He has been working for 4+ years in the field of software development. Shaigan holds a MBA degree in Marketing, and is now pursuing a Post-Graduate degree in internet marketing from Iqra University Islamabad, Pakistan. His research focuses on entrepreneurship, innovation and open access to international markets. He is co-founder of and Yum Solutions. He has an interest and several years experience in internet marketing, content writing, Business development and direct sales.

To get in touch with Nouman, Sher and Shaigan and connect with the community in Pakistan, head here.


Open Knowledge Portugal is founded in unison by Ricardo Lafuente and Olaf Veerman.

Ricardo co-founded and facilitates the activities of Transparência Hackday Portugal, Portugal’s open data collective. Coming from a communications design background and an MA in Media Design, he has been busy developing tools and projects spanning the fields of typography, open data, information visualization and web technologies. He also co-founded the Porto office of Journalism++, the data-driven journalism agency, where he takes the role of designer and data architect along with Ana Isabel Carvalho. Ana and Ricardo also run the Manufactura Independente design research studio, focusing on libre culture and open design.

Olaf Veerman leads the Lisbon office of Development Seed and their efforts to contribute to the open data community in Europe, concretely by leading project strategy and implementation through full project cycles. Before joining Development Seed, Olaf lived throughout Latin America where he worked with civil society organizations to create social impact through the use of technology. He came over from Flipside, the Lisbon based organization he founded after returning to Portugal from his last stay in the Southern hemisphere. Olaf is fluent in English, Dutch, Portuguese, and Spanish.

To get in touch with Ricardo and Olaf – and connect with the community in Portugal, head here.


Denis Gursky is the founder of the new Open Knowledge Local Group in Ukraine. He is also the found of SocialBoost; a set of innovative instruments incl. the open data movement in Ukraine, that improves civic engagement and makes government more digitalized — thus accountable, transparent and open. He is furthermore a digital communications and civic engagement expert and works on complex strategies for government and the commercial sector. He is one of the leaders of the open government data movement in Ukraine, supported by government and hacktivists, and is currently developing the Official Open Government Data Portal of Ukraine and Open Data Law.

To get in touch with Denis and connect with the community in Ukraine, head here.

Photo by, CC BY-SA.

Epub linkrot / Raffaele Messuti

Linkrot also affects epub files (who would have thought! :)).
How to check the health of external links in epub books (required tools: a shell, atool, pup, gnu parallel).

Library and Archives Canada: Planning for a new union catalogue / Dan Scott

Update 2015-03-03: Clarified (in the Privacy section) that only NRCan runs Evergreen.

I attended a meeting with Library and Archives Canada today in my role as an Ontario Library Association board member to discuss the plans around a new Canadian union catalogue based on OCLC's hosted services. Following are some of the thoughts I prepared in advance of the meeting, based on the relatively limited materials to which I had access. (I will update this post once those materials have been shared openly; they include rough implementation timelines, perhaps the most interesting of which being that it the replacement system is not expected to be in production until August 2016.) Let me say at the outset that there were no solid answers on potential costs to participating libraries, other than that LAC is striving to keep the costs as low as possible.

Basic question: What form does LAC envision the solution taking?

Will it be:

  • "Library and Archives Canada begins adding records and holdings to WorldCat" as listed for many other countries in;
  • Or a separate, standalone but openly searchable WorldCat Local catalogue that Canadians can use like the Dutch or United Kingdom union catalogues (which lack significant functionality that standard WorldCat possesses, like the integrated discovery markup)?
  • Or a separate, standalone but closed catalogue like the Dutch union catalogue GGC and the Combined Regions UnityUK that require a subscription to access?

The answer was "yes, we will be adding records and holdings to WorldCat, and yes, you will be able to search a WorldCat Local instance for both LAC-specific and AMICUS as a whole" - but they're still working out the exact details. Later we determined that it will actually be WorldCat Discovery--essentially a rewrite of WorldCat Local--which assuaged some of my concerns about the current examples we can see of other OCLC-based union catalogues.

Privacy of Canadian citizens

The "Canadian office and data centre locations" requirement does not mean that usage data is exempt from Patriot Act concerns. Specifically, OCLC is an American company and thus the USA Patriot Act "allows US authorities to obtain records from any US-linked company operating in Canada" (per a 2004 brief submitted to the BC Privacy Commissioner by CIPPIC). Canadians should not be subject to this invasion of their privacy by the agents of another nation simply to use their own national union catalogue.

The response: The Justice, Agricultural, and NRCan agencies use US-hosted library systems (the latter running the open-source Evergreen, by Equinox). However, one of the other participants from a federal agency reported that they had been trying to update to Sierra from their Millenium instance but have been stalled for two years because whatever policy allowed them to go live with US-hosted Millenium is not being allowed now.

LAC claimed that, due to NAFTA, they are not allowed to insist that data be held in Canada unless it is for national security reasons. They noted that any usage data collected wouldn't be the same volume of patron data that would be seen in public libraries. They did point out that Netherlands sends anonymized data to OCLC, but that costs money and impacts response time. Apparently the OCLC web site, they claim not to have had a request under Patriot Act.

Privacy of Canadian citizens, part 2

I didn't get the chance to bring this up during the call...

LAC noted in their background that modern systems have links to social media, and apparently want this as part of a new AMICUS. This would also open up potential privacy leaks; see Eric Hellman on this topic, for example; it is also an area of interest for the recently launched ALA Patron Privacy Technologies Interest Group.

Open data

Opening up access to data is part of the federal government's stated mission. Canada's Action Plan on Open Government 2014-16 says "Open Government Foundation - Open By Default" is a keystone of its plan; "Eligible data and information will be released in standardized, open formats, free of charge, and without restrictions on reuse" under the Open Government Licence - Canada 2.0. I therefore asserted:

  • A relaunched National Union Catalogue should therefore support open data per the federal initiative from launch.
  • The open data should include bibliographic, authority, and holdings records. Guy Berthiaume's reply to CLA and CAPAL that libraries can use the Z39.50 protocol to try to access records from individual library's Z39.50 servers ignores one of the primary purposes of a union catalogue, which is to avoid that time-consuming search across the various Z39.50 servers of the institutions that contributed their data to the union catalogue in the first place.

The response: The ACAN requirements document indicated a requirement that the data be made available under an ODC-BY license (matching OCLC's general WorldCat license); and LAC needs to get the data back to support their federated search tool.

I asked if they had checked to see if ODC-BY and Open Government License - Canada 2.0 licenses are compatible; they responded that that was something they would need to look into. Happily, the CLIPol tool indicates that the ODB-BY 1.0 and Open Government License - Canada 2.0 licenses are mostly compatible.

Contemporary features: are we achieving the stated goals?

The backgrounder benefits/objectives section stated: "In the current AMICUS?based context, the NUC has not kept pace with new technological functions, capabilities, and client needs. Contemporary features such as a user?oriented display and navigation, user customization, links to social media, and linked open data output were not available when AMICUS was implemented in the 1990s."

Canadian resource visibility

To preserve and promote our unique national culture, we want Canadian library resources to be as visible as possible on the web. This is generally accomplished by publishing a sitemap (a list of the web pages for a given web site, along with when each page was last updated) and allowing search engines like Google, Bing, and Yahoo to crawl those web pages and index their data.

To maximize the visibility of Canadian library resources on the open web, we need our union catalogue to generate a sitemap that points to only the actual records with holdings for Canadian libraries, not just in general. For example, simply points to the generic, not a specific sitemap for the Dutch union catalogue.

Our union catalogue should publish metadata to improve the discoverability of our resources in search engines (which initiated the standard for that purpose). WorldCat includes metadata, but WorldCat Local instances do not.

The response: There was some confusion about, and they asked if I didn't think that OCLC's syndication program was sufficient for enabling web discoverability. I replied in the negative.

Standards support (MARC21, RDA, ISO etc.)

I didn't get a chance to raise these questions.

What standards, exactly, are meant by this?

"Technical requirements including volumetrics and W3C compliance" is also very broad and vague. With respect to "W3C compliance", W3C Standards is just the start of many standards.

  • Presumably there will be WCAG compliance for accessibility - but to what extent?
  • Both the adamnet and fablibraries instances landing pages state that their canonical URL is, which effectively hides them from search engines.

Mobile support

The W3C Standards page mentions mobile friendliness as part of its standards. itself is not mobile friendly. It uses a separate website with different URLs to serve up mobile web pages, and does not automatically detect mobile browsers; the onus is on the user to find the "WorldCat Mobile" page, and that has been in a "Beta" state since 2009. The "beta" contravenes the stated requirements for the AMICUS replacement service to not be an alpha or beta, unless you choose to ignore the massive adoption of mobile devices for searching and browsing purposes, and the beta mobile experience lacks functionality compared to the desktop version.

The adamnet and fablibraries WorldCat Local instances don't advertise the mobile option, which is slightly different than the standard WorldCat Mobile version (for example, it offers record detail pages), but the navigation between desktop and mobile is sub-par. If you have bookmarked a page on the desktop, then open that bookmark on your synchronized browser on a mobile device, you can only get the desktop view.

Linked open data

Linked open data around records, holdings, and participating libraries has arguably been a standard since the W3 Library Linked Data working group issued its final report in 2011.

  • Data--including library holdings--should be available both as bulk downloads and as linked open data
  • Records need to be linked to libraries and holdings. For humans, that missing link in WorldCat is supplied by a JavaScript lookup based on geographic location info that the human supplies. This prevents other automated services from aggregating the data and creating new services based on it (including entirely Canadian-built and hosted services which would then protect Canadians from USA Patriot Act concerns).
  • MARC records should be one of the directly downloadable formats via the web. Currently download options are limited to experimental & incomplete ntriple, turtle, JSON-LD, and RDF-XML formats.

Application programming interface (API)

I didn't get the chance to bring this up during the call...

OCLC offers the xID API in a very limited fashion to non-members, which is one of the only ways to match ISBN, LCCN, and OCLC numbers. LAC should ensure that Canadian libraries have access to some similarly efficient means of finding matching records without having to become full OCLC Cataloguing members.

Updating the NUC

I didn't get the chance to bring this up during the call...

In an ideal world, the NUC would adopt the standard web indexing practice of checking sitemaps (for those libraries that produce them) on a regular (daily or weekly basis) and add/replace any new/modified records & holdings from the contributing libraries accordingly, rather than requiring libraries to upload their own records & holdings on an irregular basis.

Thoughts on “Search vs. Discovery” / SearchHub

“Search vs discovery” is a common dichotomy that is used in discussions about search technology, where the former is about finding specific things that are either known or assumed to exist, and the latter is about using the search/browse interface to discover what content is available. A single user session may include both of these “agendas”, especially if a users’ assumption that a certain piece of content exists, is not quickly verified by finding it. Findability is impaired when there are too many irrelevant or noise hits (false positives), which obscures or camouflages the intended results. This happens when metadata is poorly managed, search relevance is poorly tuned or when the users’ query is ambiguous and no feedback is provided by the application (such as autocomplete, recommendation or did you mean) to help improve it.

Content Visibility

Content visibility is important because a document must first be included in the result set to be found (obviously), but it is also critical for discovery especially with very large content sets. User experience has shown that faceted navigation is one of the best ways to provide this visualization especially if it includes dimensions that focus on “aboutness” and “relatedness”. However if a document is not appropriately tagged, it may become invisible to the user once the facet that it should be included in (but is not) is selected. Data quality really matters here! (My colleague Mark Bennett has authored a Data Quality Toolkit to help with this.  The venerable Lucene Index Toolbox or “Luke” which can be used to inspect the back end Lucene index is also very useful. The LukeRequestHandler is bundled with Solr. ) Without appropriate metadata, the search engine has no way of knowing what is related to what. Search engines are not smart in this way – the intelligence of a search application is built into its index.

Search and Content Curation

Findability and visibility are also very important when the search application is used as a tool for content curation within an organization. Sometimes, the search agenda is to see if something has been created before, as a “do diligence” activity before creating it. Thus, the phrase “out of sight, out of mind” becomes important when content that can’t be found tends to be re-created. This leads to unnecessary duplication, which is wasteful but also counter-productive to search both by adding to the repository size and by increasing the possibility of obfuscation by similarity. Applying “deduplication” processes after the fact is a band-aid – we should make it easier to find things in the first place so we don’t have to do more work later to clean up the mess. We also need to be confident in our search results, so that if we don’t find it, it is likely that it doesn’t exist – see my comments on this point in Introducing Query Autofiltering. Note that this is always a slippery slope. In science, absence of evidence does not equate to evidence of absence – hence “Finding Bigfoot”!  (If they ever do find “Squatch” then no more show – or they have to change the title to “Bigfoot Found!” – which would be very popular but also couldn’t be a series!  That’s OK, I only watched it once to discover that they don’t actually “find” Bigfoot – hence the ‘ing’ suffix. I suppose that “Searching for” sounds too futile to tune it in even once.)

Auto-classification Tuning

Auto-classification technology is a potential cure in all of the above cases, but can also exacerbate the problem if not properly managed. Machine Learning approaches or using an ontologies and associated rules, provide ways to enhance the relevance of important documents and to organize them in ways that improve both search and discovery. However, in the early phases of development it is likely that an auto-classification system will make two types of errors, that if not fixed can lead to problems of both findability and visibility.  First, it will tag documents erroneously leading to the camouflage or noise problem and second, it will not tag documents that it should – leading to a problem with content visibility. We call these “precision” and “recall” errors respectively. The recall error is especially insidious because if not detected will cause documents to be dropped from consideration when a navigation facet is clicked. Also, errors of omission are more difficult to detect, and require the input of persons who understand the content set well enough to know what the autoclassifier “should” do. Manual tagging, while potentially more accurate, is simply not feasible in many cases because Subject Matter Experts are difficult to outsource. Data quality analysis/curation is the key here. Many times the problem is not the search engines fault. Garbage-In-Garbage-Out as the saying goes.

Data Visualization – Search-Driven Analytics

I think that one of the most exciting usages of search as a discovery tool is the combination of the search paradigm with analytics. This used to be the purview of the relational database model which is at the core of what we call “Business Intelligence” or BI. Reports generated by analysts from relational data go under the rubric of OLAP (online analytical processing) which typically involves a Data Analyst who designs a set of relational queries, the output of which are then input to a graphing engine to generate a set of charts. When the data changes, the OLAP “cube” is re-executed and a new report emerges. Generating new ways to look at the data require the development, testing, etc of new cubes. This process by its very nature leads to stagnation – cubes are expensive to create and this may stifle new ideas since there is some expert labor required to bring these ideas to fruition. Search engines and relational databases are very different animals. Search engines are not as good as RDBMS at several things – ACID transactions, relational joins, etc — but they are much better at dealing with complex queries that include both structured and unstructured (textual) components. Search indexes like Lucene can include numerical, spatial and temporal data alongside textual information.  Using facets, they can also count things that are the output of these complex queries. This enables us to ask more interesting questions about data – questions that get to “why” something happened rather than just “what”.  Furthermore, recent enhancements to Solr have added statistical analyses to the mix – we can now develop highly interactive data discovery/visualization applications which remove the data analyst from the loop. While there is still a case for traditional BI, search-driven discovery will fill the gap by allowing any user – technical or not – to do the “what if” questions. Once an important analysis has been discovered, it can be encapsulated as an OLAP cube so that the intelligence of its questions can be productized/disseminated. Since this section is about visualization and there are no pictures in this post, you may want to “see” examples of what I am talking about. First, check out Chris Hostetter (aka “Hoss”)’s blog post “Hey, You Got Your Facets in My Stats! You Got Your Stats In My Facets!!” , and his earlier post on pivot facets. Another way cool demonstration of this capability comes from Sam Mefford when he worked at Avalon Consulting – this is a very compelling demonstration of how faceted search can be used as a discovery/visualization tool. Bravo Sam! This is where the rubber meets the road folks!

The post Thoughts on “Search vs. Discovery” appeared first on Lucidworks.

Free webinar: Expanding immigrant access through libraries / District Dispatch

Hartford Public Library

Hartford Public Library

Library services to immigrants are extensive and include world language collections, multicultural programming, ESL, citizenship, computer classes, and information brokering. Learn how your library can better support immigrants in “We Belong Here: Expanding Immigrant Access to Government and Community,” a free webinar hosted by e-government service Lib2gov from the American Library Association’s Washington Office and University of Maryland’s iPAC.

This webinar will focus on e-government services that open access for immigrants, using the Hartford Public Library’s American Place Initiative as a national model for immigrant services, resources, and engagement through public libraries.

Homa Naficy, chief adult learning officer for the Hartford Public Library, will lead the interactive webinar. Homa Naficy joined Hartford Public Library in 2000 to design and direct The American Place program for Hartford’s immigrants and refugees. Born in Paris, a native of Iran and now an American citizen, Multicultural Services Director Homa Naficy began her library career as a reference librarian at Newark Public Library. Before joining the staff of Hartford Public Library, she served as a reference librarian at Yonkers Public Library and later as librarian for Adult Services and Outreach for the Westchester Library System.

The American Place has become a magnet for new arrivals seeking immigration information, resources for learning English and preparing for United States citizenship. In 2010, the program was awarded two major grants, a citizenship education grant from the United States Citizenship and Immigration Services (the only library in the nation to receive such funding), and a National Leadership grant from the Institute of Museum and Library Services designed to promote immigrant civic engagement. On completion, this project will serve as a model for other libraries nationally. The American Place program is also the only library in the state to receive funding for adult basic education from the Connecticut Department of Education. In 2001, Ms. Naficy received the Connecticut Immigrant of the Year Award, and in 2013 she was chosen a “Champion of Change” by The White House.

Webinar title: We Belong Here: Expanding Immigrant Access to Government and Community
Date: March 11, 2015
Time: 2:00-3:00 p.m. EST
Register now

The webinar will be archived.

The post Free webinar: Expanding immigrant access through libraries appeared first on District Dispatch.

Islandora/Fedora 4 Project Update II / Islandora

On Friday, February 27th, the Fedora 4 Interest Group met for the second time to discuss the progress of our big upgration (the first meeting was back at the end of January). The full notes from the meeting are here, but I'll summarize some of the highlights:

Project Updates

The project has entered its second month with plenty accomplished. Nick was sent to Code4Lib 2015 in Portland, Oregon to work with our Technical Lead, Danny Lamb. The two worked on the proof-of-concept, and it was presented as a lightning talk (video demo). Additionally, Nick and Danny worked with the Hydra and Fedora communities on a shared data model, Hydra Works, which evolved into the Fedora Community Data Model.

After Code4Lib 2015, Nick and Danny focused on updating the Technical Design document, that provides:

  1. an understanding of the Islandora 7.x-2.x design rationale
  2. the importance of using an integration framework
  3. the use of camel
  4. inversion of control and camel
  5. camel and scripting languages
  6. Islandora Sync
  7. Solr and Triple store indexing
  8. Islandora (Drupal).

Or, to sum up the new ways of Islandora in one imge:

Nick and Danny also focused on the development virtual environment (DevOps) for the project. Nick decided to move away from using Chef and Berkshelf due to dependency support. The DevOps setup was moved to basic bash scripts and Vagrant. Contributors to the project can now spin a virtual development environment (which includes the proof-of-concept) in about 5 minutes with a single command: vagrant up. Instructions here.

Nick also focused on project documentation and documentation deployment. All document for the project resides in the git repository for the project, in Markdown format. The documentation can be generated into a static site with MkDocs and thendeployed to GitHub Pages. The documentation for the project can be viewed here, and information about how the documentation is built and deployed can be found here. There is also an outline of how you can contribute to the project here (regardless of your background. We need far more than programmers).

A new use case template makes bringing your ideas to the table much easier. Check out some of the existing use cases for examples - and add yours!

Nick, Danny, and Melissa also did an interview for Duraspace.


The upgration portion of the project is dependant on a couple of sub-items of the project to play out, but continues in tandem.

The first sub-item is the Fedora Audit Service. The Islandora community make use of the audit service in Fedora 3.x for PREMIS and other provance services. It currently does not exist in Fedora 4.x, so the community has come together to plan our the service over two conference calls that will outline use cases and functional requirements, which will then translate to JIRA tickets for a Fedora code sprint in late March. Notes from the first meeting are here. Nick has been tasked with identifying if the community should use the PROV-O ontology, the PREMIS ontology, or a combination of both. The second item is bridging the work of Mike Durbin’s migration-utils and Danny’s Apache Camel work in the Islandora & Fedora 4 project. While Nick was working to create test fixtures for Mike and Danny, he discovered a bug in Fedora 3.8.0, which will need to be resolved before any test fixtures can come out of York University's upgration pilot.

Nick and Danny will most likely focus on migration work and community contributed developer tasks in March.


The Islandora Foundation is pleased to welcome Simon Fraser University as a Partner for their support of the Fedora 4 project. Longtime member PALS has also earmarked some of their membership dues to help out the upgration. If you or your instition are interested in being financial supporters, please drop me a line.

Other News

Contributor Kevin Bowrin wrote up an account of exprience installing and trying out the work our team has done so far. Check it out.



Join March 6 free webinar on mapping inclusion: Public library technology and community needs / District Dispatch

As economic, education, health and other disparities grow, equitable access to and participation in the online environment is essential for success. And yet, communities and individuals find themselves at differing levels of readiness in their ability to access and use the Internet, engage a range of digital technologies and get and create digital content.

Digital Inclusion Survey LogoThe Digital Inclusion Survey examines the efforts of public libraries to address these readiness gaps by providing free access to broadband, public access technologies, digital content, digital literacy training and a range of programming that helps build digitally inclusive communities. A new interactive mapping tool places these library resources in a community context, including unemployment and education rates.

Join researchers and data visualization experts at a free webinar on March 6, 1-2 p.m. EST, to explore the intersections of public access technologies and education, employment, health & wellness, digital literacy, e-government and inclusion. Speakers will share new tools and demonstrate how to locate and interpret national and state-level results from the survey for planning and advocacy purposes, as well as present cases for the interactive mapping tool, with suggestions for creating a digital inclusion snapshot of your public library.

The survey, which is funded by the Institute of Museum and Library Services (IMLS) and conducted by the ALA Office for Research & Statistics and the Information Policy & Access Center (iPAC) at the University of Maryland. The International City/County Management Association and the ALA Office for Information Technology Policy are grant partners.

Learn more about the webinar and speakers from iPAC, Community Attributes, IMLS and OITP here.

The post Join March 6 free webinar on mapping inclusion: Public library technology and community needs appeared first on District Dispatch.

Rising to the newest (Knight) challenge / District Dispatch

DC Public Library in Washington, D.C. Photo by Maxine Schnitzer Photography.

DC Public Library in Washington, D.C. Photo by Maxine Schnitzer Photography.

It has been said that “libraries are the cornerstone of our democracy” so the newest Knight News Challenge on Elections should be right up our alley. From candidate forums to community conversations, about half of all public libraries report to the Digital Inclusion Survey that they host community engagement events. What is your library doing that you might want to expand or what new innovative idea would you like to seed? Knight is inviting all kinds of ideas: “We see democratic engagement as more than just the act of voting. It should be embedded in every part of civic life…”

So—what’s your best idea for: How might we better inform voters and increase civic participation before, during and after elections?

There are several ways you can participate and learn more:

  1. Check out and comment on the growing number of applications. Which of these could best help address issues you see and hear in your community and your library? On a quick scan, I could definitely see a library or libraries as partners for the Knowledge Swap Market, or a similar project, for instance. Also—how might an application be made stronger and more useful? You don’t have to be an applicant to contribute to the conversation, and comments are accepted through April 13.
  2. BUT—you should definitely consider applying! With more than $3 million available, a wide-open invitation to interpret the question as you see fit, encouragement to partner with others, and the opportunity to get feedback from others to improve your application, there’s a lot to be gained in participating.
  3. Learn more about the whole process at “virtual office hours” open Tuesday, March 3, from 1-2 p.m. Eastern Time and on Tuesday, March 17, from 1-2 p.m. ET. Information about these virtual office hours and in-person events in cities across the county can be accessed here. I attended the event in D.C., and it was a great opportunity to meet people and make connections for possible collaboration.

The challenge is a collaboration between the John S. and James L. Knight Foundation, a leading funder of news and media innovation, and three other foundations: the Democracy Fund, the Rita Allen Foundation and the William & Flora Hewlett Foundation. Winners will receive a share of more than $3 million, which includes up to $250,000 from the Democracy Fund.

This news challenge and the recent NetGain challenge are great opportunities to gain visibility and support for library projects working to address community needs and challenges in innovative ways. These invitations to engage with other community and national stakeholders also resonate with the emerging national policy agenda for libraries and the Aspen Institute report (pdf) on re-envisioning public libraries.

I hope you’ll consider joining the conversation. If so, please leave a note here in comments, so others can look for your proposal.

The post Rising to the newest (Knight) challenge appeared first on District Dispatch.

Community Contributor Kudos: Diego Pino / Islandora

It has been a while since we have done a Community Contributor Kudos post, but if anyone is worthy of reviving the feature, it is this week's subject: Diego Pino.

Diego is a freelance developer who specializes in addressing the needs of the scientific community with open source solutions. Right now he is also working as an IT Project Manager for a project that aims to build a national biodiversity network, funded by the Chilean government. If you have gone to the listserv with a question in the past several months, you will also recognize him as one of the most helpful troubleshooters in the Islandora community - pretty remarkable given that he only started using the software about a year ago:

Islandora is still new for me and still amazes me. All started about a year ago. I was given the task to find a way of storing and sharing Biodiversity occurrence records, and thus build a federated network that could help scientists to collaborate and share research data. The primary need was to move data to GBIF for storage, described with Darwin Core metadata, so I started researching what was going on in terms of preserving digital content for science. Until then I thought everything could be solved using a relational database and some custom coding (how wrong I was!)
He started by exploring eSciDoc (created by Matthias Razum), a project based on Fedora 3.x. It was designed to address a need that Diego had been working on for some time: how to involve researchers and scientists directly in the process of sharing and curating their own data. This, and the project's own documentation, sold Diego on Fedora 3.x, but he wanted more - not only the ability to ingest and preserve digital content, but a fully working framework/API that would allow him to focus on the user experience.
And then I found the Islandora's google forum and it was exactly what I needed: A big and nice community of human beings, with problems similar to mine, and with an incredible piece of software, a.k.a. Islandora.
I must admit the learning curve was hard; some needed things were not developed and I had to add to my new knowledge Drupal, Solr and Web Semantics (my favourite subject right now), but the community was great and helpful, and meeting Giancarlo Birello was an inspiration to keep working and also to help other users on the forum. I have received so much; giving a little back is a must.

Currently Diego is developing and managing a four repo configuration, with each running a stock Islandora 7.x-1.4/Fedora 3.7.1, using an external Tomcat and other goodies, but sharing a common Solr Cloud index. As Diego describes it, "one collection, many shards, many replicas." He had to fine tune the way objects were indexed to avoid duplicated PIDs and to be able to distinguish during search which repo the object lives in. The repos are also running his Redbiodiversidad Solution Pack , which handles Darwin Core based objects, maps, EML, and GBIF DC archives; and the Ontologies Solution Pack, which allows objects to be related by multiple overlapping ontologies- and which Diego is particularly proud of.

My favourite thing about this configuration is that I can search across all existing repos and their collections, use existing solution packs like PDF or Scholar to describe publications and people, relate local objects to remote ones, and build nice linked data graphs. These expand the notion of plain, independent metadata records encapsulated in objects, to a fully new dimension for us (maybe exaggerating here!) that is helping local scientists to understand their data in a more ample context: in my opinion the needed transition from information to knowledge.

A very simple and trivial example. A Chilean scientist can now discover what other biological occurrences (associated species) are found near a place where they made a discovery; who found them, when, under which method, and filter by many parameters in a few steps or clicks,  thanks to Solr search module + linked data. They can expand their knowledge, collaborate, and  manage their own research data in ways their previous workflows (excel?) did not allow. And my favourite part: if something is not working as expected I can fix it using Islandora's API. There are some many nice hooks available and more to come.

As for projects coming down the pipeline, Diego is working on a new visual workflow to ingest and manage relationships between objects, reusing the way the Ontologies SP currently displays a linked object graph. The end goal is to allow people to interactively add new objects, connect them using rules present in multiple OWLs, and finally save this new "knowledge" representation as a whole. Essentially, every ontology becomes a programable workflow. Using this system will maintain a consistent network of repositories with well-related objects, while still giving users control of their data. He has promised the community an OCR editor, which remains high on his TODO list. As an active member of the Fedora 4 Interest Group, Diego is also involved in planning and developing the next generation of Islandora (and taking a stand for those who don't want to see XML Forms vanish into the night).

Diego does all of this amazing work from his home office in a little village named Pucón in southern Chile, nestled next to an active volcano and a lake. He credits this environment with giving him the peace to code - that, and his small herd of dogs:

Lastly, none of this work using Islandora could have be done without the great support of the community and the also very important support  and patience of my wife and my 4 Dogs, who by this time already hate ontologies.

His Red Biodiversidad repo is still in development, but a beta site is online, showing Solr results from their cloud, fetched from the real repos' collections. And here is one of those collections, full of biological data and growing all the time. You can find more of Diego's work on his GitHub page, and you can usually find him making the Islandora community better one solution at a time on our listerv (it's quite remarkable how many search results for 'diego' in our Google Group turns up some variation of the phrase "thanks, Diego").

Someone in Diego's family is a remarkable photographer, so when I asked him to send along a photo I could use with this blog so the community could put a face to all of those awesome listserv posts, it was difficult to choose. I leave it the community to decide which image best suits Diego Pino: Programmer on a Mountain or Man Hugs Dog:

Programmer on a Mountain         Man Hugs Dogs

WordPress for Libraries / LibUX

Amanda and Michael are teaching simultaneous online classes on WordPress for Libraries – at least sixty hours worth of tutorial for beginners and developers. Back to back, these classes take you from using WordPress out-of-the-box to create and manage a library website through the custom development of an event management plugin.

Using WordPress to Build Library Websites

WordPress is an open-source content management system that helps you create, design, and maintain a website. Its intuitive interface means that there’s no need to learn complex programming languages — and it’s free, you can do away with purchasing expensive web development software. This course will guide you in applying WordPress tools and functionality to library content. You will learn the nuts and bolts of building a library website that is both user friendly and easy to maintain. Info

Advanced WordPress

WordPress is an incredible out-of-the-box tool, but libraries with ambitious web services will find it needs to be customized to meet their unique needs. This course is built around a single project: the ground-up development of an event management plugin, which will provide a thorough understanding of WordPress as a framework–hooks, actions, methods–that can be used to address pressing and ancillary issues like content silos and the need to C.O.P.E. – create once, publish everywhere. Info


American Library Association eCourses are asynchronous with mixed-media materials available online and at no additional cost. So, you don’t have to get a text book. You can usually proceed at your own pace and submit material through the forums, unless the facilitator changes it up — and we probably won’t, unless it makes sense to keep the class proceeding together. Both of our courses are six weeks, beginning March 16, 2015 – but we want you to squeeze as much as you can out of these classes, so we are available to explain, walkthrough, and answer questions for as long as you need. We really want you to walk away with real-world applicable skills.

The post WordPress for Libraries appeared first on LibUX.

Repetition / Ed Summers

To be satisfied with repeating, with traversing the ruts which in other conditions led to good, is the surest way of creating carelessness about present and actual good.

John Dewey in Human Nature and Conduct (p. 67).

DPLA Metadata Analysis: Part 4 – Normalized Subjects / Mark E. Phillips

This is yet another post in the series DPLA Metadata Analysis that already has three parts, here are links to part one, two and three.

This post looks at what is the effect of basic normalization of subjects on various metrics mentioned in the previous posts.


One of the things that happens in library land is that subject headings are often constructed by connecting various broader pieces into a single subject string that becomes more specific.  For example the heading “Children–Texas.” is constructed from two different pieces,  “Children”, and “Texas”.  If we had a record that was about children in Oklahoma it could be represented as “Children–Oklahoma.”.

The analysis I did earlier took the subject exactly as it occurred in the dataset and used that for the analysis.  I had a question asked about what would happen if we normalized the subjects before we did the analysis on them,  effectively turning the unique string of “Children–Texas.” into two subject pieces of “Children” and “Texas” and then applied the previous analysis to the new data. The specific normalization includes stripping trailing periods, and then splitting on double hyphens.

Note:  Because this conversion has the ability to introduce quite a bit of duplication into the number of subjects within a record I am making the normalized subjects unique before adding them to the index.  I also apply this same method to the un-normalized subjects.  In doing so I noticed that the item that had the  most subjects previously at 1,476 was reduced to 1,084 because there were a 347 values that were in the subject list more than once.  Because of this the numbers in the resulting tables will be slightly different than those in the first three posts when it comes to average subjects and total subjects,  each of these values should go down.


My predictions before the analysis are that we will see an increase in the number of unique subjects,  a drop in the number of unique subjects per Hub for some Hubs, and an increase in the number of shared subjects across Hubs.


With the normalization of subjects,  there was a change in the number of unique subject headings from 1,871,884 unique headings to 1,162,491 unique headings after normalization,  a reduction in the number of unique subject headings by 38%.

In addition to the reduction of the total number of unique subject headings by 38% as stated above,  the distribution of subjects across the Hubs changed significantly, in one case an increase of 443%.  The table below displays these numbers before and after normalization as well as the percentage change.

# of Hubs with Subject # of Subjects # of Normalized Subjects % Change
1 1,717,512 1,055,561 -39%
2 114,047 60,981 -47%
3 21,126 20,172 -5%
4 8,013 9,483 18%
5 3,905 5,130 31%
6 2,187 3,094 41%
7 1,330 2,024 52%
8 970 1,481 53%
9 689 1,080 57%
10 494 765 55%
11 405 571 41%
12 302 453 50%
13 245 413 69%
14 199 340 71%
15 152 261 72%
16 117 205 75%
17 63 152 141%
18 62 130 110%
19 32 77 141%
20 20 55 175%
21 7 38 443%
22 7 23 229%
23 0 2 N/A

The two subjects that are shared across 23 of the Hubs once normalized are “Education” and “United States”

The high level stats for all 8,012,390 records are available in the following table.

 Records Total Subject Strings Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change
8,012,390 23,860,080 28,644,188 2.98 3.57 20.05%

You can see the total number of subjects went up 20% after they were normalized, and the number of subjects per record increased from just under three per record to a little over three and a half normalized subjects per record.

Results by Hub

The table below presents data for each hub in the DPLA.  The columns are the number of records, total subjects, total normalized subjects, the average number of subjects per record, the average number of normalized subjects per record, and finally the percent of change that is represented.

Hub Records Total Subject String Count Total Normalized Subject String Count Average Subjects Per Record Average Normalized Subjects Per Record Percent Change
ARTstor 56,342 194,883 202,220 3.46 3.59 3.76
Biodiversity Heritage Library 138,288 453,843 452,007 3.28 3.27 -0.40
David Rumsey 48,132 22,976 22,976 0.48 0.48 0
Digital Commonwealth 124,804 295,778 336,935 2.37 2.7 13.91
Digital Library of Georgia 259,640 1,151,351 1,783,884 4.43 6.87 54.94
Harvard Library 10,568 26,641 36,511 2.52 3.45 37.05
HathiTrust 1,915,159 2,608,567 4,154,244 1.36 2.17 59.25
Internet Archive 208,953 363,634 412,640 1.74 1.97 13.48
J. Paul Getty Trust 92,681 32,949 43,590 0.36 0.47 32.30
Kentucky Digital Library 127,755 26,008 27,561 0.2 0.22 5.97
Minnesota Digital Library 40,533 202,456 211,539 4.99 5.22 4.49
Missouri Hub 41,557 97,111 117,933 2.34 2.84 21.44
Mountain West Digital Library 867,538 2,636,219 3,552,268 3.04 4.09 34.75
National Archives and Records Administration 700,952 231,513 231,513 0.33 0.33 0
North Carolina Digital Heritage Center 260,709 866,697 1,207,488 3.32 4.63 39.32
Smithsonian Institution 897,196 5,689,135 5,686,107 6.34 6.34 -0.05
South Carolina Digital Library 76,001 231,267 355,504 3.04 4.68 53.72
The New York Public Library 1,169,576 1,995,817 2,515,252 1.71 2.15 26.03
The Portal to Texas History 477,639 5,255,588 5,410,963 11 11.33 2.96
United States Government Printing Office (GPO) 148,715 456,363 768,830 3.07 5.17 68.47
University of Illinois at Urbana-Champaign 18,103 67,954 85,263 3.75 4.71 25.47
University of Southern California. Libraries 301,325 859,868 905,465 2.85 3 5.30
University of Virginia Library 30,188 93,378 123,405 3.09 4.09 32.16

The number of unique subjects before and after subject normalization is presented in the table below.  The percent of change is also included in the final column.

Hub Unique Subjects Unique Normalized Subjects % Change Unique
ARTstor 9,560 9,546 -0.15
Biodiversity Heritage Library 22,004 22,005 0
David Rumsey 123 123 0
Digital Commonwealth 41,704 39,557 -5.15
Digital Library of Georgia 132,160 88,200 -33.26
Harvard Library 9,257 6,210 -32.92
HathiTrust 685,733 272,340 -60.28
Internet Archive 56,911 49,117 -13.70
J. Paul Getty Trust 2,777 2,560 -7.81
Kentucky Digital Library 1,972 1,831 -7.15
Minnesota Digital Library 24,472 24,325 -0.60
Missouri Hub 6,893 6,757 -1.97
Mountain West Digital Library 227,755 172,663 -24.19
National Archives and Records Administration 7,086 7,086 0
North Carolina Digital Heritage Center 99,258 79,353 -20.05
Smithsonian Institution 348,302 346,096 -0.63
South Carolina Digital Library 23,842 17,516 -26.53
The New York Public Library 69,210 36,709 -46.96
The Portal to Texas History 104,566 97,441 -6.81
United States Government Printing Office (GPO) 174,067 48,537 -72.12
University of Illinois at Urbana-Champaign 6,183 5,724 -7.42
University of Southern California. Libraries 65,958 64,021 -2.94
University of Virginia Library 3,736 3,664 -1.93

The number and percentage of subjects and normalized subjects that are unique and also unique to a given hub is presented in the table below.

Hub Subjects Unique to Hub Normalized Subject Unique to Hub % Subjects Unique to Hub % Normalized Subjects Unique to Hub % Change
ARTstor 4,941 4,806 52 50 -4
Biodiversity Heritage Library 9,136 6,929 42 31 -26
David Rumsey 30 28 24 23 -4
Digital Commonwealth 31,094 27,712 75 70 -7
Digital Library of Georgia 114,689 67,768 87 77 -11
Harvard Library 7,204 3,238 78 52 -33
HathiTrust 570,292 200,652 83 74 -11
Internet Archive 28,978 23,387 51 48 -6
J. Paul Getty Trust 1,852 1,337 67 52 -22
Kentucky Digital Library 1,337 1,111 68 61 -10
Minnesota Digital Library 17,545 17,145 72 70 -3
Missouri Hub 4,338 3,783 63 56 -11
Mountain West Digital Library 192,501 134,870 85 78 -8
National Archives and Records Administration 3,589 3,399 51 48 -6
North Carolina Digital Heritage Center 84,203 62,406 85 79 -7
Smithsonian Institution 325,878 322,945 94 93 -1
South Carolina Digital Library 18,110 9,767 76 56 -26
The New York Public Library 52,002 18,075 75 49 -35
The Portal to Texas History 87,076 78,153 83 80 -4
United States Government Printing Office (GPO) 105,389 15,702 61 32 -48
University of Illinois at Urbana-Champaign 3,076 2,322 50 41 -18
University of Southern California. Libraries 51,822 48,889 79 76 -4
University of Virginia Library 2,425 1,134 65 31 -52


Overall there was an increase (20%) in the total occurrences of subject strings in the dataset when subject normalization was applied. The total number of unique subjects decreased significantly (38%) after subject normalization.  It is easy to identify Hubs which are heavy users of the LCSH subject headings for their subjects because the percent change in the number of unique subjects before and after normalization is quite high, examples of this include the HathiTrust and the Government Printing Office. For many of the Hubs,  normalization of subjects significantly reduced the number and percentage of subjects that were unique to that hub.

I hope you found this post interesting,  if you want to chat about the topic hit me up on Twitter.

Organising, building and deploying static web sites/applications / Alf Eaton, Alf

Build remotely

At the simplest end of the scale is GitHub Pages, which uses Jekyll to build the app on GitHub’s servers:

  • The config files and source code are in the root directory of a gh-pages branch.

  • Jekyll builds the source HTML/MD, CSS/SASS and JS/CS files to a _site directory - this is where the app is served from.

  • For third-party libraries, you can either download production-ready code manually to a lib folder and include them, or install with Bower to a bower_components folder and include them directly from there.

The benefit of this approach is that you can edit the source files through GitHub’s web interface, and the site will update without needing to do any local building or deployment.

Jekyll will build all CSS/SASS files (including those pulled in from bower_components) into a single CSS file. However, it doesn’t yet have something similar for JS/CoffeeScript. If this was available it would be ideal, as then the bower_components folder could be left out of the built app.

Directory structure of a Jekyll GitHub Pages app Directory structure of a Jekyll GitHub Pages app

Build locally, deploy the built app as a separate branch

If the app is being built locally, there are several steps that can be taken to improve the process:

  • Keep the config files in the root folder, but move the app’s source files into an app folder.

  • Use Gulp to build the Bower-managed third-party libraries alongside the app’s own styles and scripts.

  • While keeping the source files in the master branch, use Gulp to deploy the built app in a separate gh-pages branch.

A good example of this is the way that the Yeoman generator for scaffolding a Polymer app structures a project (other Yeoman generators are similar):

  • In the master branch, install/build-related files are in the root folder (run npm install and bower install to fetch third-party components, use bower link for any independent local components).

  • The actual app source files (index.html, app styles, app-specific elements) are in the app folder.

  • gulp builds all the HTML, CSS/SASS and JS source files to the dist folder; gulp serve makes the built files available over HTTP and reloads on changes; gulp deploy pushes the dist folder to a remote gh-pages branch.

Directory structure of a Polymer app Directory structure of a Polymer app


Reminder: Last chance to apply for Google summer fellowship / District Dispatch

Google Policy Fellows

Google Policy Fellows

The American Library Association’s Washington Office is calling for graduate students, especially those in library and information science-related academic programs, to apply for the 2015 Google Policy Fellows program. Applications are due by March 12, 2015.

For the summer of 2015, the selected fellow will spend 10 weeks in residence at the ALA policy office in Washington, D.C., to learn about national policy and complete a major project. Google provides the $7,500 stipend for the summer, but the work agenda is determined by the ALA and the selected fellow. Throughout the summer, Google’s Washington office will provide an educational program for all of the fellows, such as lunchtime talks and interactions with Google Washington staff.

The fellows work in diverse areas of information policy that may include digital copyright, e-book licenses and access, future of reading, international copyright policy, broadband deployment, telecommunications policy (including e-rate and network neutrality), digital divide, access to information, free expression, digital literacy, online privacy, the future of libraries generally, and many other topics.

Margaret Kavaras, a recent graduate from the George Washington University, served as the 2014 ALA Google Policy Fellow. Kavaras was later appointed as an OITP Research Associate shortly after participating in the Google Fellowship program.

Further information about the program and host organizations is available at the Google Public Policy Fellowship website.

The post Reminder: Last chance to apply for Google summer fellowship appeared first on District Dispatch.

Readings & Past Exams/Reserver Online/Masterfile access issues / James Cook University, Library Tech

Not of interest to anyone outside of JCU, just using my blog to list the workarounds for a local issue: <!--[if gte mso 9]> <![endif]--> <!--[if gte mso 9]> Normal 0 false false false EN-AU X-NONE X-NONE <![endif]--><!--[if gte mso 9]>

An error occurred / Open Library Data Additions

The RSS feed is currently experiencing technical difficulties. The error is: Search engine returned invalid information or was unresponsive

Don't Panic / David Rosenthal

I was one of the crowd of people who reacted to Wednesday's news that Argonne National Labs would shut down the NEWTON Ask A Scientist service, on-line since 1991, this Sunday by alerting Jason Scott's ArchiveTeam. Jason did what I should have done before flashing the bat-signal. He fed the URL into the Internet Archive's Save Page Now, to be told "relax, we're all over it". The site has been captured since 1996 and the most recent capture before the announcement was Feb 7th. Jason arranged for captures Thursday and today.

As you can see by these examples, the Wayback Machine has a pretty good copy of the final state of the service and, as the use of Memento spreads, it will even remain accessible via its original URL.