Planet Code4Lib

2017 Patterson Copyright Award Winner: Jonathan Band / District Dispatch

Jonathan Band, recipient of the 2017 Patterson Award

We are pleased to announce that Jonathan Band is the 2017 recipient of the L. Ray Patterson Copyright Award. The award recognizes contributions of an individual or group that pursues and supports the Constitutional purpose of the U.S. copyright law, fair use and the public domain.

ALA President James Neal had this to say:

“Jonathan Band has guided the library community over two decades through the challenges of the copyright legal and legislative battles,” said. “His deep understanding of our community and the needs of our users, in combination with his remarkable knowledge and supportive style, has raised our understanding of copyright and our commitment to balanced interpretations and applications of the law. The 2017 L. Ray Patterson Copyright Award appropriately celebrates Jonathan’s leadership, counsel and dedication.”

Band, a copyright attorney at policybandwidth and adjunct professor at Georgetown Law School, has represented libraries and technology associations before Congress, the Executive Branch and the Judicial Branch and has written public comments, testimony, amicus briefs, and countless statements supporting balanced copyright. Band’s amicus brief on behalf of the Library Copyright Alliance was quoted in the landmark Kirstaeng v. Wiley case, where the U.S. Supreme Court ruled that the first sale doctrine applied to books printed abroad, enabling libraries to buy and lend books manufactured overseas. He also represented libraries throughout the Authors Guild v. Google litigation, whose ruling advanced the concept of transformative fair use. Band’s Google Book Settlement/Litigation flow chart, developed to explain the complexity of the case to the public, is widely cited and used worldwide.

The scope of Band’s work extends internationally as well. He has argued for balanced provisions in international trade agreements, including the Trans-Pacific Partnership, and treaties that protect users’ rights and the open internet. He represented U.S. libraries at the World Intellectual Property Organization, which after several years of negotiation adopted the Marrakesh Treaty, mandating enactment of copyright exceptions permitting the making of accessible format copies for print disabled people to help address the worldwide book famine.

ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

(Left to right) Former ALA Washington Office Executive Director Emily Sheketoff, Jonathan Band, Brandon Butler and Mary Rasenberger.

Mr. Band has written extensively on intellectual property and electronic commerce matters, including several books and over 100 articles. He has been quoted as an authority on intellectual property and internet matters in numerous publications, including The New York Times, The Washington Post, USA Today and Forbes, and has been interviewed on National Public Radio, MSNBC and CNN.

The Patterson Award will be presented to Band by ALA President Jim Neal at a reception in Washington, D.C., in October. Several members of the D.C.-based technology policy community also will provide comments on Band’s influential career in advocating for balanced copyright policy.

The post 2017 Patterson Copyright Award Winner: Jonathan Band appeared first on District Dispatch.

Libraries Ready to Code adds capacity / District Dispatch

[[Guest post by Linda Braun, CE coordinator for the Young Adult Library Services Association has been involved with ALA’s Libraries Ready to Code initiative since it’s start, serving as project researcher and now assisting with the administration of the Phase III grant program.]]Student and instructor participating in Ready to Code progamming

The Libraries Ready to Code (RtC) team is growing by one! Dr. Mega Subramaniam, Associate Professor, College of Information Studies, University of Maryland, will serve as Ready to Code Fellow during 017-2018. Dr. Subramaniam began her involvement with RtC as an advisory committee member and currently serves as co-principal investigator for RtC Phase II. She will contribute overall guidance based on her professional expertise as well as her role as an ALA member.

We are also happy to announce the RtC Selection Committee who will contribute their expertise (and time!) to select a cohort of school and public libraries as part of the next phase of our multi-year initiative. The committee includes representatives of the Association for Library service to Children (ALSC), the American Association of School Librarians (AASL), the Young Adult Library Services Association (YALSA), and the Office for Information technology Policy (OITP). We are extremely pleased to have such a strong collaboration across different ALA units. Committee members are:

  • Michelle Cooper, White Oak Middle School Media Center, White Oak, TX (AASL)
  • Dr. Colette Drouillard, Valdosta State University, Valdosta, GA (ALSC)
  • Dr. Aaron Elkins, Texas Woman’s University, Denton, TX (AASL)
  • Shilo Halfen (Chair), Chicago Public Library, Chicago, IL (ALSC)
  • Christopher Harris, Genesee Valley Educational Partnership, Le Roy, NY (OITP)
  • Kelly Marie Hincks, Detroit Country Day School, Bloomfield Hills, MI (AASL)
  • Peter Kirschmann, The Clubhouse Network, Boston, MA (YALSA)
  • Dr. Rachel Magee, University of Illinois, Urbana Champaign, Champaign, IL (YALSA)
  • Carrie Sanders, Maryland State Library, Baltimore, MD (YALSA)
  • Conni Strittmatter, Harford County Public Library, Belcamp, MD (ALSC)

The committee is reviewing over 300 applications from across the country to design and implement learning activities that foster youth acquisition of computational thinking and/or computer science (CS) skills. Awards up to $25,000 will be made to as many as 50 libraries. Awardees will form a cohort that will provide feedback for the development of a toolkit of resources and implementation strategies for libraries to use when integrating computational thinking and CS into their activities with and for youth. The resulting toolkit will be made widely available so any library can use it at no cost. The program is sponsored by Google as part of its ongoing commitment to ensure library staff are prepared to provide rich coding/CS programs for youth.

This project is Phase III of the RtC ALA-Google collaboration. The work began with an environmental scan of youth-focused library coding activities. “Ready to Code: Connecting Youth to CS Opportunity Through Libraries,” published as a result of that work, highlights what libraries and library staff need in order to provide high-quality youth-centered computational thinking and computer science activities. Phase II of the project provides faculty at LIS programs across the United States with the opportunity to redesign a syllabus in order to integrate computational thinking and computer science into teaching and learning.

Learn more about the Libraries Ready to Code initiative and access an advocacy video, the Ready to Code report, and infographics on the project website.

The post Libraries Ready to Code adds capacity appeared first on District Dispatch.

DuraSpace Migration/ Upgrade Survey: Call for Participation / Islandora

From Erin Tripp, Business Development Manager at Duraspace:

I’m collecting anecdotes from people who have undertaken a migration or a major upgrade in the recent past. I hope to collect stories about how the project went, what resources were used or developed during the process, and whether it turned into an opportunity to update skills, re-engage stakeholders, normalize data, re-envision the service, etc.

The data will be used by DuraSpace and affiliate open source communities to develop resources that will fill gaps identified by participants. It will also be used in presentations, blog posts, or other communications that will highlight what we can learn from each other to make migration and upgrade projects a more positive experience in the future.

The data collection will be done through mediated surveys (interview-style) with me (in person, on the phone, or via Skype). Please express your interest in participating by emailing me at Or, if you prefer, you can also fill out the survey online by yourself.  The survey will close on Tuesday, October 17, 2017.

Please note: interviewee names are collected for administrative purposes only and will not appear in any published work unless the permission of the interviewee has been obtained in writing.

Here are the survey questions (* denotes a mandatory response):

  • Name
  • Role*
  • Institution
  • What repository software(s) have you migrated from?*
  • What repository software(s) have you migrated to?*
  • Is/are your repository(ies) customized?*
  • If yes, can you tell us how?
  • When did you last undertake a major migration/update?
  • Did customization impact your migration/upgrade?*
  • If yes, tell us how.
  • What were the most significant challenges of the migration/ upgrade process?*
  • Can you elaborate the challenges faced?
  • What were the most significant benefits of the migration/ upgrade process?*
  • Can you elaborate on the benefits?
  • Where there tools or resources that helped you during the process?
  • What element(s) of the project surprised you?
  • What do you wish you knew when you started the project? *
  • What advice would you offer to others who are planning a migration/ upgrade?
  • Is there anything else you’d like to add?


Why Facets are Even More Fascinating than you Might Have Thought / Lucidworks

I just got back from an another incredible Lucene/Solr Revolution, this year in Sin City (aka Las Vegas) Nevada. The problem is that there were so many good talks, that I now can’t wait for the video tape to be put up on U-Tube, because I routinely had to make Hobbes choices about which one to see. I was also fortunate to be among those presenting, so my own attempt at cramming well over an hour’s worth of material into 40 minutes will be available for your amusement and hopefully edification as well. In the words of one of my favorite TV comedians from my childhood, Maxwell Smart, I “Missed It by that much”. I ran 4 minutes, 41 seconds over the 40 minutes allotted to be exact.  I know this because I was running a stopwatch on my cell phone to keep me from doing just that. I had done far worse in my science career, cramming my entire Ph.D thesis into a 15 minute slide talk at a Neurosciences convention in Cincinnati – but I was young and foolish then. I should be older and wiser now. You would think.

But it was in that week in Vegas that I reached this synthesis that I’m describing here – and since then have refined even a bit more, which is also why I am writing this blog post.  When I conceived of the talk about a year ago, the idea was to do a sort of review of some interesting things that I had done and blogged about concerning facets. At the time, there must have been a “theme” somewhere in my head – because I remember having been excited about it, but by the time I got around to submitting the abstract four months later and finally putting the slide deck together nearly a year later, I couldn’t remember exactly what that was. I knew that I hadn’t wanted to do a “I did this cool thing, then I did this other cool thing, etc.” about stuff that I had mostly already blogged about, because that would have been a waste of everyone’s time. Fortunately the lens of pressure to get “something” interesting to say after my normal lengthy period of procrastination, plus the inspiration from being at Revolution and the previous days answers to “So Ted, what is your talk going to be about?” led to the light-bulb moment, just in the nick-of-time, that was an even better synthesis than I had had the year before (pretty sure, but again don’t remember, so maybe not – we’ll never know).

My talk was about some interesting things I had done with facets that go beyond traditional usages such as faceted navigation and dashboards. I started with these to get the talk revved up. I also threw in some stuff about the history of facet technologies both to show my age and vast search experience and to compare the terms different vendors used for faceting. At the time, I thought that this was merely interesting from a semantic standpoint, and it also contained an attempt at humor which I’ll get to later. But with my new post-talk improved synthesis – this facet vocabulary comparison is in fact even more interesting  so I am now really glad that I started it off this way (more on this later). I was then planning to launch into my Monty Python “And Now for Something Completely Different” mad scientist section. I also wanted to talk about search and language, which is one of my more predictable soapbox issues. This led up to a live performance of some personal favorite tracks from my quartet of Query Autofilter blogs (1,2,3,4), featuring a new and improved implementation of QAF as a Fusion Query Pipeline Stage (coming soon to Lucidworks Labs) and some new semantic insights gleaned from of my recent eCommerce work for a large home products retailer. I also showed an improved version of the “Who’s In The Who” demo that I had attempted 2 years prior in Austin, based on a cleaner, slicker query patterns (formally Verb Patterns). I used a screenshot for Vegas to avoid the ever present demo gods which had bit me 2 years earlier. I was not worried about the demo per-se with my newly improved and more robust implementation, just boring networking issues and login timeouts and such in Fusion – I needed to be as nimble as I could be. But as I worked on the deck in the week leading up to Revolution – nothing was gelin’ yet.

The Epiphany

I felt that the two most interesting things that I had done with facets were the dynamic boosting typeahead trick from what I like to call my “Jimi Hendrix Blog” and the newer stuff on Keyword Clustering in which I used facets to do some Word-2-Vec’ish things. But as I was preparing to explain these slides – I realized that in both cases, I was doing exactly the same thing at an abstract level!! I had always been talking about “context” as being important – remembering a slide from one of my webinars in which the word CONTEXT was the only word on the slide in bold italic 72 Pt font – a slide that my boss Grant Ingersoll would surely have liked (he had teased me about my well known tendency for extemporizing at lunch before my talk) – I mean, who could talk for more than 2 minutes about one word? As one of my other favorite TV comics from the 60’s and 70’s, Bob Newhart would say – “That … ah … that … would be me”. (but actually not in this case – I timed it – but I’m certainly capable of it) Also, I had always thought of facets as displaying some kind of global result-set context that the UI displayed.

I had also started the talk with a discussion about facets and metadata as being equivalent, but what I realized is that my “type the letter ‘J’ into typeahead, get back alphabetical stuff starting with ‘J’ then search for “Paul McCartney”, then type ‘J’ again and get back ‘John Lennon’ stuff on top” and my heretically mad scientist-esque “facet on all the tokens in a big text field, compute some funky ratios and of the returned 50,000 facet values for the ‘positive’ and ‘negative’ queries for each term and VOILA get back some cool Keyword Clusters” examples were based ON THE SAME PRINCIPAL!!! You guessed it “context”!!!

So, what do we actually mean by “context”?

Context is a word we search guys like to bandy around as if to say, “search is hard, because the answer that you get is dependent on context” – in other words it is often a hand-waving, i.e. B.S. term for “its very complicated”. But seriously, what is context? At the risk of getting too abstractly geeky – I would say that ‘context’ is some place or location within some kind of space. Googling the word got me this definition:

“the circumstances that form the setting for an event, statement, or idea, and in terms of which it can be fully understood and assessed.”

Let me zoom in on “setting for an event” as being roughly equivalent to my original more abstract-mathematical PhD-ie (pronounced “fuddy”) “space” notion. In other words, there are different types of context – personal, interpersonal/social/cultural, temporal, personal-temporal (aka personal history), geospatial, subject/categorical and you can think of them as some kind of “space” in which a “context” is somewhere within that larger space – i.e. some “subspace” as the math and Star Trek geeks would say (remember the “subspace continuum” Trek fans?) – I love this geeky stuff of course, but I hope that it actually helps ‘splain stuff too … The last part “in terms of which is can be fully understood and assessed” is also key and resonates nicely with the Theorem that I am about to unfold.

In my initial discussion on facets as being equivalent to metadata, the totality of all of the facet fields and their values in a Solr collection constitutes some sort of global “meta-informational space”. This led to the recollection/realization that this was why Verity called this stuff “Parametric Search” and led Endeca to call these facet things “Dimensions”. We are dealing with what Math/ML geeks would call an “N-Dimensional hyperspace” in which some dimensions are temporal, some numerical and some textual (whew!). Don’t try to get your head around this – again, just think of it as a “space” in which “context” represents some location or area within that space. Facets then represent vectors or pointers into this “meta-informational” subspace of a collection based on the current query and the collected facet values of the result set. You may want to stop now, get something to drink, watch some TV, take a nap, come back and read this paragraph a few more times before moving on. Or not. But to simplify this a bit (what me? – I usually obfuscate) – lets call a set of facets and their values returned from a query as the “meta-informational context” for that query. So that is what facets do, in a kinda-sorta geeky descriptive way. Works for me and hopefully for you too. In any case, we need to move on.

So, getting back to our example – throw in a query or two and for each, get this facet response which we are now calling the result set’s “meta-informational context” and take another look at the previous examples. In the first case, we were searching for “Paul McCartney” – storing this entity’s meta-informational context and then sending it back to the search engine as a boost query and getting back “John Lennon” related stuff. In the second case, we were searching for each term in the collection, getting back the meta-informational context for that term and then comparing that term’s context with that of all of the other terms that the two facet queries return and computing a ratio, in which related terms have more contextual overlap for the positive than the negative query – so that two terms with similar contexts have high ratios and those with little or no contextual overlap would have low ratio values hovering around 1.0.

Paul McCartney and John Lennon are very similar entities in my Music Ontology and two words that are keywords in the same subject area also have very similar contexts in a subject-based “space” – so these two seemingly different tricks appear to be doing the same thing – finding similar things based on the similarity of their meta-informational contexts – courtesy of facets! Ohhhh Kaaaaay … Cool! – I think we’re on to something here!!

The Facet Theorem

So to boil all of this to an elevator speech – single takeaway slide, I started to think of it as a Theorem in Mathematics – a set of simple, hopefully self-evident assumptions or lemmas that when combined give a cool, and hopefully surprising result. So here goes.

Lemma 1: Similar things tend to occur in similar contexts

Nice. Kinda obvious, intuitive and I added the “tend to” part to cover any hopefully rare contrary edge cases but as this is a statistical thing we are building, that’s OK. Also, I want to start slow with something that seems self-evident to us like “the shortest distance between two points is a straight line” from Euclidian Geometry.

Lemma 2: Facets are a tool for exploring meta-informational contexts

OK, that is what we have just gone through space and time warp explanations to get to, so lets put that in as our second axiom.

In laying out a Theorem we now go to the “it therefore follows that”:

Theorem: Facets can be used to find similar things.

Bingo, we have our Theorem and we already have some data points – we used Paul McCartney’s meta-informational context to find John Lennon, and we used facets to find related keywords that are all related to the same subject area (part 2 document clustering blog is coming soon, promise). So it seems to be workable. We may not have a “proof” yet, but we can use this initial evidence to keep digging for one. So lets keep looking for more examples and in particular for examples that don’t seem to fit this model. I will if you will.

Getting to The Why

So this seems to be a good explanation for why the all of the crazy but disparate seeming stuff that I have been doing with facets works. To me, that’s pretty significant, because we all know that when you can explain “why” something is happening in your code, you’ve essentially got it nailed down, conceptually speaking. It also gets us to a point where we can start to see other use cases that will further test the Facet Theorem (remember, a Theorem is not a Proof – but its how you need to start to get to one). When I think of some more of them, I’ll let you know. Or maybe some optimizations to my iterative, hard to parallalize method.

Facets and UI – Navigation and Visualization

Returning to the synonyms search vendors used for facets – Fast ESP first called these things ‘Navigators’ which Microsoft cleverly renamed to ‘Refiners’. That makes perfect sense for my synthesis – you navigate through some space to get to your goal, or you refine that subspace which represents your search goal – in this case, a set of results. Clean, elegant, it works, I’ll take it. The “goal” though is your final metadata set which may represent some weird docs if your precision sucks – so the space is broken up like a bunch of isolated bubbles. Mathematicians have a word for this – disjointed space. We call it sucky precision. I’ll try to keep these overly technical terms to a minimum from now on, sorry.

As to building way cool interactive dashboards, that is facet magic as well, where you can have lots of cool eye candy in the form of pie charts, bar charts, time-series histograms, scatter plots, tag clouds and the super way cool facet heat maps. One of the very clear advantages of Solr here is that all facet values are computed at query time and are computed wicked fast. Not only that, you can facet on anything, even stuff you didn’t think of when you designed your collection schema through the magic of facet and function queries and ValueSource extensions. Endeca could do some of this too, but Solr is much better suited for this type of wizardry. This is “surfin’ the meta-informational universe” that is your Solr collection. “Universe” is apt here because you can put literally trillions of docs in Solr and it also looks like the committers are realizing Trey’ Grainger’s vision of autoscaling Solr to this order of magnitude, thus saving many intrepid DevOps guys and gals their nights and weekends!  (Great talk as usual by our own Shalin Mangar on this one. Definitely a must-see on the Memorex versions of our talks if you didn’t see his excellent presentation live.) Surfin’ the Solr meta-verse rocks baby!

Facets? Facets? We don’t need no stinkin’ Facets!

To round out my discussion of what my good friend the Search Curmudgeon calls the “Vengines” and their terms for facets, I ended that slide with an obvious reference to everyone’s favorite tag line from the John Huston/Humphrey Bogart classic The Treasure of the Sierra Madre, with the original subject noun replaced with “Facet”.   As we all should know by now, Google uses Larry’s page ranking algorithm also known as Larry Page’s ranking algorithm – to whit PageRank, which is a crowd sourcing algorithm that works very well with hyper linked web pages but is totally useless for anything else. Google’s web search relevance ranking is so good (and continues to improve) that most of the time you just work from the first page so you don’t need no stinkin’ facets to drill in – you are most often already there and what’s the difference between one or two page clicks vs one or two navigator clicks?

I threw in Autonomy here because they also touted their relevance as being auto-magical (that’s why their name starts with ‘Auto’) and to be fair, it definitely is the best feature of that search engine (the configuration layer is tragic).   This marketing was especially true before Autonomy acquired Verity, who did have facets, after which is was much more muddled/wishy washy. One of the first things they did was to create the Fake News that was Verity K2 V7 in which they announced that the APIs would be “pin-for-pin compatible” to K2 V6 but that the core engine would now be IDOL. I now suspect that this hoax  was never really possible anyway (nobody could get it to work) because IDOL could not support navigation, aka facet requests – ’cause it didn’t have them anywhere in the index!! Maybe if they had had Yonik … And speaking of relevance, like the now historical Google Search Appliance “Toaster“, relevance that is autonomous as well as locked down within an intellectual property protection safe is hard to tune/customize. Given that what is relevant is highly contextual – this makes closed systems such as Autonomy and GSA unattractive compared to Solr/Lucene.

But it is interesting that the two engines that consider relevance to be their best feature, eschew facets as unnecessary – and they certainly have a point – facets should not be used as a band-aid for poor relevance in my opinion. If you need facets to find what you are looking for, why search in the first place? Just browse.  Yes Virginia, user queries are often vague to begin with and faceted navigation provides an excellent way to refine the search, but sacrificing too much precision for recall will lead to unhappy users.  This is especially true for mobile apps where screen real estate issues preclude extensive use of facets. Just show me what I want to see, please! So sometimes we don’t want no stinkin’ facets but when we do, they can be awesome.

Finale – reprise of The Theorem

So I want to leave you with the take home message of this rambling, yet hopefully enlightening blog post, by repeating the Facet Theorem I derived here: Facets can be used to find similar things. And the similarity “glue” is one of any good search geek’s favorite words: context. One obvious example that we have always known before, just as Dorothy instinctively knew how to get home from Oz, is in faceted navigation itself – all of the documents that are arrived at by facet queries must share the metadata values that we clicked on – so they must therefore have overlapping meta-informational contexts along our facet click’s navigational axes! The more facet clicks we make, the “space” of remaining document context becomes smaller and their similarity greater! We can now add this to our set of use cases that support the Theorem, along with the new ones I have begun to explore such as text mining, dynamic typeahead boosting and typeahead security trimming. Along these lines, a dashboard is just a way cooler visualization of this meta-informational context for the current query + facet query(ies) within the global collection meta-verse, with charts and histograms for numeric and date range data and tag clouds for text.

So to conclude, facets are fascinating, don’t you agree? And the possibilities for their use go well beyond navigation and visualization. Now to get the document clustering blog out there – darn day job!!!

The post Why Facets are Even More Fascinating than you Might Have Thought appeared first on Lucidworks.

Apache Solr 7 Ready for Download / Lucidworks

While we lived it up in Vegas at Lucene/Solr Revolution 2017, the Lucene PMC announced the release of Apache Solr 7.0.0. Download.

Here’s a webinar walking though what’s new in Solr 7.

From the release announcement:

Highlights for this Solr release include:

  • Replica Types – Solr 7 supports different replica types, which handle updates differently. In addition to pure NRT operation where all replicas build an index and keep a replication log, you can now also add so called PULL replicas, achieving the read-speed optimized benefits of a master/slave setup while at the same time keeping index redundancy.
  • Auto-scaling. Solr can now allocate new replicas to nodes using a new auto scaling policy framework. This framework will in future releases enable Solr to move shards around based on load, disk etc.
  • Indented JSON is now the default response format for all APIs, pass wt=xml and/or indent=off to use the previous unindented XML format.
  • The JSON Facet API now supports two-phase facet refinement to ensure accurate counts and statistics for facet buckets returned in distributed mode.
  • Streaming Expressions adds a new statistical programming syntax for the statistical analysis of sql queries, random samples, time series and graph result sets.
  • Analytics Component version 2.0, which now supports distributed collections, expressions over multivalued fields, a new JSON request language, and more.
  • The new v2 API, exposed at /api/ and also supported via SolrJ, is now the preferred API, but /solr/ continues to work.
  • A new ‘_default’ configset is used if no config is specified at collection creation. The data-driven functionality of this configset indexes strings as analyzed text while at the same time copying to a ‘*_str’ field suitable for faceting.
  • Solr 7 is tested with and verified to support Java 9.

Full release notes. 

The post Apache Solr 7 Ready for Download appeared first on Lucidworks.

Second Evergreen 3.0 beta released + OpenSRF 3.0.0-alpha / Evergreen ILS

The second beta release of Evergreen 3.0 is now available for testing from the downloads page.

The second beta includes the following changes since the first beta release:

  • Support for Debian Stretch
  • Various improvements to the translation system.
  • Various bug fixes; the complete list can be found on Launchpad.

This release now requires OpenSRF 3.0.0. An alpha release of OpenSRF 3.0 is available today as well.

OpenSRF 3.0.0-alpha also adds support for Debian Stretch. OpenSRF also changes how services written in the C programming language are loaded, and as a consequence, testers of the second Evergreen 3.0 beta release must plan on installing or upgrading to OpenSRF 3.0.0-alpha.  If you are upgrading a test system that used an older version of OpenSRF, please note the instructions for updating opensrf.xml.

Evergreen 3.0 will be a major release that includes:

  • community support of the web staff client for production use
  • serials and offline circulation modules for the web staff client
  • improvements to the display of headings in the public catalog browse list
  • the ability to search patron records by date of birth
  • copy tags and digital bookplates
  • batch editing of patron records
  • better support for consortia that span multiple time zones
  • and numerous other improvements

For more information on what’s coming in Evergreen 3.0.0, please read the updated draft of the release notes.

Users of Evergreen are strongly encouraged to use the beta release to test new features and the web staff client; bugs should be reported via Launchpad. A release candidate is scheduled to be made on 27 September.

Evergreen admins installing the beta or upgrading a test system to the beta should be aware of the following:

  • The minimum version of PostgreSQL required to run Evergreen 3.0 is PostgreSQL 9.4.
  • Evergreen 3.0 requires that the open-ils.qstore service be active.
  • SIP2 bugfixes in Evergreen 3.0 require an upgrade of SIPServer to be fully effective.
  • There is no database upgrade script to go from 3.0-beta1 to 3.0-beta2. While we recommend testing an upgrade by starting from a 2.12 test system, if you want to go from beta1 to beta2, you can apply the 1076.function.copy_vis_attr_cache_fixup.sql schema update.

Evergreen 3.0.0 will be a large, ambitious release; testing during beta period will be particularly important for a smooth release on 3 October.

On the Road to 3.0: Webby on Mobile / Evergreen ILS

The second in our series of videos highlighting features of 3.0 is now available on our Youtube Channel: Webby on Mobile!  Instead of highlighting a distinct feature this video shows off the 3.0 staff client and it’s high degree of parity with how ti can be used on a desktop or laptop computer on a mobile device.

While there, please subscribe to our Youtube channel, we only need 28 more subscriptions to qualify for a custom URL!


#evergreen #evgils

Evergreen 2.11.9 and 2.12.6 released / Evergreen ILS

The Evergreen community is pleased to announce two maintenance releases of Evergreen: 2.11.9 and 2.12.6.

Evergreen 2.12.6 has the following changes improving on Evergreen 2.12.5:

  • Removes the option to add a title to My List from Group Formats and Editions searches where the option never worked correctly due to a bad id.
  • Removes deleted shelving locations from the web client’s volume/copy editor.
  • Adds the patron opt-in check in the web client whenever a patron is retrieved by barcode scan, patron search, or item circ history.
  • Fixes a bug where the price and acquisitions cost fields did not display their values.
  • Fixes a bug where a patron’s circulation history no longer moved to the lead account when merging patron accounts.
  • Now hides the ebook tabs in My Account for sites that have not yet enabled the Ebook API service.
  • Trims spaces from patron barcodes in the web client check out interface.
  • Makes a string in the holds validation alert translatable.
  • Fixes a bug that prevented the web client patron registration screen from loading when there is an opt-in action triggers, such as the email checkout receipt, set to be a registration default.
  • Fixes a bug where barcode validation in the web client patron editor was using the incorrect regular expression.
  • Replaces an empty string in the mobile carrier dropdown menu with a Please select your mobile carrier label to improve usability and resolve a problem with translations.
  • Restores the ability to display a photo in the web client patron editor for accounts that have an actor.usr.photo_url.
  • Fixes a Firefox display issue in the web client that occurred when retrieving a bib record by TCN when the MARC Edit tab was set as the default view.
  • Fixes an bug where setting a patron’s default pickup location in the web client patron editor inadvertently changed the home library. It also disables any locations that are not viable pickup locations.
  • Fixes a bug where a misscan in a copy bucket failed silently.

Evergreen 2.11.9 has the following changes improving on 2.11.8:

  • The option to add a title to My List is removed from Group Formats and Editions searches where the option never worked correctly due to a bad id.

Please visit the downloads page to view the release notes and retrieve the server software and staff clients.

Starting the Conversation with the LIT Tech Diversity Reading Club / Library Tech Talk (U of Michigan)

Image of many colored pencils

In line with the University of Michigan Library's strategic plan to support diversity, individuals in the Library Information Technology division started a Diversity Reading Club where colleagues can come together to lean and discuss readings on the subject. The Reading Club has been going for over a year and a half, and we discuss what it is and why we think it works.

It's the 7.x-1.10 Code Freeze! / Islandora

Hello, collaborators,
As of yesterday night / this morning, we have officially frozen code for the 7.x-1.10 release. This means:
  • Any bug fixes or documentation tickets now need two pull requests, one on the 7.x branch and one on 7.x-1.10. 
  • Any improvement or new feature pull requests can go on the 7.x branch, and will be part of the next (early 2018) release. 
  • If you like, you can check out the 7.x-1.10 branches and start to test! A full release candidate (RC) Virtual Machine will be prepared over the next ten days and will be available on Sept 29. 
Kudos and thank you 
To all of our intrepid auditors: Janice, Melissa, Mark, Neil, Matthew, Chris, Rachel, Caleb, Bayard, Keila, Charles, and Devin, thank you for getting the README and LICENSE audits done before code freeze. Thanks especially to those who jumped in as first-time Github contributors. Github ain't easy, and a step-by-step tutorial on how to make a PR is forthcoming. Thanks also to the community who've been working on improvements and bug fixes all along and especially preparing for code freeze.  The full list of improvements and new features that are marked as 7.x-1.10 are here though there may be more that haven't yet had the fix version applied. (Friendly reminder because we all forget: when you merge a pull request, close the Jira ticket too and include a fix version) 
What kept us up:
The last improvement to make it into the release was a ticket from June 2015 that allows other modules to hook into the results of the checksum checker. We also fought Travis over the last few weeks because an update to their default distributions broke our testing framework. Thank you to Jonathan and Jared who helped non-stop with that.
Next steps
There are 157 open bugs and documentation tickets so let's see what we can get fixed for this release! 
Your 7.x-1.10 Release Team
Rosie and Diego

Jobs in Information Technology: September 20, 2017 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Yale University, Director of Monographic Processing Services, New Haven, CT

Yale University, Director of Resource Discovery Services, New Haven, CT

Yale University, Director of E-Resources and Serials Management, New Haven, CT

Marquette University Libraries, Systems Librarian, Milwaukee, WI

LYRASIS, DevOps Specialist, Atlanta, GA

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

The Islandora Roadmap Committee is now the Islandora Coordinating Committee / Islandora

The Islandora Roadmap Committee is changing its name to the Islandora Coordinating Committee. This change is intended to address some confusion about the role of this important group, which meets up every other week to work on how our community works together. They are the body primarily responsible for our Code of Conduct, workflows, and for overseeing and liaising with our many Interest Groups. Their full Terms of Reference are here.

They do a lot of important work helping this community to support our mutual goals for Islandora. The term “Roadmap,” however, can imply a level of oversight over technical direction that the group has not practiced since before the Islandora Foundation was formed in 2013. The new name better reflects the role they play in Islandora's governance now, and was driven in particular by looking at this definition of “coordinate”: 

Bring the different elements of (a complex activity or organization) into a harmonious or efficient relationship. (Oxford Dictionary)

The Islandora Coordinating Committee has been working for years to help bring the many elements of the Islandora community together into a harmonious and efficient relationship. We hope that the new name helps to make that easier to see.

The active members of the Islandora Coodinating Committee are:

  • Donald Moses, UPEI (legacy)
  • Melissa Anez, Islandora Foundation
  • Danny Lamb Islandora Foundation
  • David Wilcox, DuraSpace (legacy)
  • Anna St.Onge, York University
  • Kirsta Stapelfeldt, University of Toronto Scarborough (legacy)
  • Greg Colati, University of Connecticut
  • Jonathan Green, LYRASIS
  • Will Panting, discoverygarden Inc.
  • Gabriela Mircea, McMaster University
  • Gavin Morris, Born-Digital
  • Rosie Lefaive, UPEI
  • Mark Baggett, University of Tennessee Knoxville
  • Mark Jordan, Simon Fraser University
  • Bryan Brown, Florida State University
  • Jordan Fields, Marmot Library Network
  • Diego Pino, METRO
  • Kim Pham, University of Toronto Scarborough
  • David Keiser-Clark, Islandora Collaboration Group

Welcoming Jer Thorp as Innovator-in-Residence / Library of Congress: The Signal

Starting this week, acclaimed data artist Jer Thorp began his tenure as the 2017 Library of Congress Innovator-in-Residence. He will spend six months with the National Digital Initiatives team exploring the Library’s digital collections and creating an art piece that will be displayed in the Library’s public spaces.

Jer Thorp speaking at the Collections as Data symposium, September 27, 2016. Photo by Shawn Miller.

Jer Thorp is an artist and educator from Vancouver, Canada, currently living in New York. Coming from a background in genetics, his digital art practice explores the many- folded boundaries between science, data, art and culture. His work has appeared in a wide variety of publications, including Scientific American, The New Yorker, Popular Science, Fast Company, Business Week, Popular Science, Discover, WIRED and The Harvard Business Review. From 2012 to 2012, Jer was the Data Artist-in-Residence at the New York Times.

Jer’s data-inspired artwork has been shown around the world, including most recently in New York’s Times Square, at the Museum of Modern Art in Manhattan, at the Ars Electronica Center in Austria and at the National Seoul Museum in Korea. In 2009, Jer designed a custom algorithm which was used to place the nearly 3,000 names on the 9/11 Memorial in Manhattan.

Jer’s talks on have been watched by more than a half-million people. He is a frequent speaker at high profile events such as PopTech and The Aspen Ideas Festival. Recently, he has spoken about his work at MIT’s Media Lab, The American Museum of Natural History, MoMA and NASA’s Jet Propulsion Lab (JPL) in Pasadena.

Jer is a National Geographic Fellow, a Rockefeller Foundation Fellow and an alumnus of the World Economic Foundation’s Global Agenda Council on Design and Innovation. He is an adjunct Professor in New York University’s renowned Interactive Telecommunications Program (ITP) and is the Co-Founder of The Office for Creative Research. In 2015, Canadian Geographic named Jer one of Canada’s Greatest Explorers.

Jer will be sharing his goals, progress and process in blogs and podcasts over the coming months, so please stay tuned to The Signal and the @LC_Labs feed to learn more about what he will create. ”I am immensely excited for the chance to work at the Library of Congress. To dig deep into its astounding archives, to collaborate with its extraordinary librarians, researchers and technologists, and to engage in meaningful ways with its millions of visitors– it’s a dream come true, ” adds Jer.

The Library is establishing a broad Innovator-in-Residence Program to support new and creative uses of digital collections that showcase how the Library enriches the work, life, and imagination of the public. Last year’s inaugural Innovators were Library staff members Tong Wang and Chris Adams. During Tong’s time with NDI, he created the Beyond Words crowdsourcing application which has just launched as a pilot. NDI targeted artists for the 2017 Innovator-in-Residence program to create public, engaging and informative pieces from Library of Congress collections and data. In 2018, look for a announcement about the first open call to apply for a residency. We are especially interested in hearing from journalist and writers who want to utilize the Library’s historical data sets for in-depth reporting. The Innovators and their work embody the kind of exploration that is meant to encourage. We are thrilled to welcome Jer as our first artist Innovator-in-Residence and look forward to sharing with you the beautiful and thought-provoking work we are sure he will produce.

Dear Congress, here’s how to ensure public access to government information / District Dispatch

As previously reported, Congress’ Committee on House Administration is currently examining Title 44 of the U.S. Code. That’s the law that governs the Federal Depository Library Program (FDLP) and the Government Publishing Office (GPO).Logo for a Federal Depository Library

On Sept. 15, ALA President Jim Neal sent the committee a letter highlighting the vital role of libraries and the FDLP in providing equitable and long-term access to a wealth of information resources created by the federal government. As president Neal put it:

Libraries help the public find, use and understand government information. Through their decades-long partnership with the FDLP, libraries collect, catalog, preserve and provide reference services to support a wide array of users, including business owners, lawyers, researchers, students of every age and citizens.

However, the law designed to support those important activities has become less and less in tune with changing library and information practices. Many provisions of Title 44 were last revised in the 1960s and are understandably but badly out of step with the way Americans use, and libraries want and need to provide, information in the 21st century.

In order to ensure the public’s continued access to government information, ALA has made specific and detailed recommendations to Congress that it modernize Title 44 to:

  1. Strengthen library partnerships for public access to federal publications;
  2. Ensure the long-term preservation of federal publications; and
  3. Improve the collection and distribution of digital publications.

On Sept. 26, the committee will hold a hearing to discuss the FDLP – Congress’ first hearing on the program in 20 years. ALA looks forward to this discussion and appreciates the committee’s interest in exploring these issues, with particular thanks to chairman Rep. Gregg Harper (R-MS3) and ranking member Rep. Robert Brady (D-PA1). We hope this hearing will identify directions in which reform legislation might productively go to strengthen the program and help libraries better connect Americans to their government.

The post Dear Congress, here’s how to ensure public access to government information appeared first on District Dispatch.

Creating Connections: How Libraries Can Use Exhibits to Welcome New Students / In the Library, With the Lead Pipe

In Brief: 

Feelings of loneliness are common among first-year college students during the start of the academic year. Academic and social integration into the campus community—both factors that can positively affect student retention—are critical yet difficult for any one group to manage. Grand Valley State University Libraries expanded its reach to help foster student engagement through an immersive, multifaceted exhibit showcasing personal stories of students through illustrations and audio recordings. Participants also had an opportunity to contribute to a mural. The exhibit, which ran for the first six weeks of the fall semester, provided students with novel ways to connect and identify with their peers. We will highlight an innovative approach to cultivating student belonging and detail how an exhibit can strengthen the library’s institutional relevance.

by Emily Frigo, Emily Fisher, Gayle Schaub, and Cara Cadena

The Experience

You’re a college freshman living away from home for the first time. You don’t know many people. It’s the week before school, and you’ve been through several campus orientations. You decide to go to the library with your laptop, get a cup of coffee, and plan your first week. On the way in, you pass a glass-enclosed gallery and see a large mural of colored dots. Text on the door reads, “Connected: An Exhibit of Shared Laker Experiences.” Curious, you decide to check it out.

A wall of the Connected exhibit included paper dots that participants painted with watercolors to portray their emotions.

Copyright Grand Valley State University, 2016

Upon entering you see a poster which reads, “We’re all human. We all have stories. In an increasingly noisy world, we may not always truly hear each other. Through shared stories, we can find connections, find community, and find ourselves.” Watercolor portraits of eight current students hang on the walls. You read quotes from these students and learn something personal about each one. Matthew, an international student, shares that it’s been difficult learning to cook for himself. “I usually just eat cereal,” his story explains. Another student mentions how her relationship with her mother has improved since she moved away from home. Each anecdote is vastly different from the next, but they all read as authentic. The emotive power of watercolor brings each person to life in a unique way.

Exhibit visitors could listen to stories of first year experiences.

Copyright Grand Valley State University, 2016

Then you see two iPads mounted on the wall, each with a set of headphones asking if you’d like to hear stories from more Grand Valley State University students. You put on the headphones and meet Elyse, 28, who just finished her first year at GVSU. After dropping out of high school and taking classes at a local community college in her early twenties, Elyse looks at education differently than she once did. She is a highly motivated, successful student who, after graduating, plans to pursue a master’s degree in journalism and someday ride a motorbike through Vietnam.

Next you listen to Vanessa’s story. She graduated in 2017 with a degree in allied health sciences and a minor in criminal justice. She plans to pursue a master’s degree in public health. Growing up bilingual in a small town made Vanessa’s transition to GVSU a bit of a challenge at first. Seeking out (and receiving) grants, getting involved with student support services, and being her own strongest advocate, Vanessa has become an amazing example for others on how to succeed at college, no matter what your background.

Through each of these stories, you start to realize that GVSU is more than just 25,000 students; it’s 25,000 individuals who also didn’t know what they were doing when they were freshmen, but who persisted, asked questions, and eventually met goals they didn’t realize they had.

A table in the exhibit space was dedicated to painting emotion dots.

Copyright Grand Valley State University, 2016

You hang up the headphones and find a large table inviting you to express how you’re feeling about the new school year by painting a white paper dot in watercolor. Tips for painting in watercolor and a color wheel of emotions sit upright on the table—nervous is orange, hopeful is light green. Ten other emotions span the wheel. You tape your dot to the collective mural. You notice that many others have filled their dots with similar colors, and you begin to feel less alone. You’ve made your mark on campus, one of many you will make, signaling the beginning of your college experience.


“Connected: An Exhibit of Shared Laker Experiences” was deliberately designed to support students at a key transition point—the start of the school year—by fostering social engagement and cultivating a sense of belonging, both of which can ease their acclimatization to college.1 The exhibit, designed and curated by Erin Fisher, Gayle Schaub, Cara Cadena, and Emily Frigo, signaled to students that the Mary Idema Pew Library Learning and Information Commons is full of dynamic and accessible spaces, all intended to help them thrive. It proved to be a novel and meaningful way to reinforce the University’s mission of supporting students. This article describes the exhibit itself and details the collaborative and participatory strategies used to engage visitors and build community through creative expression. While the exhibit has a student-centric focus, the design strategies and overarching philosophy can be adapted in all types of libraries.

Background & Rationale

Grand Valley State University (GVSU) is a comprehensive university committed to providing students with a broad-based liberal education. The University Libraries demonstrates its student-centered focus with continual study of space usage in its buildings, robust patron-driven acquisitions, a peer research consultant program, responsive web design, and a curricular-based library instruction program. Faculty and staff strive to identify and provide support for students at their points of need.

The Mary Idema Pew Library, which opened in 2013, exemplifies a student-centered focus through both form and function. It was designed based on research of student study habits, preferences, and needs. The physical spaces accommodate students’ desire for flexibility and comfort; the furniture is moveable, outlets are never more than a few feet away, ample natural light fills the space, and there is a wide range of seating options.

The building also includes dedicated spaces for events and exhibits in the hopes that students from all disciplines engage in moments of learning outside of the classroom. Library programming is also intended to enliven the atmosphere and signal that the library is a vibrant community gathering space. This includes the Gary and Joyce DeWitt Exhibition Space, the installation space for the Connected exhibit; it is centrally located and glass enclosed to encourage drop-in viewing. However, observations showed that few students visited the gallery outside of formally scheduled programming. Anecdotally, students have said they are unsure whether they are allowed in.

More broadly, some students and faculty report that the Mary Idema Pew Library can be an intimidating place. We wondered if library anxiety related to the building may be a factor inhibiting students from fully engaging with our spaces and thus our services. Libraries have long participated in orientation programs, summer bridge programs, and more, to raise awareness of the library and help students transition to college. Common across all these programs is the goal to create a positive experience with the library and thus help alleviate library anxiety.2

Erin, Library Program Manager, was searching for a creative and compelling way to show students that the gallery, like all other spaces in the library, belongs to them. Emily, First Year Initiatives Coordinator, wanted a unique way to welcome first-year students to campus and to the library.

Student Focus

The starting point of a student’s college journey is a crucial transition point for freshmen. Fisher and Hood’s assertion, the most frequently cited to date in the literature, is that homesickness sets in after the first couple of weeks of the term.3 Feelings of ineptitude and isolation can negatively impact a student’s ability to succeed in college.4 According to the GVSU MapWorks5 2013 and 2014 survey results,6 Grand Valley students tend to score lower than students from peer institutions in the areas of academic and social integration, both factors that can impact student retention.7 Engaging with peers is integral to a student’s successful transition.8 A participatory exhibit was an innovative way for students to connect with their peers and help normalize the emotions that accompany the start of the school year.

While GVSU has a robust library instruction program and information literacy is integrated into the General Education curriculum, it does not have a First Year Experience (FYE) program. Without a FYE program, support services are distributed across the Division of Student and Academic Affairs, making it challenging for GVSU Libraries to collaborate and integrate on campus. One of the goals of the exhibit was to raise awareness of the library’s support services among other campus units. To see the true value of the library, the campus community needed to see beyond the beautiful, light-filled building to appreciate the staff and services that undergird it.

Exhibit Execution

In August 2015, Erin and Emily’s outreach efforts manifested in an exhibit titled “Letters for Lakers.” The exhibit encouraged visitors to take a letter, leave a letter, send a letter. Approximately 50 unique letters containing encouraging messages, reflections, and memories related to the college experience were written by GVSU faculty, students, and staff. These letters were reproduced to fill 300 or so envelopes that hung on the gallery walls. In total, 220 letters were taken. Mailboxes were set up and blank letterhead sat on a table for students to write letters to their future selves. 165 students participated in this activity. All 1,000 postcards printed were taken. Student visitors were also encouraged to use sticky notes to leave encouraging messages for one another. 183 sticky notes were contributed. Student participation with the exhibit exceeded expectations, so plans were made to create a subsequent exhibit in 2016.

The Connected exhibit took place in a room students typically avoided because they're unsure whether they're allowed in the space.

Copyright Grand Valley State University, 2016

Colleagues Cara Cadena and Gayle Schaub, Liaison Librarians, joined Erin and Emily to form a working group in January 2016. Cara brought a needed and different perspective working with professional programs on our downtown campus; Gayle was invited because of her outreach efforts and her passion for supporting students. The group met several times to brainstorm ideas. We considered the space constraints, costs, technical expertise, and other elements while keeping in mind the stated goals:

  • Entice students to enter and explore the exhibition space to signal to them that they can take ownership of the space.
  • Invite students to join each other in collective expression through a participatory element.
  • Generate a sense of welcoming to assuage feelings of homesickness.
  • Signal to students that the library is a safe and welcoming space where students’ voices are heard and valued.

Together, we decided that the 2016 exhibit would include student stories paired with watercolor portraits and audio stories. Through stories, we hoped to illustrate that no one is alone in their trepidation, happiness, and exasperation, and that the campus community works collaboratively to welcome and support them. A participatory component where visitors could directly contribute would also be included. The exhibit was inspired by many different creative influences, most notably the Oak Park Public Library’s Idea Box, Humans of New York, Wendy MacNaughton, Damien Hirst, and StoryCorps.

Watercolor Portraits

In 2014, a dedicated group of GVSU students began taking photographs and gathering stories from fellow students in the style of the widely popular project Humans of New York, which pairs photographs of everyday people with person-on-the-street interviews. We approached Humans of Grand Valley (HoGV) as collaborators on this project because this style is one imbued with an overwhelming sense of authenticity. The student group gathered a special collection of stories specifically for the exhibit. Of the twenty stories they collected, eight were selected for display, chosen to represent a range of experiences and connect with our diverse student body.

We approached art and design students to find an artist to create the watercolor portraits. Alumna Ellie Lubbers was hired to create original illustrations of students based on photographs taken by HoGV. This resulted in eight stunning watercolor illustrations. Below each portrait was an excerpt from the full-length interview conducted by HoGV.

Some of the first-person stories were accompanied be watercolor portraits.

Copyright Grand Valley State University, 2016

Ellie had been a resident assistant in the campus dorms. Her skills were paramount in making our vision truly come to life. She assisted with the overall exhibit design and installation, was instrumental in shaping the participatory component, and created promotional materials. Ellie also provided other critical feedback on how we could best reach students to accomplish the stated goals.

Audio Stories

The impetus behind the collection of audio stories was the desire to make the exhibit as inclusive as possible, not just its content but also the modes of interaction with the content. The inclusion of recorded stories, separate from any visual representation, added another dimension to the peer-to-peer interactive nature of the exhibit. Audio stories were longer, more in-depth than the stories that accompanied the watercolors. The stories are digitally archived, with transcripts, keeping them accessible long after the exhibit’s run.

As avid listeners of the weekly broadcasts of StoryCorps, heard on National Public Radio’s Morning Edition, we understood the power of stories to inspire, unite, and comfort. In fact, StoryCorps’s mission expressed perfectly one of our primary goals: “…to remind one another of our shared humanity, to strengthen and build the connections between people, to teach the value of listening, and to weave into the fabric of our culture the understanding that everyone’s story matters.”9 Students narrating their stories for others to hear was an intriguing addition to the primarily visual exhibits visitors had experienced thus far in the library.

To solicit a wide variety of stories, we reached out to the directors, organizers, and faculty advisors of groups at GVSU that offer support, resources, and guidance to students of various backgrounds with differing needs. The invitation to participate didn’t make any specific demands; students were simply asked if they’d be willing to tell a story or two about their experiences at GVSU.

We received responses from a number of student organizations representing students of varying ages and from a spectrum of gender, social, economic, and cultural backgrounds. Ten students shared their stories. The original, full-length recordings were transcribed and then edited into shorter sound bites for the online collection used in the exhibit. For most of the students, it was the only time they had been offered the chance to talk at length about themselves, to articulate their unique educational challenges and successes, and to be truly heard. For those who participated, the process of storytelling was as important as the stories themselves.10

One student’s recording session included a highly emotional recounting of a racially charged conversation with a professor. Afterward, she recognized aloud that not only had she not intended to tell that particular story, she felt an extraordinary sense of relief and empowerment at having done so. In telling her story, she realized her experiences shaped who she was, helped her find strength, affected her career choices, and defined her self-worth. Another participant, a returning veteran student, military wife, and pregnant mother of a toddler, found the storytelling process unexpectedly cathartic. As she spoke, she came to terms with the incredible amount of work and stress she faced, breaking down more than once. For all ten participants, the exhibit creation process gave as much or more to them as it did to the visitors.

Participatory Mural

Approximately 900 cut circles of white vinyl stickers were affixed to a wall of the gallery to create the canvas for the temporary, participatory mural. Corresponding three-inch circles cut from watercolor paper sat on a nearby table along with paint, brushes, water, and instructions for the activity. A color wheel detailing a range of emotions was prominently displayed to guide visitors in creating a watercolor dot that was unique to their experience. Watercolor was an ideal medium for representing emotions and it provided a low-threshold way for anyone, no matter their artistic ability, to participate.

The Value of Arts Programming

The exhibit provided visitors with a visual, auditory, and tactile experience that was multivocal and interactive. No other means would have provided such capabilities; art historian Mark Getlein explains that artmaking has the power to “Create places for some human purpose; create extraordinary versions of ordinary objects; give tangible form to the unknown; give tangible form to feelings and ideas; and refresh our vision to help us see the world in new ways.”11 The non-prescriptive nature of art also means that individuals can interpret work based on their own unique experiences. Through art, we created an accessible space where students could connect with their peers, the library, and the University at large in a novel way.

Participatory Techniques

The exhibit’s participatory artmaking element deepened students’ experience by allowing them to not only consume its content but contribute to it as well. We were first introduced to the concept of participatory exhibits through the work of Nina Simon, author of The Participatory Museum. In the book, Simon explains that participation enables visitors to “create, share and connect with each other around content.”12 The techniques popularized by Simon have been widely adopted by museums, libraries, and other cultural institutions as a way to more actively engage visitors while still honoring the mission, vision, and values of an institution. Claire Bishop writes about participation in the realm of contemporary art in the book Participation. In the introduction, she lists three reasons why artists typically employ participatory techniques: they give the audience agency; they are less hierarchical than other modes of artistic production; and they create social bonds through collective expression.13

The Mary Idema Pew Library strives to create a learning environment that “supports the whole student through the academic journey.”14 Participatory exhibits are an exemplary way to build students’ affinity for the library. We believe they lead to deeper engagement with our spaces and services, while also allowing students to make connections with their peers. Social connections are critical to students’ overall success. Alexander Astin states that “peers are the single most potent source of influence,” affecting virtually every aspect of their development.15 Our exhibit goes far beyond traditional library orientations by acknowledging that social needs are equally as important as academic, and does so at a crucial time in their college journey. Even more, exhibits like “Connected” give students the opportunity to actively engage in creative expression, a key tenet of a liberal arts education.

Exhibit Evaluation

Libraries of all types still struggle to find the appropriate means to evaluate cultural programming.16 With each new exhibit, we consider new or revised ways to measure reach and impact more concretely. Our quantitative evidence is sparse but the qualitative evidence gathered suggests that the exhibit accomplished its intended goals.

We do not know how many people in total attended the exhibit because the space does not include sensors to count visitors. Our target audience comprised 4,380 first-time students. Almost 300 dots were painted as part of the participatory mural. Although attendance numbers do not directly correlate with value, the metric would be helpful to evaluate reach.

A small table near the mural wall included a comment box and slips of paper with a prompt asking students to “Tell us what element(s) of the exhibit you connected with most.” Feedback was uniformly positive. We received 60 responses, including the following statements:

  • “I loved that we can connect with the community with art, colors, and how we feel.”
  • “I connected with a few of the stories. I love how there is always someone out there feeling the same emotions.”
  • “So good! I never thought so many other students were as nervous all the time as I was.”
  • “Awesome! I enjoyed listening & reading others’ stories. I can relate.”
  • “This was a beautiful opportunity! I loved being able to be creative which is something I don’t get to do often!”
  • “Thank you so much for bringing this here. It was a great outlet to silently release my emotions creatively.”


Our organization supports a culture of innovation and informed risk-taking, which allows us to try new methods of engaging and supporting our students. The exhibit is a good example of that culture in action. Conceptualizing and designing a project of this scope and magnitude was not easy, yet creating spaces for discovery is worth doing. Like many art exhibits, ours was designed to elicit contemplation and creativity; we wanted visitors to listen and learn. Student stories were honest and insightful. The stories, the portraits, and the wall of emotions were intended to make visitors feel more connected to a place (Grand Valley) they would call home for the next several years, and a space (the library) where they would spend a lot of their time. They also reminded us, the exhibit organizers, that everyone has their own set of difficulties and triumphs.

The components of the exhibit were unique to GVSU. Your participatory exhibit will be unique to your library community, but the message will be the same: the library is a place where stories matter and individual voices are heard.

Acknowledgements: Thank you to our internal reviewer, Bethany Messersmith; our external reviewer, Jamie Vander Broek; and publishing editor, Amy Koester. “Connected: An Exhibit of Shared Laker Experiences” was created in collaboration with Humans of Grand Valley, especially Jaclyn Ermoyan. Grand Valley alumna Ellie Lubbers created original artwork for the exhibit. Thank you to Len O’Kelly Ph.D., who patiently edited the audio files for exhibit and archiving. Our gratitude to Matthew Reidsma for wrangling the website for the audio stories. Endless thanks to the students whose stories were featured in the exhibit.


Astin, A. W. (1993). What matters in college?: Four critical years revisited. San Francisco: Jossey-Bass.

Batty, P. (2014). MAP-Works executive summary. Retrieved from

Bishop, C. (2010). Participation. London: Whitechapel.

Fisher, S., & Hood, B. (1987). The stress of the transition to university: A longitudinal study of psychological disturbance, absent-mindedness and vulnerability to homesickness. British Journal Of Psychology, 78(4), 425.

Fraser, J., Sheppard, B., & Norlander, R. J. (2014).  National Impact of the Library Public Programs Assessment (NILPPA): Meta-Analysis of the American Library Association Public Programs Office Archives. (NewKnowledge Publication #IMLS.74.83.02). New York: New Knowledge Organization Ltd.

Getlein, M. (2008). Living with art. New York: McGraw-Hill Higher Education.

GVSU Office of Institutional Analysis. (2013). Results from the MAP-Works survey of first-year undergraduates. Retrieved from

Jiao, Q., & Onwuegbuzie, A. (1997). Antecedents of library anxiety. The Library Quarterly: Information, Community, Policy, 67(4), 372-389. Retrieved from

Kuh, G., Kinzie, J., Buckley, J., Bridges, B., & Hayek, J. (2006). What Matters to Students Success: A Review of the Literature. Commissioned Report for the National Symposium on Post-Secondary Student Success: Spearheading a Dialog on Student Success. National Postsecondary Education Cooperative (NPEC).

Makowski, M. (2016). Mary Idema Pew Library named landmark library. GV Now. Retrieved from

Simon, N. (2010). The Participatory Museum. Santa Cruz, Calif: Museum 2.0.

Stanton, B. (2017). Humans of New York. Retrieved from

StoryCorps. (2017). Mission statement. Retrieved from

Thompson, N. (2015). Seeing Power: Art and Activism in the 21st Century. Brooklyn, N.Y: Melville House.

Thurber, C. A., & Walton, E. A. (2012). Homesickness and adjustment in university students. Journal of American College Health, 60(5), 415-419.

Tinto, V. (1975). Dropout from higher education: A theoretical synthesis of recent research. Review of Educational Research 45(1) 89–125. Retrieved from

Tinto, V.  (2010). From theory to action: Exploring the institutional conditions for student retention. In J.C. Smart (Ed.), Higher Education: Handbook of Theory and Research, Vol. 25 (pp. 51-90). DOI:10.1007/978-90-481-8598-6_2

  1. Tinto, 1975; 2010
  2. Jiao, 1997
  3. 1987; Thurber & Walton, 2012
  4. Tinto, 1975; 2010
  5. MAP-Works is an online student retention tool administered by Skyfactor.
  6. Batty, 2014; GVSU Office of Institutional Analysis, 2013
  7. Tinto, 1975; 2010
  8. Kuh et al., 2006
  9. StoryCorps, 2017
  10. Audio stories featured students from: WISE (Women in Science and Engineering); TRIO Student Support Services, a federally-funded support program for first-generation and limited-income students; GVSU Crew (club sport); GVSU Veterans Network; Milton E. Ford Lesbian, Gay, Bisexual & Transgender (LGBT) Resource Center; Padnos International Center; DeVos Center for Entrepreneurship and Innovation.
  11. 2008, pp. 7-10
  12. Simon, 2010, p. ii
  13. Bishop, 2010
  14. Makowski, 2016, in an interview with retired Dean of University Libraries Lee Van Orsdel
  15. 1993, p. 398; Kuh et al, 2006
  16. Fraser, et al., 2014

Public money? Public code! / Open Knowledge Foundation

If taxpayers pay for something, they should have access to the results of the work they paid for. This seems a very logical basic premise that no-one would disagree with, but there are many cases of where this is not common practice. For example, in various countries Freedom of Information laws do not fully apply to cases where governments outsource services. This would prevent you from finding out how your tax money has been spent. Or think about the cost of access to academic outputs resulting from public money: while much of the university research is paid for by the public, the academic outputs are locked away in academic journals, university libraries pay a lot of money to have access to these outputs, and the general public has no access at all unless they pay up.

But there is another important area where taxpayers’ money is used to lock away results. In our increasingly digitised societies, more and more software is being built by governments, or commissioned to external parties. The results of that work is in most cases proprietary software, which continues to be owned by the supplier. As a result, governments suffer from vendor lock-in, which means they rely fully on the external supplier for anything related to the software. No-one else is able to provide any adaptations or additions to the software, test the software properly to make sure there are no vulnerabilities, and the government cannot easily move to a different supplier if they are unhappy with the software provided. An easy solution for these issues exists: mandate that all software developed using public money is public code: stipulate in all contracts with external suppliers that the software they develop is released  under a Free and Open Source Software license.

This issue forms the heart of the Public Code, Public Money campaign the Free Software Foundation Europe launched recently. The ultimate aim of the campaign is to make sure Free and Open Source Software will be the default option for publicly financed software everywhere. Open Knowledge International wholeheartedly supports this movement and we add our voice to the creed: If it is public money, it should be public code! Together with all signatories, we call on our representatives to take the necessary steps to require that publicly financed software developed for the public sector be made publicly available under a Free and Open Source Software licence.

This topic is dear to us at Open Knowledge International. As the original developers and one of the main stewards of the CKAN software, we try to do our bit to make sure there is trustworthy, high quality open source software available for governments to deploy. CKAN is currently used by many governments worldwide (include the US, UK, Canada, Brasil, Germany, and Australia – to name a few) to publish data. As many governments have similar needs in publishing data on their websites, it would be a waste of public money if each government commissions the development of their own platform, or even pay a commercial supplier for a proprietary product. Because if a good open source solution is available, governments do not have to pay for license fees for the software: they use it for free. They can still contract an external company to deploy the open source software for them, and make any adaptations that they might want. But as long as these adaptations are also released as open source, the government is not tied to this one supplier – since the software is freely accessible they can easily take it to a different supplier if they’re unhappy. In practice though, this is not the case for most software in use by governments, and they continue to rely on suppliers for whom a vendor lock-in model is attractive.

But we know change is possible. We have seen some successes in the last few years in the area of academic publishing, as the open access movement has gathered steam: increasingly funders of academic research stipulate that if you receive grants from them, you are expected to publish the result of this work under an open access license, which means that anyone can read and download their work.

We hope a similar transformation is possible for publicly funded software, and we urge you all to add your signature to the campaign now!

Copyright Office releases draft bill to change Section 108 / District Dispatch

Thirteen years. That’s arguably how long the U.S. Copyright Office and many in industry and the library universe (notably including ALA president Jim Neal) have wrestled with difficult practical, legal and political questions about whether or how to modernize section 108 of the Copyright Act (17 USC §108). That’s the provision that creates a “safe harbor” from copyright infringement liability under specific circumstances for libraries and archives copying or distributing copyrighted material to preserve, secure and replace their collections (and other limited purposes). In a 70+ page “Discussion Document” released last Friday, and building on the work of the Section 108 Study Group’s own 2008 recommendations, the Copyright Office recounts the history of that effort and makes detailed proposals for updating section 108. It also provides “model” statutory language for Congress’ consideration.Text in a book highlighted to feature copyright.

ALA and its coalition colleagues are analyzing Friday’s Discussion Document and accompanying legisla-tive text. While drafted in legislative form, they have not yet been introduced (in whole or in part) as actual legislation and it is not yet clear whether they will be. More on that front as information becomes available. For the moment, as summarized by the Copyright Office, its principal findings and recommendations for changes to section 108 include:

Organization and Scope

  • Reorganize section 108 to make it easier to understand and apply in practice;
  • Add museums to the statute in order to increase the reach of section 108 and ensure that more works can be preserved and made available to scholars and researchers;
  • Add exceptions to the rights of public display and performance where appropriate; and
  • Add common-sense conditions for libraries, archives, and museums to meet in order to be eligible for section 108 coverage, so as to balance the significant expansion of the exceptions.

Preservation, Research, and Replacement Copies

  • Replace the current published/unpublished distinction with a new publicly disseminated/not publicly disseminated distinction, to better reflect the ways in which commercialized works are made available;
  • Allow preservation copies to be made of all works in an eligible entity’s collections, with expanded access for copies of works that were not disseminated to the public, a “dark archive” for publicly disseminated works, and replacement of the three-copy limit with a “reasonably necessary” standard;
  • Expand the limits of what is allowed to be copied for research use in another institution, and replace the three-copy limit with a limit of what is “reasonably necessary” to result in a single end-use copy; and
  • Add “fragile” to the list of conditions that may trigger a replacement copy, expand off-premises access for replacement copies, and replace the three-copy limit with a limit of what is “reasonably necessary” to result in a single end-use copy.

Copies for Users

  • Clarify that digital distributions, displays, and performances are allowed to be made of copies made at the request of users, under certain conditions;
  • Add a requirement for copies for users of an entire work or a substantial part of a work, that not only must a usable copy of the work not be available for purchase, but the user must not be able to license the use of the work; and
  • Eliminate the exclusion of musical works; pictorial, graphic, or sculptural works; and motion pictures or other audio- visual works from the provisions permitting copies to be made at the request of users, under certain conditions.

Audio-visual News Programs, Last 20 Years of Protection, and Unsupervised Reproducing Equipment

  • Expand the means through which copies of audio-visual news programs may be distributed;
  • Expand the provision concerning exceptions in the last 20 years of copyright protection to cover all works, not only published works; and
  • Clarify that the limitation of liability for patron use of unsupervised reproducing equipment includes equipment brought onto the premises by users, such as smart phones and portable scanners, and require copyright warnings be posted throughout the institution’s public areas.

Licenses and Outsourcing

  • Provide that eligible institutions do not infringe a work if they make preservation or security reproduc¬tions in violation of contrary, non-bargained-for, contractual language; and
  • Allow eligible institutions to contract with third parties to perform any of the reproduction functions under section 108, under specific conditions.

The post Copyright Office releases draft bill to change Section 108 appeared first on District Dispatch.

2017 LITA Forum – Programs, Schedule Available / LITA

Check out the 2017 LITA Forum website now for the preliminary schedule, program tracks, posters, and speakers. You’re sure to find sessions and more sessions that you really want to attend.

Register Now!

Denver, CO
November 9-12, 2017

Participate with your LITA and library technology colleagues for the excellent networking opportunities at the 2017 LITA Forum.

And don’t forget the other conference highlights including the Keynote speakers and the Preconferences.

Keynote Speakers:

Casey Fiesler, University of Colorado Boulder

Armed with a PhD in Human-Centered Computing from Georgia Tech and a JD from Vanderbilt Law School, Casey Fiesler primarily researches social computing, law, ethics, and fan communities (occasionally all at the same time). Find out more on her website at:

Vivienne Ming, Scientist and Entrepreneur

Dr. Vivienne Ming, named one of 10 Women to Watch in Tech by Inc. Magazine, is a theoretical neuroscientist, technologist and entrepreneur. Her speaking topics address her philosophy to maximize human potential, emphasizing education and labor markets, diversity, and AI and cybernetics. Find out more on her website at

The Preconference Workshops:

IT Security and Privacy in Libraries: Stay Safe From Ransomware, Hackers & Snoops

Blake Carver will help participants tackle security myths, passwords, tracking, malware, and more, covering a range of tools and techniques, making this session ideal for any library staff that works with IT.

Improving Technology Services with Design Thinking: A Workshop

Michelle Frisque will guide participants in using the Design Thinking Toolkit for Libraries, an open source, step-by-step guide created by IDEO with Bill and Melinda Gates Foundation support, participants learn how to leverage design strategies to better understand and serve library patrons.

Full Details

Join us in Denver, Colorado, at the Embassy Suites by Hilton Denver Downtown Convention Center, for the 2017 LITA Forum, a three-day education and networking event featuring 2 preconferences, 2 keynote sessions, more than 50 concurrent sessions and 15 poster presentations. It’s the 20th annual gathering of the highly regarded LITA Forum for technology-minded information professionals. Meet with your colleagues involved in new and leading edge technologies in the library and information technology field. Registration is limited in order to preserve the important networking advantages of a smaller conference. Attendees take advantage of the informal Friday evening reception, networking dinners, game night, and other social opportunities to get to know colleagues and speakers.

Get the latest information, register and book a hotel room at the 2017 Forum Web site.

We thank our LITA Forum Sponsors:

ExLibris, Google, Aten, BiblioCommons

Questions or Comments?

Contact LITA at (312) 280-4268 or Mark Beatty,

See you in Denver.

lita logo with text

D-Lib Magazine / D-Lib

D-Lib Magazine ceased publishing new issues in July 2017. This RSS Feed will no longer be updated.

Library Launches / Library of Congress: The Signal

Today the Library of Congress has launched as a new online space designed to empower exploration and discovery in digital collections. Library of Congress Labs will host a changing selection of experiments, projects, events and resources designed to encourage creative use of the Library’s digital collections. To help demonstrate what exciting discoveries are possible, the new site will also feature a gallery of projects from innovators-in-residence and challenge winners, blog posts, and video presentations from leaders in the field.

Labs will enable users at every level of technical knowledge to engage with the Library’s digital collections. Visitors will have the opportunity to try experimental applications; crowdsourcing programs will allow the public to add their knowledge to the Library’s collections; and tutorials will provide a stepping stone for new computational discovery.

Screenshot of homepage

Library of Congress Labs homepage at

“We’re excited to see what happens when you bring together the largest collection of human knowledge ever assembled with the power of 21st century technology,” said Kate Zwaard, the chief of the Library’s National Digital Initiatives office, which manages the new website. “Every day, students, researchers, journalists, and artists are using code and computation to derive new knowledge from library collections. With Labs, we hope to create a community dedicated to using technology to expand what’s possible with the world’s creative and intellectual treasures.”

Currently featured on are the following:

  • LC for Robots – a collection of machine-readable data sources and APIs for Library of Congress collections.
  • Visual experiments with collections data
  • A new open source crowdsourcing app called Beyond Words that features World War I-era newspapers
  • Presentations, papers, videos, and other releases about what the National Digital Initiatives group is doing
  • Upcoming and past events

LC for Robots: Library of Congress API

To maximize the potential for creative use of its digital collections, the Library has leveraged industry standards to create application programming interfaces (APIs) to various digital collections. These windows to the Library will make our collections and data more accessible to automated access, via scripting and software, and will empower developers to explore new ways to use the Library’s collections. Information about each API are available on a section of called LC for Robots dedicated to helping people explore the Library’s APIs and data sets.

Newly available is a JSON API for, which is released as a work in progress that is subject to change as the Library of Congress learns more about the needs of its scholarly and technical user communities. The Library is releasing the API as a minimum viable product so that feedback from early adopters can help drive design and development for further enhancements.

The public can anticipate more opportunities to explore Library collections in the coming months. As Kate Zwaard explains, “We don’t think of labs as a product, we think of it as a promise. We’re excited about the projects that we’re launching with, but the purpose is to create a space that encourages creative work with the digital collections.” We invite you to make, discover, and tell us your experience on Twitter @LC_labs using the hashtag #BuiltwithLC and at

See the Library of Congress press release about the launch. 

Attacking (Users Of) The Wayback Machine / David Rosenthal

Right from the start, nearly two decades ago, the LOCKSS system assumed that:
Alas, even libraries have enemies. Governments and corporations have tried to rewrite history. Ideological zealots have tried to suppress research of which they disapprove.
The LOCKSS polling and repair protocol was designed to make it as difficult as possible for even a powerful attacker to change content preserved in a decentralized LOCKSS network, by exploiting excess replication and the lack of a central locus of control.

Just like libraries, Web archives have enemies. Jack Cushman and Ilya Kreymer's (CK) talk at the 2017 Web Archiving Conference identified seven potential vulnerabilities of centralized Web archives that an attacker could exploit to change or destroy content in the archive, or mislead an eventual reader as to the archived content.

Now, Rewriting History: Changing the Archived Web from the Present by Ada Lerner et al (L) identifies four attacks that, without compromising the archive itself, caused browsers using the Internet Archive's Wayback Machine to view pages that look different to the originally archived content. It is important to observe that the title is misleading, and that these attacks are less serious than those that compromise the archive. Problems with replaying archived content are fixable, loss or damage to archived content is not fixable.

Below the fold I examine L's four attacks and relate them to CK's seven vulnerabilities.

To review, CK's seven vulnerabilities are:
  1. Archiving local server files, in which resources local to the crawler end up in the archive.
  2. Hacking the headless browser, in which vulnerabilities in the execution of Javascript by the crawler are exploited.
  3. Stealing user secrets during capture, a vulnerability of user-driven crawlers which typically violate cross-domain protections.
  4. Cross site scripting to steal archive logins:
    When replaying preserved content, the archive must serve all preserved content from a different top-level domain from that used by users to log in to the archive and for the archive to serve the parts of a replay page (e.g. the Wayback machine's timeline) that are not preserved content. The preserved content should be isolated in an iframe.
  5. Live web leakage on playback:
    Especially with Javascript in archived pages, it is hard to make sure that all resources in a replayed page come from the archive, not from the live Web. If live Web Javascript is executed, all sorts of bad things can happen. Malicious Javascript could exfiltrate information from the archive, track users, or modify the content displayed.
  6. Show different page contents when archived:
    it is possible for an attacker to create pages that detect when they are being archived, so that the archive's content will be unrepresentative and possibly hostile. Alternately, the page can detect that it is being replayed, and display different content or attack the replayer.
  7. Banner spoofing:
    When replayed, malicious pages can overwrite the archive's banner, misleading the reader about the provenance of the page.
Vulnerabilities CK1 through CK4 are attacks on the archive itself, possibly leading to corruption and loss. The remaining three are attacks on the eventual reader, similar to of L's four. You need to read the paper to get the full details of their attacks, but in summary they are are:
  1. Archive-Escape Abuse: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner, which they show is common. The resource was not present in the archive, so when they obtained control of the domain they were able to serve from it malicious JavaScript that the page served from the Wayback Machine would include. This is a version of vulnerability CK5.
  2. Same-Origin Escape Abuse: The attackers identified an archived victim page that, in an iframe from a third-party domain, included malicious JavaScript. On the live Web the Same-Origin policy prevented it from executing, but when served from the Wayback Machine the page and the iframe had the same origin. This is related to vulnerability CK4. It requires foresight, since the iframe code must be present at ingest time.
  3. Same-Origin Escape + Archive-Escape: The attackers combined L1 and L2 by including in the iframe code that deliberately generated archive escapes. It again requires foresight, since the escape-generating code must be present at ingest time.
  4. Anachronism-Injection: The attackers identified an archived victim page that embedded a JavaScript resource from a third-party domain that had no owner. The resource was not present in the archive, so when they obtained control of the domain they could use the Wayback Machine's "Save Page Now" facility to create an archived version of the resource. Now when the Wayback Machine served the page, the attackers' version of the resource would be served from the archive. The only way to defend against this attack, since the attacker's version of the resource will always be the closest in time to the victim page, would be to restrict searches for nearest-in-time resources to a small time range.
Unlike L, CK note that Web archives could prevent leaks to the live Web:
Injecting the Content-Security-Policy (CSP) header into replayed content could mitigate these risks by preventing compliant browsers from loading resources except from the specified domain(s), which would be the archive's replay domain(s).
Web archives should; browsers have supported the CSP header for at least 4 years. The version of the Wayback Machine used by the Internet Archive's ArchiveIt service uses CSP to prevent live Web leakage, but the main Wayback Machine currently doesn't. If it did, L1 through L3 would be ineffective.

All this being said, there are some important caveats that users of preserved Web content should bear in mind. It is extremely likely that the payload of a URL delivered by the Wayback Machine is the same as that its crawler collected at the specified time. However, this does not mean that the rendered page in your browser looks the same as it would have had you visited the page when the Wayback Machine's crawler did:
  • If the Web archive's replay system does not use CSP, all bets are off.
  • Browsers evolve, rendering pages differently. Using can mitigate, but not eliminate this problem, as I wrote in The Internet Is for Cats.
  • The embedded resources, such as images, CSS files, and JavaScript libraries, may not have been collected at the same time as the page itself, so may be different, as in the L4 attack.
  • At collection time, the owner of the page's domain, or the domain of any of the embedded resources, or even someone who had compromised the Web servers of the page or any of its embedded resources, could be malicious. As in the CK6 vulnerability, they could detect that the page was being archived and deliver to the crawler a payload different from that they would have delivered to a browser.
The bottom line is that all critical uses of preserved Web content, such as legal evidence, should be based on the source of the payload, not on a rendered page image. for Tourism / Richard Wallis

The latest release of (v3.3 August 2017) included enhancements proposed by The Tourism Structured Web Data Community Group.

Although fairly small, these enhancements have significantly improved the capability for describing Tourist Attractions and hopefully enabling more tourist discoveries like the one pictured here.

The TouristAttraction Type

The TouristAttraction type has been around, as a subtype of Place, from the earliest days back in 2011.  However it did not have any specific properties of its own or examples for its use.  For those interested in promoting and sharing data to help with tourism and the discovery of tourist attractions, the situation was a bit limiting.

As a result of the efforts by the group, TouristAttraction now has two properties — availableLanguage (A language someone may use with or at the item, service or place.) and touristType (Attraction suitable for type(s) of tourist. eg. Children, visitors from a particular country, etc.).   So now we can say, in data, for example that an attraction is suitable for Spanish & French speaking tourists who are interested in wine.

Application and Examples

At initial view the addition of a couple of general purpose properties does not seem much of an advancement.  However the set of examples, that the Group has provided, demonstrate the power and flexibility of this enhanced type for use with tourism.

The principle behind their approach, as demonstrated in the examples, is that most anything can be of interest to a tourist — a beach, mountain, costal highway, ancient building, work of art, war cemetery, winery, amusement park — the list is endless.  It was soon clear that it would not be practical to add tourist relevant properties to all such relevant types within

Multi-Typed Entities (MTEs)
The Multi-Type Entity is a powerful feature of which has often caused confusion and has a bit of a reputation for being complex.  When describing a thing (an entity in data speak) has the capability to indicate that it is of more than one type.

For example you could be describing a Book, with an author, subject, isbn, etc., and in order to represent a physical example of that book you can also say it is a Product with weight, width, purchaseDate, itemCondition, etc.  In you simply achieve that by indicating in your mark-up that the thing you are describing is both a Book and a Product.

Utilising the MTE principle for tourism, you would firstly describe the thing itself, using the appropriate Types (Mountain, Church, Beach, AmusementPark, etc.).  Having done that you then add the TouristAttraction type. 

In Microdata it would look like this:
  <div itemtype=“” itemscope>
    <link itemprop=“additionalType” href=“” />
    <meta itemprop=“name” content=“Villers–Bretonneux Australian National Memorial />

<script type=“application/ld+json”>
“@context”: “”,
 “@type”: [“Cemetery”,“TouristAttraction”],
 “name”: “Villers–Bretonneux Australian National Memorial”,

If there is no specific type for the thing you are describing, you would just identify it as being a TouristAttraction, using all the properties inherited from Place to describe it as best as you can.

Following the ‘anything can be a tourist attraction principle’ it is perfectly possible for a Person to also be a TouristAttraction.  Take the human statue street performer who dresses up as the Statue of Liberty and stands in Times Square New York on most days (or at least seems to be there every time I visit).  

Using he/she can be described as a Person with name, email, jobTitle (Human Statue of Liberty), etc., and in addition as a TouristAttraction.  Sharing this information in on a page describing this performer would possibly increase their discoverability to those looking for tourist things in New York.

Public Access

In addition to the properties and examples for TouristAttraction, the Group also proposed a new publicAccess property.  It was felt that this had wider benefit than just for tourism and hence it was added to Place, and hence inherited by TouristAttraction.  publicAccess is a boolean property enabling you to state if a Place is or is not accessible to the public.  There is no assumed default value for this property if omitted from markup.


(Tourist image by Lain
Statue image by Bruce Crummy)



Metadata Quality Interfaces: Cluster Dashboard (OpenRefine Clustering Baked Right In) / Mark E. Phillips

This is the last of the updates from our summer’s activities in creating new metadata interfaces for the UNT Libraries Digital Collections.  If you are interested in the others in this series you can view the past few posts on this blog where I talk about our facet, count, search, and item interfaces.

This time I am going to talk a bit about our Cluster Dashboard.  This interface took a little bit longer than the others to complete.  Because of this, we are just rolling it out this week, but it is before Autumn so I’m calling it a Summer interface.

I warn you that there are going to be a bunch of screenshots here, so if you don’t like those, you probably won’t like this post.

Cluster Dashboard

For a number of years I have been using OpenRefine for working with spreadsheets of data before we load them into our digital repository.  This tool has a number of great features that help you get an overview of the data you are working with, as well as identifying some problem areas that you should think about cleaning up.  The feature that I have always felt was the most interesting was their data clustering interface.  The idea of this interface is that you choose a facet, (dimension, column) of your data and then group like values together.  There are a number of ways of doing this grouping and for an in-depth discussion of those algorithms I will point you to the wonderful OpenRefine Clustering documentation.

OpenRefine is a wonderful tool for working with spreadsheets (and a whole bunch of other types of data) but there are a few challenges that you run into when you are working with data from our digital library collections.  First of all our data generally isn’t rectangular.  It doesn’t easily fit into a spreadsheet.  We have some records with one creator, we have some records with dozens of creators.  There are ways to work with these multiple values but things get complicated. The bigger challenge we generally have is that while many systems can generate a spreadsheet of their data for exporting, very few of them (our system included) have a way of importing those changes back into the system in a spreadsheet format.  This means that while you could pull data from the system, clean it up in OpenRefine, when you were ready to put it back in the system you would run into the problem that there wasn’t a way to get that nice clean data back into the system. A way that you could use OpenRefine was to identify records to change and then have to go back into the system and change records there. But that is far from ideal.

So how did we overcome this? We wanted to use the OpenRefine clustering but couldn’t get data easily back into our system.  Our solution?  Bake the OpenRefine clustering right into the system.  That’s what this post is about.

The first thing you see when you load up the Cluster Dashboard is a quick bit of information about how many records, collections, and partners you are going to be working on values from.  This is helpful to let you know the scope of what you are cluster, both to understand why it might take a while to generate clusters, but also because it is generally better to run these clustering tools over the largest sets of data that you can because it can pull in variations from many different records.  Other than that you are presented with a pretty standard dashboard interface from the UNT Libraries’ Edit System. You can limit to subsets of records with the facets on the left side and the number of items you cluster over will change accordingly.

Cluster Dashboard

The next thing that you will see is a little help box below the clustering stats. This is a help interface that helps to explain how to use the clustering dashboard and a little more information about how the different algorithms work.  Metadata folks generally like to know the fine details about how the algorithms work, or at least be able to find that information if they want to know it later.

Cluster Dashboard Help

The first thing you do is select a field/element/facet that you are interested in clustering. In the example below I’m going to select the Contributor field.

Choosing an Element to Cluster

Once you make a selection you can further limit it to a qualifier, in this case you could limit it to just the Contributors that are organizations, or Contributors that are Composers.  As I said above, using more data generally works better so we will just run the algorithms over all of the values. You next have the option of choosing an algorithm for your clustering.  We recommend to people that they start with the default Fingerprint algorithm because it is a great starting point.  I will discuss the other algorithms later in this post.

Choosing an Algorithm

After you select your algorithm, you hit submit and things start working.  You are given a screen that will have a spinner that tells you the clusters are generating.

Generating Clusters

Depending on your dataset size and the number of unique values of the selected element, you could get your results back on a second or dozens of seconds.  The general flow of data after you hit submit is to query the Solr backend for all of the facet values and their counts.  These values are then processed with the chosen algorithm that creates a “key” for that value.  Another way to think about it is that the values are placed into a bucket that groups similar values together.  There are some calculations that are preformed on the clusters and then they are cached for about ten minutes by the system.  After you wait for the clusters to generate the first time they are much quicker for the next ten minutes.

In the screen below you can see the results of this first clustering.  I will go into detail about the values and options you have to work with the clusters.

Contributor Clusters with Fingerprint Key Collision Hashing

The first thing that you might want to do is sort the clusters in a different way.  By default they are sorted with the value of the cluster key.  Sometimes this makes sense, sometimes it doesn’t make sense as to why something is in a given order.  We thought about displaying the key but found that it was also distracting in the interface.

Different ways of sorting clusters

One of the ways that I like to sort the clusters is by the number of cluster Members.  The image below shows the clusters with this sort applied.

Contributor Field sorted by Members

Here is a more detailed view of a few clusters.  You can see that the name of the Russian composer Shostakovich has been grouped into a cluster of 14 members.  This represents 125 different records in the system with a Contributor element for this composer.  Next to each Member Value you will see a number in parenthesis, this is the number of records that uses that variation of the value.

Contributor Cluster Detail

You can also sort based on the number of records that a cluster contains.  This brings up the most frequently used values.  Generally there are a large number that have a value and then a few records that have a competing value.  Usually pretty easy to fix.

Contributor Element sorted by Records

Sorting by the Average Length Variation can help find values that are strange duplications of themselves.  Repeated phrases, a double copy and paste, strange things like that come to the surface.

Contributor Element sorted by Average Length Variation

Finally sorting by Average Length is helpful if you want to work with the longest or shortest values that are similar.

Contributor Element sorted by Average Length

Different Algorithms

I’m going to go through the different algorithms that we currently have in production.  Our hope is that as time moves forward we will introduce new algorithms or slight variations of algorithms to really get at some of the oddities of the data in the system.  First up is the Fingerprint algorithm.  This is a direct clone of the default fingerprint algorithm used by OpenRefine.

Contributor Element Clustered using Fingerprint Key Collision

A small variation we introduced was instead of replacing punctuation with a whitespace character, the Fingerprint-NS (No Space) just removes the punctuation without adding whitespace.  This would group F.B.I with FBI where the other Fingerprint algorithm wouldn’t group them together.  This small variation surfaces different clusters.  We had to keep reminding ourselves that when we created the algorithms that there wasn’t such a thing as “best”, or “better”, but instead they were just “different”.

Contributor Element Clustered using Fingerprint (No Space) Key Collision

One thing that is really common for names in bibliographic metadata is that they have many dates.  Birth, death, flourished, and so on.  We have a variation of the Fingerprint algorithm that removes all numbers in addition to punctuation.  We call this one Fingerprint-ND (No Dates).  This is helpful for grouping names that are missing dates with versions of the name that have dates.  In the second cluster below I pointed out an instance of Mozart’s name that wouldn’t have been grouped with the default Fingerprint algorithm.  Remember, different, not better or best.

Contributor Element Clustered using Fingerprint (No Dates) Key Collision

From there we branch out into a few simpler algorithms.  The Caseless algorithm just lowercases all of the values and you can see clusters that only differ in ways that are related to upper case or lower case values.

Contributor Element Clustered using Caseless (lowercase) Key Collision

Next up is the ASCII algorithm which tries to group together values that only differ in diacritics.  So for instance the name Jose and José would be grouped together.

Contributor Element Clustered using ASCII Key Collision

The final algorithm is just a whitespace normalization called Normalize Whitespace, it removes consecutive whitespace characters to group values.

Contributor Element Clustered using Normalized Whitespace Key Collision

You may have noticed that the number of clusters went down dramatically from the Fingerprint algorithms to the Caseless, ASCII, or Normalize Whitespace, we generally want people to start with the Fingerprint algorithms because they will be useful most of the time.

Other Example Elements

Here are a few more examples from other fields.  I’ve gone ahead and sorted them by Members (High to Low) because I think that’s the best way to see the value of this interface.  First up is the Creator field.

Creator Element clustered with Fingerprint algorithm and sorted by Members

Next up is the Subject field.  We have so so many ways of saying “OU Football”

Subject Element clustered with Fingerprint algorithm and sorted by Members

The real power of this interface is when you start fixing things.  In the example below I’m wanting to focus in on the value “Football (O U )”.  I do this by clicking the link for that Member Value.

Subject Element Cluster Detail

You are taken directly to a result set that has the records for that selected value.  In this case there are two records with “Football (O U )”.

Selected Records

All you have to do at this point is open up a record, make the edit and publish that record back. Many of you will say “yeah but wouldn’t some sort of batch editing be faster here?”  And I will answer “absolutely,  we are going to look into how we would do that!” (but it is a non-trivial activity due to how we manage and store metadata, so sadface 🙁 )

Subject Value in the Record

There you have it, the Cluster Dashboard and how it works.  The hope is to empower our metadata creators and metadata managers to better understand and if needed, clean up the values in our metadata records.  By doing so we are improving the ability for people to connect different records based on common valuse between the records.

As we move forward we will introduce a number of other algorithms that we can use to cluster values.  There are also some other metrics that we will look at for sorting records to try and tease out “which clusters would be the most helpful to our users to correct first”.  That is always something we are keeping in the back of our head,  how can we provide a sorted list of things that are most in need of human fixing.  So if you are interested in that sort of thing stay tuned, I will probably talk about it on this blog.

If you have questions or comments about this post,  please let me know via Twitter.

WorldShare APIs and SSL3 / OCLC Dev Network

On September 10, OCLC disabled access to via the SSL3 protocol. This change did not impact access to via web browsers, but it has affected some community applications that call WorldShare APIs via SSL3.

Islandora Camp HRM Coming in 2018 / Islandora

Islandora Camp is coming to Atlantic Canada! July 18 - 20, 2018, we will be gathering in historic Halifax, Nova Scotia, on the campus of Mount Saint Vincent University. The Halifax Regional Municipality is a city of around 400,000 with seven post-secondary institutions, a vibrant waterfront, and easy transport to some of the best attractions that east-coast that Canada has to offer in the summer. 

Registration will open in late 2017 at our usual CAD rates of $450 Early Bird and $499 Regular, so stay tuned and save the date.

A Case Study on the Path to Resource Discovery / Information Technology and Libraries

A meeting in April 2015 explored the potential withdrawal of valuable collections of microfilm held by the University of Maryland, College Park Libraries. This resulted in a project to identify OCLC record numbers (OCN) for addition to OCLC’s Chadwyck-Healey Early English Books Online (EEBO) KBART file.[i] Initially, the project was an attempt to adapt cataloging workflows to a new environment in which the copy cataloging of e-resources takes place within discovery system tools rather than traditional cataloging utilities and MARC record set or individual record downloads into online catalogs. In the course of the project, it was discovered that the microfilm and e-version bibliographic records contained metadata which had not been utilized by OCLC to improve its link resolution and discovery services for digitized versions of the microfilm resources. This metadata may be advantageous to OCLC and to others in their work to transition from MARC to linked data on the Semantic Web. With MARC record field indexing and linked data implementations, this collection and others could better support scholarly research.

[i] A KBART file is a file compliant with the NISO recommended practice, Knowledge Bases and Related Tools (KBART). See KBART Phase II Working Group, Knowledge Bases and Related Tools (KBART): Recommended Practice: NISO RP-9-2014 (Baltimore, MD: National Information Standards Organization (NISO), 2014), accessed March 14, 2017,

Bibliographic Classification in the Digital Age: Current Trends & Future Directions / Information Technology and Libraries

Bibliographic classification is among the core activities of Library & Information Science that brings order and proper management to the holdings of a library. Compared to printed media, digital collections present numerous challenges regarding their preservation, curation, organization and their resource discovery and access. In this regard true native perspective is needed to be adopted for bibliographic classification in digital environments. In this research article, we have investigated and reported different approaches to bibliographic classification of digital collections. The article also contributes two evaluation frameworks that evaluate different classification schemes and elaborate different approaches that exist in theory, in manual practice and automatically in digital environments. The article presents a bird-eye-view for researchers in reaching a generalized and holistic approach towards bibliographic classification research, where new research avenues have been identified. 

Managing Metadata for Philatelic Materials / Information Technology and Libraries

Stamp collectors frequently donate their stamps to cultural heritage institutions. As digitization becomes more prevalent for other kinds of materials, it is worth exploring how cultural heritage institutions are digitizing their philatelic materials. This paper begins with a review of the literature about the purpose of metadata, current metadata standards, and metadata that are relevant to philatelists. The paper then examines the digital philatelic collections of four large cultural heritage institutions, discussing the metadata standards and elements employed by these institutions. The paper concludes with a recommendation to create international standards that describe metadata management explicitly for philatelic materials.

Consider TTY::Command for all your external process/shell out needs in ruby / Jonathan Rochkind

When writing a ruby app, I regularly have the need to execute and wait for an external non-ruby “command line” process. Sometimes I think of this as a “shell out”, but in truth depending on how you do it a shell (like bash or sh) may not be involved at all, the ruby process can execute the external process directly.  Typical examples for me are the imagemagick/graphicsmagick command line.

(Which is incidentally, I think, what the popular ruby minimagick gem does, just execute an external process using IM command line. As opposed to rmagick, which tries to actually use the system C IM libraries. Sometimes “shelling out” to command line utility is just simpler and easier to get right).

There are a few high-level ways built into ruby to execute external processes easily. Including the simple system and  backticks (`), which is usually what most people start with, they’re simple and right there for you!

But I think many people end up finding what I have, the most common patterns I want in a “launch and wait for external command line process” function are difficult with system and backticks.  I definitely want the exit value — I usually am going to wait to raise an exception if the exit value isn’t 0 (unix for “success”).   I usually want to suppress stdout/stderr from the external process (instead of having it end up in my own processes stdout/stderr and/or logs), but I want to capture them in a string (sometimes separate strings for stdout/stderr), because in an error condition I do want to log them and/or include them in an exception message. And of course there’s making sure you are safe from command injection vulnerabilities. 

Neither system nor backticks will actually give you all this.  You end up having to do Open3#popen3 to get full control. And it ends up pretty confusing, verbose, and tricky, to make sure you’re doing what you want, and without accidentally dead-blocking for some of the weirder combinations. In part because popen3 is just an old-school low-level C-style OS API being exposed to you in ruby.

The good news is @piotrmurrach’s TTY::Command will do it all for you. It’s got the right API to easily express the common use-cases you actually have, succinctly and clearly, and taking care of the tricky socket/blocking stuff for you.

One common use case I have is:  execute an external process. Do not let it output to stderr/stdout, but do capture the stderr/stdout in string(s). If the command fails,  raise with the captured stdout/stderr included (that I intentionally didn’t output to logs, but I wanna see it on error). Do it all with proper protection from command injection attack, of course. :null).run('vips', 'dzsave', input_file_path_string)

Woah, done! run will already:

If the command fails (with a non-zero exit code), a TTY::Command::ExitError is raised. The ExitError message will include: the name of command executed; the exit status; stdout bytes; stderr bytes

Does exactly what I need, cause, guess, what, what I need is a very common use case and piotr recognized that, prob from his own work.

Want to not raise on the error, but still detect it and log stdout/stderr? No problem.

result = :null).run("vips", "dzsave", whatever)
if result.failed?
$stderr.puts("Our vips thing failed!!! with this output:\n #{result.stdout} #{result.stderr}")

If you want to not raise on error but still detect it, pass ENV, a bunch of other things, TTY::Command has got ya. Supply stdin too? No prob.  Supply a custom output formatter, so stuff goes to stdout/stderr but properly colorized/indented for your own command line utility, to look all nice and consistent with your other output? Yup. You even get a dry-run mode!

Ordinary natural rubyish options for just about anything I can think of I might want to do, and some things I hadn’t realized I might want to do until I saw em doc’d as options in TTY::Command. Easy-peasy.

In the past, I sometimes end up writing bash scripts when I’m writing something that calls a lot of external processes, cause bash seems like the suitable fit for that, it can be annoying and verbose to do a lot of that how you want in ruby script. Inevitably the bash script grows to the point that I’m looking up non-trivial parts of bash (I’m not an expert), and fighting with them, and regretting that I used bash.  In the future, when I have the thought “this might be best in bash”, I plan to try using just ruby with TTY::Command, I think it’ll lessen the pain of lots of external processes in ruby to where there’s no reason to even consider using bash.


Filed under: General Describing Global Corporations Local Cafés And Everything In-between / Richard Wallis

There have been  discussions in Github Issues about the way Organizations their offices, branches and other locations can be marked up.  (The trail is here: #1734)  It started with a question about why the property hasMap was available for a LocalBusiness but not for Organization.  The simple answer being that LocalBusiness inherits the hasMap property from Place, one of its super-types, and not from Organiszation, its other super-type.

As was commented in the thread…

there are plenty of organizations/corporations that have a head office which nobody would consider a LocalBusiness yet for which it would be handy to be able to express it can be found on a map.

The following discussion exposed a lack of clarity in the way to structure descriptions of Organizations and their locations, offices, branches , etc.  It is also clear that the current structure when applied correctly can deliver the functionality required.  This is not to recognise that the descriptions of terms and associated examples could not be improved, and may be helped by some tweaking to the current structure — the discussion continues!

To address that lack of clarity I thought it would be useful to share some examples here.

The LocalBusiness simple case
As a combination of Organization and Place, LocalBusiness gives you most anything you would need to describe a local shop, repair garage, café, etc.  Thus:
<script type=“application/ld+json”>
  “@context”: “”,
  “@type”: “LocalBusiness”,
“address”: {
    “@type”: “PostalAddress”,
    “addressLocality”: “Mexico Beach”,
    “addressRegion”: “FL”,
    “streetAddress”: “3102 Highway 98”
  “description”: “A superb collection of fine gifts and clothing to accent your stay in Mexico Beach.”,
  “name”: “Beachwalk Beachwear & Giftware”,
  “telephone”: “850-648-4200”
There are plenty of other properties available, so for example if you have link to a map you could add:
“hasMap”: “,-85.4207543,17z”,

If you had other information about the business such as business numbers or tax identifiers you could add:
“vatID”: “1234567890abc”,

What about a group of LocalBusinesses
Also this is a fairly simple case. By using the parentOrganization & subOrganization properties (inherited from the Organiztion super-type) you can build a hierarchy of relationships as complex as you would ever need:
<script type=“application/ld+json”>
  “@context”: “”,
  “@type”: “LocalBusiness”,
  “@id”: “”,
  “name”: “Localshops”,
  “subOrganization”: “”
<script type=“application/ld+json”>
  “@context”: “”,
  “@type”: “LocalBusiness”,
  “@id”: “”,
  “name”: “Localshops”,
  “parentOrganization”: “”

Location and POS
There are a couple of properties inherited from Organization (location, hasPOS) which may help you link to things that might not be obvious local businesses — the warehouse location for your group of hardware stores, or the beach kiosk for your café for example.

Banks, Libraries, and Hotels
There are many local examples of larger organizations that appear on our high streets, for some of the more common ones there are ready made subtypes of LocalBusiness – BankOrCreditUnion, Library, Hotel, etc.  If you can’t find a suitable one, default to LocalBusiness.

This possibility sometimes causes some confusion.  For instance Wells Fargo, the global finance company, could no way be considered as a local business, however their branch in your local city can indeed be considered as one.  For example:
  “@context”: “”,
  “@type”: “BankOrCreditUnion”,
  “name”: “Wells Fargo – Smalltown Branch”,
  “parentOrganization”: “”

Governments, Global Corporations, Online Groups
Picking up on that confusion about non-LocalBusiness-ness, brings me to the prime use of the Organization type.  That prime use in my experience is to describe the legal entity, corporate organiation, cooperation group, government, international body, etc.  From a dictionary definition: “an organized group of people with a particular purpose, such as a business or government department.”

My approach to this would be to describe the organization first — name, description, email, foundingDate, leicode, logo, taxID, etc.  If it has a registered or main address, use the address property, if it has channels for specific contact methods etc, use contactPoint to describe areaServed, contactType etc.

Some of these organizations operate out of various locations — head offices, regional/country based main/local offices, factories, warehouses, distribution centres, research and development centres, laboratories, etc.  The list is a substantial one.  This is where the location property comes into play.

Each of those locations can then be described using a Place type or one of its subtypes…

<script type=“application/ld+json”>
  “@context”: “”,
  “@type”: “Organization”,
  “@id”: “”,
  “name”: “Global-Corp Company”,
  “location”:  [
    “@type”: “Place”,
    “name”: “Global-Corp Company HQ”,
    “address”: “Future City Development, Main Street, Anytown, NY”,
    “hasMap”: “”
    “@type”: “Place”,
    “name”: “Global-Corp Company Research Laboratory”,
    “address”: “Lab1, Future Park, Anytown, NY”,
    “hasMap”: “”

This is not a perfect solution, it would be nice to have specific subtypes of Place for Office, Factory, etc. or a placeType property on Place.  I am sure there will be continued discussion on this.   In the meantime what is already available goes a long way towards describing what is needed from Global Corporations to Local Cafés and everything in-between.

(Image: Simon)

Prepare Now for Topical Storm Chrome 62 / Eric Hellman

Sometime in October, probably the week of October 17th, version 62 of Google's Chrome web browser will be declared "stable". When that happens, users of Chrome will get their software updated to version 62 when they restart.

One of the small but important changes that will occur is that many websites that have not implemented HTTPS to secure their communications will be marked in a subtle way as "Not Secure". When such a website presents a web form, typing into the form will change the appearance of the website URL. Here's what it will look like:

Unfortunately, many libraries, and the vendors and publishers that serve them, have not yet implemented HTTPS, so many library users that type into search boxes will start seeing the words "Not Secure" and may be alarmed.

What's going to happen? Here's what I HOPE happens:
  • Libraries, Vendors, and Publishers that have been working on switching their websites for the past two years (because usually it's a lot more work than just pushing a button) are motivated to fix the last few problems, turn on their secure connections, and redirect all their web traffic through their secure servers before October 17.
          So instead of this:

           ... users will see this:

  • Library management and staff will be prepared to answer questions about the few remaining problems that occur. The internet is not a secure place, and Chrome's subtle indicator is just a reminder not to type in sensitive information, like passwords, personal names and identifiers, into "not secure" websites.
  • The "Not Secure" animation will be noticed by many users of libraries, vendors, and publishers that haven't devoted resources to securing their websites. The users will file helpful bug reports and the website providers will acknowledge their prior misjudgments and start to work carefully to do what needs to be done to protect their users.
  • Libraries, vendors, and publishers will work together to address many interactions and dependencies in their internet systems.

Here's what I FEAR might happen:
  • The words "Not Secure" will cause people in charge to think their organizations' websites "have been hacked". 
  • Publishing executives seeing the "Not Secure" label will order their IT staff to "DO SOMETHING" without the time or resources to do a proper job.
  • Library directors will demand that Chrome be replaced by Firefox on all library computers because of a "BUG in CHROME". (creating an even worse problem when Firefox follows suit in a few months!) 
  • Library staff will put up signs instructing patrons to "ignore security warnings" on the internet. Patrons will believe them.
Back here in the real world, libraries are under-resourced and struggling to keep things working. The industry in general has been well behind the curve of HTTPS adoption, needlessly putting many library users at risk. The complicated technical environment, including proxy servers, authentication systems, federated search, and link servers has made the job of switching to secure connections more difficult.

So here's my forecast of what WILL happen:
  • Many libraries, publishers and vendors, motivated by Chrome 62, will finish their switch-over projects before October 17. Users of library web services will have better security and privacy. (For example, I expect OCLC's WorldCat, shown above in secure and not secure versions, will be in this category.)
  • Many switch-over projects will be rushed, and staff throughout the industry, both technical and user-facing, will need to scramble and cooperate to report and fix many minor issues.
  • A few not-so-thoughtful voices will complain that this whole security and privacy fuss is overblown, and blame it on an evil Google conspiracy.

Here are some notes to help you prepare:
  1. I've been asked whether libraries need to update links in their catalog to use the secure version of resource links. Yes, but there's no need to rush. Website providers should be using HTTP redirects to force users into the secure connections, and should use HSTS headers to make sure that their future connections are secure from the start.
  2. Libraries using proxy servers MUST update their software to reasonably current versions, and update proxy settings to account for secure versions of provider services. In many cases this will require acquisition of a wildcard certificate for the proxy server.
  3.  I've had publishers and vendors complain to me that library customers have asked them to retain the option of insecure connections ... because reasons. Recently, I've seen reports on listservs that vendors are being asked to retain insecure server settings because the library "can't" update their obsolete and insecure proxy software. These libraries should be ashamed of themselves - their negligence is holding back progress for everyone and endangering library users. 
  4. Chrome 62 is expected to reach beta next week. You'll then be able to install it from the beta channel. (Currently, it's in the dev channel.) Even then, you may need to set the #mark-non-secure-as flag to see the new behavior. Once Chrome 62 is stable, you may still be able to disable the feature using this flag.
  5. A screen capture using chrome 62 now might help convince your manager, your IT department, or a vendor that a website really needs to be switched to HTTPS.
  6. Mixed content warnings are the result of embedding not-secure images, fonts, or scripts in a secure web page. A malicious actor can insert content or code in these elements, endangering the user. Much of the work in switching a large site from HTTP to HTTPS consists of finding and addressing mixed content issues.
  7. Google's Emily Schechter gives an excellent presentation on the transition to HTTPS, and how the Chrome UI is gradually changing to more accurately communicate to users that non-HTTPS sites may present risks: (discussion of Chrome 62 changes starts around 32:00)
  8. (added 9/15/2017) As an example of a company that's been working for a while on switching, Elsevier has informed its ScienceDirect customers that ScienceDirect will be switching to HTTPS in October. They have posted instructions for testing proxy configurations.

OpenSRF 2.5.2 released / Evergreen ILS

We are pleased to announce the release of OpenSRF 2.5.2, a message routing network that offers scalability and failover support for individual services and entire servers with minimal development and deployment overhead.

OpenSRF 2.5.2 fixes a significant bug that was introduced in 2.5.1. All users of OpenSRF 2.5.1, including testers of the Evergreen 3.0 beta release, are advised to upgrade as soon as possible.  In particular, 2.5.2 fixes bug 1717350, where an error in the splitting of large messages could result in characters getting dropped at the division points.

To download OpenSRF 2.5.2, please visit the downloads page. The release notes can be viewed here.

We would also like to thank the following people who contributed to the release:

  • Bill Erickson
  • Dan Wells
  • Galen Charlton
  • Jason Stephenson
  • Jeff Davis
  • Mike Rylander

Washington Office’s copyright expert, puzzle master recognized / District Dispatch

Carrie Russell is being recognized by her alma mater the School of Information Studies at the University of Wisconsin-Milwaukee this weekend at their 50th Anniversary Gala.Here’s something you may not know about the Washington Office’s in-house copyright expert Carrie Russell: Carrie loves puzzles.

On the first floor of the Washington Office, Carrie always has a jigsaw puzzle in progress. And, in her office on the second floor, she’s spent the last 18 years trying to untangle the field’s bigger knots.

Carrie — who is being recognized by her alma mater the School of Information Studies at the University of Wisconsin-Milwaukee (UMW) this weekend at their 50th Anniversary Gala — thinks there is no better feeling than completing a puzzle.

“When I was a student at UWM, I worked in the library and loved cataloging,” said Carrie. “I remember we had to carry around these huge Sears List of Subject Headings manuals everywhere and I loved it. To me, cataloging was a puzzle that’s a lot like the public policy I work on now.”

Carrie joined the Washington Office in 1999 as a copyright specialist, developing copyright education programs for librarians and analyzing the expansion of copyright law in the digital environment. Carrie’s work broadened in 2007 when she became director of the Program on Public Access to Information, covering international copyright, accessibility, e-books and more.

“This honor is certainly well-deserved given Carrie’s many contributions to the field,” said Office for Information Technology Policy’s Director Alan Inouye. “She has been a driving force behind ALA’s involvement in big copyright issues like the Marrakesh treaty, the Google books lawsuit, HathiTrust, the ongoing Georgia State e-reserves case, Section 108 study group and our work with LCA and the newer Re:Create coalition. We are grateful for her vision, dedication and partnership these significant areas.”

UWM is honoring Carrie alongside 49 other alumni — including ALA’s Kristin Pekoll, assistant director of the Office for Intellectual Freedom — who, through their lives and careers, have exemplified UMW’s focus on critical inquiry and leadership to address the needs of a diverse and global information society.

Carrie didn’t start out thinking she would become an international copyright expert. When she graduated from UMW in 1985, she took a position as a serials cataloger in the library at the University of Arizona Tucson. There she progressed from serials to media.

“After I graduated from UMW I became really interested in the concept of information as a commodity,” explained Carrie. “I started to spend time with the political economists in the University’s Media Arts Department.”

Carrie, who studied film (ask her about Italian neorealism!) during her undergraduate years, was no stranger to Media Arts—and brought a unique perspective.

“To me, serials pricing was the perfect political economy puzzle. Each journal title publishes unique research findings and, as a result, creates these unique commodities,” explained Carrie. “But, at U of A, rising subscription costs often prevented us from being able to afford certain subscriptions. We had to try to contain costs while simultaneously maximizing access for our users.”

After getting her second master’s degree, Carrie moved into a position as the University’s Copyright Librarian, consulting faculty regarding curriculum-related copyright issues and developing an advocacy program for faculty on scholarly communication and alternative publishing models.

Today, Carrie continues to work directly with librarians on copyright, hosting training and webinars, analyzing legislation, presenting at conferences and writing books. Through her work with ALA’s Copyright Education Subcommittee, she became known for her creative communication on copyright, creating the spinner, the slider, the “foldy thingy” and Fair Use coasters. She was the recipient of the 2013 ABC-CLIO/Greenwood Award for Best Book in Library Literature for Copyright Copyright: An Everyday Guide for K-12 Librarians and Educators. She also is the author of Complete Copyright: An Everyday Guide for Librarians, now in its second edition.

“I know that all of my colleagues at the Washington Office and ALA broadly join me in applauding both Carrie and Kristin for this impressive acknowledgment of their outstanding leadership and commitment to the library field,” said Associate Executive Director Kathi Kromer. “ALA is fortunate to have such talented issues experts on staff to help us serve members and the public and we look forward to their future contributions.”

The post Washington Office’s copyright expert, puzzle master recognized appeared first on District Dispatch.

New OCLC System Status Dashboard / OCLC Dev Network

OCLC has launched a new way to check the real-time and planned status of OCLC systems and applications, including APIs.

House approves full IMLS, LSTA and IAL funding for FY 2018 / District Dispatch

Today, the full House of Representatives voted as part of a large spending package (H.R. 3354) not to make any cuts in federal funding for the Institute of Museum and Library Services (IMLS), including all funding for its programs under the Library Services and Technology Act, and for the Department of Education’s Innovative Approaches to Literacy program. Notably, the package also increased funding for the National Library of Medicine by $6 million.

sign: black scissors in a circle crossed out

The House approved full funding for FY 2018 key federal library programs. Photo credit:

With today’s vote, the House has now finished work for the FY 2018 appropriations cycle, at least until it must reconcile its bill to one eventually passed by the Senate. The full Senate is not likely to take up its own appropriations bill* until late this year.

Once the Senate has acted on a spending bill and the process of resolving differences with the House bill begins, ALA and library supporters everywhere may need to again push hard to retain funding gains made in the Senate. ALA’s Washington Office will put out an alert when your voice can be most effective.

Congratulations to everyone who contacted their representative. Your advocacy has put libraries in a great position at this point in the process; your persistence later this year will give us a strong finish.


* Read last week’s post on the Senate Appropriations Committee vote to increase IMLS’ FY 2018 budget.

The post House approves full IMLS, LSTA and IAL funding for FY 2018 appeared first on District Dispatch.

ALA appoints Jon Peha and Sari Feldman as senior fellows / District Dispatch

two headshots

ALA has appointed Sari Feldman and Jon Peha as senior fellows. Photo credit: American Library Association

I am pleased to announce the appointment of Jon Peha and Sari Feldman as senior fellows of the American Library Association’s (ALA) Office for Information Technology Policy (OITP). As senior fellows, they will provide strategic advice on our national policy advocacy.

Jon Peha is a professor in the Department of Engineering and Public Policy and the Department of Electrical and Computer Engineering at Carnegie Mellon University. He has served as the chief technologist at the Federal Communications Commission, assistant director in the White House Office of Science and Technology Policy, legislative fellow on the House Energy & Commerce Committee and team leader and fellow in the U.S. Agency for International Development. In industry, Peha has been chief technical officer for three high-tech companies and a member of technical staff at SRI International, AT&T Bell Laboratories and Microsoft. He is a fellow of the Institute of Electrical and Electronics Engineers (IEEE) and the American Association for the Advancement of Science (AAAS). Peha will provide ALA with counsel on the broad range of telecommunications issues from net neutrality and network engineering to spectrum and universal service.

Sari Feldman is executive director of the 27-branch Cuyahoga County Public Library in Ohio. She has served as the president of the American Library Association and the Public Library Association. As part of her service to ALA, Feldman served as co-chair of the ALA Digital Content Working Group that successfully advocated for library access to e-books from the largest publishers. She was named one of Publishers Weekly’s “Notable Publishing People of 2014.” She has held other leadership posts in libraries that include deputy director at the Cleveland Public Library. She also served as an adjunct faculty member at the School of Information Studies at Syracuse University. Feldman’s local civic involvement included president of the board of Cuyahoga Arts and Culture and as well as an appointed member of the City-County Workforce Development Board. She currently serves as Chair of DigitalC, formerly OneCommunity. Feldman will provide guidance on e-book and digital content issues as well as library broadband policy and implementation.

“ALA is fortunate to have such capable individuals sharing their knowledge with us,” said Marc Gartler, chair of the OITP Advisory Committee. “Libraries have many policy concerns, so the breadth and depth provided by the senior fellows is essential to our ability to adequately address the myriad issues before us. On behalf of ALA, I extend our sincere thanks for their willingness to serve.”

Feldman and Peha join senior fellow Robert Bocher, senior counsel Alan Fishel and senior advisor Roger Rosen as strategic advisors to ALA.


The post ALA appoints Jon Peha and Sari Feldman as senior fellows appeared first on District Dispatch.

Solr Payloads / Lucidworks

Before we delve into the technical details, what’s the big picture?  What real-world challenges are made better with these new Solr capabilities?   Here’s some use cases where payloads can help:

  • per-store pricing
  • weighted terms, such as the confidence or importance of a term
  • weighting term types, like factoring synonyms lower, or verbs higher

Now on to the technical details, starting with how payloads are implemented in Lucene, and then to Solr’s integration.

Payloads in Lucene

The heart of Solr is powered by our favorite Java library of all-time, Lucene. Lucene has had this payload feature for a while, but hasn’t seen much light of day partly because until now it hasn’t been supported natively in Solr.

Let’s take a moment to refresh ourselves on how how Lucene works, and then show where payloads fit in.

Lucene Index Structure

Lucene builds an inverted index of the content fed to it. An inverted index is, at a basic level, a straightforward dictionary of words from the corpus alphabetized for easy finding later. This inverted index powers keyword searches handily. Want to find documents with “cat” in the title? Simply look up “cat” in the inverted index, and report all of the documents listed that contain that term – very much like looking up words in the index at the back of books to find the referring pages.

Finding documents super fast based off words in them is what Lucene does.  We may also require matching words in proximity to one another, and thus Lucene optionally records the position information to allow for phrase matching, words or terms close to one another. Position information provides the word number (or position offset) of a term: “cat herder” has “cat” and “herder” in successive positions.

For each occurrence of an indexed word (or term) in a document, the positional information is recorded. Additionally, and also optionally, the offsets (the actual character start and end offset) can be encoded per term position.


Available alongside the positionally related information is an optional general purpose byte array. At the lowest-level, Lucene allows any term in any position to store whatever bytes it’d like in its payload area. This byte array can be retrieved as the term’s position is accessed.

These per-term/position byte arrays can be populated in a variety of ways using some esoteric built-in Lucene TokenFilter‘s, a few of which I’ll de-cloak below.

A payload’s primary use case is to affect relevancy scoring; there are also other very interesting ways to use payloads, discussed here later. Built-in at Lucene’s core scoring mechanism is float Similarity#computePayloadFactor() which until now has not been used by any production code in Lucene or Solr; though to be sure, it has been exercised extensively within Lucene’s test suite since inception. It’s hardy, just under-utilized outside custom expert-level coding to ensure index-time payloads are encoded the same way they are decoded at query time, and to hook this mechanism into scoring.

Payloads in Solr

One of Solr’s value-adds is providing rigor to the fields and field types used, keeping index and query time behavior in sync. Payload support followed along, linking index-time payload encoding with query time decoding through the field type definition.

The payload features described here were added to Solr 6.6, tracked in SOLR-1485.

Let’s start with an end-to-end example…

Solr|6.6 Payload Example

Here’s a quick example of assigning per-term float payloads and leveraging them:

bin/solr start
bin/solr create -c payloads
bin/post -c payloads -type text/csv -out yes -d $'id,vals_dpf\n1,one|1.0 two|2.0 three|3.0\n2,weighted|50.0 weighted|100.0'

If that last command gives you any trouble, navigate to <http://localhost:8983/solr/#/payloads/documents>, change the `Document Type` to CSV, and paste this CSV into the “Document(s)” area:

1,one|1.0 two|2.0 three|3.0
2,weighted|50.0 weighted|100.0

Two documents are indexed (id 1 and 2) with a special field called vals_dpf.  Solr’s default configuration provides *_dpf, the suffix indicating it is of “delimited payloads, float” field type.

Let’s see what this example can do, and then we’ll break down how it worked.

The payload() function returns a float computed from the numerically encoded payloads on a particular term. In the first document just indexed, the term “one” has a float of 1.0 encoded into its payload, and likewise “two” with the value of 2.0, “three” with 3.0. The second document has the same term, “weighted” repeated, with a different (remember, payloads are per-position) payload for each of those terms’ positions.

Solr’s pseudo-fields provide a useful way to access payload function computations. For example, to compute the payload function for the term “three”, we use payload(vals_dpf,three). The first argument is the field name, and the second argument is the term of interest.


The first document has a term “three” with a payload value of 3.0. The second document does not contain this term, and the payload() function returns the default 0.0 value.

Using the above indexed data, here’s an example that leverages all the various payload() function options:



There’s a useful bit of parameter substitution indirection to allow the field name to be specified as f=vals_dpf once and referenced in all the functions.  Similarly, the term weighted is specified as the query parameter t and substituted in the payload functions.

Note that this query limits to q=id:2 to demonstrate the effect with multiple payloads involved.  The fl expression def:payload($f,not_there,37) finds no term “not_there” and returns the specified fall-back default value of 37.0, and avg:payload($f,$t,0.0,average) takes the average of the payloads found on all the positions of the term “weighted” (50.0 and 100.0) and returns the average, 75.0.

Indexing terms with payloads

The default (data_driven) configuration comes with three new payload-using field types. In the example above, the delimited_payloads_float field type was used, which is mapped to a *_dpf dynamic field definition making it handy to use right away. This field type is defined with a WhitespaceTokenizer followed by a DelimitedPayloadTokenFilter. Textually, it’s just a whitespace tokenizer (case and characters matter!). If the token ends with a vertical bar (|) delimiter followed a floating point number, the delimiter and number are stripped from the indexed term and the number encoded into the payload.

Solr’s analysis tool provides introspection into how these delimited payloads field types work.  Using the first document in the earlier example, keeping the output simple (non-verbose), we see the effect of whitespace tokenization followed by delimited payload filtering, with the basic textual indexing of the term being the base word/token value, stripping off the delimiter and everything following it.  Indexing-wise, this means the terms “one”, “two”, and “three” are indexed and searchable with standard queries, just as if we had indexed “one two three” into a standard text field.

delimited payloads, float – analysis terms

Looking a little deeper into the indexing analysis by turning on verbose view, we can see in the following screenshot a hex dump view of the payload bytes assigned to each term in the last row labeled “payload”.

delimited payloads, float – verbose analysis

Payloaded field types

These new payloaded field types are available in Solr’s data_driven configuration:

field type payload encoding dynamic field mapping
delimited_payloads_float float *_dpf
delimited_payloads_int integer *_dpi
delimited_payloads_string string, as-is *_dps

Each of these is whitespace tokenized with delimited payload filtering, the difference being the payload encoding/decoding used.

payload() function

The payload() function, in the simple case of unique, non-repeating terms with a numeric (integer or float) payload, effectively returns the actual payload value. When the payload() function encounters terms that are repeated, it will either take the first value it encounters, or iterate through all of them returning the minimum, maximum, or average payload value.

The payload() function signature is this:

payload(field,term[,default, [min|max|average|first]])

where the defaults are 0.0 for the default value, and for averaging the payload values.

Back to the Use Cases

That’s great, three=3.0, and the average of 50.0 and 100.0 is 75.0.   Like we needed payloads to tell us that.  We could have indexed a field, say words_t with “three three three” and done termsfreq(words_t,three) and gotten 3 back.  We could have fields min_f set to 50.0 and max_f set to 100.0  and used div(sum(min_f,max_f),2) to get 75.0.

Payloads give us another technique, and it opens up some new possibilities.

Per-store Pricing

Business is booming, we’ve got stores all across the country!  Logistics is hard, and expensive.  The closer a widget is made to the store, the less shipping costs; or rather, it costs more for a widget the further it has to travel.  Maybe not so contrived rationale aside, this sort of situation with per-store pricing of products is how it is with some businesses.  So, when a customer is browsing my online store they are associated with their preferred or nearest physical store, where all product prices seen (and faceted and sorted don’t forget!) are specific to the pricing set up for that store for that product.

Let’s whiteboard that one out and be pragmatic about the solutions available: if you’ve got five stores, maybe have five Solr collections even with everything the same but the prices?   What if there are 100 stores and growing, managing that many collections becomes a whole new level of complexity, so then maybe have a field for each store, on each product document?   Both of those work, and work well…. to a point.  There are pro’s and con’s to these various approaches.   But what if we’ve got 5000 stores?   Things get unwieldy with lots of fields due Solr’s caching and per-field machinery; consider one user from each store region doing a search with sorting and faceting, multiplying a traditional numeric sorting requirement times 5000.   Another technique that folks implement is to cross products with stores and have a document for every store-product, which is similar to a collection per store but very quickly explodes to lots of documents (num_stores * num_products can be a lot!).  Let’s see how payloads gives us another way to handle this situation.

Create a product collection with bin/create -c products and then CSV import the following data; using the Documents tab in CSV mode is easiest, paste in this and submit:

 SB-X,Snow Blower,350.37,STORE_FL|275.99
 AC-2,Air Conditioner,499.50,STORE_AK|312.99


Products Documents with Payloaded Prices

I stuck with dynamic field mappings to keep things working easily out of the box in this example, but I’d personally use cleaner names for real such as default_price instead of default_price_fandstore_prices instead of store_prices_dpf.

Let’s find all products, sorted by price, first by default_price_f:  http://localhost:8983/solr/products/browse?fl=*&sort=default_price_f%20asc

In Alaska, though, that’s not the correct sort order.  Let’s associate the request with STORE_AK, using &store_id=STORE_AK, and see the computed price based on the payload associated with the store_id for each product document with &computed_price=payload(store_prices_dpf,$store_id,default_price_f).  Note that those two parameters are ours, not Solr’s.  With a function defined as a separate parameter, we can re-use it where we need it.  To see the field, add it to fl with &fl=actual_price:${computed_price}, and to sort by it, use &sort=${computed_price} asc.


Circling back to the approaches with per-store pricing as if we had 5000 stores.  5000*number_of_products documents versus 5000 collections versus 5000 fields versus 5000 terms.  Lucene is good at lots of terms per field, and with payloads along for the ride it’s a good fit for this many-value-per-document scenario.

Faceting on numeric payloads

Faceting is currently a bit trickier with computed values, since facet.range only works with actual fields not pseudo ones.  In this price case, since there aren’t usually many price ranges needed we can use facet.query‘s along with {!frange} on the payload().  With the example data, let’s facet on (computed/actual) price ranges.   The following two parameters define two price ranges:

  • facet.query={!frange key=up_to_400 l=0 u=400}${computed_price} (includes price=400.00)
  • facet.query={!frange key=above_400 l=400 incl=false}${computed_price} (excludes price=400.00, with “include lower” incl=false)

Depending on which store_id we pass, we either have both products in the up_to_400 range (STORE_AK) or one product in each bucket (STORE_FL).  The following link provides the full URL with these two price range facets added: /products/query?…facet.query={!frange%20key=above_400%20l=400%20incl=false}${computed_price}

Here’s the price range facet output with store_id=STORE_AK:

 facet_queries:  {
   up_to_400: 2,
   above_400: 0

Weighted terms

This particular use case is implemented exactly as the pricing example, using whatever terms appropriate instead of store identifiers.  This could be, for example, useful for weighting the same words differently depending on the context in which they appear – words being parsed into an <H1> html tag could be assigned a payload weight greater than other terms.   Or perhaps during indexing, entity extraction can assign confidence weights about the confidence of the entity choice.

To assign payloads to terms using the delimited payload token filtering, the indexing process will need to craft the terms in the “term|payload” delimited fashion.

Synonym weighting

One technique many of us have used is the two-field copyField trick where one field has synonyms enabled, and another without synonym filtering, and using query fields (edismax qf) to weight the non-synonym field higher than the synonym field allowing closer to exact matches a relevancy boost.

Instead, payloads can be used to down-weight synonyms within a single field.  Note this is an index-time technique with synonyms, not query-time.  The secret behind this comes from a handy analysis component called NumericPayloadTokenFilterFactory – this handy filter assigns the specified payload to all terms matching the token type specified, “SYNONYM” in this case.  The synonym filter injects terms with this special token type value; token type is generally ignored and not indexed in any manner, yet being useful during the analysis process to key off of for other operations like this trick of assigning a payload to only certain tagged tokens.

For demonstration purposes, let’s create a new collection to experiment with: bin/solr create -c docs

There’s no built-in field type that has this set up already, so let’s add one:

curl -X POST -H 'Content-type:application/json' -d '{
 "add-field-type": {
   "name": "synonyms_with_payloads",
   "stored": "true",
   "class": "solr.TextField",
   "positionIncrementGap": "100",
   "indexAnalyzer": {
     "tokenizer": {
       "class": "solr.StandardTokenizerFactory"
     "filters": [
         "class": "solr.SynonymGraphFilterFactory",
         "expand": "true",
         "ignoreCase": "true",
         "synonyms": "synonyms.txt"
         "class": "solr.LowerCaseFilterFactory"
         "class": "solr.NumericPayloadTokenFilterFactory",
         "payload": "0.1",
         "typeMatch": "SYNONYM"
   "queryAnalyzer": {
     "tokenizer": {
       "class": "solr.StandardTokenizerFactory"
     "filters": [
         "class": "solr.LowerCaseFilterFactory"

 "add-field" : {
   "stored": "true",
   "multiValued": "true"
}' http://localhost:8983/solr/docs/schema

With that field, we can add a document that will have synonyms assigned (the out of the box synonyms.txt contains Television, Televisions, TV, TVs), again adding it through the Solr admin Documents area, for the docs collection just created using Document Type CSV:


Using the {!payload_score} query parser this time, we can search for “tv” like this: http://localhost:8983/solr/docs/select?q={!payload_score f=synonyms_with_payloads v=$payload_term func=max}&debug=true&fl=id,score&wt=csv&payload_term=tv

which returns:


Changing &payload_term=television reduces the score to 0.1.

This term type to numeric payload mapping can be useful beyond synonyms – there are a number of other token types that various Solr analysis components can assign, including <EMAIL> and <URL>tokens that  UAX29URLEmailTokenizer can extract.

Payload-savvy query parsers

There are two new query parsers available that leverage payloads, payload_score and payload_check.  The following table details the syntax of these parsers:

query parser description specification
{!payload_score} SpanQuery/phrase matching, scores based on numerically encoded payloads attached to the matching terms
{!payload_check} SpanQuery/phrase matching that have a specific payload at the given position, scores based on SpanQuery/phrase scoring

Both of these query parsers tokenize the query string based on the field type’s query time analysis definition (whitespace tokenization for the built-in payload types) and formulates an exact phrase (SpanNearQuery) query for matching.

{!payload_score} query parser

The {!payload_score} query parser matches on the phrase specified, scoring each document based on the payloads encountered on the query terms, using the min, max, or average.  In addition, the natural score of the phrase match based off the usual index statistics for the query terms can be multipled into the computed payload scoring factor using includeSpanScore=true.

{!payload_check} query parser

So far we’ve focused on numeric payloads, however strings (or raw bytes) can be encoded into payloads as well.  These non-numeric payloads, while not usable with the payload() function intended solely for numeric encoded payloads, they can be used for an additional level of matching.

Let’s add another document to our original “payloads” collection using the *_dps dynamic field to encode payloads as strings:

99,taking|VERB the|ARTICLE train|NOUN

The convenient command-line to index this data is:

bin/post -c payloads -type text/csv -out yes -d $'id,words_dps\n99,taking|VERB the|ARTICLE train|NOUN'

We’ve now got three terms, payloaded with their part of speech.   Using {!payload_check}, we can search for “train” and only match if it was payloaded as “NOUN”:

q={!payload_check f=words_dps v=train payloads=NOUN}

if instead payloads=VERB, this document would not match.  Scoring from {!payload_check} is the score of the SpanNearQuery generated, using payloads purely for matching.  When multiple words are specified for the main query string, multiple payloads must be specified, and match in the same order as the specified query terms.   The payloads specified must be space separated.   We can match “the train” in this example when those two words are, in order, an ARTICLE and a NOUN:

q={!payload_check f=words_dps v='the train' payloads='ARTICLE NOUN'}

whereas payloads='ARTICLE VERB' does not match.


The payload feature provides per-term instance metadata, available to influence scores and provide additional level of query matching.

Next steps

Above we saw how to range facet using payloads.  This is less than ideal, but there’s hope for true range faceting over functions.  Track SOLR-10541 to see when this feature is implemented.

Just after this Solr payload work was completed, a related useful addition was made to Lucene to allow term frequency overriding, which is a short-cut to the age-old repeated keyword technique.  This was implemented for Lucene 7.0 at LUCENE-7854.  Like the payload delimiting token filters described above, there’s now also a DelimitedTermFrequencyTokenFilter.   Payloads, remember, are encoded per term position, increasing the index size and requiring an additional lookup per term position to retrieve and decode them.  Term frequency, however, is a single value for a given term.  It’s limited to integer values and is more performantly accessible than a payload.  The payload() function can be modified to transparently support integer encoded payloads and delimited term frequency overrides (note: the termfreq() function would work in this case already).  Track SOLR-11358 for the status of the transparent term frequency / integer payload() implementation.

Also, alas, there was a bug reported with debug=true mode when using the payload() when assertions are enabled.  A fix is on the patch provided at SOLR-10874.


The post Solr Payloads appeared first on Lucidworks.

An update from Open Burkina / Open Knowledge Foundation

Energy is fundamental to any development. The National Electricity Company of Burkina Faso (SONABEL) whose task is the generation, transmission, and distribution of electricity in the Burkinabè population, works hard to enable citizens to benefit from this as an important resource. However, it is clear that SONABEL hardly fulfills this mission: hardly a day goes by without a power failure in Ouagadougou. After multiple complaints, which can be found on social networks, citizens have ended up resigning and passively endure the cuts.

A tweet that compares the electricity supply to light effect

Among these cuts, there are load-sheddings (the deliberate shutdown of electric power in parts of a system to prevent the failure of the entire system), which are due to an insufficient capacity of SONABEL. Other cuts are due to incidents on the transmission or distribution networks.

Regarding load-shedding, SONABEL produces a weekly program, but that is not legible for the citizen. It is therefore difficult for them to know if they should be concerned or not. This decreases the value of the program to the citizens.

Load-shedding program as it is published by the electricity company

On the other hand there is no data on cuts, such as their numbers or their locations, which make citizen advocacy to improve service delivery more difficult.

For better service delivery, the Open Knowledge International local group in Burkina Faso, called Open Burkina, started the reflections since 2015. The idea is to provide citizen support to the efforts of the state. From reflection, a project with three components was born.

Mapping components

Through the mapping, the project intends to represent the load-shedding program on a map to make it more readable. A notification system can be set up to send an email or SMS to the residents of areas affected by load-shedding.

Data Collection Components

In this component, domestic sensors are designed to record cuts and current returns. The data will then be centralized and made available in open data. The sensors are designed with Arduino cards drawing on Waziup and Open IoT projects.

Notifying threshold

In the case of power cuts, a system will be provided that will notify residents of an area at the approach of the consumption threshold that can lead to a load-shedding. These users will be invited to reduce their consumption to avoid reaching the threshold. We hope that this system will help regulate the consumption of electricity and avoid outages due to power cuts.

A nurse, using her phone light to receive her patients during a power cut in Ouagadougou. Photograph: Aoua Ouédraogo

Our project was presented for a competitive grant for open data innovators in Africa, launched by our partner ODI in June 2017. Despite more than 80 candidate projects of all African countries, we are part of the three winning projects. Thanks to this recognition, the project will have a £ 6000 (~ 4.2 million FCFA) funding to achieve its objectives.

The project is expected to last three months, and Open Burkina work closely with SONABEL, the IGB, the ANPTIC, Nos3S and the city of Ouagadougou for its success.


Long-Lived Scientific Observations / David Rosenthal

By BabelStone, CC BY-SA 3.0
Keeping scientific data, especially observations that are not repeatable, for the long term is important. In our 2006 Eurosys paper we used an example from China. During the Shang dynasty:
astronomers inscribed eclipse observations on animal bones. About 3200 years later, researchers used these records to estimate that the accumulated clock error was about 7 hours. From this they derived a value for the viscosity of the Earth's mantle as it rebounds from the weight of the glaciers.
Last week we had another, if only one-fifth as old, example of the value of long-ago scientific observations. Korean astronomers' records of a nova in 1437 provide strong evidence that:
1473 nova remains
"cataclysmic binaries"—novae, novae-like variables, and dwarf novae—are one and the same, not separate entities as has been previously suggested. After an eruption, a nova becomes "nova-like," then a dwarf nova, and then, after a possible hibernation, comes back to being nova-like, and then a nova, and does it over and over again, up to 100,000 times over billions of years.
How were these 580-year-old records preserved? Follow me below the fold.

The eclipse nova was recorded in the sillok, the Annals of the Joseon Dynasty. Because they were compiled over 200 years after Choe Yun-ui's (최윤의) 1234 invention of bronze movable type, the final versions of each reign's Annals, from:
the Annals of Sejong (r. 1418–1450) onwards, were printed with movable metal and wooden type, which was unprecedented in the making of annals in Japan and China.
And Lots Of Copies were made to Keep Stuff Safe using geographical diversity, regular audit, and replacement of lost copies:
Four separate repositories were established in Chunchugwan, Chungju County, Jeonju County, and Seongju County to store copies of the Annals. All but the repository in Jeonju were burned down during the Imjin wars. After the war, five more copies of the Annals were produced and stored in Chunchugwan and the mountain repositories of Myohyang-san, Taebaeksan, Odaesan, and Mani-san.
A good way to preserve information, which the LOCKSS Program implemented! The story of their preservation is told in Shin Byung Ju's Dedicated Efforts to Preserve the Annals of the Joseon Dynasty:
Although the Annals of the Joseon Dynasty (Joseonwangjosillok) have been duly recognized as an incomparable documentary treasure, this would not have been possible without its elaborate and scientific system of maintenance and preservation. This included the building of archives in remote mountainous regions, where the Annals could be safely stored for future generations, along with the development of nearby guardian temples to protect the archives during times of crisis. The Annals would be stored in special boxes, together with medicinal herbs to ward off insects and absorb moisture. Also, the Annals were aired out once every two years as part of a continuous maintenance and preservation process. As such, it was the rigid adherence to these painstaking procedures that enabled the Annals of the Joseon Dynasty to be maintained in their original form after all these centuries.
The details are fascinating, go read! Similar care was taken at Haeinsa:
most notable for being the home of the Tripitaka Koreana, the whole of the Buddhist Scriptures carved onto 81,350 wooden printing blocks, which it has housed since 1398.
Winston Smith in "1984" was an editor for the Ministry of Truth; he "rewrites records and alters photographs to conform to the state's ever-changing version of history itself". George Orwell wasn't a prophet. Throughout history, governments of all stripes have found the need to employ Winston Smiths and the Joseon dynasty was no exception. But the Koreans of that era even defended against their Winston Smiths:
In the Later Joseon period when there was intense conflict between different political factions, revision or rewriting of sillok by rival factions took place, but they were identified as such, and the original version was preserved.
Today's eclipse records would be on the Web, not paper or bone. Will astronomers 3200 or even only 580 years from now be able to use them?

New IMLS-funded project: Opening access to 20th century public domain serials / John Mark Ockerbloom

I’m happy to report that over the next year, I and others at Penn will be working on a project that the Institute of Museum and Library Services has just funded to help open access to the vast public domain of 20th century serials.  We’ll be developing and demonstrating data sets and procedures to make it much easier to verify public domain status for content in scholarly journals, magazines, newspapers, newsletters, and special interest periodicals published in the United States.  We hope that all kinds of libraries can take advantage of the resources we provide to make materials like this in their collections available online to all, and digitally preserve them for posterity.

As I’ve noted previously, US publications prior to 1964 had to have their copyrights renewed or they would enter the public domain, and most serial content from that era was not renewed.  Projects like HathiTrust and JSTOR have lots of public domain serial content digitized, but they generally don’t provide access to it past 1922, since it’s difficult to verify the status of the post-1922 volumes they have digitized.  HathiTrust’s Copyright Review Program has investigated and opened access to many post-1922 books, but to date has not moved into serials, where one needs to verify the public domain status not only of the serial itself, but of the individual contributions (articles, stories, etc.) in the serials.  To date, that’s been too complicated to do at scale for them and for many other projects.

We aim to make it simple and practical to do so for many serials.  Here’s how we plan to do it:

  • First, we’ll complete and openly publish online an inventory, long in the works, of all the serials that have an active issue or contribution copyright renewal filed with the US Copyright Office, and the dates of the first such renewals, up to when the Copyright Office’s online registration catalog picks up.
  • Second, with the help of legal experts, we’ll draft suggested procedures for using this inventory, along with other online resources, to quickly identify and check public domain serial content.  We plan to develop procedures along the lines of those described in Michigan’s Finding the Public Domain Toolkit, and to be similarly useful to libraries and other digitizers of serials.
  • Finally, we’ll demonstrate and publicize our procedures and data sets, by using them to copyright-clear and digitize some sample serial content, by publishing our resources and reports online, and by reaching out to librarians and others who can use them.

We’d love to have you join us in this work.  Here are some of the things you might do:

  • Help spread the word about this project.   I’ll be doing a brief presentation about it at the upcoming Digital Library Federation Forum in Pittsburgh, and will post my slides after the event.  I’m happy to talk about it further and answer questions there and anywhere else that’s practical.
  • Give us feedback.  What are the best ways for us to provide our inventories and procedures?  How can we make them easier to understand, use, and (as appropriate) automatically process and repurpose?  What kinds of serials deserve particular attention in our procedures and sample digitizations?  How can we best reach, and get contributions and suggestions from, the people and institutions that could put these resources to good use?
  • Try using the data and procedures, when we have them, to copyright-clear and digitize, public domain serial content you’d like to share with the world.  If you let us know what you’ve done, and how it went, we might be able to use what you tell us to improve what we provide; and we can also publicize your digitizations in places like The Online Books Page.
  • Help enhance the data we have.  Our basic inventory will list all the serials in the time period it covers that have renewals, and list the earliest active renewals.  But many of those serials also have public domain content after those first renewals.  We don’t have time ourselves to list all of those renewals ourselves for all the serials we list, but it’s possible for motivated folks to do so for material they’d like to digitize or see digitized.  I’m very interested in sharing or linking to any such lists that are made for particular serials.  (I hope to share an example list of this sort in an upcoming post.)

I’m excited about this project, and about the content that I hope this project will help make available to the world online.  If you find it of interest as well, I’d love to hear from you.



LITA Forum 2017 – Call for Library School Student Volunteers / LITA

2017 LITA Forum
Denver, Colorado
November 9 – 12, 2017

Student registration rate available – 50% off registration rate – $180

The Library and Information Technology Association (LITA), a division of the American Library Association, is offering a discounted student registration rate for the 2017 LITA Forum. This offer is limited to graduate students enrolled in ALA-accredited programs. In exchange for the lower registration cost, these graduate students will be asked to assist the LITA organizers and Forum presenters with onsite operations. This is a great way to network and meet librarians active in the field.

The selected students will be expected to attend the full LITA Forum, Friday noon through Sunday noon. Attendance during the preconferences on Thursday afternoon and Friday morning is not required. While you will be assigned a variety of duties, you will be able to attend the Forum programs, which include 2 keynote sessions, over 50 concurrent sessions, and poster presentations, as well as many opportunities for social engagement.

The student rate is $180 – half the regular registration rate for LITA members. A real bargain, this rate includes a Friday night reception, breakfasts, and Saturday lunch.

The Forum will be held November 9-12, 2017 at the Embassy Suites by Hilton Denver Downtown Convention Center in Denver, Colorado. For more information about the Forum, visit We anticipate an attendance of 300+ decision makers and implementers of new information technologies in libraries.

To apply to be a student volunteer, complete and submit this form by September 29, 2017.

You will be asked to provide the following:

  1. Contact information, including email address and cell phone number
  2. Name of the school you are attending
  3. Statement of 150 words (or less) explaining why you want to attend the LITA National Forum

Those selected to be volunteers registered at the student rate will be notified no later than Friday, October 13, 2017.

Questions should be forwarded to Ali VanDoren,

Jobs in Information Technology: September 13, 2017 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week

Cornell University Library, Web Application Developer, – the Next Generation, Ithaca, NY

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

An Invitation to an Adventure in Data Management and Curation at the National Transportation Library / Library of Congress: The Signal

The following is a guest post by Laura Farley, Fellow, National Transportation Library, Bureau of Transportation Statistics.

If you’re looking to build skills as a data manager and have a lasting impact on federal data curation, come join the National Transportation Library (NTL) Fellowship program. As a current Fellow, I can attest that your experiences here will sharpen your technical and reasoning skills, challenge you to think broadly and creatively, and introduce you to a range of professionals working together to make government data accessible and sustainable.

NTL is looking for their next Data Management/Data Curation Fellow. This Fellows Program offers hands-on experience with exposure to creating and evaluating data management plans, crafting data curation standard operating plans and policies, cataloging data sets for preservation, data format migration and other data services, all within the context of providing access to an entire research package which will contain data and reports. Fellows will also conduct their own research, as well as participate in networking and outreach. This is your opportunity to have a direct impact on the future of data management and curation practices in a federal agency.

NTL operates within the Bureau of Transportation Statistics (BTS) at the US Department of Transportation (US DOT). Established in 1998, NTL operated within the US DOT Library until 2015 when the libraries separated. Today, NTL serves a vital role collecting, disseminating, and preserving transportation information. The staff of eight provide a range of services including virtual reference and coordination of the nation’s transportation knowledge networks for information professionals. In 2016, NTL strengthened their commitment to data management within the library and throughout BTS. The NTL Fellowship program grew out of that commitment.

I ended up as NTL’s first Fellow as the result of a series of chain reactions beginning with a database course in library school and ending with a growing desire to switch career focuses from serving the public directly to serving the public through data management. I came to realize I wanted to be part of this moment in data. Of course, data science isn’t new, but this moment is special as large amounts of data become easier to access and share, and data visualization becomes ever more present in how we interact with information. It was during the fall 2016 Library of Congress Collections As Data symposium that I became deeply interested in the intersection between traditional humanities and data. I could see how learning to collect, organize, analyze, present, and preserve data would become increasingly important for humanities fields, not just STEM.

U.S. Map of Annual Freight Tonnage by mode

Tonnage on Highways, Railroads and Inland Waterways: 2002. U.S. Department of Transportation, Federal Highway Administration, Freight Analysis Framework, Version 2.2, 2007.

It was this new curiosity that lead me to the NTL Fellowship Program where I could increase my skills in data management on the job. Information professionals come from varied professional backgrounds, making it part of why a career in librarianship so exciting. I majored in history and never would have believed 10 years into my career I’d find myself on a data driven path working at a 100% digital library. As a person for whom numbers are a challenge, it’s an endless source of amazement to me that my colleagues are statisticians, economists, and GIS specialists. I have found collaborating with colleagues outside my immediate skill set to be an asset; we learn from one another’s professional perspectives and work styles.

Over this first year I will complete rotations with all areas of the library; reference, cataloging, data management, and systems operations. During year two and three I will work on a data focused project of my choosing. The Fellowship is meant to be mutually beneficial, an opportunity for NTL to bring new ideas and experiences into the library, but also educationally focused for the Fellow to pursue training and projects that will build their skills. Already I’ve had challenging and fruitful experiences, including coursework, field trips, and networking. Most importantly, I’ve had the chance to immediately contribute to real projects at NTL, including website usability and designing a data management plan for incoming data sets.

The DM/DC Fellowship is paid, and open to applicants with a Master’s degree in Library Science, Information Science, Computer Science, or related field. View the fellowship details here. Applications are due by 5:00 PM ET, Monday, September 25, 2017.

If you’re looking for a challenge and the chance to collaborate with innovative, passionate, and driven professionals, this opportunity is for you. Just consider, how many fellowships give you the opportunity to eat lunch on a decommissioned nuclear shipping vessel as part of your welcome?

Nuclear Ship Savannah on the water in 1962

Nuclear Ship Savannah, the first commercial nuclear power cargo vessel, en route to the World’s Fair in Seattle, 1962. Photo from U.S. National Archives and Records Administration (

Copyright Office IT modernization progress report / District Dispatch

Stack of academic books about copyright. Stock photo.
The library community has been urging Congress not to derail the Copyright Office (CO)’s ongoing IT modernization by relocating it now or in the future. Rights holders, on the other hand, have been lobbying hard for CO independence. Some of them argue that the CO should become independent of the Library of Congress (LC) because the CO’s technology needs, rights holders suggest, are inherently different. In fact, both seem to have similar needs: record management, searchable databases, archiving, online transactions, security, storage availability and data integrity, to name a few.

I am happy to report, according to a new Copyright Office report submitted to the House Committee on Appropriations, “Modified U.S. Copyright Office Provisional IT Modernization Plan,” IT modernization at the CO is well underway. Much credit goes to Librarian of Congress Dr. Carla Hayden, who was confirmed in July 2016. One of the first things that Dr. Hayden did was restructure her management team so the head of the Library’s Office of the Chief Information Officer (OCIO) now reports directly to her, thus allowing the Librarian to manage modernization for the entire LC, including the CO. As a result, the House Committee on Appropriations asked the CO to modify its 2016 Provisional Plan, which assumed that modernization would be “managed from within the Copyright Office.” The modified plan makes clear that coordination between LC and CO is now the new normal.

And real progress has been made. The last year has been spent studying the many complex elements of the CO’s copyright recordation function. Currently, a manual system is used to record who currently holds the exclusive rights to a work, with a six to 10-month lag time for integrating new records. (So, if the work you are interested in is not an orphan work, the CO can tell you who holds the copyright of a work only as of late 2016.)

The work plan has three phases over the next several years. The existing systems for registration (the ancient eCo system), public records catalog (which currently limits retrieval of search sets to 10,000 results – ouch!) and the statutory licensing systems are slated to be scrapped for new systems and significant upgrades. The CO has a fiduciary responsibility to distribute royalty fees to rights holders, so any system also must be able to ingest and examine licensing data. Of course, accomplishing these goals will take several years and there will be a period in which the old and new systems will operate concurrently, but enhancements will be implemented in as early as two years.

While other national priorities have moved copyright review legislation to Congress’ back burner, the OCIO and the CO are making progress on much-needed IT modernization. By working together, the OCIO can work towards the IT goals of the CO as well as provide needed IT support. Meanwhile, LC, OCIO and CO should be congratulated for the progress made thus far, which already has cut two years off earlier IT improvement projections.

So, if relocating the Copyright Office makes little technical, functional or economic sense, it’s hard to say how it will be any more sensible as time and improvements march on.

The post Copyright Office IT modernization progress report appeared first on District Dispatch.