Planet Code4Lib

Back-to-school mobile snapshot / Erin White

This week I took a look at mobile phone usage on the VCU Libraries website for the first couple weeks of class and compared that to similar time periods from the past couple years.

2015

Here’s some data from the first week of class through today.

Note that mobile is 9.2% of web traffic. To round some numbers, 58% of those devices are iPhones/iPods and 13% are iPads. So we’re looking at about 71% of mobile traffic (about 6.5% of all web traffic) from Apple devices. Dang. After that, it’s a bit of a long tail of other device types.

To give context, about 7.2% of our overall traffic came from the Firefox browser. So we have more mobile users than Firefox users.

2015 mobile device breakdown

2014

Mobile jumped to 9% of all traffic this year. This is partially due to our retiring our mobile-only website in lieu of a responsive web design. As with the other years, at least 2/3 of the mobile traffic is an iOS device.

2014 mobile device breakdown

2013

Mobile was 4.7% of all traffic; iOS was 74% of all traffic; tablets, amazingly, were 32% of all mobile traffic.

I have one explanation for the relatively low traffic from iPhone: at the time, we had a separate mobile website that was catching a lot of traffic for handheld devices. Most phone users were being automatically redirected there.

2013 mobile device breakdown

Observations

Browser support

Nobody’s surprised that people are using their phones to access our sites. When we launched the new VCU Libraries website last January, the web team built it with a responsive web design that could accommodate browsers of many shapes and sizes. At the same time, we decided which desktop browsers to leave behind – like Internet Explorer 8 and below, which we also stopped fully supporting when we launched the site. Looking at stats like this helps us figure out which devices to prioritize/test most with our design.

Types of devices

VCU Libraries mobile circa 2011Though it’s impossible to test on every device, we have targeted most of our mobile development on iOS devices, which seems to be a direction we should keep going as it catches a majority of our mobile users. It would also be useful for us to look at larger-screen Android devices, though (any takers?). With virtual testing platforms like BrowserStack at our disposal we can test on many types of devices. But we should also look at ways to test with real devices and real people.

Content

Thinking broadly about strategy, making special mobile websites/m-dots doesn’t make sense anymore. People want full functionality of the web, not an oversimplified version with only so-called “on-the-go” information. Five years ago when we debuted our mobile site, this might’ve been the case. Now people are doing everything with their phones–including writing short papers, according to our personas research a couple years ago. So we should keep pushing to make everything usable no matter the screen.

Library groups keep up fight for net neutrality / District Dispatch

"Internet Open" sign

From Flickr

Co-authored by Larra Clark and Kevin Maher

Library groups are again stepping to the front lines in the battle to preserve an open internet. The American Library Association (ALA), Association of College and Research Libraries (ACRL), Association for Research Libraries (ARL) and the Chief Officers of State Library Agencies (COSLA) have requested the right to file an amici curiae brief supporting the respondent in the case of United States Telecom Association (USTA) v. Federal Communications Commission (FCC) and United States of America. The brief would be filed in the US Court of Appeals for the District of Columbia Circuit—which also has decided two previous network neutrality legal challenges. ALA also is opposing efforts by Congressional Appropriators to defund FCC rules.

Legal brief to buttress FCC rules, highlight library values

The amici request builds on library and higher education advocacy throughout the last year supporting the development of strong, enforceable open internet rules by the FCC. As library groups, we decided to pursue our own separate legal brief to best support and buttress the FCC’s strong protections, complement the filings of other network neutrality advocates, and maintain the visibility for the specific concerns of the library community. Each of the amici parties will have quite limited space to make its arguments (likely 4,000-4,500 words), so particular library concerns (rather than broad shared concerns related to free expression, for instance) are unlikely to be addressed by other filers and demand a separate voice. The FCC also adopted in its Order a standard that library and higher education groups specifically and particularly brought forward—a standard for future conduct that reflects the dynamic nature of the internet and internet innovation to extend protections against questionable practices on a case-by-case basis.

Based on conversations with FCC general counsel and lawyers with aligned advocates, we plan to focus our brief on supporting the future conduct standard (formally referenced starting on paragraph 133 in the Order as “no unreasonable interference or unreasonable disadvantage standard for internet conduct”) and why it is important to our community, re-emphasize the negative impact of paid prioritization for our community and our users if the bright-line rules adopted by the FCC are not sustained, and ultimately make our arguments through the lens of the library mission and promoting our research and learning activities.

As the library group motion states, we argue that FCC rules are “necessary to protect the mission and values of libraries and their patrons, particularly with respect to the rules prohibiting paid prioritization.” Also, the FCC’s general conduct standard is “an important tool in ensuring the open character of the Internet is preserved, allowing the Internet to continue to operate as a democratic platform for research, learning and the sharing of information.”

USTA and amici opposed to FCC rules filed their briefs July 30, and the FCC filing is due September 16. Briefs supporting the FCC must be filed by September 21.

Congress threatens to defund FCC rules

ALA also is working to oppose Republican moves to insert defunding language in appropriations bills that could effectively block the FCC from implementing its net neutrality order. Under language included in both the House and Senate versions of the Financial Services and General Government Appropriations Bill, the FCC would be prohibited from spending any funds towards implementing or enforcing its net neutrality rules during FY2016 until specified legal cases and appeals (see above!) are resolved. ALA staff and counsel have been meeting with Congressional leaders to oppose these measures.

The Obama Administration criticized the defunding move in a letter from Office of Management and Budget (OMB) Director Shaun Donovan stating, “The inclusion of these provisions threatens to undermine an orderly appropriations process.” While not explicitly threatening a Presidential veto, the letter raises concern with appropriators attempts at “delaying or preventing implementation of the FCC’s net neutrality order, which creates a level playing field for innovation and provides important consumer protections on broadband service…”

Neither the House or Senate versions of the funding measure has received floor consideration. The appropriations process faces a bumpy road in the coming weeks as House and Senate leaders seek to iron out differing funding approaches and thorny policy issues before the October 1 start of the new fiscal year. Congress will likely need to pass a short-term continuing resolution to keep the government open while discussions continue. House and Senate Republican leaders have indicated they will work to avoid a government shut-down. Stay tuned!

The post Library groups keep up fight for net neutrality appeared first on District Dispatch.

DPLA Archival Description Working Group / DPLA

The Library, Archives, and Museum communities have many shared goals: to preserve the richness of our culture and history, to increase and share knowledge, to create a lasting record of human progress.

However, each of these communities approaches these goals in different ways. For example, description standards vary widely among these groups. The library typically adopts a 1:1 model where each item has its own descriptive record. Archives and special collections, on the other hand, usually describe materials in the aggregate as a collection. A single record, usually called a “finding aid,” is created for the entire collection. Only the very rare or special item typically warrants a description all its own. So the archival data model typically has one metadata record for many objects (or a 1:n ratio).

At DPLA, our metadata application profile and access platform have been centered on an item-centric library model for description: one metadata record for each individual digital object. While this method works well for most of the items in DPLA, it doesn’t translate to the way many archives are creating records for their digital objects. Instead, these institutions are applying an aggregate description to their objects.

Since DPLA works with organizations that use both the item-level and aggregation-based description practices, we need a way to support both. The Archival Description Working Group will help us get there.

The group will explore solutions to support varying approaches to digital object description and access and will produce a whitepaper outlining research and recommendations. While the whitepaper recommendations will be of particular use to DPLA or other large-scale aggregators, any data models or tools advanced by the group will be shared with the community for further development or adoption.

The group will include representatives from DPLA Hubs and Contributing Institutions, as well as national-level experts in digital object description and discovery. Several members of the working group have been invited to participate, but DPLA is looking for a few additional members to volunteer. As a member of the working group, active participation in conference calls is required, as well as a willingness to assist with research and writing.

If you are interested in being part of the Archival Description Working Group, please fill out the volunteer application form by 9/13/15. Three applicants will be chosen to be a part of the working group, and others will be asked to be the first reviewers of the whitepaper and any deliverables. An announcement of the full group membership will be made by the end of the month.

3D Printing Partnerships: Tales Of Collaboration, Prototyping, And Just Plain Panic / LITA

 

Capture2

*Photo taken from Flickr w/Attribution CC License: http://bit.ly/1UnoxIN

Many institutions have seen the rise of makerspaces within their libraries, but it’s still difficult to get a sense of how embedded they truly are within the academic fabric of their campuses and how they contribute to student learning. Libraries have undergone significant changes in the last five years, shifting from repositories to learning spaces, from places to experiences. It is within these new directions that the makerspace movement has risen to the forefront and begun to pave the way for truly transformative thinking and doing. Educause defines a makerspace as “a physical location where people gather to share resources and knowledge, work on projects, network, and build” (ELI 2013). These types of spaces are being embraced by the arts as well as the sciences and are quickly being adopted by the academic community because “much of the value of a makerspace lies in its informal character and its appeal to the spirit of invention” as students take control of their own learning (ELI 2013).

Nowhere is this spirit more alive than in entrepreneurship where creativity and innovation are the norm. The Oklahoma State University Library recently established a formal partnership with the School of Entrepreneurship to embed 3D printing into two pilot sections of its EEE 3023 course with the idea that if successful, all sections of this course would include a making component that could involve more advanced equipment down the road. Students in this class work in teams to develop an original product from idea, to design, to marketing. The library provides training on coordination of the design process, use of the equipment, and technical assistance for each team. In addition, this partnership includes outreach activities such as featuring the printers at entrepreneurship career fairs, startup weekends and poster pitch sessions. We have not yet started working with the classes, so much of this will likely change as we learn from our mistakes and apply what worked well to future iterations of this project.

This is all well and good, but how did we arrive at this stage of the process? The library first approached the School of Entrepreneurship with an idea for collaboration, but as we discovered, simply saying we wanted to partner would not be enough. We didn’t have a clear idea in mind, and the discussions ended without a concrete action plan. Fast forward to the summer, when the library was approached and asked about something that had been mentioned in the meeting-a makerspace. Were we interested in splitting the cost and pilot a project with a course? The answer was a resounding yes.

We quickly met several times to discuss exactly what we meant by “makerspace”, and we decided that 3D printing would be a good place to start. We drafted an outline that consisted of the equipment needed, which consisted of three Makerbot Replicator 5th generation printers and one larger Z18 along with the accompanying accessories and warranties. This information was gathered based on the collective experiences of the group along, with a few quick website searches to establish what other institutions were doing.

Next, we turned our attention to discussing the curriculum. While creating learning outcomes for making is certainly part of the equation, we had a very short time frame to get this done, so we opted for two sets of workshops for students with homework in between culminating in a certification to enable them to work on their product. The first workshop will walk them through using Blender to create an original design at a basic level, the second is designed to have them try out the printers themselves. In between workshops, they will watch videos and have access to a book to help them learn as they go. The certification at the end will consist of each team coming in and printing something (small) on their own after which they will be cleared to work on their own products. Drop-in assistance as well as consultation assistance will also be available, and we are determining the best way to queue requests as they come in knowing that we might have jobs printing over night, while others may come in at the very last minute.

Although as mentioned, we have just started on this project, we’ve learned several valuable lessons already that are worth sharing-they may sound obvious, but are still important to highlight:

  1. Be flexible! Nothing spells disaster like a rigid plan that cannot be changed at the last minute. We wanted a website for the project, we didn’t have time to create one. We had to wait until we received the printers to train ourselves on how they worked so that we can turn around and train the students. We are adapting as we go!
  2. Start small. Even two sections are proving to be a challenge with 40+ students all descending on a small space with limited printers. We hope they won’t come to blows, but we may have to play referee as much as consultant. There are well over 30 sections of this course that will present a much bigger challenge should we decide to incorporate this model into all of them.
  3. Have a plan in place, even if you end up changing it. We are now realizing that there are three main components to this collaboration all of which need a point person and support structure: tech support, curriculum, and outreach. There are 4 separate departments in the library (Research and Learning Services, Access Services, Communications, and IT) who are working together to make this a successful experience for all involved, not to mention our external partners.

Oh yes, and there’s the nagging thought at the end of each day-please, please, let this work. Fingers crossed!

Using Thoth as a Real-Time Solr Monitor and Search Analysis Engine / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Mhatre Braga and Praneet Damiano’s session on how Trulia uses Thoth and Solr for real-time monitoring and analysis. Managing a large and diversified Solr search infrastructure can be challenging and there is still a lack of good tools that can help monitor the entire system and help the scaling process. This session will cover Thoth: an open source real-time Solr monitor and search analysis engine that we wrote and currently use at Trulia. We will talk about how Thoth was designed, why we chose Solr to analyze Solr and the challenges that we encountered while building and scaling the system. Then, we will talk about some Thoth useful features like integration with Apache ActiveMQ and Nagios for real-time paging, generation of reports on query volume, latency, time period comparisons and the Thoth dashboard. Following that, we will summarize our application of machine learning algorithms and its results to the process of query analysis and pattern recognition. Then we will talk about the future directions of Thoth, opportunities to expand the project with new plug-ins and integration with Solr Cloud. Damiano is part of the search team at Trulia where he also helps managing the search infrastructure and creating internal tools to help the scaling process. Prior to Trulia, he studied and worked for the University of Ferrara (Italy) where he completed his Master Degree in Computer science Engineering. Praneet works as a Data Mining Engineer on Trulia’s Algorithms team. He works on property data handling algorithms, stats and trends generation, comparable homes and other data driver projects at Trulia. Before Trulia, he got his Bachelors degree in Computer Engineering from VJTI, India and his Masters in Computer Science from the University of California, Irvine.
Thoth – Real-time Solr Monitor and Search Analysis Engine: Presented by Damiano Braga & Praneet Mhatre, Trulia from Lucidworks
lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Using Thoth as a Real-Time Solr Monitor and Search Analysis Engine appeared first on Lucidworks.

Lucene Revolution Presents, Inside Austin(‘s) City Limits: Stump The Chump! / SearchHub

ONE NIGHT ONLY! THURSDAY OCTOBER 15TH! LIVE AT THE HILTON AUSTIN! STUMP! THE! CHUMP!

It’s that time of year again folks…

Six weeks from today, Stump The Chump will be coming to Austin Texas at Lucene/Solr Revolution 2015.

If you are not familiar with “Stump the Chump” it’s a Q&A style session where “The Chump” (that’s me) is put on the spot with tough, challenging, unusual questions about Solr & Lucene — live, on stage, in front of hundreds of rowdy convention goers, with judges (who have all had a chance to review and think about the questions in advance) taking the opportunity to mock The Chump (still me) and award prizes to people whose questions do the best job of “Stumping The Chump”.

If that sounds kind of insane, it’s because it kind of is.

You can see for yourself by checking out the videos from past events like Lucene/Solr Revolution Dublin 2013 and Lucene/Solr Revolution 2013 in San Diego, CA. (Unfortunately no video of Stump The Chump is available from Lucene/Solr Revolution 2014: D.C. due to audio problems.)

Information on how to submit questions is available on the conference website.

I’ll be posting more details as we get closer to the conference, but until then you can subscribe to this blog (or just the “Chump” tag) to stay informed.

The post Lucene Revolution Presents, Inside Austin(‘s) City Limits: Stump The Chump! appeared first on Lucidworks.

bento_search 1.4 released / Jonathan Rochkind

bento_search is a ruby gem that provides standardized ruby API and other support for querying external search engines with HTTP API’s, retrieving results, and displaying them in Rails. It’s focused on search engines that return scholarly articles or citations.

I just released version 1.4.

The main new feature is a round-trippable JSON serialization of any BentoSearch::Results or Items. This serialization captures internal state, suitable for a round-trip, such that if you’ve changed configuration related to an engine between dump and load, you get the new configuration after load.  It’s main use case is a consumer that is also ruby software using bento_search. It is not really suitable for use as an API for external clients, since it doesn’t capture full semantics, but just internal state sufficient to restore to a ruby object with full semantics. (bento_search does already provide a tool that supports an Atom serialization intended for external client API use).

It’s interesting that once you start getting into serialization, you realize there’s no one true serialization, it depends on the use cases of the serialization. I needed a serialization that really was just of internal state, for a round trip back to ruby.

bento_search 1.4 also includes some improvements to make the specialty JournalTocsForJournal adapter a bit more robust. I am working on an implementation of JournalTocs featching that needed the JSON round-trippable serialization too, for an Umlaut plug-in. Stay tuned.


Filed under: General

Link roundup September 3, 2015 / Harvard Library Innovation Lab

Goodbye summer

You can now buy Star Wars’ adorable BB-8 droid and let it patrol your home | The Verge

World Airports Voronoi

Stephen Colbert on Making The Late Show His Own | GQ

See What Happens When Competing Brands Swap Colors | Mental Floss

The Website MLB Couldn’t Buy

Studying the Altmetrics of Zotero Data / Zotero

In April of last year, we announced a partnership with the University of Montreal and Indiana University, funded by a grant from the Alfred P. Sloan Foundation, to examine the readership of reference sources across a range of platforms and to expand the Zotero API to enable bibliometric research on Zotero data.

The first part of this grant involved aggregating anonymized data from Zotero libraries. The initial dataset was limited to items with DOIs, and it included library counts and the months that items were added. For items in public libraries, the data also included titles, creators, and years, as well as links to the public libraries containing the items. We have been analyzing this anonymized, aggregated data with our research partners in Montreal, and now are beginning the process of making that data freely and publicly available, beginning with Impactstory and Altmetric, who have offered to conduct preliminary analysis (we’ll discuss Impactstory’s experience in a future post).

In our correspondence with Altmetric over the years, they have repeatedly shown interest in Zotero data, and we reached out to them to see if they would partner with us in examining the data. The Altmetric team that analyzed the data consists of about twenty people with backgrounds in English literature and computer science, including former researchers and librarians. Altmetric is interested in any communication that involves the use or spread of research outputs, so in addition to analyzing the initial dataset, they’re eager to add the upcoming API to their workflow.

The Altmetric team parsed the aggregated data and checked it against the set of documents known to have been mentioned or saved elsewhere, such as on blogs and social media. Their analysis revealed that approximately 60% of the items in their database that had been mentioned in at least one other place, such as on social media or news sites, had at least one save in Zotero. The Altmetric team was pleased to find such high coverage, which points to the diversity of Zotero usage, though further research will be needed to determine the distribution of items across disciplines.

The next step forward for the Altmetric team involves applying the data to other projects and tools such as the Altmetric bookmarklet. The data will be useful in understanding the impact of scholarly communication, because conjectures about reference manager data can be confirmed or denied, and this information can be studied in order to gain a greater comprehension of what such data represents and the best ways to interpret it.

Based on this initial collaboration, Zotero developers are verifying and refining the aggregation process in preparation for the release of a public API and dataset of anonymized, aggregated data, which will allow bibliometric data to be highlighted across the Zotero ecosystem and enable other researchers to study the readership of Zotero data.

Matching names to VIAF / Thom Hickey

The Virtual International Authority File (VIAF) currently has about 28 million entities created by a merge of three dozen authority files from around the world.  Here at OCLC we are finding it very useful in controlling names in records.  In the linked data world we are beginning to experience 'controlling' means assigning URIs (or at least identifiers that can easily be converted to URIs) to the entities.  Because of ambiguities in VIAF and the bibliographic records we are matching it to, the process is a bit more complicated than you might imagine. In fact, our first naive attempts at matching were barely usable.  Since we know others are attempting to match VIAF to their files, we thought a description of how we go about it would be welcome (of course if your file consists of bibliographic records and they are already in WorldCat, then we've already done the matching).  While a number of people have been involved in refining this process, most of the analysis and code was done by Jenny Toves here in OCLC Research over the last few years.

First some numbers: The 28 million entities in VIAF were derived from 53 million source records and 111 million bibliographic records. Although we do matching to other entities in VIAF, this post is about matching against VIAF's 24 million corporate and personal entities.  The file we are matching it to (WorldCat) consists of about 400 million bibliographic records (at least nominally in MARC-21), each of which have been assigned a work identifier before the matching described below. Of the 430 million names in author/contributor (1XX/7XX) fields in WorldCat we are able to match 356 million (or 83%).  If those headings were weighted by how many holdings are associated with them, the percentage controlled would be even higher, as names in the more popular records are more likely to have been subjected to authority control somewhere in the world.

It is important to understand the issues raised when pulling together the source files that VIAF is based on.  While we claim that better than 99% of the 54 million links that VIAF makes between source records are correct, that does not mean that the resulting clusters are 99% perfect.  In fact many of the more common entities represented in VIAF will have not only the a 'main' VIAF cluster, but one or more smaller clusters derived from authority records that we were unable to bring into the main cluster because of missing, duplicated or ambiguous information.  Another thing to keep in mind is that any relatively common name that has one or more famous people associated with it can be expected to have some misattributed titles (this is true for even the most carefully curated authority files of any size).

WorldCat has many headings with subfield 0's ($0s) that associate an identifier with the heading. This is very common in records loaded into WorldCat by some national libraries, such as French and German, so one of the first things we do in our matching is look for identifiers in $0's which can be mapped to VIAF.  When those mappings are unambiguous we use that VIAF identifier and are done.

The rest of this post is a description of what we do with the names that do not already have a usable identifier associated with them.  The main difficulties arise when there either are multiple VIAF clusters that look like good matches or we lack enough information to make a good match (e.g. no title or date match).  Since a poor link is often worse than no link at all, we do not make a link unless we are reasonably confident of it.

First we extract information about each name of interest in each of the bibliographic records:

  • Normalized name key:
    • Extract subfields a,q and j
    • Expand $a with $q when appropriate
    • Perform enhanced NACO normalization on the name
  • $b, $c's, $d, $0's, LCCNs, DDC class numbers, titles, language of cataloging, work identifier

The normalized name key does not include the dates ($d) because they are often not included in the headings in bibliographic records. The $b and $c are so variable, especially across languages, that they also ignored at this point.  The goal is to have a key which will bring together variant forms of the name without pulling in too many different entities together. After preliminary matching we do matching with more precision and $b, $c and $d are used for that.

Similar normalized name keys are generated from the names in VIAF clusters.

When evaluating matches we have a routine that scores the match based on criteria about the names:

  • Start out with '0'
    • A negative value implies the names do not match
    • A 0 implies the names are compatible (nothing to indicate they can't represent the same entity), but nothing beyond that
    • Increasing positive values imply increasing confidence in the match
  • -1 if dates conflict*
  • +1 if a begin or end date matches
  • +1 if both begin and end dates match
  • +1 if begin and end dates are birth and death dates (as opposed to circa or flourished)
  • +1 if there is at least one title match
  • +1 if there is at least one LCCN match
  • -3 if $b's do not match
  • +1 if $c's match
  • +1 if DDCs match
  • +1 if the match is against a preferred form

Here are the stages we go through.  At each stage proceed to the next if the criteria are not met:

  • If only one VIAF cluster has the normalized name from the bibliographic record, use that VIAF identifier
  • Collapse bibliographic information based on the associated work identifiers so that they can share name dates, $b and $c, LCCN, DDC
    • Try to detect fathers/sons in same bibliographic record so that we don’t link them to the same VIAF cluster
  • If a single best VIAF cluster (better than all others) exists – use it
    • Uses dates, $b, $c, titles, preferred form of name to determine best match as described above
  • Try the previous rule again adding LCC and DDC class numbers in addition to the other match points (as matches were made in the previous step, data was collected to make this easier)
    • If there is a single best candidate, use it
    • If more than one best candidate – sort candidate clusters based on the number of source records in the clusters. If there is one cluster that has 5 or more sources and the next largest cluster has 2 or less sources, use the larger cluster
  • Consider clusters where the names are compatible, but not exact name matches
    • Candidate clusters include those where dates and/or enumeration do not exist either in the bibliographic record or the cluster
    • Select the cluster based on the number of sources as described above
  • If only one cluster has an LC authority record in it, use that one
  • No link is made

Fuzzy Title Matching

Since this process is mainly about matching names, and titles are used only to resolve ambiguity, the process described here depends on a separate title matching process.  As part of OCLC’s FRBR matching (which happens after the name matching described here) we pull bibliographic records into work clusters, and each bibliographic record in WorldCat has a work identifier associated with it based on these clusters.  Once we can associate a work identifier with a VIAF identifier, that work identifier can be used to pull in otherwise ambiguous missed matches on a name.  Here is a simple example:

Record 1:

    Author: Smith, John

    Title: Title with work ID #1

Record 2:

    Author: Smith, John

    Title: Another title with work ID #1

Record 3:

    Author: Smith, John

    Title: Title with work ID #2

In this case, if we were able to associate the John Smith in record #1 to a VIAF identifier, we could also assign the same VIAF identifier to the John Smith in record #2 (even though we do not have a direct match on title), but not to the author of record #3. This lets us use all the variant titles we have associated with a work to help sort out the author/contributor names.

Of course this is not perfect.  There could be two different John Smith’s associated with a work (e.g. father and son), so occasionally titles (even those that appear to be properly grouped in a work) can lead us astray.

That's a sketch of how the name matching process operates.  Currently WorldCat is updated with this information once per month and it is visible in the various linked data views of WorldCat.

--Th & JT

*If you want to understand more about how dates are processed, our code{4}lib article about Parsing and Matching Dates in VIAF describes that in detail.

Seeking Comment on Migration Checklist / Library of Congress: The Signal

The NDSA Infrastructure Working Group’s goals are to identify and share emerging practices around the development and maintenance of tools and systems for the curation, preservation, storage, hosting, migration, and similar activities supporting the long term preservation of digital content. One of the ways the IWG strives to achieve their goals is to collaboratively develop and publish technical guidance documents about core digital preservation activities. The NDSA Levels of Digital Preservation and the Fixity document are examples of this.

Birds in Pen

Birds. Ducks in pen. (Photo by Theodor Horydczak, 1920) (Source: Horydczak Collection Library of Congress Prints and Photographs Division, http://hdl.loc.gov/loc.pnp/thc.5a37506)

The latest addition to this guidance is a migration checklist. The IWG would like to share a draft of the checklist with the larger community in order to gather comments and feedback that will ultimately make this a better and more useful document. We expect to formally publish a version of this checklist later in the Fall, so please review the draft below and let us know by October 15, 2015 in the comments below or in email via ndsa at loc dot gov if you have anything to add that will improve the checklist.

Thanks, in advance, from your IWG co-chairs Sibyl Schaefer from University of California, San Diego, Nick Krabbenhoeft from Educopia and Abbey Potter from Library of Congress. Another thank you to the former IWG co-chairs Trevor Owens from IMLS and Karen Cariani from WGBH who lead the work to initially develop this checklist.

Good Migrations: A Checklist for Moving from One Digital Preservation Stack to Another

The goal of this document is to provide a checklist for things you will want to do or think through before and after moving digital materials and metadata forward to new digital preservation systems/infrastructures. This could entail switching from one system to another system in your digital preservation and storage architecture (various layers of hardware, software, databases, etc.). This is a relatively expansive notion of system. In some cases, organizations have adopted turn-key solutions whereby the requirements for ensuring long term access to digital objects are taken care of by a single system or application. However, in many cases, organizations make use of a range of built and bought applications and core functions of interfaces to storage media that collectively serve the function of a preservation system. This document is intended to be useful for migrations between either comprehensive systems as well as situations where one is swapping out individual components in a larger preservation system architecture.

Issues around normalization of data or of moving content or metadata from one format to another are out of scope for this document. This document is strictly focused on checking through issues related to moving fixed digital materials and metadata forward to new systems/infrastructures.

Before you Move:

  1. Review the state of data in the current system, clean up any data inconsistencies or issues that are likely to create problems on migration and identify and document key information (database naming conventions, nuances and idiosyncrasies in system/data structures, use metrics, etc.).
  2. Make sure you have fixity information for your objects and make sure you have a plan for how to bring that fixity information over into your new system. Note, that different systems may use different algorithms/instruments for documenting fixity information so check to make sure you are comparing the same kinds of outputs.
  3. Make sure you know where all your metadata/records for your objects are stored and that if you are moving that information that you have plans to ensure it’s integrity in place.
  4. Check/validate additional copies of your content stored in other systems, you may need to rely on some of those copies for repair if you run into migration issues.
  5. Identify any dependent systems using API calls into your system or other interfaces which will need to be updated and make plans to update, retire, or otherwise notify users of changes.
  6. Document feature parity and differences between the new and old system and make plans to change/revise and refine workflows and processes.
  7. Develop new documentation and/or training for users to transition from the old to the new system.
  8. Notify users of the date and time the system will be down and not accepting new records or objects. If the process will take some time, provide users with a plan for expectations on what level of service will be provided at what point and take the necessary steps to protect the data you are moving forward during that downtime.
  9. Have a place/plan on where to put items that need ingestion while doing the migration.  You may not be able to tell people to just stop and wait.
  10. Decide on what to do with your old storage media/systems. You might want to keep them for a period just in case, reuse them for some other purpose or destroy them. In any event it should be a deliberate, documented decision.
  11. Create documentation recording what you did and how you approached the migration (any issues, failures, or issues that arose) to provide provenance information about the migration of the materials.
  12. Test migration workflow to make sure it works – both single records and bulk batches of varying sizes to see if there are any issues.

After you Migrate

  1. Check your fixity information to ensure that your new system has all your objects intact.
  2. If any objects did not come across correctly, as identified by comparing fixity values, then repair or replace the objects via copies in other systems. Ideally, log this kind of information as events for your records.
  3. Check to make sure all your metadata has come across, spot check to make sure it hasn’t been mangled.
  4. Notify your users of the change and again provide them with new or revised user documentation.
  5. Record what is done with the old storage media/systems after migration.
  6. Assemble all documentation generated and keep with other system information for future migrations.
  7. Establish timeline and process for reevaluating when future migrations should be planned for (if relevant).

Relevant resources and tools:

This post was updated 9/3/2015 to fix formatting and add email information.

Last chance to support libraries at SXSW / District Dispatch

South by Southwest welcome banner.

From Flickr

A couple of weeks ago, the ALA Washington Office urged support for library programs at South by Southwest (SXSW). The library community’s footprint at this annual set of conferences and activities has expanded in recent years, and we must keep this trend going! Now is your last chance to do your part, as public voting on panel proposals will end at 11:59 pm (CDT) this Friday, September 4th [Update: Now Monday, September 7th]. SXSW received more than 4,000 submissions this year—an all-time record—so we need your help more than ever to make library community submissions stand out. You can read about, comment on, and vote for, the full slate of proposed panels involving the Washington Office here.

Also, the SXSW library “team” that connects through the lib*interactive Facebook group and #liblove has compiled a list of library programs that have been proposed for all four SXSW gatherings. Please show your support for all of them. Thanks!

The post Last chance to support libraries at SXSW appeared first on District Dispatch.

Get Involved in the National Digital Platform for Libraries / LITA

Editor’s note: This is a guest post by Emily Reynolds and Trevor Owens.

Recently IMLS has increased its focus on funding digital library projects through the lens of our National Digital Platform strategic priority area. The National Digital Platform is the combination of software applications, social and technical infrastructure, and staff expertise that provides library content and services to all users in the U.S… in other words, it’s the work many LITA members are already doing!

Participants at IMLS Focus: The National Digital PlatformParticipants at IMLS Focus: The National Digital Platform

As libraries increasingly use digital infrastructure to provide access to digital content and resources, there are more and more opportunities for collaboration around the tools and services that they use to meet their users’ needs. It is possible for each library in the country to leverage and benefit from the work of other libraries in shared digital services, systems, and infrastructure. We’re looking at ways to maximize the impact of our funds by encouraging collaboration, interoperability, and staff training. We are excited to have this chance to engage with and invite participation from the librarians involved in LITA in helping to develop and sustain this national digital platform for libraries.

National Digital Platform convening reportNational Digital Platform convening report

Earlier this year, IMLS held a meeting at the DC Public Library to convene stakeholders from across the country to identify opportunities and gaps in existing digital library infrastructure nationwide. Recordings of those sessions are now available online, as is a summary report published by OCLC Research. Key themes include:

 

Engaging, Mobilizing and Connecting Communities

  • Engaging users in national digital platform projects through crowdsourcing and other approaches
  • Establishing radical and systematic collaborations across sectors of the library, archives, and museum communities, as well as with other allied institutions
  • Championing diversity and inclusion by ensuring that the national digital platform serves and represents a wide range of communities

Establishing and Refining Tools and Infrastructure

  • Leveraging linked open data to connect content across institutions and amplify impact
  • Focusing on documentation and system interoperability across digital library software projects
  • Researching and developing tools and services that leverage computational methods to increase accessibility and scale practice across individual projects

Cultivating the Digital Library Workforce

  • Shifting to continuous professional learning as part of library professional practice
  • Focusing on hands-on training to develop computational literacy in formal library education programs
  • Educating librarians and archivists to meet the emerging digital needs of libraries and archives, including cross-training in technical and other skills

We’re looking to support these areas of work with the IMLS grant programs available to library applicants.

IMLS Funding Opportunities

IMLS has three major competitive grant programs for libraries, and we encourage the submission of proposals related to the National Digital Platform priority to all three. Those programs are:

  • National Leadership Grants for Libraries (NLG): The NLG program is specifically focused on supporting our two strategic priorities, the National Digital Platform and Learning in Libraries. The most competitive proposals will advance some area of library practice on a national scale, with new tools, research findings, alliances, or similar outcomes. The NLG program makes awards up to $2,000,000, with funds available for both project and planning grants.
  • Laura Bush 21st Century Librarian Program (LB21): The LB21 program supports professional development, graduate education and continuing education for librarians and archivists. The LB21 program makes awards up to $500,000, and like NLG supports planning as well as project grants.
  • Sparks! Ignition Grants for Libraries: Sparks! grants support the development, testing, and evaluation of promising new tools, products, services, and practices. They often balance broad potential impact with an element of risk or innovation. The Sparks! program makes awards up to $25,000.

These programs can fund a wide range of activities. NLG and LB21 grants support projects, research, planning, and national forums (where grantees can hold meetings to gather stakeholders around a particular topic). The LB21 program also has a specific category for supporting early career LIS faculty research.

Application Process and Deadlines

Over the past year, IMLS piloted an exciting new model for our grant application process, which this year will be in place for both the NLG and LB21 programs. Rather than requiring a full application from every applicant, only a two-page preliminary proposal is due at the deadline. After a first round of peer review, a small subset of applicants will be invited to submit full proposals, and will have the benefit of the peer reviewers’ comments to assist in constructing the proposal. The full proposals will be reviewed by a second panel of peer reviewers before funding decisions are made. The Sparks! program goes through a single round of peer review, and requires the submission of a full proposal from all applicants.

The LB21 and NLG programs will both have a preliminary proposal application deadline on October 1, 2015, as well as an additional application deadline in February, 2016.

Are you considering applying for an IMLS grant for your digital library project? Do you want to discuss which program might be the best fit for your proposal? We’re always happy to chat, and love hearing your project ideas, so please email us at ereynolds@imls.gov (Emily) and tjowens@imls.gov (Trevor).

How Bloomberg Executes Search Analytics with Apache Solr / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Steven Bower’s session on how Bloomberg uses Solr for search analytics. Search at Bloomberg is not just about text, it’s about numbers, lots of numbers. In order for our clients to research, measure and drive decisions from those numbers we must provide flexible, accurate and timely analytics tools. We decided to build these tools using Solr, as Solr provides the indexing performance, filtering and faceting capabilities needed to achieve the flexibility and timeliness required by the tools. To perform the analytics required we developed an Analytics component for Solr. This talk will cover the Analytics Component that we built at Bloomberg, some use cases that drove it and then dive into features/functionality it provides. Steven Bower has worked for 15 years in the web/enterprise search industry. First as part of the R&D and Services teams at FAST Search and Transfer, Inc. and then as a principal engineer at Attivio, Inc. He has participated/lead the delivery of hundreds of search applications and now leads the search infrastructure team at Bloomberg LP, providing a search as a service platform for 80+ applications.
Search Analytics Component: Presented by Steven Bower, Bloomberg L.P. from Lucidworks
lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Bloomberg Executes Search Analytics with Apache Solr appeared first on Lucidworks.

Access testimonial / William Denton

I submitted a testimonial about the annual Access conference about libraries and technology:

The first time I went to Access was 2006 in Ottawa. I was one year out of library school. I was unemployed. I paid my own way. I didn’t want to miss it. Everyone I admired in the library technology world was going to be there. They were excited about it, and said how much they loved the conference every year. When I got there, the first morning, I thought, “These are my people.” I left admiring a lot of new acquaintances. Every year I go, I feel the same way and the same thing happens.

All true, every word. (That conference was where Dan Chudnov and I chatted over a glass of wine, which made it all the better.)

Here’s more reason why I like Access: the 2015 conference is in Toronto next week, and I’m running a hackfest about turning data into music. This is what my proposal looked like:

Music, Code, Data

I emailed them a JPEG. They accepted the proposal. That’s my kind of conference.

I also have to mention the talk Adam Taves and I did in Winnipeg at Access 2010: After Launching Search and Discovery, Who Is Mission Control?. “A Tragicomedy in 8 or 9 Acts.” It’s a rare conference where you can mix systems librarianship with performance art.

But of all the write-ups I’ve done of anything Access-related, I think 2009’s DIG: Hackfest, done about Cory Doctorow when I was channelling James Ellroy, is the best:

Signs: “Hackfest.” We follow. People point. We get to the room. There are people there. They have laptops. They sit. They mill around. They stand. They talk. “Haven’t seen you since last year. How’ve you been?”

Vibe: GEEK.

Cory Doctorow giving a talk. No talks at Hackfest before. People uncertain. What’s going on? Cory sitting in chair. Cory working on laptop. Cory gets up. Paper with notes scribbled on it. He talks about copyright. He talks about freedom. He talks about how copyright law could affect US.

He vibes geek. He vibes cool.

MarcEdit: Build New Field Tool / Terry Reese

I’m not sure how I’ve missed creating something like this for so long, but it took a question from a cataloger to help me crystalize the need for a new feature.  Here was the question:

Add an 856 url to all records that uses the OCLC number from the 035 field, the title from 245 $a, and the ISSN, if present. This will populate an ILLiad form with the publication title, ISSN, call number (same for all records), and OCLC number. Although I haven’t worked it in yet, the link in our catalog will also include instructions to “click for document delivery form” or something like that.

In essence, the user was looking to generate a link within some records – however, the link would need to be made up of data pulled from different parts of the MARC record.  It’s a question that comes up all the time, and in many cases, the answer I generally give points users to the Swap Field Function – a tool designed around moving data between fields.  For fields that are to be assembled from data in multiple fields – multiple swap field operations would need to be run.  The difference here was how the data from the various MARC fields listed above, needed to be presented.  The swap field tool moves data from one subfield to another – where as this user was looking to pull data from various fields and reassemble that data using a specific data pattern.  And in thinking about how I would answer this question – it kind of clicked – we need a new tool.

Build New Field:

The build new field tool is the newest global editing tool being added to the MarcEditor tool kit.  The tool will be available in the Tools menu:

image

And will be supported in the Task Automation list.  The tool is designed around this notion of data patterns – the idea that rather than moving data between fields – some field data needs to be created via a specific set of data patterns.  The example provided by the user asking this question as:

  • http://illiad.mylibrary.org/illiad.dll?Action=10&Form=22&PhotoJournalTitle=[Title from 245$a]&&ISSN=[ISSN from 022$a]&CallNumber=[CallNumber from 099$a]&ESPNumber=[oclc number from the 035]

While the swap field could move all this data around, the tool isn’t designed to do this level of data integration when generating a new field.  In fact, none of MarcEdit’s present global editing tasks are configured for this work.  To address this gap, I’ve introduced the Build New Field tool:

image

The Build New Field tool utilizes data patterns to construct a new MARC field.  Using the example above, a user could create a new 856 by utilizing the following pattern:

=856  41$uhttp://illiad.mylibrary.org/illiad.dll?Action=10&Form=22&PhotoJournalTitle={245$a}&ISSN={022$a}&CallNumber={099$a}&ESPNumber={035$a(OcLC)}

Do you see the pattern?  This tool allows users to construct their field, replacing the variable data to be extracted from their MARC records using the mnemonic structure: {field$subfield}.  Additionally, in the ESPNumber tag, you can see that in addition to the field and subfield, qualifying information was also included.  The tool allows users to provide this information, which is particularly useful when utilizing fields like the 035 to extract control numbers.

Finally, the new tool provides two additional options.  For items like proxy development, data extracted from the MARC record will need to be URL encoded.  By checking the “Escape data from URL” option, all MARC data extracted and utilized within the data pattern will be URL encoded.  Leaving this item unchecked will allow the tool to capture the data as presented within the record. 

The second option, “Replace Existing Field or Add New one if not Present” tells the tool what to do if the field exists.  If left unchecked, the tool will create a new field if this option is not selected (were the field defined in the pattern exists or not).  If you check this option, the the tool will replace any existing field data or create a new field if one doesn’t exist, for the field defined by your pattern.

Does that make sense?  This function will be part of the next MarcEdit release, so if you have questions or comments, let me know.

–tr

Bookmarks for September 2, 2015 / Nicole Engard

Today I found the following resources and bookmarked them on Delicious.

  • Thimble by Mozilla
    Thimble is an online code editor that makes it easy to create and publish your own web pages while learning HTML, CSS & JavaScript.
  • Google Coder
    a simple way to make web stuff on Raspberry Pi

Digest powered by RSS Digest

The post Bookmarks for September 2, 2015 appeared first on What I Learned Today....

ALA urges FCC to include internet in the Lifeline program / District Dispatch

FCC Building in Washington, D.C.

FCC Building in Washington, D.C.

This week the American Library Association (ALA) submitted comments with the Federal Communications Commission in its Lifeline modernization proceeding. As it has done with its other universal service programs, including most recently with the E-rate program, the Commission sought input from a wide variety of stakeholders on how best to transition a 20th century program to one that meets the 21st century needs of, in this case, low-income consumers.

Lifeline was established in 1985 to help make phone service more affordable for low-income consumers and has received little attention as to today’s most pressing communication need: access to broadband. ALA’s comments wholeheartedly agree with the Commission that broadband is no longer a “nice-to-have,” but a necessity to fully participate in civic society. We are clearly on record with the Commission describing the myriad of library services (which may be characterized by The E’s of Libraries®) that are not only dependent themselves on access to broadband, but that provide patrons with access to the wealth of digital resources so that libraries may indeed transform communities. We well understand the urgency of making sure everyone, regardless of geographic location or economic circumstances, has access to broadband and the internet as well as the ability to use it.

In addition to making broadband an eligible service in the Lifeline program, the Commission asks questions related to addressing the “homework gap” which refers to those families with school-age children who do not have home internet thus leaving these kids with extra challenges to school success. Other areas the Commission is investigating include whether the program should adopt minimum standards of service (for telephone and internet); if it should be capped at a specific funding level; and how to encourage more service providers to participate in the program.

Our Lifeline comments reiterate the important role libraries have in connecting (and transforming) communities across the country and call on the Commission to:

  • Address the homework gap as well as similar hurdles for vulnerable populations, including people with disabilities;
  • Consider service standards that are reasonably comparable to the consumer marketplace, are regularly evaluated and updated, and to the extent possible fashioned to anticipate trends in technology;
  • Allow libraries that provide WiFi devices to Lifeline-eligible patrons be eligible for financial support for the connectivity of those devices; and
  • Address the affordability barrier to broadband access through the Lifeline program, but continue to identify ways it can also promote broadband adoption.

We also reiterate the principles (pdf) outlined by The Leadership Conference on Civil and Human Rights and supported by ALA that call for universality of service for eligible households, program excellence, choice and competition, innovation, and efficiency, transparency and accountability.

Now that the comments are filed, we will mine the public comment system to read through other stakeholder comments and consult with other national groups in preparing reply comments (we get an opportunity to respond to other commenters as well as add details or more information on our own proposals). Reply comments are due to the Commission September 30. So as always with the Commission, there is more to come, which includes in-person meetings if warranted. Also, as always, many thanks to the librarians in the field and those who are also members of ALA committees who provided input and advice.

The post ALA urges FCC to include internet in the Lifeline program appeared first on District Dispatch.

Jobs in Information Technology: September 2, 2015 / LITA

New vacancy listings are posted weekly on Wednesday at approximately 12 noon Central Time. They appear under New This Week and under the appropriate regional listing. Postings remain on the LITA Job Site for a minimum of four weeks.

New This Week:

Head Librarian, Science Library and Director of Scholarly Communications, Princeton University, Princeton,NJ

Sales Representative, Backstage Library Works, Midwest Region

Analyst Programmer III, Oregon State University Libraries & Press, Corvallis, OR

Web Product Manager and Usability Specialist, Massachusetts Institute of Technology, Cambridge, MA

Interlibrary Loan/Reference Librarian, Pittsburgh Theological Seminary, Pittsburgh, PA

Health Sciences Librarian, Asst or Assoc Professor, SIU Edwardsville, Edwardsville, IL

University Archivist, University of North Carolina at Charlotte, Charlotte, NC

 

Visit the LITA Job Site for more available jobs and for information on submitting a job posting.

Libraries’ tech pipeline problem / Coral Sheldon-Hess

“We’ve got a pipeline problem, so let’s build a better pipeline.” –Bess Sadler, Code4Lib 2014 Conference (the link goes to the video)

I’ve been thinking hard (for two years, judging by the draft date on this post) about how to grow as a programmer, when one is also a librarian. I’m talking not so much about teaching/learning the basics of coding, which is something a lot of people are working really hard on, but more about getting from “OK, I finished yet another Python/Rails/JavaScript/whatever workshop” or “OK, I’ve been through all of Code Academy/edX/whatever”—or from where I am, “OK, I can Do Interesting Things™ with code, but there are huge gaps in my tech knowledge and vocabulary”—to the point where one could get a full-time librarian-coder position.

I should add, right here: I’m no longer trying to get a librarian-coder position*. This post isn’t about me, although it is, of course, from my perspective and informed by my experiences. This post is about a field I love, which is currently shooting itself in the foot, which frustrates me.

Bess is right: libraries need 1) more developers and 2) more diversity among them. Libraries are hamstrung by expensive, insufficient vendor “solutions.” (I’m not hating on the vendors, here; libraries’ problems are complex, and fragmentation and a number of other issues make it difficult for vendors to provide really good solutions.) Libraries and librarians could be so much more effective if we had good software, with interoperable APIs, designed specifically to fill modern libraries’ needs.

Please, don’t get me wrong: I know some libraries are working on this. But they’re too few, and their developers’ demographics do not represent the demographics of libraries at large, let alone our patron bases. I argue that the dearth and the demographic skew will continue and probably worsen, unless we make a radical change to our hiring practices and training options for technical talent.

Building technical skills among librarians

The biggest issue I see is that we offer a fair number of very basic learn-to-code workshops, but we don’t offer a realistic path from there to writing code as a job. To put a finer point on it, we do not offer “junior developer” positions in libraries; we write job ads asking for unicorns, with expert- or near-expert-level skills in at least two areas (I’ve seen ones that wanted strong skills in development, user experience, and devops, for instance).

This is unfortunate, because developing real fluency with any skill, including coding, requires practicing it regularly. In the case of software development, there are things you can really only learn on the job, working with other developers (ask me about Git, sometime); only, nobody seems willing to hire for that. And, yes, I understand that there are lots of single-person teams in libraries—far more than there should be—but many open source software projects can fill in a lot of that group learning and mentoring experience, if a lone developer is allowed to participate in them on work time. (OSS is how I am planning to fill in those skills, myself.)

From what I can tell, if you’re a librarian who wants to learn to code, you generally have two really bad options: 1) learn in your spare time, somehow; or 2) quit libraries and work somewhere else until your skills are built up. I’ve been down both of those roads, and as a result I no longer have “be a [paid] librarian-developer” on my goals list.

Option one: Learn in your spare time

This option is clown shoes. It isn’t sustainable for anybody, really, but it’s especially not sustainable for people in caretaker roles (e.g. single parents), people with certain disabilities (who have less energy and free time to start with), people who need to work more than one job, etc.—that is, people from marginalized groups. Frankly, it’s oppressive, and it’s absolutely a contributing factor to libtech’s largely male, white, middle to upper-middle class, able-bodied demographics—in contrast to the demographics of the field at large (which is also most of those things, but certainly not predominantly male).

“I’ve never bought this ‘do it in your spare time’ stuff. And it turns out that doing it in your spare time is terribly discriminatory, because … a prominent aspect of oppression is that you have more to do in less spare time.” – Valerie Aurora, during her keynote interview for Code4Lib 2014 (the link goes to the video)

“It’s become the norm in many technology shops to expect that people will take care of skills upgrading on their own time. But that’s just not a sustainable model. Even people who adore late night, just-for-fun hacking sessions during the legendary ‘larval phase’ of discovering software development can come to feel differently in a later part of their lives.” – Bess Sadler, same talk as above

I tried to make it work, in my last library job, by taking one day off every other week** to work on my development skills. I did make some headway—a lot, arguably—but one day every two weeks is not enough to build real fluency, just as fiddling around alone did not help me build the skills that a project with a team would have. Not only do most people not have the privilege of dropping to 90% of their work time, but even if you do, that’s not an effective route to learning enough!

And, here, you might think of the coding bootcamps (at more than $10k per) or the (free, but you have to live in NYC) Recurse Center (which sits on my bucket list, unvisited), but, again: most people can’t afford to take three months away from work, like that. And the Recurse Center isn’t so much a school (hence the name change away from “Hacker School”) as it is a place to get away from the pressures of daily life and just code; realistically, you have to be at a certain level to get in. My point, though, is that the people for whom these are realistic options tend to be among the least marginalized in other ways. So, I argue that they are not solutions and not something we should expect people to do.

Option two: go work in tech

If you can’t get the training you need within libraries or in your spare time, it kind of makes sense to go find a job with some tech company, work there for a few years, build up your skills, and then come back. I thought so, anyway. It turns out, this plan was clown shoes, too.

Every woman I’ve talked to who has taken this approach has had a terrible experience. (I also know of a few women who’ve tried this approach and haven’t reported back, at least to me. So my data is incomplete, here. Still, tech’s horror stories are numerous, so go with me here.) I have a theory that library vendors are a safer bet and may be open to hiring newer developers than libraries currently are, but I don’t have enough data (or anecdata) to back it up, so I’m going to talk about tech-tech.

Frankly, if we expect members of any marginalized group to go work in tech in order to build up the skills necessary for a librarian-developer job, we are throwing them to the wolves. In tech, even able-bodied straight cisgender middle class white women are a badly marginalized group, and heaven help you if you’re on any other axis of oppression.

And, sure, yeah. Not all tech. I’ll agree that there are non-terrible jobs for people from marginalized groups in tech, but you have to be skilled enough to get to be that choosy, which people in the scenario we’re discussing are not. I think my story is a pretty good illustration of how even a promising-looking tech job can still turn out horrible. (TLDR: I found a company that could talk about basic inclusivity and diversity in a knowledgeable way and seemed to want to build a healthy culture. It did not have a healthy culture.)

We just can’t outsource that skill-building period to non-library tech. It isn’t right. We stand to lose good people that way.

We need to develop our own techies—I’m talking code, here, because it’s what I know, but most of my argument expands to all of libtech and possibly even to library leadership—or continue offering our patrons sub-par software built within vendor silos and patched together by a small, privileged subset of our field. I don’t have to tell you what that looks like; we live with it, already.

What to do?

I’m going to focus on what you, as an individual organization, or leader within an organization, can do to help; I acknowledge that there are some systemic issues at play, beyond what my relatively small suggestions can reach, and I hope this post gets people talking and thinking about them (and not just to wave their hands and sigh and complain that “there isn’t enough money,” because doomsaying is boring and not helpful).

First of all, when you’re looking at adding to the tech talent in your organization, look within your organization. Is there a cataloger who knows some scripting and might want to learn more? (Ask around! Find out!) What about your web content manager, UX person, etc.? (Offer!) You’ll probably be tempted to look at men, first, because society has programmed us all in evil ways (seriously), so acknowledge that impulse and look harder. The same goes for race and disability and having the MLIS, which is too often a stand-in for socioeconomic class; actively resist those biases (and we all have those biases).

If you need tech talent and can’t grow it from within your organization, sit down and figure out what you really need, on day one, versus what might be nice to have, but could realistically wait. Don’t put a single nice-to-have on your requirements list, and don’t you dare lose sight of what is and isn’t necessary when evaluating candidates.

Recruit in diverse and non-traditional spaces for tech folks — dashing off an email to Code4Lib is not good enough (although, sure, do that too; they’re nice folks). LibTechWomen is an obvious choice, as are the Spectrum Scholars, but you might also look at the cataloging listservs or the UX listservs, just to name two options. Maybe see who tweets about #libtechgender and #critlib (and possibly #lismicroaggressions?), and invite those folks to apply and to share your linted job opening with their networks.

Don’t use whiteboard interviews! They are useless and unnecessarily intimidating! They screen for “confidence,” not technical ability. Pair-programming exercises, with actual taking turns and pairing, are a good alternative. Talking through scenarios is also a good alternative.

Don’t give candidates technology vocabulary tests. Not only is it nearly useless as an evaluation tool (and a little insulting); it actively discriminates against people without formal CS education (or, cough, people with CS minors from more than a decade ago). You want to know that they can approach a problem in an organized manner, not that they can define a term that’s easily Googled.

Do some reading about impostor syndrome, stereotype threat, and responsible tech hiring. Model View Culture’s a good place to start; here is their hiring issue.

(I have a whole slew of comments about hiring, and I’ll make those—and probably repeat the list above—in another post.)

Once you have someone in a position, or (better) you’re growing someone into a position, be sure to set reasonable expectations and deadlines. There will be some training time for any tech person; you want this, because something built with enough forethought and research will be better than something hurriedly duct-taped (figuratively, you hope) together.

Give people access to mentorship, in whatever form you can. If you can’t give them access to a team within your organization, give them dedicated time to contribute to relevant OSS projects. Send them to—just to name two really inclusive and helpful conferences/communities—Code4Lib (which has regional meetings, too) and/or Open Source Bridge.

 

So… that’s what I’ve got. What have I missed? What else should we be doing to help fix this gap?

 

* In truth, as excited as I am about starting my own business, I wouldn’t turn down an interview for a librarian-coder position local to Pittsburgh, but 1) it doesn’t feel like the wind is blowing that way, here, and 2) I’m in the midst of a whole slew of posts that may make me unemployable, anyway ;) (back to the text)

** To be fair, I did get to do some development on the clock, there. Unfortunately, because I wore so many hats, and other hats grew more quickly, it was not a large part of my work. Still, I got most of my PHP experience there, and I’m glad I had the opportunity. (back to the text)

 

How Twitter Uses Apache Lucene for Real-Time Search / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Michael Busch’s session on how Twitter executes real-time search with Apache Lucene. Twitter’s search engine serves billions of queries per day from different Lucene indexes, while appending more than hundreds of millions of tweets per day in real time. This session will give an overview of Twitter’s search architecture and recent changes and improvements that have been made. It will focus on the usage of Lucene and the modifications that have been made to it to support Twitter’s unique performance requirements. Michael Busch is architect in Twitter’s Search & Content organization. He designed and implemented Twitter’s current search index, which is based on Apache Lucene and optimized for realtime search. Prior to Twitter Michael worked at IBM on search and eDiscovery applications. Michael is Lucene committer and Apache member for many years.
Search at Twitter: Presented by Michael Busch, Twitter from Lucidworks
Twitter’s search engine serves billions of queries per day from different Lucene indexes, while appending more than hundreds of millions of tweets per day in real time. This session will give an overview of Twitter’s search architecture and recent changes and improvements that have been made. It will focus on the usage of Lucene and the modifications that have been made to it to support Twitter’s unique performance requirements. lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Twitter Uses Apache Lucene for Real-Time Search appeared first on Lucidworks.

025: WordPress for Libraries with Chad Haefele / LibUX

Chad Haefele sat with Amanda and I — Michael — to talk about his new book about WordPress for Libraries. If you’ve been paying attention then you know WordPress is our jam, so we were chomping at the bit.

WordPress for Libraries by Chad HaefeleYou only have until tomorrow at the time of this posting, but if you jump on it you can enter to win a free copy.

Also, Chad has a lot to say about usability testing, especially using optimal workshop tools, as well as about organizations allocating a user experience design budget —

as well as the inglorious end of Google Wave.

 

The post 025: WordPress for Libraries with Chad Haefele appeared first on LibUX.

New Report: “Open Budget Data: Mapping the Landscape” / Open Knowledge Foundation

We’re pleased to announce a new report, “Open Budget Data: Mapping the Landscape” undertaken as a collaboration between Open Knowledge, the Global Initiative for Financial Transparency and the Digital Methods Initiative at the University of Amsterdam.

The report offers an unprecedented empirical mapping and analysis of the emerging issue of open budget data, which has appeared as ideals from the open data movement have begun to gain traction amongst advocates and practitioners of financial transparency.

In the report we chart the definitions, best practices, actors, issues and initiatives associated with the emerging issue of open budget data in different forms of digital media.

In doing so, our objective is to enable practitioners – in particular civil society organisations, intergovernmental organisations, governments, multilaterals and funders – to navigate this developing field and to identify trends, gaps and opportunities for supporting it.

How public money is collected and distributed is one of the most pressing political questions of our time, influencing the health, well-being and prospects of billions of people. Decisions about fiscal policy affect everyone-determining everything from the resourcing of essential public services, to the capacity of public institutions to take action on global challenges such as poverty, inequality or climate change.

Digital technologies have the potential to transform the way that information about public money is organised, circulated and utilised in society, which in turn could shape the character of public debate, democratic engagement, governmental accountability and public participation in decision-making about public funds. Data could play a vital role in tackling the democratic deficit in fiscal policy and in supporting better outcomes for citizens.

The report includes the following recommendations:

  1. CSOs, IGOs, multilaterals and governments should undertake further work to identify, engage with and map the interests of a broader range of civil society actors whose work might benefit from open fiscal data, in order to inform data release priorities and data standards work. Stronger feedback loops should be established between the contexts of data production and its various contexts of usage in civil society – particularly in journalism and in advocacy.

  2. Governments, IGOs and funders should support pilot projects undertaken by CSOs and/or media organisations in order to further explore the role of data in the democratisation of fiscal policy – especially in relation to areas which appear to have been comparatively under-explored in this field, such as tax distribution and tax base erosion, or tracking money through from revenues to results.

  3. Governments should work to make data “citizen readable” as well as “machine readable”, and should take steps to ensure that information about flows of public money and the institutional processes around them are accessible to non-specialist audiences – including through documentation, media, events and guidance materials. This is a critical step towards the greater democratisation and accountability of fiscal policy.

  4. Further research should be undertaken to explore the potential implications and impacts of opening up information about public finance which is currently not routinely disclosed, such as more detailed data about tax revenues – as well as measures needed to protect the personal privacy of individuals.

  5. CSOs, IGOs, multilaterals and governments should work together to promote and adopt consistent definitions of open budget data, open spending data and open fiscal data in order to establish the legal and technical openness of public information about public money as a global norm in financial transparency.

Viewshare Supports Critical Thinking in the Classroom / Library of Congress: The Signal

This year I had the pleasure of meeting Dr. Peggy Spitzer Christoff, lecturer in Asian and Asian American Studies at Stony Brook University. She shared with me how she’s using the Library of Congress’ Viewshare tool to engage her students in an introduction to Asia Studies course. Peg talked about using digital platforms as a way to improve writing, visual and information literacy skills in her students. In this interview, she talks about why and how Viewshare is useful in connecting the students’ time “surfing the web” to creating presentations that require reflection and analysis.

Abbey: How did you first hear about Viewshare and what inspired you to use it in your classes?

Peg Christoff, Lecturer at Stony Brook University

Peg Christoff, Lecturer at Stony Brook University

Peg: I heard about it through the monthly Library of Congress Women’s History Discussion Group, about three years ago. At the time, Trevor Owens [former Library of Congress staff member] was doing presentations throughout the Library and he presented Viewshare to that group. It sounded like a neat way to organize information. Around the same time, I was developing the Department of Asian and Asian American Studies’ introductory (gateway) course for first and second year students at Stony Brook University. Faculty in our department were concerned that students couldn’t find Asian countries on a map and had very little understanding of basic information about Asia. I thought that developing a student project using Viewshare would enable each student to identify, describe and visually represent aspects of Asia of their choosing — as a launching pad for further exploration. Plus, I liked the idea of students writing paragraphs to describe each of the items they selected because it could help them become better writers. Finally, I wanted students to learn how to use an Excel spreadsheet in the context of a digital platform.

Abbey: So it sounds like the digital platforms project is allowing your students to explore a specific topic they may not be familiar with (i.e., Asian Studies) with a resource they are probably more familiar with (i.e., the web) while at the same time exposing them to basic data curation principles. Would you agree?

Peg: Yes. Combining these into one project has been so popular because we’ve broadened student interest in how collections are developed and organized.

Abbey: Why do you think Viewshare works well in the classroom?

Peg: Because students have the freedom to develop their own collections of Asian artifacts and, at the end of the semester, share their collections with each other. Students approach the assignment differently and it’s surprising to them (and me) to see how their interests in “Asia” change throughout the semester, as they develop their collections.

Abbey: Please walk us through how you approach teaching your students to use Viewshare in their assignments.

Peg: I introduce the Viewshare platform to engage students in critical thinking. The project requires students to select, classify, and describe the significance of Asian artifacts relating to subjects of common concern — education, health, religion and values, consumer issues, family and home, mobility, children, careers and work, entertainment and leisure, etc. Also, I want students to think about cultured spaces in India, Southeast Asia, China, Korea, Japan and Asian communities in the United States. I encourage students to consider the emotional appeal of the items, which could include anything from a photograph of the Demilitarized Zone (DMZ) in Korea, to ornamental jade pieces from China, to ancient religious texts from India, to anime from Japan. Food has a particularly emotional appeal, especially for college students.

Undergrad TAs have developed power point slides as “tutorials” on how to use Viewshare, which I post on Blackboard. We explore the website in class and everyone signs up for an account at the very beginning of the semester. The TA helps with troubleshooting. Four times throughout the semester, the students add several artifacts, I grade their written descriptions and the TA reviews their excel spreadsheet to correct format problems. Then, around the last few weeks of the semester, the students upload their excel spreadsheet into the Viewshare platform and generate maps, timelines, pie charts, etc. Here’s an example of a typical final project.

Example Final Project

Example Final Project

Abbey: How have your students reacted to using Viewshare?

Peg: Sometimes they are frustrated when they can’t get the platform to load correctly. Almost always they enjoy seeing the final result and would like to work more on it — if we only had more time during the semester.

Abbey: Do you see any possibilities for making more use of Viewshare?

Peg: I’d like to keep track of the Asian artifacts the students select and how they describe them over long periods of time — to interpret changes in student interests. (We have a large Asian population on campus and over 50% of my students are either Asian or Asian American.)

Also, my department would like to use the Viewshare platform to illustrate a collection of Asian connections to Long Island.

Abbey: Anything else to add?

Peg: I think Viewshare is really ideal for student projects. And I have used Viewshare in academic writing to organize data and illustrate patterns. I just cited a Viewshare view in a footnote.

The Case for Open Tools in Pedagogy / LITA

Academic libraries support certain software by virtue of what they have available on their public computers, what their librarians are trained to use, and what instruction sessions they offer. Sometimes libraries don’t have a choice in the software they are tasked with supporting, but often they do. If the goal of the software support is to simply help students achieve success in the short term, then any software that the library already has a license for is fair game. If the goal is to teach them a tool they can rely on anywhere, then libraries must consider the impact of choosing open tools over commercial ones.

Suppose we have a student, we’ll call them “Student A”, who wants to learn about citation management. They see a workshop on EndNote, a popular piece of citation management software, and they decide to attend. Student A becomes enamored with EndNote and continues to grow their skills with it throughout their undergraduate career. Upon graduating, Student A gets hired and is expected to keep up with the latest research in their field, but suddenly they no longer have access to EndNote through their university’s subscription. They can either pay for an individual license, or choose a new piece of citation management software (losing all of their hard earned EndNote-specific skills in the process).

Now let’s imagine Student B who also wants to learn about citation management software but ends up going to a workshop promoting Zotero, an open source alternative to EndNote. Similar to Student A, Student B continues to use Zotero throughout their undergraduate career, slowly mastering it. Since Zotero requires no license to use, Student B continues to use Zotero after graduating, allowing the skills that served them as a student to continue to do so as a professional.

Which one of these scenarios do you think is more helpful to the student in the long run? By teaching our students to use tools that they will lose access to once outside of the university system, we are essentially handing them a ticking time bomb that will explode as they transition from student to professional, which happens to be one of the most vulnerable and stressful periods in one’s life. Any academic library that cares about the continuing success of their students once they graduate should definitely take a look at their list of current supported software and ask themselves, “Am I teaching a tool or a time bomb?”

Telling VIVO Stories at Duke University with Julia Trimmer / DuraSpace News

“Telling VIVO Stories” is a community-led initiative aimed at introducing project leaders and their ideas to one another while providing details about VIVO implementations for the community and beyond. The following interview includes personal observations that may not represent the opinions and views of Duke University or the VIVO Project. Carol Minton Morris from DuraSpace interviewed Julia Trimmer from Duke University to learn about Scholars@Duke.

Better Search with Fusion Signals / SearchHub

Signals in Lucidworks Fusion leverage information about external activity, e.g., information collected from logfiles and transaction databases, to improve the quality of search results. This post follows on my previous post, Basics of Storing Signals in Solr with Fusion for Data Engineers, which showed how to index and aggregate signal data. In this post, I show how to write and debug query pipelines using this aggregated signal information.

User clicks provide a link between what people ask for and what they choose to view, given a set of search results, usually with product images. In the aggregate, if users have winnowed the set of search results for a given kind of thing, down to a set of products that are exactly that kind of thing, e.g., if the logfile entries link queries for “Netgear”, or “router”, or “netgear router” to clicks for products that really are routers, then this information can be used to improve new searches over the product catalog.

The Story So Far

To show how signals can be used to improve search in an e-commerce application, I created a set of Fusion collections:

  • A collection called “bb_catalog”, which contains Best Buy product data, a dataset comprised of over 1.2M items, mainly consumer electronics such as household appliances, TVs, computers, and entertainment media such as games, music, and movies. This is the primary collection.
  • An auxiliary collection called “bb_catalog_signals”, created from a synthetic dataset over Best Buy query logs from 2011. This is the raw signals data, meaning that each logfile entry is stored as an individual document.
  • An auxiliary collection called “bb_catalog_signals_aggr” derived from the data in “bb_catalog_signals” by aggregating all raw signal records based on the combination of search query, field “query_s”, item clicked on, field “doc_id_s”, and search categories, field “filters_ss”.

All documents in collection “bb_catalog” have a unique product ID stored in field “id”. All items belong to one of more categories which are stored in the field “categories_ss”.

The following screenshot shows the Fusion UI search panel over collection “bb_catalog”, after using the Search UI Configuration tool to limit the document fields displayed. The gear icon next to the search box toggles this control open and closed. The “Documents” settings are set so that the primary field displayed is “name_t”, the secondary field is “id”, and additional fields are “name_t”, “id”, and “category_ss”. The document in the yellow rectangle is a Netgear router with product id “1208844”.

bb_catalog

For collection “bb_catalog_signals”, the search query string is stored in field “query_s”, the timestamp is stored in field “tz_timestamp_txt”, the id of the document clicked on is stored in field “doc_id_s”, and the set of category filters are stored in fields “filters_ss” as well as “filters_orig_ss”.

The following screenshot shows the results of a search for raw signals where the id of the product clicked on was “1208844”.

bb_catalog

The collection “bb_catalog_signals_aggr” contains aggregated signals. In addition to the fields “doc_id_s”, “query_s”, and “filter_ss”, aggregated click signals contain fields:

  • “count_i” – the number of raw signals found for this query, doc, filter combo.
  • “weight_d” – a real-number used as a multiplier to boost the score of these documents.
  • “tz_timestamp_txt” – all timestamps of raw signals, stored as a list of strings.

The following screenshot shows aggregated signals for searches for “netgear”. There were 3 raw signals where the search query “netgear” and some set of category choices resulted in a click on the item with id “1208844”:

bb_catalog

Using Click Signals in a Fusion Query Pipeline

Fusion&aposs Query Pipelines take as input a set of search terms and process them into Solr query request. The Fusion UI Search panel has a control which allows you to choose the processing pipeline. In the following screenshot of the collection “bb_catalog”, the query pipeline control is just below the search input box. Here the pipeline chosen is “bb_catalog-default” (circled in yellow):

bb_catalog

The pre-configured default query pipelines consist of 3 stages:

  • A Search Fields query stage, used to define common Solr query parameters. The initial configuration specifies that the 10 best-scoring documents should be returned.
  • A Facet query stage which defines the facets to be returned as part of the Solr search results. No facet field names are specified in the initial defaults.
  • A Solr query stage which transforms a query request object into a Solr query and submits the request to Solr. The default configuration specifies the HTTP method as a POST request.

In order to get text-based search over the collection “bb_catalog” to work as expected, the Search Field query stage must be configured to specify the set of fields that which contain relevant text. For the majority of the 1.2M products in the product catalog, the item name, found in field “name_t” is only field amenable to free text search. The following screenshot shows how to add this field to the Search Fields stage by editing the query pipeline via the Fusion 2 UI:

add search field, search term: ipad

The search panel on the right displays the results of a search for “ipad”. There were 1,359 hits for this query, which far exceeds the number of items that are an Apple iPad. The best scoring items contain “iPad” in the title, sometimes twice, but these are all iPad accessories, not the device itself.

Recommendation Boosting query stage

A Recommendation Boosting stage uses aggregated signals to selectively boost items in the set of search results. The following screenshot show the results of the same search after adding a Recommendations Boosting stage to the query pipeline:

recommendations boost, search term: ipad

The edit pipeline panel on the left shows the updated query pipeline “bb_catalog-default” after adding a “Recommendations Boosting” stage. All parameter settings for this stage have been left at their default values. In particular, the recommendation boosts are applied to field “id”. The search panel on the right shows the updated results for the search query “ipad”. Now the three most relevant items are for Apple iPads. They are iPad 2 models because the click dataset used here is based on logfile data from 2011, and at that time, the iPad 2 was the most recent iPad on the market. There were more clicks on the 16GB iPads over the more expensive 32GB model, and for the color black over the color white.

Peeking Under the Hood

Of course, under the hood, Fusion is leveraging the awesome power of Solr. To see how this works, I show both the Fusion query and the JSON of the Solr response. To display the Fusion query, I go into the Search UI Configuration and change the “General” settings and check the set “Show Query URL” option. To see the Solr response in JSON format, I change the display control from “Results” to “JSON”.

The following screenshot shows the Fusion UI search display for “ipad”:

recommendations boost, under the hood

The query “ipad” entered via the Fusion UI search box is transformed into the following request sent to the Fusion REST-API:

/api/apollo/query-pipelines/bb_catalog-default/collections/bb_catalog/select?fl=*,score&echoParams=all&wt=json&json.nl=arrarr&sort&start=0&q=ipad&debug=true&rows=10

This request to the Query Pipelines API sends a query through the query pipeline “bb_catalog-default” for the collection “bb_catalog” using the Solr “select” request handler, where the search query parameter “q” has value “ipad”. Because the parameter “debug” has value “true”, the Solr response contains debug information, outlined by the yellow rectangle. The “bb_catalog-default” query pipeline transforms the query “ipad” into the following Solr query:

"parsedquery": "(+DisjunctionMaxQuery((name_t:ipad)) 
id:1945531^4.0904393 id:2339322^1.5108471 id:1945595^1.0636971
id:1945674^0.4065684 id:2842056^0.3342921 id:2408224^0.4388061
id:2339386^0.39254773 id:2319133^0.32736558 id:9924603^0.1956079
id:1432551^0.18906432)/no_coord"

The outer part of this expression, “( … )/no_coord” is a reporting detail, indicating Solr&aposs “coord scoring” feature wasn&apost used.

The enclosed expression consists of:

  • The search: “+DisjunctionMaxQuery(name_t:ipad)”.
  • A set of selective boosts to be applied to the search results

The field name “name_t” is supplied by the set of search fields specified by the Search Fields query stage. (Note: if no search fields are specified, the default search field name “text” is used. Since the documents in collection “bb_catalog” don&apost contain a field named “text”, this stage must be configured with the appropriate set of search fields.)

The Recommendations Boosting stage was configured with the default parameters:

  • Number of Recommendations: 10
  • Number of Signals: 100

There are 10 documents boosted, with ids ( 1945531, 2339322, 1945595, 1945674, 2842056, 2408224, 2339386, 2319133, 9924603, 1432551 ). This set of 10 documents represents documents which had at least 100 clicks where “ipad” occurred in the user search query. The boost factor is a number derived from the aggregated signals by the Recommendation Boosting stage. If those documents contain the term “name_t:ipad”, then they will be boosted. If those documents don&apost contain the term, then they won&apost be returned by the Solr query.

To summarize: adding in the Recommendations Boosting stage results in a Solr query where selective boosts will be applied to 10 documents, based on clickstream information from an undifferentiated set of previous searches. The improvement in the quality of the search results is dramatic.

Even Better Search

Adding more processing to the query pipeline allows for user-specific and search-specific refinements. Like the Recommendations Boosting stage, these more complex query pipelines leverage Solr&aposs expressive query language, flexible scoring, and lightning fast search and indexing. Fusion query pipelines plus aggregated signals give you the tools you need to rapidly improve the user search experience.

The post Better Search with Fusion Signals appeared first on Lucidworks.

Koha - 3.20.3, 3.18.10, 3.16.14 / FOSS4Lib Recent Releases

Package: 
Release Date: 
Monday, August 31, 2015

Last updated September 1, 2015. Created by David Nind on September 1, 2015.
Log in to edit this page.

Monthly maintenance releases for Koha.

See the release announcements for the details:

New Exhibitions from the Public Library Partnerships Project / DPLA

We are pleased to announce the publication of 10 new exhibitions created by DPLA Hubs and public librarian participants in our Public Library Partnerships Project (PLPP), funded by the Bill and Melinda Gates Foundation. Over the course of the last six months, curators from Digital Commonwealth, Digital Library of Georgia, Minnesota Digital Library, the Montana Memory Project, and Mountain West Digital Library researched and built these exhibitions to showcase content digitized through PLPP. Through this final phase of the project, public librarians had the opportunity to share their new content, learn exhibition curation skills, explore Omeka for future projects, and contribute to an open peer review process for exhibition drafts.

A History of US Public Libraries: http://dp.la/exhibitions/exhibits/show/history-us-public-libraries Patriotic Labor: America During World War I Best Foot Forward: http://dp.la/exhibitions/exhibits/show/shoe-industry-massachusetts Quack Cures and Self-Remedies: Patent Medicine Boom and Bust: The Industries That Settled Montana Recreational Tourism in the Mountain West Children in Progressive-Era America Roosevelt's Tree Army: The Civilian Conservation Corps Georgia's Home Front: World War II Urban Parks in the United States

Congratulations to all of our curators and, in particular, our exhibition organizers: Greta Bahnemann, Jennifer Birnel, Hillary Brady, Anna Fahey-Flynn, Greer Martin, Mandy Mastrovita, Anna Neatrour, Carla Urban, Della Yeager, and Franky Abbott.

Thanks to the following reviewers who participated in our open peer review process: Dale Alger, Cody Allen, Greta Bahnemann, Alexandra Beswick, Jennifer Birnel, Hillary Brady, Wanda Brown, Anne Dalton, Carly Delsigne, Liz Dube, Ted Hathaway, Sarah Hawkins, Jenny Herring, Tammi Jalowiec, Stef Johnson, Greer Martin, Sheila McAlister, Lisa Mecklenberg-Jackson, Tina Monaco, Mary Moore, Anna Neatrour, Michele Poor, Amy Rudersdorf, Beth Safford, Angela Stanley, Kathy Turton, and Carla Urban.

For more information about the Public Library Partnerships Project, please contact PLPP project manager, Franky Abbott: franky@dp.la.

Momentum, we have it! / District Dispatch

The word Momentum displayed as a Newtons Cradle

Source: Real Momentum

As you may have read here, school libraries are well represented in S. 1177, the Every Child Achieves Act.  In fact, we were more successful with this bill than we have been in recent history and this is largely due to your efforts in contacting Congress.

Currently, the House Committee on Education and Workforce (H.R. 5, the Student Success Act) and the Senate Committee on Health, Education, Labor and Pensions are preparing to go to “conference” in an attempt to work out differences between the two versions of the legislation and reach agreement on reauthorization of ESEA. ALA is encouraged that provisions included under S. 1177, would support effective school library programs. In particular, ALA is pleased that effective school library program provisions were adopted unanimously during HELP Committee consideration of an amendment offered by Senator Whitehouse (D-RI)) and on the Senate floor with an amendment offered by Senators Reed (D-RI) and Cochran (R-MS).

ALA is asking (with your help!) that any conference agreement to reauthorize ESEA maintain the following provisions that were overwhelmingly adopted by the HELP Committee and the full Senate under S. 1177, the Every Child Achieves Act:

  1. Title V, Part H – Literacy and Arts Education – Authorizes activities to promote literacy programs that support the development of literacy skills in low-income communities (similar to the Innovative Approaches to Literacy program that has been funded through appropriations) as well as activities to promote arts education for disadvantaged students.
  2. Title I – Improving Basic Programs Operated by State and Local Educational Agencies – Under Title I of ESEA, State Educational Agencies (SEAs) and local educational agencies (LEAs) must develop plans on how they will implement activities funded under the Act.
  3. Title V, Part G – Innovative Technology Expands Children’s Horizons (I-TECH) – Authorizes activities to ensure all students have access to personalized, rigorous learning experiences that are supported through technology and to ensure that educators have the knowledge and skills to use technology to personalize learning.

Now is the time to keep the momentum going! Contact your Senators and Representative to let them know that you support the effective school library provisions found in the Senate bill and they should too!

A complete list of school library provisions found in S.1177 can be found here.

The post Momentum, we have it! appeared first on District Dispatch.

Call for Convenors / Access Conference

Do you want to be part of the magic of AccessYYZ? Well, aren’t you lucky? Turns out we’re  looking for some convenors!

Convening isn’t much work (not that we think you’re a slacker or anything)–all you have to do is introduce the name of the session, read the bio of the speaker(s), and thank any sponsors. Oh, and facilitate any question and answer segments. Which doesn’t actually mean you’re on the hook to come up with questions (that’d be rather unpleasant of us) so much as you’ll repeat questions from the crowd into the microphone. Yup, that’s it. We’ll give you a script and everything!

In return, you’ll get eternal gratitude from the AccessYYZ Organizing Committee. And also a high five! If you’re into that sort of thing. Even if you’re not, you’ll get to enjoy the bright lights and the glory that comes with standing up in front of some of libraryland’s most talented humans for 60 seconds. Sound good? We thought so.

You can dibs a session by filling out the Doodle poll.

Supporting ProseMirror inline HTML editor / Peter Sefton

The world needs a good, sane in-browser editing component, one that edits document structure (headings, lists, quotes etc) rather than format (font, size etc). I’ve been thinking for a while that an editing component based around Markdown (or Commonmark) would be just the thing. Markdown/Commonmark is effectively a spec for the minimal sensible markup set for documents, it’s more than adequate for articles, theses, reports etc. And it can be extended with document semantics.

Anyway, there’s a crowdfunding campaign going on for an editor called ProseMirror that does just that, and promises collaborative editing as well. It’s beta quality but looks promising, I chipped in 50 Euros to try to get it over the line to be released as open source.

The author says:

Who I am

This campaign is being run by Marijn Haverbeke, author of CodeMirror, a widely used in-browser code editor, Eloquent JavaScript, a freely available JavaScript book, and Tern, which is an editor-assistance engine for JavaScript coding that I also crowd-funded here. I have a long history of releasing and maintaining solid open software. My work on CodeMirror (which you might know as the editor in Chrome and Firefox’s dev tools) has given me lots of experience with writing a fast, solid, extendable editor. Many of the techniques that went into ProseMirror have already proven themselves in CodeMirror.

There’s a lot to like with this editor - it has a nice floating toolbar that pops up at the right of the paragraph, with a couple of non-quite-standard behaviours that just might catch on. Mostly works, but has some really obvious bugs usability issues , like when I try to make a nested list it makes commonmark like this:

* List item
* List item
* * List item

And it even renders the two bullets side by side in the HTML view. Even thought that is apparently supported by commonmark, for a prose editor it’s just wrong. Nobody means two bullets unless they’re up to no good, typographically speaking.

The editor should do the thing you almost certainly mean. Something like:

* List item
* List item
  * List item

But, if that stuff gets cleaned up then this will be perfect for producing Scholarly Markdown, and Scholarly HTML. The $84 AUD means I’ll get priority on a reporting a bug, assuming it reaches its funding goal.

Apache Solr for Multi-language Content Discovery Through Entity Driven Search / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Alessandro Benedetti’s session on using entity driven search for multi-language content discovery and search. This talk is about the description of the implementation of a Semantic Search Engine based on Solr. Meaningfully structuring content is critical, Natural Language Processing and Semantic Enrichment is becoming increasingly important to improve the quality of Solr search results. Our solution is based on three advanced features:
  1. Entity-oriented search – Searching not by keyword, but by entities (concepts in a certain domain)
  2. Knowledge graphs – Leveraging relationships amongst entities: Linked Data datasets (Freebase, DbPedia, Custom …)
  3. Search assistance – Autocomplete and Spellchecking are now common features, but using semantic data makes it possible to offer smarter features, driving the users to build queries in a natural way.
The approach includes unstructured data processing mechanisms integrated with Solr to automatically index semantic and multi-language information. Smart Autocomplete will complete users’ query with entity names and properties from the domain knowledge graph. As the user types, the system will propose a set of named entities and/or a set of entity types across different languages. As the user accepts a suggestion, the system will dynamically adapt following suggestions and return relevant documents. Semantic More Like This will find similar documents to a seed one, based on the underlying knowledge in the documents, instead of tokens. Alessandro Benedetti is a search expert and semantic technology passionate, working in the R&D division of Zaizi. His favorite work is in R&D on information retrieval, NLP and machine learning with a big emphasis on data structures, algorithms and probability theory. Alessandro earned his Masters in Computer Science with full grade in 2009, then spent 6 month with Universita’ degli Studi di Roma working on his masters thesis around a new approach to improve semantic web search. Alessandro spent 3 years with Sourcesense as a Search and Open Source consultant and developer.
Multi-language Content Discovery Through Entity Driven Search: Presented by Alessandro Benedetti, Zaizi from Lucidworks
lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Apache Solr for Multi-language Content Discovery Through Entity Driven Search appeared first on Lucidworks.

Islandora 150 / Islandora

Last fall I issued a challenge to the community to help us get 100 dots on our Installations Map by the end of 2014. It was a tight goal, but we got there, and our map got a little more cluttered:

This year I'm asking the community to stretch a little further. We know there are far more sites out there than show on our list. I want to find at least 48 of them and put them on the map by the end of 2015, taking us to 150. For helping us get there, I'm offering up a chance to win one of five Islandora Tuque Tuques or one of three ridiculously adorable Islandoracon lobster finger puppets.

How to help (and enter the draw):

  • Tell me about your Islandora repository and send a link
  • Tell me about your Islandora repository in development
  • Give me a new link to your repository if you are already on the list but not linked or at an outdated link.
  • Nominate a public-facing Islandora repository that is not already on the list.
  • Give some new information about your link, so we can tag it as:
    • Insitutional Repository
    • Research Data
    • Digital Humanities
    • Consortium/Multisite

Email manez@islandora.ca with your new or updated Installation Map dots. If we hit 150 by January 1st, 2016, I will draw for the hats and lobsters and send out some New Year's prizes to the lucky winners. Thanks for your help!

Packaging Video DVDs for the Repository / Mark E. Phillips

For a while I’ve had two large boxes of DVDs that a partner institution dropped off with the hopes of having them added to The Portal to Texas History.  These DVDs were from oral histories conducted by the local historical commission from 1998-2002 and were converted from VHS to DVD sometime in the late 2000s.  They were interested in adding these to the Portal so that they could be viewed by a wider audience and also be preserved in the UNT Libraries’ digital repository.

So these DVDs sat on my desk for a while because I couldn’t figure out what I wanted to do with them.  I wanted to figure out a workflow that I could use from all Video DVD based projects in the future and it hurt my head whenever I started to work on the project.  So they sat.

When the partner politely emailed about the disks and asked about the delay in getting them loaded I figured it was finally time to get a workflow figured out so that I could get the originals back to the partner.  I’m sharing the workflow that I came up with here because I didn’t see much prior information on this sort of thing when I was researching the process.

Goals:

I had two primary goals of the conversion workflow, first I wanted to retain an exact copy of the disk that we were working with.  All of these videos were VHS to DVD conversions most likely completed with a stand alone recorder.  They had very simple title screens and lacked other features but I figured for other kinds of Video DVD work in the future that they might have more features that I didn’t want to lose by just extracting the video.  The second goal was to pull off the video from the DVD without introducing additional compression during the process. When these files get ingested into the repository and the final access system they will be converted into an mp4 container using the h.264 codex so they will get another round of  compression later.

With these two goals in mind here is what I ended up with.

For the conversion I used my MacBook Pro and SuperDrive.  I first created an iso image of the disc using the hdiutil command.

hdiutil makehybrid -iso -joliet -o image.iso /Volumes/DVD_VR/

Once this image as created I mounted the image by double clicking on the image.iso file in the Finder.

I then loaded makeMKV and created an MKV file from the video and audio on the disk that I was interested in.  This resulting mkv file would contain the primary video content that users will interact with in the future.  I saved this file as title00.mkv

makeMKV screenshot

makeMKV screenshot

Once this step was completed I used ffmpeg to convert the mkv container to an mpeg container to add to the repository.   I could of kept the container as an mkv but decided to move it over to mpeg because we already have a number of those files in the repository and no mkv files to date.  The ffmpeg command is as follows.

ffmpeg -i title00.mkv -vcodec copy -acodec copy -f vob -copyts -y video.mpg

Because the the makeMKV and ffmpeg commands are just muxing the video and audio and not compressing, they tend to process very quickly in just a few seconds.  The most time consuming part of the process is getting the iso created in the first step.

With all of these files now created I packaged them up for loading into the repository.  Here is what a pre-submission package looks like for a Video DVD using this workflow.

DI028_dodo_parker_1998-07-15/
├── 01_mpg/
│   └── DI028_dodo_parker_1998-07-15.mpg
├── 02_iso/
│   └── DI028_dodo_parker_1998-07-15.iso
└── metadata.xml

You can see that we place the mpg and iso files in separate folders, 01_mpg for the mpg and 02_iso for the iso file.  When we create the SIP for these files we will notate that the 02_iso format should not be pushed to the dissemination package (what we locally call an Access Content Package or ACP) so the iso file and folder will just live with the archival package.

This seemed to work for me to get these Video DVDs converted over and placed in the repository.  The workflow satisfied my two goals of retaining a full copy of the original disk as an iso and also getting a copy of the video from the disk in a format that didn’t introduce an extra compression step.  I think that there is probably a way of getting from the iso straight to the mpg version, probably with the handy ffmpeg (or possibly mplayer?) but I haven’t take the time to look into that.

There is a downside to this way of handling Video DVDs, which is that it will most likely take up twice the amount of storage as the original disk, so for a 4 GB Video DVD, we will be storing 8 GB of data in the repository,  this would probably add up for a very large project, but that’s a worry for another day.  (and a worry that honestly gets smaller year after year)

I hope that this explanation of how I processed Video  DVDs for inclusion into our repository was useful to someone else.

Let me know what you think via Twitter if you have questions or comments.

Mining Events for Recommendations / SearchHub

Summary: TheEventMiner” feature in Lucidworks Fusion can be used to mine event logs to power recommendations. We describe how the system uses graph navigation to generate diverse and high-quality recommendations.

User Events

The log files that most web services generate are a rich source of data for learning about user behavior and modifying system behavior based on this. For example, most search engines will automatically log details on user queries and the resulting clicked documents (URLs). We can define a (user, query, click, time) record which records a unique “event” that occurred at a specific time in the system. Other examples of event data include e-commerce transactions (e.g. “add to cart”, “purchase”), call data records, financial transactions etc. By analyzing a large volume of these events we can “surface” implicit structures in the data (e.g. relationships between users, queries and documents), and use this information to make recommendations, improve search result quality and power analytics for business owners. In this article we describe the steps we take to support this functionality.

1. Grouping Events into Sessions

Event logs can be considered as a form of “time series” data, where the logged events are in temporal order. We can then make use of the observation that events close together in time will be more closely related than events further apart. To do this we need to group the event data into sessions.
A session is a time window for all events generated by a given source (like a unique user ID). If two or more queries (e.g. “climate change” and “sea level rise”) frequently occur together in a search session then we may decide that those two queries are related. The same would apply for documents that are frequently clicked on together. A “session reconstruction” operation identifies users’ sessions by processing raw event logs and grouping them based on user IDs, using the time-intervals between each and every event. If two events triggered by the same user occur too far apart in time, they will be treated as coming from two different sessions. For this to be possible we need some kind of unique ID in the raw event data that allows us to tell that two or more events are related because they were initiated by the same user within a given time period. However, from a privacy point of view, we do not need an ID which identifies an actual real person with all their associated personal information. All we need is an (opaque) unique ID which allows us to track an “actor” in the system.

2. Generating a Co-Occurrence Matrix from the Session Data

We are interested in entities that frequently co-occur, as we might then infer some kind of interdependence between those entities. For example, a click event can be described using a click(user, query, document) tuple, and we associate each of those entities with each other and with other similar events within a session. A key point here is that we generate the co-occurrence relations not just between the same field types e.g. (query, query) pairs, but also “cross-field” relations e.g. (query, document), (document, user) pairs etc. This will give us an N x N co-occurrence matrix, where N = all unique instances of the field types that we want to calculate co-occurrence relations for. Figure 1 below shows a co-occurrence matrix that encodes how many times different characters co-occur (appear together in the text) in the novel “Les Miserables”. Each colored cell represents two characters that appeared in the same chapter; darker cells indicate characters that co-occurred more frequently. The diagonal line going from the top left to the bottom right shows that each character co-occurs with itself. You can also see that the character named “Valjean”, the protagonist of the novel, appears with nearly every other character in the book.

Les-Miserables-Co-Occurrence

Figure 1. “Les Miserables” Co-occurrence Matrix by Mike Bostock.

In Fusion we generate a similar type of matrix, where each of the items is one of the types specified when configuring the system. The value in each cell will then be the frequency of co-occurrence for any two given items e.g. a (query, document) pair, a (query, query) pair, a (user, query) pair etc.

For example, if the query “Les Mis” and a click on the web page for the musical appear together in the same user session then they will be treated as having co-occurred. The frequency of co-occurrence is then the number of times this has happened in the raw event logs being processed.

3. Generating a Graph from the Matrix

The co-occurrence matrix from the previous step can also be treated as an “adjacency matrix”, which encodes whether two vertices (nodes) in a graph are “adjacent” to each other i.e. have a link or “co-occur”. This matrix can then be used to generate a graph, as shown in Figure 2:

Adjacency-Matrix-Graph

Figure 2. Generating a Graph from a Matrix.

Here the values in the matrix are the frequency of co-occurrence for those two vertices. We can see that in the graph representation these are stored as “weights” on the edge (link) between the nodes e.g. nodes V2 and V3 co-occurred 5 times together.

We encode the graph structure in a collection in Solr using a simple JSON record for each node. Each record contains fields that list the IDs of other nodes that point “in” at this record, or which this node points “out” to.

Fusion provides an abstraction layer which hides the details of constructing queries to Solr to navigate the graph. Because we know the IDs of the records we are interested in we can generate a single boolean query where the individual IDs we are looking for are separated by OR operators e.g. (id:3677 OR id:9762 OR id:1459). This means we only make a single request to Solr to get the details we need.

In addition, the fact that we are only interested in the neighborhood graph around a start point means the system does not have to store the entire graph (which is potentially very large) in memory.

4. Powering Recommendations from the Graph

At query/recommendation time we can use the graph to make suggestions on which other items in that graph are most related to the input item, using the following approach:

  1. Navigate the co-occurrence graph out from the seed item to harvest additional entities (documents, users, queries).
  2. Merge the list of entities harvested from different nodes in the graph so that the more lists an entity appears in the more weight it receives and the higher it rises in the final output list.
  3. Weights are based on the reciprocal rank of the overall rank of the entity. The overall rank is calculated as the sum of the rank of the result the entity came from and the rank of the entity within its own list.
 

The following image shows the graph surrounding the document “Midnight Club: Los Angeles” from a sample data set:

midnight-club-graph

Figure 3. An Example Neighborhood Graph.

Here the relative size of the nodes shows how frequently they occurred in the raw event data, and the size of the arrows is a visual indicator of the weight or frequency of co-occurrence between two elements.

For example, we can see that the query “midnight club” (blue node on bottom RHS) most frequently resulted in a click on the “Midnight Club: Los Angeles Complete Edition Platinum Hits” product (as opposed to the original version above it). This is the type of information that would be useful to a business analyst trying to understand user behavior on a site.

Diversity in Recommendations

For a given item, we may only have a small number of items that co-occur with it (based on the co-occurrence matrix). By adding in the data from navigating the graph (which comes from the matrix), we increase the diversity of suggestions. Items that appear in multiple source lists then rise to the top. We believe this helps improve the quality of the recommendations & reduce bias. For example, in Figure 4 we show some sample recommendations for the query “Call of Duty”, where the recommendations are coming from a “popularity-based” recommender i.e. it gives a large weight to items with the most clicks. We can see that the suggestions are all from the “Call of Duty” video game franchise:

Query-Top-Clicks

Figure 4. Recommendations from a “popularity-based” recommender system.

In contrast, in Figure 5 we show the recommendations from EventMiner for the same query:

Query-Related-DocIds

Figure 5. Recommendations from navigating the graph.

Here we can see that the suggestions are now more diverse, with the first two being games from the same genre (“First Person Shooter” games) as the original query.

In the case of an e-commerce site, diversity in recommendations can be an important factor in suggesting items to a user that are related to their original query, but which they may not be aware of. This in turn can help increase the overall CTR (Click-Through Rate) and conversion rate on the site, which would have a direct positive impact on revenue and customer retention.

Evaluating Recommendation Quality

To evaluate the quality of the recommendations produced by this approach we used CrowdFlower to get user judgements on the relevance of the suggestions produced by EventMiner. Figure 6 shows an example of how a sample recommendation was presented to a human judge:

EventMiner-CrowdFlower-Survey-Resident-Evil

Figure 6. Example relevance judgment screen (CrowdFlower).

Here the original user query (“resident evil”) is shown, along with an example recommendation (another video game called “Dead Island”). We can see that the judge is asked to select one of four options, which is used to give the item a numeric relevance score:

  1. Off Topic
  2. Acceptable
  3. Good
  4. Excellent
  In this example the user might judge the relevance for this suggestion as “good”, as the game being recommended is in the same genre (“survival horror”) as the original query. Note that the product title contains no terms in common with the query i.e. the recommendations are based purely on the graph navigation and do not rely on an overlap between the query and the document being suggested. In Table 1 we summarize the results of this evaluation:
Items Judgements Users Avg. Relevance (1 – 4)
1000 2319 30 3.27
 

Here we can see that the average relevance score across all judgements was 3.27 i.e. “good” to “excellent”.

Conclusion

If you want an “out-of-the-box” recommender system that generates high-quality recommendations from your data please consider downloading and trying out Lucidworks Fusion.

The post Mining Events for Recommendations appeared first on Lucidworks.

Michigan becomes the latest Hydra Partner / Hydra Project

We are delighted to announce that the University of Michigan has become the latest formal Hydra Partner.  Maurice York, their Associate University Librarian for Library Information Technology, writes:

“The strength, vibrancy and richness of the Hydra community is compelling to us.  We are motivated by partnership and collaboration with this community, more than simply use of the technology and tools. The interest in and commitment to the community is organization-wide; last fall we sent over twenty participants to Hydra Connect from across five technology and service divisions; our showing this year will be equally strong, our enthusiasm tempered only by the registration limits.”

Welcome Michigan!  We look forward to a long collaboration with you.

Update on the Library Privacy Pledge / Eric Hellman

The Library Privacy Pledge of 2015, which I wrote about previously, has been finalized. We got a lot of good feedback, and the big changes have focused on the schedule.

Now, any library , organization or company that signs the pledge will have 6 months to implement HTTPS from the effective date of their signature. This should give everyone plenty of margin to do a good job on the implementation.

We pushed back our launch date to the first week of November. That's when we'll announce the list of "charter signatories". If you want your library, company or organization to be included in the charter signatory list, please send an e-mail to pledge@libraryfreedomproject.org.

The Let's Encrypt project will be launching soon. They are just one certificate authority that can help with HTTPS implementation.

I think this is an very important step for the library information community to take, together. Let's make it happen.

Here's the finalized pledge:

The Library Freedom Project is inviting the library community - libraries, vendors that serve libraries, and membership organizations - to sign the "Library Digital Privacy Pledge of 2015". For this first pledge, we're focusing on the use of HTTPS to deliver library services and the information resources offered by libraries. It’s just a first step: HTTPS is a privacy prerequisite, not a privacy solution. Building a culture of library digital privacy will not end with this 2015 pledge, but committing to this first modest step together will begin a process that won't turn back.  We aim to gather momentum and raise awareness with this pledge; and will develop similar pledges in the future as appropriate to advance digital privacy practices for library patrons.

We focus on HTTPS as a first step because of its timeliness. The Let's Encrypt initiative of the Electronic Frontier Foundation will soon launch a new certificate infrastructure that will remove much of the cost and technical difficulty involved in the implementation of HTTPS, with general availability scheduled for September. Due to a heightened concern about digital surveillance, many prominent internet companies, such as Google, Twitter, and Facebook, have moved their services exclusively to HTTPS rather than relying on unencrypted HTTP connections. The White House has issued a directive that all government websites must move their services to HTTPS by the end of 2016. We believe that libraries must also make this change, lest they be viewed as technology and privacy laggards, and dishonor their proud history of protecting reader privacy.

The 3rd article of the American Library Association Code of Ethics sets a broad objective:

We protect each library user's right to privacy and confidentiality with respect to information sought or received and resources consulted, borrowed, acquired or transmitted.
It's not always clear how to interpret this broad mandate, especially when everything is done on the internet. However, one principle of implementation should be clear and uncontroversial:
Library services and resources should be delivered, whenever practical, over channels that are immune to eavesdropping.

The current best practice dictated by this principle is as following:
Libraries and vendors that serve libraries and library patrons, should require HTTPS for all services and resources delivered via the web.

The Pledge for Libraries:

1. We will make every effort to ensure that web services and information resources under direct control of our library will use HTTPS within six months. [ dated______ ]

2. Starting in 2016, our library will assure that any new or renewed contracts for web services or information resources will require support for HTTPS by the end of 2016.

The Pledge for Service Providers (Publishers and Vendors):

1. We will make every effort to ensure that all web services that we (the signatories) offer to libraries will enable HTTPS within six months. [ dated______ ]

2. All web services that we (the signatories) offer to libraries will default to HTTPS by the end of 2016.

The Pledge for Membership Organizations:

1. We will make every effort to ensure that all web services that our organization directly control will use HTTPS within six months. [ dated______ ]

2. We encourage our members to support and sign the appropriate version of the pledge.

There's a FAQ available, too. All this will soon be posted on the Library Freedom Project website.

Link roundup August 30, 2015 / Harvard Library Innovation Lab

This is the good stuff.

Rethinking Work

Putting Elon Musk and Steve Jobs on a Pedestal Misrepresents How Innovation Happens

Lamp Shows | HAIKU SALUT

Lawn Order | 99% Invisible

Cineca DSpace Service Provider Update / DuraSpace News

From Andrea Bollini, Cineca

It has been a hot and productive summer here in Cineca,  we have carried out several DSpace activities together with the go live of the National ORCID Hub to support the adoption of ORCID in Italy [1][2].

iSchool / Ed Summers

As you can see, I’ve recently changed things around here at inkdroid.org. Yeah, it’s looking quite spartan at the moment, although I’m hoping that will change in the coming year. I really wanted to optimize this space for writing in my favorite editor, and making it easy to publish and preserve the content. Wordpress has served me well over the last 10 years and up till now I’ve resisted the urge to switch over to a static site. But yesterday I converted the 394 posts, archived the Wordpress site and database, and am now using Jekyll. I haven’t been using Ruby as much in the past few years, but the tooling around Jekyll feels very solid, especially given GitHub’s investment in it.

Honestly, there was something that pushed me over the edge to do the switch. Next week I’m starting in the University of Maryland iSchool, where I will be pursuing a doctoral degree. I’m specifically hoping to examine some of the ideas I dredged up while preparing for my talk at NDF in New Zealand a couple years ago. I was given almost a year to think about what I wanted to talk about – so it was a great opportunity for me to reflect on my professional career so far, and examine where I wanted to go.

After I got back I happened across a paper by Steven Jackson called Rethinking Repair, which introduced me to what felt like a very new and exciting approach to information technology design and innovation that he calls Broken World Thinking. In hindsight I can see that both of these things conspired to make returning to school at 46 years of age look like a logical thing to do. If all goes as planned I’m going to be doing this part-time while also working at the Maryland Istitute for Technology in the Humanities, so it’s going to take a while. But I’m in a good spot, and am not in any rush … so it’s all good as far as I’m concerned.

I’m planning to use this space for notes about what I’m reading, papers, reflections etc. I thought about putting my citations, notes into Evernote, Zotero, Mendeley etc, and I may still do that. But I’m going to try to keep it relatively simple and use this space as best I can to start. My blog has always had a navel gazy kind of feel to it, so I doubt it’s going to matter much.

To get things started I thought I’d share the personal statement I wrote for admission to the iSchool. I’m already feeling more focus than when I wrote it almost a year ago, so it will be interesting to return to it periodically. The thing that has become clearer to me in the intervening year is that I’m increasingly interested in examining the role that broken world thinking has played in both the design and evolution of the Web.

So here’s the personal statement. Hoepfully it’s not too personal :-)


For close to twenty years I have been working as a software developer in the field of libraries and archives. As I was completing my Masters degree in the mid-1990s, the Web was going through a period of rapid growth and evolution. The computer labs at Rutgers University provided me with what felt like a front row seat to the development of this new medium of the World Wide Web. My classes on hypermedia and information seeking behavior gave me a critical foundation for engaging with the emerging Web. When I graduated I was well positioned to build a career around the development of software applications for making library and archival material available on the Web. Now, after working in the field, I would like to pursue a PhD in the UMD iSchool to better understand the role that the Web plays as an information platform in our society, with a particular focus on how archival theory and practice can inform it. I am specifically interested in archives of born digital Web content, but also in what it means to create a website that gets called an archive. As the use of the Web continues to accelerate and proliferate it is more and more important to have a better understanding of its archival properties.

My interest in how computing (specifically the World Wide Web) can be informed by archival theory developed while working in the Repository Development Center under Babak Hamidzadeh at the Library of Congress. During my eight years at LC I designed and built both internally focused digital curation tools as well as access systems intended for researchers and the public. For example, I designed a Web based quality assurance tool that was used by curators to approve millions of images that were delivered as part of our various digital conversion projects. I also designed the National Digital Newspaper Program’s delivery application, Chronicling America, that provides thousands of researchers access to over 8 million pages of historic American newspapers every day. In addition, I implemented the data management application that transfers and inventories 500 million tweets a day to the Library of Congress. I prototyped the Library of Congress Linked Data Service which makes millions of authority records available using Linked Data technologies.

These projects gave me hands on, practical experience using the Web to manage and deliver Library of Congress data assets. Since I like to use agile methodologies to develop software, this work necessarily brought me into direct contact with the people who needed the tools built, namely archivists. It was through these interactions over the years that I began to recognize that my Masters work at Rutgers University was in fact quite biased towards libraries, and lacked depth when it came to the theory and praxis of archives. I remedied this by spending about two years of personal study focused on reading about archival theory and practice with a focus on appraisal, provenance, ethics, preservation and access. I also began participating member of the Society of American Archivists.

During this period of study I became particularly interested in the More Product Less Process (MPLP) approach to archival work. I found that MPLP had a positive impact on the design of archival processing software since it oriented the work around making content available, rather than on often time consuming preservation activities. The importance of access to digital material is particularly evident since copies are easy to make, but rendering can often prove challenging. In this regard I observed that requirements for digital preservation metadata and file formats can paradoxically hamper preservation efforts. I found that making content available sooner rather than later can serve as an excellent test of whether digital preservation processing has been sufficient. While working with Trevor Owens on the processing of the Carl Sagan collection we developed an experimental system for processing born digital content using lightweight preservation standards such as BagIt in combination with automated topic model driven description tools that could be used by archivists. This work also leveraged the Web and the browser for access by automatically converting formats such as WordPerfect to HTML, so they could be viewable and indexable, while keeping the original file for preservation.

Another strand of archival theory that captured my interest was the work of Terry Cook, Verne Harris, Frank Upward and Sue McKemmish on post-custodial thinking and the archival enterprise. It was specifically my work with the Web archiving team at the Library of Congress that highlighted how important it is for record management practices to be pushed outwards onto the Web. I gained experience in seeing what makes a particular web page or website easier to harvest, and how impractical it is to collect the entire Web. I gained an appreciation for how innovation in the area of Web archiving was driven by real problems such as dynamic content and social media. For example I worked with the Internet Archive to archive Web content related to the killing of Michael Brown in Ferguson, Missouri by creating an archive of 13 million tweets, which I used as an appraisal tool, to help the Internet Archive identify Web content that needed archiving. In general I also saw how traditional, monolithic approaches to system building needed to be replaced with distributed processing architectures and the application of cloud computing technologies to easily and efficiently build up and tear down such systems on demand.

Around this time I also began to see parallels between the work of Matthew Kirschenbaum on the forensic and formal materiality of disk based media and my interests in the Web as a medium. Archivists usually think of the Web content as volatile and unstable, where turning off a web server can result in links breaking, and content disappearing forever. However it is also the case that Web content is easily copied, and the Internet itself was designed to route around damage. I began to notice how technologies such as distributed revision control systems, Web caches, and peer-to-peer distribution technologies like BitTorrent can make Web content extremely resilient. It was this emerging interest in the materiality of the Web that drew me to a position in the Maryland Institute for Technology in the Humanities where Kirschenbaum is the Assistant Director.

There are several iSchool faculty that I would potentially like to work with in developing my research. I am interested in the ethical dimensions to Web archiving and how technical architectures embody social values, which is one of Katie Shilton’s areas of research. Brian Butler’s work studying online community development and open data is also highly relevant to the study of collaborative and cooperative models for Web archiving. Ricky Punzalan’s work on virtual reunification in Web archives is also of interest because of its parallels with post-custodial archival theory, and the role of access in preservation. And Richard Marciano’s work on digital curation, in particular his recent work with the NSF on Brown Dog, would be an opportunity for me to further my experience building tools for digital preservation.

If admitted to the program I would focus my research on how Web archives are constructed and made accessible. This would include a historical analysis of the development of Web archiving technologies and organizations. I plan to look specifically at the evolution and deployment of Web standards and their relationship to notions of impermanence, and change over time. I will systematically examine current technical architectures for harvesting and providing access to Web archives. Based on user behavior studies I would also like to reimagine what some of the tools for building and providing access to Web archives might look like. I expect that I would spend a portion of my time prototyping and using my skills as a software developer to build, test and evaluate these ideas. Of course, I would expect to adapt much of this plan based on the things I learn during my course of study in the iSchool, and the opportunities presented by working with faculty.

Upon completion of the PhD program I plan to continue working on digital humanities and preservation projects at MITH. I think the PhD program could also qualify me to help build the iSchool’s new Digital Curation Lab at UMD, or similar centers at other institutions. My hope is that my academic work will not only theoretically ground my work at MITH, but will also be a source of fruitful collaboration with the iSchool, the Library and larger community at the University of Maryland. I look forward to helping educate a new generation of archivists in the theory and practice of Web archiving.

Learn About Islandora at the Amigos Online Conference / Cherry Hill Company

On September 17, 2015, I'll be giving the presentation "Bring you Local, Unique Content to the Web Using Islandora" at the Amigos Open Source Software and Tools for the Library and Archive online conference. Amigos is bringing together practitioners from around the library field who have used open source in projects at their library. My talk will be about the Islandora digital asset management system, the fundamental building block of the Cherry Hill LibraryDAMS service.

Every library has content that is unique to itself and its community. Islandora is open source software that enables libraries to store, present, and preserve that unique content to their communities and to the world. Built atop the popular Drupal content management system and the Fedora digital object repository, Islandora powers many digital projects on the...

Read more »

How Shutterstock Searches 35 Million Images by Color Using Apache Solr / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Shutterstock engineer Chris Becker’s session on how they use Apache Solr to search 35 million images by color. This talk covers some of the methods they’ve used for building color search applications at Shutterstock using Solr to search 40 million images. A couple of these applications can be found in Shutterstock Labs – notably Spectrum and Palette. We’ll go over the steps for extracting color data from images and indexing them into Solr, as well as looking at some ways to query color data in your Solr index. We’ll cover some issues such as what does relevance mean when you’re searching for colors rather than text, and how you can achieve various effects by ranking on different visual attributes. At the timeof this presetnation, Chris was the Principal Engineer of Search at Shutterstock– a stock photography marketplace selling over 35 million images– where he’s worked on image search since 2008. In that time he’s worked on all the pieces of Shutterstock’s search technology ecosystem from the core platform, to relevance algorithms, search analytics, image processing, similarity search, internationalization, and user experience. He started using Solr in 2011 and has used it for building various image search and analytics applications.
Searching Images by Color: Presented by Chris Becker, Shutterstock from Lucidworks
lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post How Shutterstock Searches 35 Million Images by Color Using Apache Solr appeared first on Lucidworks.

DPLA Welcomes Four New Service Hubs to Our Growing Network / DPLA

The Digital Public Library of America is pleased to announce the addition of four Service Hubs that will be joining our Hub network. The Hubs represent Illinois, Michigan, Pennsylvania and Wisconsin.  The addition of these Hubs continues our efforts to help build local community and capacity, and further efforts to build an on-ramp to DPLA participation for every cultural heritage institution in the United States and its territories.

These Hubs were selected from the second round of our application process for new DPLA Hubs.  Each Hub has a strong commitment to bring together the cultural heritage content in their state to be a part of DPLA, and to build community and data quality among the participants.

In Illinois, the Service Hub responsibilities will be shared by the Illinois State Library, the Chicago Public Library, the Consortium of Academic and Research Libraries of Illinois (CARLI), and the University of Illinois at Urbana Champaign. More information about the Illinois planning process can be found here. Illinois plans to make available collections documenting coal mining in the state, World War II photographs taken by an Illinois veteran and photographer, and collections documenting rural healthcare in the state.

In Michigan, the Service Hub responsibilities will be shared by the University of Michigan, Michigan State University, Wayne State University, Western Michigan University, the Midwest Collaborative for Library Services and the Library of Michigan.  Collections to be shared with the DPLA cover topics including the history of the Motor City, historically significant American cookbooks, and Civil War diaries from the Midwest.

In Pennsylvania, the Service Hub will be led by Temple University, Penn State University, University of Pennsylvania and Free Library of Philadelphia in partnership with the Philadelphia Consortium of Special Collections Libraries (PACSCL) and the Pennsylvania Academic Library Consortium (PALCI), among other key institutions throughout the state.  More information about the Service Hub planning process in Pennsylvania can be found here.  Collections to be shared with DPLA cover topics including the Civil Rights Movement in Pennsylvania, Early American History, and the Pittsburgh Iron and Steel Industry.

The final Service Hub, representing Wisconsin will be led by Wisconsin Library Services (WiLS) in partnership with the University of Wisconsin-Madison, Milwaukee Public Library, University of Wisconsin-Milwaukee, Wisconsin Department of Public Instruction and Wisconsin Historical Society.  The Wisconsin Service Hub will build off of the Recollection Wisconsin statewide initiative.  Materials to be made available document the American Civil Rights Movement’s Freedom Summer and the diversity of Wisconsin, including collections documenting the lives of Native Americans in the state.

“We are excited to welcome these four new Service Hubs to the DPLA Network,” said Emily Gore, DPLA Director for Content. “These four states have each led robust, collaborative planning efforts and will undoubtedly be strong contributors to the DPLA Hubs Network.  We look forward to making their materials available in the coming months.”

The March on Washington: Hear the Call / DPLA

Fifty-two years ago this week, more than 200,000 Americans came together in the nation’s capitol to rally in support of the ongoing Civil Rights movement. It was at that march that Martin Luther King Jr.’s iconic “I Have A Dream” speech was delivered. And it was at that march that the course of American history was forever changed, in an event that resonates with protests, marches, and movements for change around the country decades later.

Get a new perspective on the historic March on Washington with this incredible collection from WGBH via Digital Commonwealth. This collection of audio pieces, 15 hours in total, offers uninterrupted coverage of the March on Washington, recorded by WGBH and the Educational Radio Network (a small radio distribution network that later became part of National Public Radio). This type of coverage was unprecedented in 1963, and offers a wholly unique view on one of the nation’s most crucial historic moments.

In this audio series, you can hear Martin Luther King Jr.’s historic speech, along with the words of many other prominent civil rights leaders–John Lewis, Bayard Rustin, Jackie Robinson, Roy Wilkins,  Rosa Parks, and Fred Shuttlesworth. There are interviews with Hollywood elite like Marlon Brando and Arthur Miller, alongside the complex views of the “everyman” Washington resident. There’s also the folk music of the movement, recorded live here, of Joan Baez, Bob Dylan, and Peter, Paul, and Mary. There are the stories of some of the thousands of Americans who came to Washington D.C. that August–teachers, social workers, activists, and even a man who roller-skated to the march all the way from Chicago.

Hear speeches made about the global nonviolence movement, the labor movement, and powerful words from Holocaust survivor Joachim Prinz. Another notable moment in the collection is an announcement of the death of W.E.B DuBois, one of the founders of the NAACP and an early voice for civil rights issues.

These historic speeches are just part of the coverage, however. There are fascinating, if more mundane, announcements, too, about the amount of traffic in Washington and issues with both marchers’ and commuters’ travel (though they reported that “north of K Street appears just as it would on a Sunday in Washington”). Another big, though less notable, issue of the day, according to WGBH reports, was food poisoning from the chicken in boxed lunches served to participants at the march. There is also information about the preparation for the press, which a member of the march’s press committee says included more than 300 “out-of-town correspondents.” This was in addition to the core Washington reporters, radio stations, like WGBH, TV networks, and international stations from Canada, Japan, France, Germany and the United Kingdom. These types of minute details and logistics offer a new window into a complex historic event, bringing together thousands of Americans at the nation’s capitol (though, as WGBH reported, not without its transportation hurdles!).

At the end of the demonstration, you can hear for yourself a powerful pledge, recited from the crowd, to further the mission of the march. It ends poignantly: “I pledge my heart and my mind and my body unequivocally and without regard to personal sacrifice, to the achievement of social peace through social justice.”

Hear the pledge, alongside the rest of the march as it was broadcast live, in this inspiring and insightful collection, courtesy of WGBH via Digital Commonwealth.

Banner image courtesy of the National Archives and Records Administration.

A view of the March on Washington, showing the Reflecting Pool and the Washington Monument. Courtesy of the National Archives and Records Administration.

A view of the March on Washington, showing the Reflecting Pool and the Washington Monument. Courtesy of the National Archives and Records Administration.