Planet Code4Lib

LinkedIn’s Galene Search Architecture Built on Apache Lucene / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting LinkedIn engineers Diego Buthay and Sriram Sankar’s session on how they run their search architecture for the massive social network.

LinkedIn’s corpus is a richly structured professional graph comprised of 300M+ people, 3M+ companies, 2M+ groups, and 1.5M+ publishers. Members perform billions of searches, and each of those searches is highly personalized based on the searcher’s identity and relationships with other professional entities in LinkedIn’s economic graph. And all this data is in constant flux as LinkedIn adds more than 2 members every second in over 200 countries (2/3 of whom are outside the United States). As a result, we’ve built a system quite different from those used for other search applications. In this talk, we will discuss some of the unique systems challenges we’ve faced as we deliver highly personalized search over semi-structured data at massive scale.”

Diego (“Mono”) Buthay is a staff engineer at LinkedIn, where he works on the back-end infrastructure for all of LinkedIn’s search products. Before that, he built the search-as-a-service platform at IndexTank, which LinkedIn acquired in 2011. He has BS and MS degrees in computer software engineering from the University of Buenos Aires. Sriram Sankar is a principal staff engineer at LinkedIn, where he leads the development of its next-generation search architecture. Before that, he led Facebook’s search quality efforts for Graph Search, and was a key contributor to Unicorn. He previously worked at Google on search quality and ads infrastructure. He is also the author of JavaCC, a leading parser generator for Java. Sriram has a PhD from Stanford University and a BS from IIT Kanpur.”

Galene – LinkedIn's Search Architecture: Presented by Diego Buthay & Sriram Sankar, LinkedIn from Lucidworks

lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post LinkedIn’s Galene Search Architecture Built on Apache Lucene appeared first on

The Cavalry Shows Up in the IoT War Zone / David Rosenthal

Back in May I posted Time For Another IoT Rant. Since then I've added 28 comments about the developments over the last 132 days, or more than one new disaster every 5 days. Those are just the ones I noticed. So its time for another dispatch from the front lines of the IoT war zone on which I can hang reports of the disasters to come.  Below the fold, I cover yesterday's happenings on two sectors of the front line.

Lets start with the obvious fact that good wars have two sides, the guys with the black hats (Boo!) and the guys with the white hats (Yay!). So far, the white hats hats have been pretty much missing in action. But now, riding over the hill in the home router sector of the front lines, comes the white-hat cavalry!

Is the opposite of malware benware? If so, Symantec has found "highly virulent" benware called "Ifwatch" infecting "more than 10,000 Linux-based routers, mostly in China and Brazil":
Ifwatch software is a mysterious piece of “malware” that infects routers through Telnet ports, which are often weakly secured with default security credentials that could be open to malicious attack. Instead, Ifwatch takes that opportunity to set up shop, close the door behind it, and then prompts users to change their Telnet passwords, if they are actually going to use the port.

According to Symantec’s research, it also has code dedicated to removing software that has entered the device with less altruistic intentions. Ifwatch finds out and removes “well-known families of malware targeting embedded devices,”
How awesome is it that the titanic struggle between good and evil is taking place inside your home router, so you have a ringside seat?

Meanwhile, in the enterprise router sector, the black hats advanced. Dan Goodin at Ars Technica reports that there is a Backdoor infecting Cisco VPNs steals customers’ network passwords:
Attackers are infecting a widely used virtual private network product sold by Cisco Systems to install backdoors that collect user names and passwords used to log in to corporate networks, security researchers said. ... The attacks appear to be carried out by multiple parties using at least two separate entry points. Once the backdoor is in place, it may operate unnoticed for months as it collects credentials that employees enter as they log in to company networks.
That's the news from the war zone yesterday. Stay tuned for more in the comments.

Archival Description Working Group Members / DPLA

We are pleased to announce the membership of the Archival Description Working Group:

  • Jodi Allison-Bunnel, OrbisCascade Alliance
  • Mark Custer, Yale University
  • Bradley Daigle, University of Virginia
  • Jackie Dean, University of North Carolina at Chapel Hill
  • Max Eckard, University of Michigan
  • Ben Goldman, The Pennsylvania State University
  • Kris Keisling, University of Minnesota
  • Leigh Grinstead, LYRASIS
  • Adrian Turner, California Digital Library

In addition, we have appointed an Advisory Board that will help the workgroup by reviewing drafts before public release and providing feedback on workplans and tools. Advisory Board members include:

  • Shawn Averkamp, New York Public Library
  • Erin Hawkins, World Digital Library, Library of Congress
  • Sheila McAlister, Digital Library of Georgia
  • Sandra McIntyre, Mountain West Digital Library
  • Anne Van Camp, Smithsonian Institution

We were so excited to find that so many volunteered to help us with the group and regret that we can’t include everyone. We will share the group’s progress through social media, and those who filled out the volunteer form will be asked to help review and comment on the draft of the whitepaper and any other deliverables once the working group and advisory board have developed a first draft.

DPOE Plants Seed for Statewide Digital Preservation Effort in California / Library of Congress: The Signal

The following is a guest post by Barrie Howard, IT project manager at the Library of Congress.

The Digital Preservation Outreach and Education (DPOE) program is pleased to announce the successful completion of another train-the-trainer workshop in 2015. The most recent workshop took place in Sacramento, California, from September 22th–25th. This domestic training event follows closely behind two workshops held in Australia in late spring.

The Library of Congress partnered with the State Library of California to host the three-and-a-half day workshop to increase the knowledge and skills of working professionals, who are charged with providing long-term access to digital content. Planning and events management support were provided by the California Preservation Program (CPP), which provides consultation, information resources, and preservation services to archives, historical societies, libraries, and museums across the state.

Trainers and trainees at the DPOE workshop in Sacramento.

Trainers and trainees at the DPOE workshop in Sacramento. Photo by Darla Gunning.

This cohort of trainers was highly energized at the completion of the workshop, and left the event buzzing with plans to band together to establish a statewide effort to guarantee long-term, enduring access to California’s collective cultural heritage captured in digital formats. The workshop’s train-the-trainer model inspired the participants to think about how they could work across jurisdictional and organizational boundaries to meet the needs of all California cultural heritage institutions, especially small organizations with very few staff.

CPP Steering Committee Chair Barclay Ogden set the stage by stating, “I’m looking forward to the DPOE workshop to position a cohort of California librarians, archivists, and history museum curators to educate and advocate for statewide digital preservation services. California’s smaller memory institutions need help with digital preservation.

Left to right: Jacob Nadal, DPOE Anchor Instructor; George Coulbourne, DPOE Program Director; Stacey Wiens, DPOE Topical Trainer

Left to right: Jacob Nadal, DPOE Anchor Instructor; George Coulbourne, DPOE Program Director; Stacey Wiens, DPOE Topical Trainer. Photo by Darla Gunning

DPOE Program Director George Coulbourne led the week-long workshop. Veteran anchor instructors Mary Molinaro (University of Kentucky Libraries) and Jacob Nadal (The Research Collections and Preservation Consortium) and the I joined the instructor team for the first time. We provided presentations throughout the week and facilitated hands-on activities.

The enthusiasm and vision captured at the workshop are a legacy, rather than merely an outcome, that participants carry with them as they join a vibrant network of practitioners in the broader digital preservation community. DPOE continues to nurture the network by providing an email distribution list so practitioners can share information about digital preservation best practices, services, and tools, and to surface stories about their experiences in advancing digital preservation. DPOE also maintains a training calendar as a public service to help working professionals discover professional development opportunities in the practice of digital preservation. The calendar is updated regularly, and includes training events hosted by DPOE trainers, as well as others.

Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump / SearchHub

Are you a Gambler? Even if you aren’t, what are you waiting for?

There’s no ante or no buy in needed to “go all in” for a nice pot of prize money in this years Stump The Chump contest at Lucene/Solr Revolution 2015 in Austin Texas. But time is running out! There are only a few days for you to submit your most challenging questions.

Even if you can’t make it to Austin to attend the conference, you can still participate. Check out the session information page for details on how to submit your questions.

To keep up with all the “Chump” related info, you can subscribe to this blog (or just the “Chump” tag).

The post Know When To Hold ’em … Know When To Run – Time Is Running Out To Stump The Chump appeared first on

White Dudes Giving Speeches / Ed Summers

Thank you for inviting me here today to be with you all here at MARAC. I’ll admit that I’m more than a bit nervous to be up here. I normally apologize for being a software developer right about now. But I’m not going to do that today…although I guess I just did. I’m not paying software developers any compliments by using them as a scapegoat for my public presentation skills. And the truth is that I’ve seen plenty of software developers give moving and inspiring talks.

So the reason why I’m a bit more nervous today than usual is because you are archivists. I don’t need to #askanarchivist to know that you think differently about things, in subtle and profound ways. To paraphrase Orwell: You work in the present to shape what we know about the past, in order to help create our future. You are a bunch of time travelers. How do you decide what to hold on to, and what to let go of? How do you care for this material? How do you let people know about it? You do this for organizations, communities and collectively for entire cultures. I want to pause a moment to thank you for your work. You should applaud yourselves. Seriously, you deserve it. Thunderous applause.

My Twitter profile used to say I was a “hacker for libraries”. I changed it a few years ago to “pragmatist, archivist and humanist”. But the reality is that these are aspirational..these are the things I want to be. I have major imposter syndrome about claiming to be an archivist…and that’s why I’m nervous.

Can you believe that I went through a Masters in Library & Information Science program without learning a lick about archival theory? Maybe I just picked the wrong classes, but this was truly a missed opportunity, both for me and the school. After graduating I spent some time working with metadata in libraries, then as a software developer at a startup, then at a publisher, and then in government. It was in this last role helping bits move around at the Library of Congress (yes, some of the bits did still move, kinda, sorta) that I realized how much I had missed about the theory of archives in school.

I found that the literature of archives and archival theory spoke directly to what I was doing as a software developer in the area of digital preservation. With guidance from friends and colleagues I read about archives from people like Hugh Taylor, Helen Samuels, “the Terrys” (Cook and Eastwood), Verne Harris, Ernst Posner, Heather MacNeil, Sue McKemmish, Randall Jimerson, Tom Nesmith and more. I started following some of you on Twitter to read the tea leaves of the profession. I became a member of SAA. I put some of the ideas into practice in my work. I felt like I was getting on a well traveled but not widely known road. I guess it was more like a path among many paths. It definitely wasn’t an information superhighway. I have a lot more still to learn.

So why am I up here talking to you? This isn’t about me right? It’s about we. So I would like to talk about this thing we do, namely create things like this:

Don’t worry I’m not really going to be talking about the creation of finding aids. I think that they are something we all roughly understand. We use them to manage physical and intellectual access to our collections right? Finding aids are also used by researchers to discover what collections we have. Hey, it happens. Instead what I would like to focus on in this talk is the nature of this particular collection. What are the records being described here?

Yes, they are tweets that the Cuban Heritage Collection at the University of Miami collected after the announcement by President Obama on December 17, 2014 that the United States was going to begin normalizing relations with Cuba. You can see information about what format the data is in, when the data was collected, how it was collected, how much data there is, and the rights associated with the data.

Why would you want to do this? What can 25 million tweets tell us about the public reaction to Obama’s announcement? What will they tell us in 10, 25 or 50 years? Natalie Baur (the archivist who organized this collection) is thinking they could tell us a lot, and I think she is right. What I like about Natalie’s work is that she has managed to fold this data collection work in with the traditional work of the archive. I know there were some technical hoops to jump through regarding data collection, but the social engineering required to get people working together as a team so that data collection leads leads to processing and then to product in a timely manner is what I thought was truly impressive. Natalie got in touch with Bergis Jules and I to help with some of the technical pieces since she knew that we had done some similar work in this area before. I thought I would tell you about how that work came to be. But if you take nothing else from my talk today take this example of Natalie’s work.

About a year ago I was at SAA in Washington, DC on a panel that Hillel Arnold set up to talk about Agency, Ethics and Information. Here’s a quote from the panel description:

From the Internet activism of Aaron Swartz to Wikileaks’ release of confidential U.S. diplomatic cables, numerous events in recent years have challenged the scope and implications of privacy, confidentiality, and access for archives and archivists. With them comes the opportunity, and perhaps the imperative, to interrogate the role of power, ethics, and regulation in information systems. How are we to engage with these questions as archivists and citizens, and what are their implications for user access?

My contribution to the panel was to talk about the Facebook Emotional Contagion study, and to try to get people thinking about Facebook as an archive. In the question and answer period someone (I wish I could remember his name) asked what archivists were doing to collect what was happening in social media and on the Web regarding the protests in Ferguson. The panel was on August 14th, just 5 days after Mike Brown was killed by police officer Darren Wilson in Ferguson, Missouri. It was starting to turn into a national story, but only after a large amount of protest, discussion and on the ground coverage happening also in Twitter. Someone helpfully pointed out that just a few hours earlier ArchiveIt (the Internet Archive’s subscription service) had announced that it was seeking nominations of web pages to archive related to the events in Ferguson. We can see today that close to 981 pages were collected. 236 of those were submitted using the the form that Internet Archive made available.

But what about the conversation that was happening in Twitter? That’s what you’ve been watching a little bit of for the past few minutes up on the screen here. Right after the panel a group of people made there way to the hotel bar to continue the discussion. At some point I remember talking to Bergis Jules who impressed on me the importance of trying to do what we could to collect the torrent of conversation about Ferguson going on in Twitter. I had done some work collecting data from the Twitter API before and offered to lend a hand. Little did I know what would happen.

When we stopped this initial round of data collection we had collected 13,480,000 tweets that mentioned the word “ferguson” between August 10, 2014 and August 27, 2014.

You can see from this graph of tweets per day, that there were definite cycles in the Twitter traffic. In fact the volume was so high at times, and we had started data collection 6 days late, you can see there are periods where we weren’t able to get the tweets. You might be wondering what this data collection looks like. Before looking closer at the data let me try to demystify it a little bit for you.

Here is a page from the online documentation for Twitter’s API. If you haven’t heard the term API before it stands for Application Programming Interface, and that’s just a fancy name for a website that delivers up data (such as XML or JSON) instead of human readable web pages. If you have a Twitter app on your phone it most likely uses Twitter’s API to access the tweets of people you follow. Twitter isn’t the only place making APIs available: they are everywhere on the Web: Facebook, Google, YouTube, Wikipedia, OCLC, even the Library of Congress has APIs. In some ways if you make your EAD XML available on the Web it is a kind of API. I really hope I didn’t just mansplain what APIs are, that’s not what I was trying to do.

A single API can support multiple “calls” or questions that you can ask. Among the many calls Twitter’s API has a call that allows you to do a search, and get back 100 tweets that match your query plus a token to go and get the next 100. They let you ask this question 180 times every 15 minutes. If you do the math you can see that you can fetch 72,000 tweets an hour, or 1.7 million tweets per day. Unfortunately the API only lets you search the last 9 days of tweets, after which you can pay Twitter for data.

So what Bergis and I did was use a small Python program I had written previously called twarc to try to collect as much of the tweets as we could that had the word “ferguson” in them. twarc is just one tool for collecting data from the Twitter API.

Another tool you can use from the comfort of your Web browser (no command line fu required) is the popular Twitter Archiving Google Sheet TAGS. TAGS lets you collect data from the search API which it puts directly into a spreadsheet for analysis. This is super handy if you don’t want to parse the original JSON data returned by the Twitter API. TAGS is great for individual use.

And another option is the Social Feed Manager (SFM) project. SFM is a project started by George Washington University with support from IMLS and the National Historical Publications and Records Commission. I think SFM is doubly important to bring up today since the theme for the conference is Ingenuity and Innovation in Archives. NHPRC’s support for the SFM project has been instrumental in getting it to where it is today. SFM is an open source Web application that you download and set up at your institution and which users then log into using their Twitter credentials to to setup data collection jobs. GW named it Social Feed Manager because they are in the process of adding other content sources such as Flickr and Tumblr. They are hoping that extending it in this way will provide an architecture that will allow interested people to add other social media sites, and contribute them back to the project. The other nice thing that both SFM and twarc do (but that TAGS does not) is collect the original JSON data from the Twitter API. In a world where original order matters I think this is an important factor to keep in mind.

JSON is an acronym for JavaScript Object Notation. There are lots of other formats for sending data around on the Web, but JSON has emerged as the defacto standard for APIs. This has largely been the result of its versatility and that support for it is cooked into every Web browser that can run JavaScript.

So what’s in the JSON data for a tweet? Twitter is famous for its 140 character message limit. But the text of a tweet only accounts for about 2% of the JSON data that is made available by the Twitter API. Some people might call this metadata, but I’m just going to call it data for now, since this is the original data that Twitter collected and pushed out to any clients that are listening for it.

Also included in the JSON data are things like: the time that the tweet was sent, any hashtags present, geo coordinates for the user (if they have geo-location turned on in their preferences), urls mentioned, places recognized, embedded media such as images or videos, retweet information, reply to information, lots information about the user sending the message, handles for other users mentioned, the current follower count of the sender. And of course you can use the tweet ID to go back to the Twitter API to get all sorts of information such as who has retweeted or liked a tweet.

Here’s what the JSON looks like for a tweet. I’m not going to go into this in detail, but I thought I would at least show it to you. I suspect catalogers or EAD finding aid creators out there might not find this too scary to look at. JSON is much more expressive than the rows and columns of a spreadsheet because you can have lists of things, key/value pairs and hierarchical structures that don’t fit comfortably into a spreadsheet.

Ok, I can imagine some eyes glazing over at these mundane details so lets get back to the Ferguson tweets. Do you remember that form that ArchiveIt put together to let people submit URLs to archive? You may remember that 236 URLs were submitted. Bergis and I were curious so we extracted all the URLs mentioned in the 13 million tweets, unshortened them, and then ranked them by the number of times they were mentioned. You can see a list of the top 100 shared links in that time period. Notice at the time we checked to see if Internet Archive had archived the page.

We then took a look just within the [fiirst day of tweets] that we had, to see if they looked any different. Look at number #2 there, Racial Profiling Data/2013. It’s a government document from the Missouri Attorney General’s Office with statistics from the Ferguson Police Department. Let’s take a moment to digest those stats along with the 1,538 Twitter users who did that day.

Now what’s interesting is that the URL that was tweeted so many times then is already broken. And look, Internet Archive has it, but it was collected for the first time on August 12, 2014. Just as the conversation was erupting on Twitter. Perhaps this URL was submitted by an archivist to the form ArchiveIt put together. Or perhaps someone recognized the importance of archiving it and submitted it directly to the Internet Archive using their Save Now form.

The thing I didn’t mention earlier is that we found 417,972 unique, unshortened URLs. Among them were 21,457 YouTube videos. Here’s the fourth most shared YouTube video, that ended up being seen over half a million times.

As Bergis said in July of this year as he prepared for a class about archiving social media at Archival Education and Research Initiative (AERI):

Bergis was thinking specifically about events like the Watts Riots in Los Angeles where 34 people were killed and 3,438 arrested.

Of course the story does not end there. As I mentioned I work at the Maryland Institute for Technology in the Humanities at the University of Maryland. We aren’t an archive or a library, we’re a digital humanities lab that is closely affiliated with the University library. Neil Fraistat, the director of MITH, immediately recognized the value of doing this work. He not only supported me in spending time on it with Bergis, but also talked about the work with his olleagues at the University.

When there was a Town Hall meeting on December 3, 2014 we were invited to speak along with other faculty, students and the University Police Commissioner. The slides you saw earlier of popular tweets during that period was originally created for the Town Hall. I spoke very briefly about the data we collected and invited students and faculty who were interested in working with the data to please get in touch. The meeting was attended by hundreds of students, and ended up lasting some 4 hours, with most of the time being taken up by students sharing stories from their experience on campus of harassment by police, concern about military grade weapons being deployed in the community, insight into the forms of institutionalized racism that we all still live with today. It was an incredibly moving experience, and our images from the “archive” were playing the whole time as a backdrop.

After the Town Hall meeting Neil and a group of faculty on campus organized the BlackLivesMatter at UMD group, and a set of teach-ins at UMD where the regularly scheduled syllabus was set aside to discuss the events in Ferguson and Black Lives Matter more generally. Porter Olsen (who taught the BitCurator workshop yesterday) helped organize a series of sessions we call digital humanities incubators to build a community of practice around digital methods in the humanities. These sessions focused on tools for data collection, data analysis, and rights and ethical issues. We had Laura Wrubel visit from George Washington University to talk about Social Feed Manager. Trevor Munoz, Katie Shilton and Ricky Punzalan spoke about the rights issues associated with working with social media data. Josh Westgaard from the library spoke about working with JSON data. And Cody Buntain, Nick Diakopoulos and Ben Scheiderman helped us use tools like Python Notebooks and NodeXL for data analysis.

And of course, we didn’t know it at the time, but Ferguson was just the beginning. Or rather it was the beginning of a growing awareness of police injustice towards African Americans and people of color in the United States that began to be known as the BlackLivesMatter movement. BlackLivesMatter was actually started by Alicia Garza, Patrisse Cullors, and Opal Tometi after the acquittal of George Zimmerman in the Florida shooting death of Trayvon Martin two years earlier. But the protests on the ground in Ferguson, elsewhere in the US, and in social media brought international attention to the issue. Names like Aiyana Jones, Rekia Boyd, Jordan Davis, Renisha McBride, Dontre Hamilton, Eric Garner, John Crawford, led up to Michael Brown, and were followed by Tamir Rice, Antonio Martin, Walter Scott, Freddie Gray, Sandra Bland and Samuel Dubose.

Bergis and I did our best to collect what we could from these sad, terrifying and enraging events. The protests in Baltimore were of particular interest to us at the University of Maryland since it was right in our backyard. Our data collection efforts got the attention of Rashawn Ray who is a professor in sociology at the University of Maryland. He and his PhD student Melissa Brown were interested in studying how the discussion of Ferguson changed in four datasets we had collected: the initial killing of Michael Brown, the non-indictment of Darren Wilson, the Justice Department Report and then the one year anniversary. They have been exploring what the hashtags, images and text tell us about the shaping of narratives, sub-narratives and counter-narratives around the Black experience in the United States.

And we haven’t even accessioned any of the data. It’s sitting in MITH’s Amazon cloud storage. This really isn’t anybody’s fault but my own. I haven’t made it a priority to figure out how to get it into the University’s Fedora repository. In theory it should be doable. This is why I’m such a fan of Natalie’s work at the University of Miami that I mentioned at the beginning. Not only did she get the data into the archive, but she described it with a finding aid that is now on the Web, waiting to be discovered by a researcher like Rashawn.

So what value do you think social media has as a tool for guiding appraisal in Web archives? Would it be useful if you could easily participate in conversation going on in your community and collect the Web documents that were important to them? Let me read you the first paragraph of a grant proposal Bergis wrote recently:

The dramatic rise in the public’s use of social media services to document events of historical significance presents archivists and others who build primary source research collections with a unique opportunity to transform appraisal, collecting, preservation and discovery of this new type of research data. The personal nature of documenting participation in historical events also presents researchers with new opportunities to engage with the data generated by individual users of services such as Twitter, which has emerged as one of the most important tools used in social activism to build support, share information and remain engaged. Twitter users document activities or events through the text, images, videos and audio embedded in or linked from their tweets. This means vast amounts of digital content is being shared and re-shared using Twitter as a platform for other social media applications like YouTube, Instagram, Flickr and the Web at large. While such digital content adds a new layer of documentary evidence that is immensely valuable to those interested in researching and understanding contemporary events, it also presents significant data management, rights management, access and visualization challenges.

As with all good ideas, we’re not alone in seeing the usefulness of social media in archival work. Ed Fox and his team just down the road at Virginia Tech have been working solidly on this problem for a few years and recently received an NSF grant to further develop their Integrated Digital Event Archiving and Library IDEAL. Here’s a paragraph from their grant proposal:

The Integrated Digital Event Archive and Library (IDEAL) system addresses the need for combining the best of digital library and archive technologies in support of stakeholders who are remembering and/or studying important events. It extends the work at Virginia Tech on the Crisis, Tragedy, and Recovery network (see to handle government and community events, in addition to a range of significant natural or manmade disasters. It addresses needs of those interested in emergency preparedness/response, digital government, and the social sciences. It proves the effectiveness of the 5S (Societies, Scenarios, Spaces, Structures, Streams) approach to intelligent information systems by crawling and archiving events of broad interest. It leverages and extends the capabilities of the Internet Archive to develop spontaneous event collections that can be permanently archived as well as searched and accessed, and of the LucidWorks Big Data software that supports scalable indexing, analyzing, and accessing of very large collections.

Maybe you should have another Ed up here speaking! Or an archivist like Bergis. Here’s another project called iCrawl from the University of Hannover who are doing something very similar to Virginia Tech.

Seriously though, this has been fun. Before I leave you here are a few places you could go to get involved in and learn about this work.

  1. If you are an SAA member please join the conversation at the SAA Web Archiving discussion list. One of the cool things that happend on this discussion list last year was drafting a letter to Facebook that was sent by President Kathleen Roe.
  2. If you’re not an SAA member there’s a new discussion list called Web Archives. It’s just getting started, so it’s a perfect time to join.
  3. Bergis, Christie Peterson, Bert Lyons, Allison Jai O’Dell, Ryan Baumann and I have been writing short pieces about this kind of work on Medium in the On Archivy publication. If you have ideass, thought experiments, actual work, or commentary please write it on Medium and send us request to include it.

And as Hillel Arnold pointed out recently:

Let’s get to work.

Seminar Week 6 / Ed Summers

This week we dove into some readings about information retrieval. The literature on the topic is pretty vast, so luckily we had Doug Oard on hand to walk us through it. The readings on deck were Liu (2009), Chapelle, Joachims, Radlinski, & Yue (2012) and Sanderson & Croft (2012). The first two of these were had some pretty technical, mathematical components that were kind of intimidating and over my head. But the basic gist of both of them was understandable, especially after the context that Oard provided.

Oard’s presentation in class was Socratic: he posed questions for us to answer, which he helped answer as well, which led on to other questions. We started with what information, retrieval and research was. I must admit to being a bit frustrated about returning to this definitional game of information. It feels so slippery, but we basically agreed that it was social contruct and moved on to lower level questions such as: what is data, what is a database, what are the feature sets of information retrieval. The feature sets discussion was interesting because we basically have worked with three different features sets: descriptions of things (think of catalog records), content (e.g. contents of books) and user behavior.

We then embarked on a pretty detailed discussion of user behavior, and how the technique of interleaved data sets lets computer systems adaptively tweak the many parameters that tune information retrieval algorithms based on user behavior. Companies like Google and Facebook have the user attention to be able to deploy these adaptive techniques to evolve their systems. I thought it was interesting to reflect on how academic researchers are then almost required to work with these large corporations in order to deploy their research ideas. I also thought it might be interesting how having a large set of users who expect to use your product in a particular way might become a straight jacket of sorts, and perhaps over time lead to a calcification of ideas and techniques. This wasn’t a fully formed thought, but it seemed that this purely statistical and algorithmic approach to design lacked some creative energy that is fundamentally human – even though their technique had human behavior at its center as well.

I guess it’s nice to think that the voice of the individual matters, and we’re not just dumbing all our designs down to the lowest common denominator between us. I think this class successfully steered me away from the competitive space of information retrieval even though my interest in appraisal and web archives moves in that direction, with respect to focused crawling. Luckily a lot of the information retrieval research in this area has been done already, but what is perhaps lacking are system designs that incorporate the decisions of the curator/archivist more. If not I guess I can fall back on my other research area of the history of standards on the Web.


Chapelle, O., Joachims, T., Radlinski, F., & Yue, Y. (2012). Large-scale validation and analysis of interleaved search evaluation. ACM Transactions on Information Systems (TOIS), 30(1), 6.

Liu, T.-Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3(3), 225–331.

Sanderson, M., & Croft, W. B. (2012). The history of information retrieval research. Proceedings of the IEEE, 100(Special Centennial Issue), 1444–1451.

Seminar Week 5 / Ed Summers

In this weeks class we took a closer look at design methods and prototyping with readings from Druin (1999), Zimmerman, Forlizzi, & Evenson (2007) and a paper that fellow student Joohee picked out Buchenau & Suri (2000). In addition to a discussion of the readings Brenna McNally from the iSchool visited us to demonstrate the Cooperative Inquiry that was discussed in the Druin paper.

In a nutshell Cooperative Inquiry is a research methodology that Druin specifically designed to enable adults and children to work collaboratively as equals on design problems. The methodology grew out of work at the University of Maryland Kid Design Team. In the paper Druin specifically discusses two projects at UMD: KidPad and PETS and how cooperative inquiry drew upon the traditions of contextual inquiry and participatory design.

Brenna’s demonstration of the technique was both fun and instructive. It was really interesting to be a participant and then asked to reflect on it in a meta way afterwards. Basically we were asked to generate some ideas about things we would like to do with digital cameras: being able to take a photo while driving, being able to take a picture quickly (like when a young child smiles), and taking pictures in your dreams (that was my suggestion). Then Brenna brought some prototyping materials: colored paper, string, pipe cleaners, various sticky things and asked us to prototype some solutions.

I wish I had take some pictures. Jonathan and Diane’s digital dream catcher that you wore like a showercap was memorable. Joohee and I create a little device that could sit on top of your car and take pictures on voice command and beam them through the “dream cloud” to a picture frame device in your house. I found it difficult to make the leap into using the materials at hand to prototype, but Brenna helped gave us examples of the types of prototyping she was looking for. Also, while we worked she was busily writing down different features that she noticed in our designs.

When we were done we each presented our ideas … and applauded each other (of course). Afterwhich Brenna went over some of the design elements she noticed, and highlighted ones she thought were interesting. We discussed some of them and decided on ones that would be worth digging into further. At this point a new cycle of prorotyping would begin. I thought this demonstration really clearly showed the iterative nature of observation, ideation and prototyping that make up the method.


Buchenau, M., & Suri, J. F. (2000). Experience prototyping. In Proceedings of the 3rd conference on designing interactive systems: Processes, practices, methods, and techniques (pp. 424–433). Association for Computing Machinery.

Druin, A. (1999). Cooperative inquiry: Developing new technologies for children with children. In Proceedings of the sIGCHI conference on human factors in computing systems (pp. 592–599). Association for Computing Machinery.

Zimmerman, J., Forlizzi, J., & Evenson, S. (2007). Research through design as a method for interaction design research in hCI. In Proceedings of the sIGCHI conference on human factors in computing systems (pp. 493–502). Association for Computing Machinery.

Digital Privacy Toolkit for Librarians, a LITA webinar / LITA

Attend this important new LITA webinar:

Toolkit_Icon_MediumDigital Privacy Toolkit for Librarians

Tuesday October 20, 2015
1:30 pm – 3:00 pm Central Time
Register Online, page arranged by session date (login required)

This 90 minute webinar will include a discussion and demonstration of practical tools for online privacy that can be implemented in library PC environments or taught to patrons in classes/one-on-one tech sessions, including browsers for privacy and anonymity, tools for secure deletion of cookies, cache, and internet history, tools to prevent online tracking, and encryption for online communications.

Attendees will:

Alison’s work for the Library Freedom Project and classes for patrons including tips on teaching patron privacy classes can be found at:

Alison Macrina

alisonmacrinaIs a librarian, privacy rights activist, and the founder and director of the Library Freedom Project, an initiative which aims to make real the promise of intellectual freedom in libraries by teaching librarians and their local communities about surveillance threats, privacy rights and law, and privacy-protecting technology tools to help safeguard digital freedoms. Alison is passionate about connecting surveillance issues to larger global struggles for justice, demystifying privacy and security technologies for ordinary users, and resisting an internet controlled by a handful of intelligence agencies and giant multinational corporations. When she’s not doing any of that, she’s reading.

Register for the Webinar

Full details
Can’t make the date but still ant to join in? Registered participants will have access to the recorded webinar.


  • LITA Member: $45
  • Non-Member: $105
  • Group: $196

Registration Information:

Register Online, page arranged by session date (login required)
Mail or fax form to ALA Registration
call 1-800-545-2433 and press 5

Questions or Comments?

For all other questions or comments related to the course, contact LITA at (312) 280-4268 or Mark Beatty,

Implementing Apache Solr at Target / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Target engineer Raja Ramachandran’s session on implementing Solr at one of the world’s largest retail companies.

Sending Solr into action on a high volume, high profile website within a large corporation presents several challenges — and not all of them are technical. This will be an open discussion and overview of the journey at Target to date. We’ll cover some of the wins, losses and ties that we’ve had while implementing Solr at Target as a replacement for a legacy enterprise search platform. In some cases the solutions were basic, while others required a little more creativity. We’ll cover both to paint the whole picture.

Raja Ramachandran is an experienced Solr architect with a passion for improving relevancy and acquiring data signals to improve search’s contextual understanding of its user.

Journey of Implementing Solr at Target: Presented by Raja Ramachandran, Target from Lucidworks

lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Implementing Apache Solr at Target appeared first on

Life-term for nation’s librarian running out? / District Dispatch

The hourglass is running out as Congress considers a term-limit for the Librarian of Congress.

Congress is considering a term-limit for the nation’s Librarian. (photo:

Hard on the heels of the recent surprise announcement that the current Librarian of Congress, Dr. James Billington, would accelerate his retirement from year’s end (as announced in June) to September 30, the Senate last night approved legislation to limit the service of all future Librarians. Co-authored by all five members of the Senate’s Joint Committee on the Library, and passed without debate by unanimous consent on the day of its introduction, the “Librarian of Congress Succession Modernization Act of 2015” (S.2162) would establish a ten-year term for the post, renewable by the President upon Senate reconfirmation. Since the position was established in 1800, it has been held by just 13 Librarians of Congress appointed to life terms. Comparable House legislation is expected, but the timing of its introduction and consideration is uncertain.

The Senate’s action last night comes as the President is preparing to nominate Dr. Billington’s successor against the backdrop of two scathing reports by the Government Accountability Office detailing serious and pervasive inefficiencies and deficiencies of both the Library of Congress‘ and the U.S. Copyright Office’s information systems and (particularly in the case of the Library itself) management. Deputy Librarian David Mao is currently serving as Acting Librarian.

While no timetable for the President’s nomination of Dr. Billington’s successor has been announced, action by the White House (if not Senate confirmation) is expected before the end of this calendar year. In a letter to him last June, ALA President Courtney Young strongly urged President Obama to appoint a professional librarian to the post, a position since echoed by 30 other state-based library organizations.  This summer, in an OpEd published in Roll Call, ALA Office for Information Technology Policy Director Alan Inouye also emphasized the need for the next Librarian of Congress to possess a skill set tailored to the challenge of leading the institution into the 21st Century.

The post Life-term for nation’s librarian running out? appeared first on District Dispatch.

E-rate, broadband @ ARSL in Little Rock / District Dispatch

ARSO Tshirt featured at 2015 Little Rock conference

Official Tshirt of the 2015 ARSL Conference at Little Rock, Arkansas

Waiting for my connecting flight on the way back to D.C. from the 2015 conference of the Association for Rural & Small Libraries (ARSL) in Little Rock, I had plenty of time to reflect on the whirlwind of experiences that were packed into my day and a half at the conference. While the impetus for attending was the E-rate modernization proceeding we focused on much of the last two years of our telecom work, an equally important outcome was to be immersed in the culture of librarians dedicated to their rural communities. Learning from the librarians at the conference will be critically important as our office investigates potential rural-focused advocacy work. And I would be remiss if I didn’t mention how much fun we had along the way.

I started the conference providing context to our policy presentation during which my colleague, Alan Inouye, gave an overview of the challenging and often murky work we do on behalf of libraries with decision makers at the national level. It may be counterintuitive to those of you who know E-rate to think of it as providing enlightenment on anything. However, as a case study for how a small association does policy—-as compared to advocacy organizations that have separate budget lines for paying people to wait in lines for congressional hearings (go on, Google it)—-E-rate makes a pretty good story. I am not known for brevity and our E-rate work lends itself to many intertwined and complex twists and turn between the countless phone calls, in-person meetings, and official filings with the Federal Communications Commission (FCC); collaborating (or not) with our coalitions; standing firm on behalf of libraries amidst the strong school focus; swaying the press; coordinating our library partners; and keeping ALA members informed. It was challenging to pick it all apart in my allotted 12 minutes (or in an acceptable blog post length). Interest piqued? Read more here.

Day 2 was E-rate and broadband day

I was privileged to attend sessions by my ALA E-rate Task Force colleagues. Amber Gregory, E-rate coordinator for Arkansas at the state library, presented a comprehensive yet digestible version of E-rate in “E-rate: Get your Share” and Emily Almond, IT Director for Georgia Public Library Service, put the why bother with E-rate into a bigger perspective with “Broadband 101.” My takeaways from these sessions were:

  1. E-rate is an important tool to make sure your library has the internet connection that allows your patrons to do what they need to do online.
  2. It’s time to think beyond the basics and E-rate can help you plan for your library’s future broadband needs.
  3. It’s ok to ask questions even if you don’t know exactly how to ask: We’re librarians and we love information and we love to share!
  4. There are people who can help.

Questions from the participants yielded more discussion about the challenges often felt by rural libraries who lag far behind the broadband speeds we know are necessary for many library services. The discussion also gave me more ideas for where we might focus our E-rate and broadband advocacy efforts in the near term which I will take back to the E-rate Task Force and colleagues in D.C.

Making personal connections

The last event for me in Little Rock was perhaps the highlight. Discussed during our policy presentation, having the impact stories of libraries working in their local communities is essential to the work we do with national decision makers. We need to be able to show how libraries support national priorities in education, employment and economic development, and healthcare to name a few. Examples provide the color to the message we try to convey. Alan and I spent the evening listening to (and grilling?) a table full of librarians who shared with us the challenges and strengths of their rural libraries. They also touched on aspirations for additional services they might provide their communities. This was all to our benefit and we came away with many notes and are thankful for time well spent. Spending time with these librarians and at the conference is a good reminder of how important it is to get out of D.C. regularly to gather input and anecdotes that make our work that much richer and more impactful.

Tipsy Librarian, a popular concoction at the ARSL conference.

Tipsy Librarian–a special concoction popular at the ARSL conference.

Walking through the hotel lobby after dinner, I was reminded that while the topics we talked about during the session are critically important for rural communities and the long-term impact of libraries that serve them, it’s also important to connect with colleagues at conferences like ARSL’s. This was reinforced to me during conversations with librarians who are often dealing with few resources and on their own without significant support. The ARSL tradition of “dine-arounds” and I believe a new cocktail tradition created by the hotel are a fun way to create bonds that last beyond the conference. Another tidbit I tucked away for later use.

The post E-rate, broadband @ ARSL in Little Rock appeared first on District Dispatch.

Two In Two Days / David Rosenthal

Tuesday, Cory Doctorow pointed to "another of [Maciej Cegłowski's] barn-burning speeches". It is entitled What Happens Next Will Amaze You and it is a must-read exploration of the ecosystem of the Web and its business model of pervasive surveillance. I commented on my post from last May Preserving the Ads? pointing to it, because Cegłowski goes into much more of the awfulness of the Web ecosystem than I did.

Yesterday Doctorow pointed to another of Maciej Cegłowski's barn-burning speeches. This one is entitled Haunted by Data, and it is just as much of a must-read. Doctorow is obviously a fan of Cegłowski's and now so am I. It is hard to write talks this good, and even harder to ensure that they are relevant to stuff I was posting in May. This one takes the argument of The Panopticon Is Good For You, also from May, and makes it more general and much clearer. Below the fold, details.

I argued that the big data enthusiasts in the health industry were failing to, and probably had never even considered, ensuring that they had informed consent from their patients subjects victims as to the negative consequences of the inevitable leak of the big data that was being collected about them. Doctorow writes:
Maciej Cegłowski spoke ... about the toxicity of data -- the fact that data collected is likely to leak, and that data-leaks resemble nuclear leaks in that even the "dilute" data (metadata or lightly contaminated boiler suits and tools) are still deadly when enough of them leak out (I've been using this metaphor since 2008).
Cegłowski writes:
In particular, I'd like to draw a parallel between what we're doing and nuclear energy, another technology whose beneficial uses we could never quite untangle from the harmful ones. A singular problem of nuclear power is that it generated deadly waste whose lifespan was far longer than the institutions we could build to guard it. Nuclear waste remains dangerous for many thousands of years. The data we're collecting about people has this same odd property. Tech companies come and go, not to mention the fact that we share and sell personal data promiscuously. But information about people retains its power as long as those people are alive, and sometimes as long as their children are alive. No one knows what will become of sites like Twitter in five years or ten. But the data those sites own will retain the power to hurt for decades.
In a world where everything is tracked and kept forever, like the world we're for some reason building, you become hostage to the worst thing you've ever done. Whoever controls that data has power over you, whether or not they exercise it. And yet we treat this data with the utmost carelessness, as if it held no power at all.
It isn't just the data that lasts forever, the organization becomes addicted to the flow of data:
You can't just set up an elaborate surveillance infrastructure and then decide to ignore it. These data pipelines take on an institutional life of their own, and it doesn't help that people speak of the "data driven organization" with the same religious fervor as a "Christ-centered life". The data mindset is good for some questions, but completely inadequate for others. But try arguing that with someone who insists on seeing the numbers.
In the same way an addict always wants more drug, the organization always wants more data. Here is Cegłowski: 
The promise is that enough data will give you insight. ... There's a little bit of a con going on here. On the data side, they tell you to collect all the data you can, because they have magic algorithms to help you make sense of it. On the algorithms side, where I live, they tell us not to worry too much about our models, because they have magical data. ... The data collectors put their faith in the algorithms, and the programmers put their faith in the data.
And here is Doctorow:
Big Data's advocates believe that all this can be solved with more Big Data. This requires them to deny the privacy harms from collecting (and, inevitably, leaking) our personal information, and to assert without evidence that they can massage the data so that it can't be associated with the humans from whom it was extracted.
And, like the addict, the organization's effectiveness decays as the drug takes over:
The pharmaceutical industry has something called Eroom's Law (which is ‘Moore’s Law’ spelled backwards). It's the observation that the number of drugs discovered per billion dollars in research has dropped by half every nine years since 1950. ... This is astonishing, because the entire science of biochemistry has developed since 1950. Every step of the drug discovery pipeline has become more efficient, some by orders of magnitude, and yet overall the process is eighty times less cost-effective. This has been a bitter pill to swallow for the pharmacological industry. They bought in to the idea of big data very early on.
I hope this is enough to get you to read Cegłowski's talk; its well worth your time. While you're there, read this one too.

DPLA Receives $250,000 from Anonymous Donor to Expand Technical Capabilities / DPLA

The Digital Public Library of America is thrilled to announce that an anonymous donor has committed to provide substantial support towards DPLA’s mission in the form of a $250,000 grant to strengthen DPLA’s technical capabilities. This grant will allow DPLA to expand its technology team to handle additional content ingestion and to implement important new features based around its platform and website.

Today’s grant represents the third investment in DPLA’s mission by this anonymous donor. In 2013 they contributed to the rapid scaling-up of DPLA’s Hub network, and in 2015 they provided support for DPLAfest 2015 in Indianapolis.

“It’s wonderful to have this incredible, ongoing support from someone who concurs with the Digital Public Library of America about the importance of democratizing access to our shared cultural heritage,” said Dan Cohen, DPLA’s Executive Director. “Increasing our technical capacity in this way will advance that mission immediately and substantially.”

The Digital Public Library of America strives to contain the full breadth of human expression, from the written word, to works of art and culture, to records of America’s heritage, to the efforts and data of science. Since launching in April 2013, it has aggregated more than 11 million items from 1,600 institutions. The DPLA is a registered 501(c)(3) non-profit.

Federal libraries and the national policy agenda / District Dispatch

Library of Congress

Library of Congress

On Tuesday, I had the pleasure of meeting at the Library of Congress with the FEDLINK Advisory Board. My brief was a presentation and discussion of the National Policy Agenda for Libraries in the context of federal libraries and related institutions.

Federal libraries represent both a particular segment of the library community and an extensive and far-reaching one as well—including service to the general public. Those with the highest visibility and general name recognition include the Library of Congress and National Library of Medicine but in fact there are numerous libraries in the federal sector, including several hundred libraries in the armed forces. The latter includes the Navy General Library Program, 96 years old, with over a million sailor visits in the last fiscal year.

Following is the FEDLINK mission statement:

The Federal Library and Information Network (FEDLINK) is an organization of federal agencies working together to achieve optimum use of the resources and facilities of federal libraries and information centers by promoting common services, coordinating and sharing available resources, and providing continuing professional education for federal library and information staff. FEDLINK serves as a forum for discussion of the policies, programs, procedures and technologies that affect federal libraries and the information services they provide to their agencies, to the Congress, the federal courts and the American people.

FEDLINK celebrates 50 years of service this year. I’m pleased to have ALA’s Jessica McGilvray serving as our liaison to FEDLINK.

The policy and advocacy challenges for federal libraries have substantial commonalities with other library segments. The problem of higher-ups not understanding the true contributions of libraries resonated—and accordingly, suggested the need for all library managers to be marketing and sales people (and the consequent need for such education in master’s programs).

Many thanks to Blane Dessy of the Library of Congress for the invitation. ALA looks forward to continued and closer collaboration on these issues, and I am particularly committed to working cooperatively on the many federal library issues that intersect with ALA’s national policy work.

The post Federal libraries and the national policy agenda appeared first on District Dispatch.

Subscribe to the DuraSpace Quickbyte Video Channel / DuraSpace News

Winchester, MA  Information seekers can now find videos and broadcasts tailored to almost any interest or topic online. Reading long emails went the way of the dinosaurs as rich media viewed on our phones became the way many of us access news and information. DuraSpace has joined the fray by establishing the DuraSpace Quickbyte video series on YouTube.

Extra Extra! Chronicling America Posts its 10 Millionth Historic Newspaper Page / Library of Congress: The Signal

Talk about newsworthy! Chronicling America, an online searchable database of historic U.S. newspapers, has posted its 10 millionth page today. Way back in 2013, Chronicling America boasted 6 million pages available for access online.

The San Francisco call., October 12, 1902, Image 15

The San Francisco call., October 12, 1902, Image 15

The site makes digitized newspapers (of those published between 1836 and 1922) available through the National Digital Newspaper Program. It also includes a separate searchable directory of US newspaper records, describing more than 150,000 titles published between 1690 to the present and listing libraries that have physical copies in microfilm or original print. The site now features more than 74 terabytes of total data – from more than 1,900 newspapers in 38 states and territories and the District of Columbia.

For the past eight years, the site has grown with content and providing enhanced access.  The NDNP data is in the public domain and available on the web for anyone to use. In addition, the web application supporting the Chronicling America web site is published as open-source software for others to implement and customize for their own digitized newspaper collections.

The technical aspects of the program are based around sustainable practices in digital preservation, including open and standardized file formats and metadata structures,  technical validation, and using the digital collection and inventory management tools developed at the Library.

New-York tribune., November 25, 1906, Image 17

New-York tribune., November 25, 1906, Image 17

“It’s very exciting to have created such a large collection of newspapers from so many places around the country covering a wide breadth of time,” said Deb Thomas, who manages the program for the Library of Congress. “We can see how individual communities understood the world around them in those decades.”

The goal for Chronicling America, Thomas said, is to have all 50 states plus U.S. territories represented in the archive– something she estimates may take about 10 more years. “The newspapers are the first draft of history,” she said. “That’s it – it has something for everyone in it. It’s not a specialized resource. It’s a record of community history and cultural history. That’s where we put it all.”

Chronicling America ( ) provides free and open access to more than 10 million pages of historic American newspapers selected by memory institutions in 38 states and territories so far. These states participate in the National Digital Newspaper Program, a joint program of the National Endowment for the Humanities and the Library of Congress. Read more about it at and follow us on Twitter @librarycongress #ChronAm #10million!

Read other Library of Congress blog posts recognizing this milestone:

New Self-Guided Curriculum for Digitization / DPLA

Through the Public Library Partnerships Project (PLPP), DPLA has been working with existing DPLA Service Hubs to provide digital skills training for public librarians and connect them sustainably with state and regional resources for digitizing, describing, and exhibiting their cultural heritage content.

During the project, DPLA collaborated with trainers at Digital Commonwealth, Digital Library of Georgia, Minnesota Digital Library, Montana Memory Project, and Mountain West Digital Library to write and iterate a workshop curriculum based on documented best practices. Through the project workshops, we used this curriculum to introduce 150 public librarians to the digitization process.

Now at the end of the project, we’ve made this curriculum available in a self-guided version intended for digitization beginners from a variety of cultural heritage institutions. Each module includes a video presentation, slides with notes in Powerpoint, and slides in PDF. Please feel free to share, reuse, and adapt these materials.

These modules follow the flow of the digitization process and each is presented by different members of the curriculum writing team. Many thanks to the hubs who collaborated to develop and test this curriculum, the PLPP participants for providing us with feedback on how to improve it, and the Bill & Melinda Gates Foundation for its generous funding.

To learn more about the Public Library Partnerships Project, please visit our project page or email

Header image courtesy of the University of Kentucky Libraries via Kentucky Digital Library.

Agile Development: Sprint Retrospective / LITA


In my last two posts I’ve discussed how to carry out sprint review and sprint planning meetings. This month we’ll look at the final component of the sprint boundary process, the sprint retrospective, which is where the team analyzes its inner workings.


The sprint retrospective is an opportunity for the development team to review their performance over the previous sprint, identify strengths and weaknesses, and modify processes to increase productivity and well-being.


The retrospective should take place near the end of the iteration. It usually follows the sprint review, and can be held immediately following, but some sort of boundary should be established (take a short break, change the room, etc.) to make it clear that these are two very different meetings with very different purposes. The length of the meeting will change from sprint to sprint; budget as much time as you think you will need to fully explore team performance. If there isn’t much of substance to discuss, you can always end the meeting early and gain hero status within the team.


This is the most intimate gathering of the three we have looked at so far. No one other than the core iteration team should be present. Select stakeholders (Product Owner, department managers) may be included for some part of the meeting in order to gather feedback on specific issues, but at its core the retrospective should be limited to the people who performed the work during the iteration. Peripheral stakeholders and authority figures can dampen the effectiveness of this meeting.

Meeting Agenda

The “traditional” retrospective agenda consists of a quantitative review of team iteration metrics, followed by each team member answering the following 3 three questions to encourage dialogue:

  • What went right?
  • What went wrong?
  • What can be improved?

That’s as good a place to start as any, but your retrospective’s format should adapt to your team. Such a tightly-formatted agenda may cause some teams to fall into rote, uninspired contribution (“here, let me give you one of each and be done”), while more free-flowing conversations can fail to surface critical issues or avenues for improvement. You will want to provide enough structure to provoke meaningful exchanges, but not so much that it suppresses them. You know your team better than anyone else, so it’s up to you to identify the format that fits best.

The point of the meeting is to get your team into a comfortable critique space where everyone is comfortable sharing their thoughts on how to make the development process as efficient and effective as possible. Team members should avoid playing the blame game, but shouldn’t be afraid to point out behavior that detracts from team performance.

Of the three sprint boundary meetings, the retrospective is the hardest one to facilitate: it has the largest qualitative component, and it explores sensitive subjects like team dynamics and team member feelings. This is the meeting that will test a scrum master’s interpersonal and leadership skills the most, but it is also the one that will have the biggest impact on the development environment. When the user stories are flying fast and furious and time is at a premium, it’s easy to think of the retrospective as a luxury that the team may not be able to afford; however, it is crucial for every development team to set aside enough time to thoroughly analyze their own performance and identify the best potential avenues for meaningful and lasting change.

If you want to learn more about sprint retrospective meetings, you can check out the following resources:

I’ll be back next month to discuss how to build an agile organizational culture.

What strategies do you use to make your retrospectives fruitful? How do you encourage team members to be both forthright in their evaluations and open to criticism? How do you keep retrospectives from becoming exercises in finger-pointing and face-saving?

BIS-Sprint-Final-24-06-13-05” image By Birkenkrahe (Own work) [CC BY-SA 3.0 (], via Wikimedia Commons.

White Librarianship in Blackface: Diversity Initiatives in LIS / In the Library, With the Lead Pipe

Download PDF

In Brief:

Whiteness—an ideological practice that can extend beyond notions of racial supremacy to other areas of dominance—has permeated every aspect of librarianship, extending even to the initiatives we claim are committed to increasing diversity. This state of affairs, however, need not remain. This article examines the ways in which whiteness controls diversity initiatives in LIS, particularly in light of the application requirements set upon candidates. I then suggest ways to correct for whiteness in LIS diversity programs by providing mentorship to diverse applicants struggling to navigate the whiteness of the profession and concurrently working in solidarity to dismantle whiteness from within.1

Failure of Diversity Initiatives in LIS

It is no secret that librarianship has traditionally been and continues to be a profession dominated by whiteness (Bourg, 2014; Branche, 2012; Galvan, 2015; Hall, 2012; Honma, 2006), which is a theoretical concept that can extend beyond the realities of racial privilege to a wide range of dominant ideologies based on gender identity, sexual orientation, class, and other categories. In fact, recent years have seen LIS professional organizations and institutions striving to provide increasing numbers of diversity initiatives to help members from underrepresented groups enter and remain in librarianship (Gonzalez-Smith, Swanson, & Tanaka, 2014). The Association of Research Libraries (ARL) and Society of American Archivists conduct the Mosaic Program to attract diverse students to careers in archiving; the American Association of Law Libraries manages the George A. Strait Minority Scholarship to help fund library school for college graduates interested in law librarianship; the American Library Association (ALA) runs its Spectrum Scholars Program to provide scholarships to diverse LIS students and a corresponding Spectrum Leadership Institute to help prepare these students for successful careers in the library field. Examples abound of library organizations attempting to address the “problem of diversity” in the LIS field.

Nevertheless, these efforts are not making any meaningful difference. As one of my colleagues has so accurately put it: “We’re bringing [people] from underrepresented identity groups into the profession at same rate they’re leaving. Attrition [is] a problem” (Vinopal, 2015). With minority librarians leaving the profession as soon as they are recruited, what can be done to render our abundance of diversity initiatives truly effective? Why are these ambitious and numerous initiatives failing to have the desired effect? Shortly after discussing this very issue with a colleague over lunch, I received an email regarding the approaching deadline for the ARL Career Enhancement Program, which is aimed at placing diverse, early career librarians in internships with member libraries. Reading through the onerous application process, the realization hit me: Our diversity programs do not work because they are themselves coded to promote whiteness as the norm in the profession and unduly burden those individuals they are most intended to help.

Whiteness in LIS

Studying whiteness in LIS has yet to hit the mainstream of library scholarship, but there have been a number of critical and radical library scholars who have taken up the challenge of interrogating and troubling the whiteness of the profession (Bourg, 2014; Espinal, 2000; Galvan, 2015; Hall, 2012; Honma, 2006). These critical examinations highlight the many dimensions of any accurate definition of whiteness as an ideological practice. As Galvan (2015) so succinctly puts it, “whiteness . . . means: white, heterosexual, capitalist, and middle class.” Hall (2012) takes a different approach to defining the breadth of whiteness in LIS by differentiating it from the “black bodies” of LIS: “I would assert that [whiteness] is an issue, a question, that transcends race, ethnicity, any broad or limiting categorization and unites all librarians who identify or are identified as different” (p. 201). For these writers, whiteness refers not only to racial and ethnic categorizations but a complete system of exclusion based on hegemony. Likewise, in this article, I use “whiteness” to refer not only to the socio-cultural differential of power and privilege that results from categories of race and ethnicity; it also stands as a marker for the privilege and power that acts to reinforce itself through hegemonic cultural practice that excludes all who are different.

This system of exclusion functions primarily through the normativity of whiteness within librarian and larger societal culture. As Branche (2012) notes, “Whiteness and white normativity are embedded in U.S. library culture” (p. 205). The normativity of whiteness works insidiously, invisibly, to create binary categorizations of people as either acceptable to whiteness and therefore normal or different and therefore other. The invisible nature of whiteness is key to its power; when it is not named or interrogated, it can persist in creating a culture of exclusion behind the scenes of LIS practice (Espinal, 2001; Galvan, 2015; Honma, 2006). As Yeo and Jacobs (2006) note, “One must ask oneself if it would be possible to really achieve diversity without challenging our racist, homophobic and sexist consciousnesses that are so deeply imbedded that we don’t even recognize them?”

For example, whiteness as hegemonic practice is at work when a librarian of color is mistaken for a library assistant by white colleagues at a professional conference. Likewise, whiteness is at work when genderqueer librarians are forced to choose between binary gender groupings, neither of which apply to their identities, when using the restroom at work. Finally, whiteness is at work when a librarian from a working-class background in search of employment is told by well-meaning colleagues, “Just take a job anywhere and move,” when the unemployed librarian lacks the financial privilege to do so. This working of white normativity occurs without thought and intention but is still powerfully exclusionary and damaging to the profession.

A major contributor to the invisible normativity of whiteness in librarianship has been the fact that whiteness has played such a fundamental role in the profession from the start. Public libraries in the U.S. developed initially as sites of cultural assimilation and “Americanization” of immigrants needing to learn the mores of white society (Hall, 2012; Honma, 2006). Given the historical context, white normativity continues to be a hallmark of modern librarianship.

White normativity in LIS extends to the ways in which we discuss and address diversity in the profession. Rather than being framed as a shared goal for the common good, diversity is approached as a problem that must be solved, with diverse librarians becoming the objectified pawns deployed to attack the problem. With this white-centered thinking at the fore, many LIS diversity initiatives seem to focus primarily on increasing numbers and visibility without paying corresponding attention to retention and the lived experiences of underrepresented librarians surrounded by the whiteness of the profession (Gonzalez-Smith, Swanson, & Tanaka, 2014; Honma, 2006; Yeo & Jacobs, 2006). Focusing on numbers rather than the deeper issues of experience and structural discrimination allows the profession to take a self-congratulatory and complacent approach to the “problem of diversity” without ever overtly naming and addressing the issue of whiteness (Espinal, 2001; Honma, 2006).

In many ways, this article serves as an extension of Galvan’s (2015) examination of the practice of whiteness in LIS hiring and job recruitment. She identifies culture, conspicuous leisure, and access to wealth as barriers to entry for members from diverse backgrounds (Galvan, 2015). My research extends that framework to examine ways in which similar barriers come into play even before the hiring process—in diversity initiatives supposedly aimed at encouraging members of marginalized groups to pursue the education and training necessary for a career in librarianship.

“White” Diversity Initiatives

The profession is so imbued with whiteness, extending even to the ways in which we discuss and address diversity, it is no wonder that our myriad diversity initiatives are not working. When we recruit for whiteness, we will perpetuate whiteness in the profession, even when it comes in the form of a librarian with a diverse background. A look at the application requirements for a typical LIS diversity initiative demonstrates this point. In order to qualify for an internship through the ARL Career Enhancement Program, for example, applicants must submit:

  1. a completed application form;
  2. a resume;
  3. a 500-word essay detailing their professional interests and goals;
  4. an official letter of acceptance to an ALA-accredited MLIS program;
  5. official transcripts; and
  6. two letters of recommendation, one of which must be from a professor or employer.

Each of these requirements assumes that applicants are situated in positions of white, middle-class, cisgender normativity that allow for the temporal, financial, and educational privilege that fulfilling these criteria would require. Only an applicant with access to the privileges of whiteness would have the tools needed to engage in the requisite work and volunteer opportunities called for by the diversity program, have the high-level of educational achievement required, possess the close relationships with individuals of power needed for stellar recommendations, and be able to provide all the documentation necessary to complete their application through the online form. In many ways, this long list of requirements resembles the complex application processes of the most elite private institutions of higher education. Many public institutions, including almost all community colleges, do not require such detailed paperwork for matriculation into their undergraduate programs (see e.g., St. Petersburg College). These institutions take their public mission seriously to provide education to all members of the community. However, diversity initiatives in LIS that are meant to benefit members of underrepresented groups require lengthy applications that many individuals from diverse backgrounds may not be equipped to complete.

These applications are created particularly to recruit for whiteness and require the ability to play at whiteness in order to succeed. For example, applicants are required to submit resumes detailing their work experience, but an applicant from a working-class background may not have the requisite experience, either through work or volunteering, to place on a resume. Building a relevant resume assumes the applicant has the white, middle-class background that allows for early career professional work or volunteerism, whereas many applicants do not have that privilege (Galvan, 2015). It may also be the case that the applicant has plenty of work experience in low-wage jobs but is unaware of ways to frame that experience to reflect the transferable skills that relate to librarianship. Without the white-normative experience of applying for professional opportunities, the applicant will not know how to frame their resume to meet the requirements for the application and, because of this lack of knowledge, may decide not to apply at all.

Another example can be seen in the requirement of official transcripts. A genderqueer applicant who has since changed names and gender identities may not know how to navigate the legal and bureaucratic labyrinth of transferring their personal information from one name and identity to another. Because the transcripts must be official, the applicant will likely have to work with the educational institution, as well as the diversity program, to verify their identity. This process adds additional labor to the already onerous application process—labor that is not required of the white-normative, cisgender applicant—and could likely discourage the applicant from applying.

In both cases, an application process rooted in whiteness can have a chilling effect on the types of applicants who actually apply, creating a self-selection process that further promotes whiteness in the profession. Even for those applicants who successfully apply and are accepted into these diversity programs, playing at whiteness is still a requirement for career success. Programs like the ARL Career Enhancement Program assume that successful applicants possess the privileged free time, financial backing, and familial circumstances to allow them to relocate for these internships, residencies, or ALA-accredited library programs. Moreover, these diversity initiatives not only require whiteness for the application process but they also require continued whiteness to succeed in the profession (Galvan, 2015). Thus, those applicants who find success in these diversity programs are those who can successfully replicate necessary whiteness. As Espinal (2001) observes, “Many librarians of color have commented that they are more accepted if and when they look and act white” (p. 144). This means the inverse is also true: Those librarians not able to play successfully at whiteness will be continually excluded from the profession (Satifice, 2015).

This phenomenon is not unique to LIS. Writing about the technology sector, Kẏra (2014) notes, “When we talk about diversity and inclusion, we necessarily position marginalized groups as naturally needing to assimilate into dominant ones, rather than to undermine said structures of domination.” Jack (2015) makes a similar observation regarding elite undergraduate institutions matriculating underrepresented minority students—the “privileged poor”—from private high schools: “Elite colleges effectively hedge their bets: They recruit those already familiar with the social and cultural norms that pervade their own campuses.” Manipulating diversity programs to recruit for whiteness ensures that only those diverse candidates adept in whiteness will succeed.

My own experience serves as a prime example. I am a cisgender, heterosexual, middle-class black woman, raised by two highly educated parents who taught me from a young age the importance of playing at whiteness to achieve. I can specifically remember my mother admonishing me to “play the game and do what you want later” throughout my life. I have grown very adept at playing at whiteness; it has allowed me to complete a number of post-graduate degrees, spend time practicing corporate law at an award-winning global firm, and successfully transfer careers to a rewarding position in academic librarianship. This playing at whiteness also allowed me to apply for and successfully obtain a position as an ALA Spectrum Scholar in the 2012 cohort. Knowing how to replicate whiteness has served me well.

“Lifting as We Climb”

While my own ability to play at whiteness has served me in my career, it is a privilege that I know I cannot use selfishly. As my mother reminded me in a recent conversation about the issue of diversity in the professional world, “You play the game and give the white world what it wants just to get through the door. Then, once you’re inside, you blast that door wide open for others to follow you” (B. Evans Hathcock, personal communication, August 18, 2015). Just as the National Association of Colored Women exhorted fellow middle-class blacks to do in their motto “Lifting as We Climb” (Wormser, 2002), it is important that those of us in LIS with privilege—be it the privilege of actual whiteness or the privilege of skill in playing whiteness—serve as effective allies to those who do not. We need to make space for our diverse colleagues to thrive within the profession. In short, we need to dismantle whiteness from within LIS. We can best do that in two equally important ways: by modifying our diversity programs to attract truly diverse applicants and by mentoring early career librarians in both playing at and dismantling whiteness in LIS.

One of the first steps to washing away the blackface of white librarianship is to reframe diversity initiatives so that they attract and retain applicants from truly diverse backgrounds. When we recruit for whiteness, we will get whiteness; but when we recruit for diversity, we will truly achieve diversity. It is important to note that reworking application processes to accommodate applicants with different backgrounds and experiences in no way requires lowering standards. Talented applicants from truly diverse backgrounds—that is, backgrounds not functionally equivalent to standards of successful whiteness—exist and can be recruited and retained for these programs. To identify and attract them, however, requires framing application questions and required material in ways that make sense for the applicants’ experiences.

For example, instead of requiring that at least one or all letters of recommendation come from professors or former employers, it may be useful and more relevant to allow applicants to submit letters from community members or other acquaintances who can provide equally informed assessments of the applicant’s work and goals. Assuming that an applicant has the necessary relationship with a professor or supervisor means assuming that applicant attends school or works in a white, middle-class, cis-male environment where closeness with professors or supervisors is the norm. A diverse applicant may not have the opportunity to form those kinds of school and work relationships. However, that same applicant may know a staff member at the local public library who is well aware of the applicant’s career goals and the work they have put toward achieving them. The local library staff member would not qualify as either a professor or former employer but can still provide valuable insight into the qualifications of that particular applicant.

Dismantling whiteness from the infrastructure of our diversity programs is key, but it will take time. In the meantime, there are diverse individuals out there who wish to become and remain successful librarians. Thus, another important step in washing away the blackface of white librarianship involves teaching new librarians from diverse backgrounds how to navigate effectively the white system that we have. We also need to teach these new librarians how to dismantle whiteness’ stranglehold on the profession. Being a nonwhite librarian playing at whiteness is an isolating and lonely practice, so it is essential that new librarians from diverse backgrounds get the support they need and have safe spaces to go in the midst of this work.

Fortunately, there are a number of communities of radical and critical librarians who are willing to provide support, guidance, and mentorship in bringing true diversity and anti-racist practice to the profession. One colleague and fellow beneficiary of LIS diversity initiatives has created a mentorship group for students of color to help them navigate the realities of learning and working in a privileged space and to assist them in fulfilling the requirements of whiteness necessary to succeed (Padilla, 2015). Social media spaces, such as #critlib and #radlib on Twitter, provide public spaces for librarians to vent frustrations and share strategies for combating whiteness—comprising a range of hegemonic statuses, as defined above—in LIS. For those not comfortable with speaking out publicly, social media can also provide useful points of contact for more private, offline relationships and discussions aimed at combating whiteness in the profession. Even within our professional organizations, a number of caucuses and interest groups, including the Gay, Lesbian, Bisexual, and Transgender Round Table and the Asian Pacific American Librarians Association exist to help members of diverse identity groups find community in the midst of the whiteness of librarianship (Espinal, 2001; Gonzalez-Smith, Swanson, & Tanaka, 2014).

There are many ways for nonwhite librarians and library students to gain the support and knowledge they need to enter the doors of the profession and subsequently “blast them open.” Likewise, there are many practical ways more experienced librarians—from all backgrounds and levels of privilege—can help to fight whiteness in our diversity initiatives:

  • Volunteer to serve on ALA and workplace committees and working groups tasked with organizing LIS diversity initiatives and speak up regarding ways those initiatives can be modified to embrace a more diverse applicant pool.
  • Offer to take part in formal mentoring programs through professional associations or within your institution. Help library workers new to the profession to navigate the culture of whiteness in the profession at large and within your specific place of work. For example, the Association of College and Research Libraries’ Dr. E. J. Josey Spectrum Scholar Mentor Program pairs academic librarians with current Spectrum Scholars interested in academic librarianship, and mentor applications are always welcome.
  • Participate in informal mentoring with nonwhite library workers and students. With social media, it is possible to serve as an effective resource and ally for someone, even from miles away. Do what you can to let new colleagues from diverse backgrounds know that you are available as a resource for advice, to serve as a reference, etc.
  • Even if you are yourself new to the profession, you have a role to play. Develop relationships with more seasoned librarians who have demonstrated a commitment to inclusivity and learn from their experiences in the struggle. If you have privilege, begin speaking up for those who do not and signal boost their messages.

Fighting whiteness is hard work that requires additional labor from everyone. As Lumby and Morrison (2010) note, “It is therefore in the interest of all to address inequities, and not just in the interest of the apparently disadvantaged” (p. 12, citing Frankenburg, 1993).

Washing Away the White Librarianship in Blackface

Whiteness has permeated every aspect of librarianship, extending even to the initiatives we commit to increasing diversity. We can, however, make meaningful and important changes. With continued critical study of whiteness and its effects on LIS, it is possible to redirect our thinking about diversity from a problem to be solved to a goal worth achieving. Moreover, we can and should develop real strategies for attaining that goal. The first step is to help diverse applicants navigate the whiteness of the profession and make a concerted effort to dismantle whiteness from within. In doing so, we can recreate the profession into one that truly embraces inclusivity. We can wash away our white librarianship in blackface.

Huge thank you to Annie Pho, Jennifer Vinopal, and Erin Dorney for reading, reviewing, and helping to revise this article. It is so much better having come across their desks. Unending gratitude to Betty Evans and Dewitt Hathcock for teaching me how to play the game successfully and raising me to be the radical I am today.

Works Cited

Branche, C. L. (2012). Diversity in librarianship: Is there a color line? In A. P. Jackson, J. C. Jefferson, Jr., & A. S. Nosakhere (Eds.), The 21st-century black librarian in America (pp. 203-206). Lanham, MD: Scarecrow Press.

Bourg, C. (2014, March 3). The unbearable whiteness of librarianship. Feral librarian. [Blog post]. Retrieved from

Espinal, I. (2001). A new inclusive vocabulary for inclusive librarianship: Applying whiteness theory to our profession. In L. Castillo-Speed (Ed.), The power of language/El poder de la palabra: Selected papers from the second REFORMA National Conference (pp. 131-149). Englewood, CO: Libraries Unlimited.

Frankenburg, R. (1993). White women. Race matters. The social construction of whiteness. Minneapolis, MN: University of Minnesota Press.

Galvan, A. (2015). Soliciting performance, hiding bias: Whiteness and librarianship. In the Library with the Lead Pipe. Retrieved from

Gonzalez-Smith, I., Swanson, J., & Tanaka, A. (2014) Unpacking identity: Racial, ethnic, and professional identity and academic librarians of color. In N. Pagowsky & M. Rigby (Eds.), The librarian stereotype: Deconstructing perceptions and presentations of information work (pp. 149-173). Chicago, IL: Association of College and Research Libraries.

Hall, T. D. (2012). The black body at the reference desk: Critical race theory and black librarianship. In A. P. Jackson, J. C. Jefferson, Jr., & A. S. Nosakhere (Eds.), The 21st-century black librarian in America (pp. 197-202). Lanham, MD: Scarecrow Press.

Honma, T. (2006). Trippin’ over the color line: The invisibility of race in library and information studies. InterActions: UCLA Journal of Education and Information Studies, 1(2). Retrieved from

Jack, A. A. (2015, September 12). What the privileged poor can teach us. The New York Times. Retrieved from

Kẏra (2014, December 10). How to uphold white supremacy by focusing on diversity and inclusion. Model View Culture. Retrieved from

Lorde, A. (1984). Sexism: An American disease in blackface. In Sister Outsider (pp. 60-65). Trumansburg, NY: Crossing Press.

Lumby, J., & Morrison, M. (2010). Leadership and diversity: Theory and research. School Leadership & Management: Formerly School Organisation, 30(1), 3-17.

Padilla, T. [@thomasgpadilla]. (2015, August 18). @AprilHathcock we started a students of color group, tried to mentor incoming groups to privileged realities, req. of entrance. [Tweet]. Retrieved from

Satifice. (2015, September 10). It’s time to get personal, dirty, and downright nasty [Tumblr post]. Retrieved from

Vinopal, J. [@jvinopal]. (2015, August 18). @AprilHathcock we’re bringing ppl from underrepresented identity groups into profession at same rate they are leaving. Attrition a problem+. [Tweet]. Retrieved from

Wormser, R. (2002). Jim Crow stories: National Association of Colored Women. The rise and fall of Jim Crow. Retrieved from

Yeo, S., & Jacobs, J. R. (2006). Diversity matters? Rethinking diversity in libraries. Counterpoise, 9(2). Retrieved from

  1. The title of this article is a variation on a quote by librarian, scholar, and activist Audre Lorde (1984): “Black feminism is not white feminism in blackface.” In this article, I am arguing the opposite as it relates to diversity initiatives in LIS in that I posit that diverse librarianship as we conceive of it is in fact white librarianship in blackface.

Knowledge Continues to be Unlatched / Roy Tennant

2015-10-06_15-05-24I’ve written before about two similar efforts to open up books by having individuals ( or libraries (Knowledge Unlatched) pledge money until a certain total has been raised. They both continue to work at these efforts, and KU recently announced a new offering for libraries to consider.

The “KU Collection,” as it is dubbed, includes 78 new books in five subject areas (Anthropology, History, Literature, Media and Communications, and Politics) from 26 scholarly publishers. For a capped maximum of $3,891, which represents an average of just under $50 a book, libraries can participate in making this collection open to all. As more libraries pledge, the per-institution cost declines.

This is an interesting model, and the cost profile seems quite reasonable given current prices for scholarly monographs. Also, this likely provides welcome guaranteed income for scholarly presses that can struggle in an unpredictable market. The collection can be viewed, and pledges made, at . Libraries have until the end of January 2016 to make a pledge.

The Next Librarian of Congress? / Meredith Farkas

The New Republic

Late last week, I received an email from the culture editor at the New Republic about writing an article on the next Librarian of Congress. It was the first offer I’ve ever had to write for a non-library-centric publication and the New Republic has a political bent I really respect, so it was an offer I couldn’t refuse. It ended up being a really fun exercise in positive thinking and in articulating why regular people should actually care at all about who the next Librarian of Congress is. 

What could a truly great Librarian of Congress do in the 21st century? Maybe one who uses technologies beyond the fax machine? Maybe one who shares the values of librarians (or maybe even IS a librarian). Maybe one who knows how to run a complex organization and doesn’t berate their employees. Maybe one who knows that the best way forward in digitization and preservation of our nation’s history is all about collaboration. Maybe one who understands that the DMCA is ridiculously restrictive and needs to strike a better balance on the side of end-users, creatives involved in remix culture, and people who just want to tinker with the technologies they’ve legally purchased.

Clearly I could not be this snarky in the article, but I’m still pretty proud of it and you can read it here.

The Next Librarian of Congress Should Be an Actual Librarian

I’d love to hear your thoughts on the next Librarian of Congress! And if you’re interested in this issue, check out Jessamyn West’s amazing article on Medium, her terrific website Librarian of Progress, and the #nextLoC hashtag on Twitter.

I will not suggest that you read (not will I link to) Siva Vaidyanathan’s Slate article, where he wrote “but the library needs more than a respected scholar or librarian. It needs a visionary who can leverage the position to lead us through some essential upgrades and debates that could push this vital institution into public consciousness.” Silly me. I thought librarians could be visionaries and leaders too.  

Thanks so much to Jessamyn, my husband Adam, and my parents for reading over my draft this weekend and helping me make sure I didn’t write anything too stupid.

LITA Forum Student Registration Rate Available / LITA

2015 LITA Forumlita_forum15_badge_125x300
Minneapolis, MN
November 12-15, 2015


LITA is offering a special student registration rate to the 2015 LITA Forum to a limited number of graduate students enrolled in ALA-accredited programs. The Forum will be held November 12-15, 2015 at the Hyatt Regency Minneapolis in Minneapolis, MN. To learn more about the Forum, visit .

In exchange for the discounted registration, students will assist the LITA organizers and the Forum presenters with the on-site operations for the Forum. We are anticipating an attendance of 300+ decision makers and implementers of new information technologies in libraries.

The selected students will be expected to attend the full LITA Forum, Friday noon through Saturday noon, but attending during the pre-conferences on Thursday afternoon and Friday morning is not required. They will be assigned a variety of duties, but will be able to attend the Forum programs, which include 3 keynote sessions, approximately 50 concurrent sessions, and 15 poster presentations, as well as many opportunities for social engagement.

The student rate is $180 – half the regular registration rate for LITA members. This rate includes a Friday night reception at the hotel, continental breakfasts, and Saturday lunch.

To apply for the student registration rate, please use and submit this form:

You will be asked to provide the following:

1) Complete contact information including email address,
2) The name of the school you are attending, and
3) 150 word (or less) statement on why you want to attend the LITA Forum

Please complete and submit this form no later than October 17, 2015.

Those selected for the student rate will be notified no later than October 23, 2015.

Check this link for Why Attend?

026: 20 Questions / LibUX

I — Michael — was grilled by #LIS3267 about what it’s like to do web design, the future of libraries, maintaining work/life balance in a highly disrupted field, and other slightly neurotic grad-student things. We thought it would be fun to hear Amanda’s take for this episode. No special guests, no specific topics. Enjoy!

Amanda L. Goodman edited this episode. Abigail Phillips (twitter) was super nice for the invitation to jabber at impressionable youthfolks. You can subscribe to LibUX on Stitcher, iTunes, or plug our feed right into your podcatcher of choice. Help us out and say something nice.

The post 026: 20 Questions appeared first on LibUX.

Five Questions for the Smithsonian Institution Archives’ Lynda Schmitz Fuhrig / Library of Congress: The Signal

The following is a guest post from Michael Neubert, a supervisory digital projects specialist at the Library of Congress.

In February of this year I wrote a post here about an collaborative effort of representatives of the National Archives and Records Administration (NARA), the Government Publishing Office (GPO), and the Library of Congress to work together in various ways on archiving of federal government agency websites – Introducing the Federal Web Archiving Working Group.

Smithsonian Institution Building, 1000 Jefferson Drive, between Ninth & Twelfth Streets, Southwest, Washington, District of Columbia, DC.  1968.  Library of Congress.

Smithsonian Institution Building, 1000 Jefferson Drive, between Ninth & Twelfth Streets, Southwest, Washington, District of Columbia, DC. 1968. Library of Congress.

Since that time we have expanded the participation from NARA, GPO, and the Library to include some additional federal agencies that are more heavily focused on harvesting of their own agency sites and less on harvesting the sites of other agencies to include the National Library of Medicine, the Smithsonian Institution, and the Department of Health and Human Services. We plan to reach out to more soon. We have realized we have things we can learn from one another about web archiving and federal sites because of the relative newness of this activity in what is a small community of interested staff and managers at federal agencies with this shared interest.

Lynda Schmitz Fuhrig, electronic records archivist, is the representative to the Federal Web Archiving Working Group from the Smithsonian Institution Archives (SIA). SIA “captures, preserves, and makes available to the public the history of this extraordinary Institution. From its inception in 1846 to the present, the records of the history of the Institution—its people, its programs, its research, and its stories—have been gathered, organized, and disseminated so that everyone can learn about the Smithsonian. The history of the Smithsonian is a vital part of American history, of scientific exploration, and of international cultural understanding.” Since the late 1990s this has included archiving of the websites and social media presence for Smithsonian’s various museums, research centers, and offices.

Michael: Why does the Smithsonian Institution archive its own sites? What is your process?

Lynda: As the official recordkeeper of the Smithsonian, we document what the Institution does in terms of exhibits and program planning, construction of buildings, and many other aspects. Our websites and social media accounts also serve as the public face of the Smithsonian. Many of them contain significant content of historical and research value that is now not found elsewhere. These are considered records of the Institution. It is also interesting to see how websites evolve over time. It would irresponsible of us as an archives to only rely upon other organizations to archive our websites.

We use the web crawling service from Archive-It to capture most of these sites. In addition to Archive-It hosting our web archives, we also retain copies of the files in our collections. We use some other tools to capture specific tweets or hashtags or sites that are a little more challenging due to how they are constructed and the dynamic nature of social media content.

In terms of public-facing websites, we try to capture them every 12 to 18 months. It is more frequent if a redesign is happening, and the archiving will happen before and after the update/refresh. An archivist appraises the content on the social media sites to determine if it has been replicated and captured elsewhere in some instances. For example, a museum’s postings on Facebook and Twitter could be similar and don’t require frequent captures. We now have more than 400 websites and blogs and more than 600 social media accounts that include Twitter, Facebook, Instagram, and YouTube across the Institution.

Michael: You’ve been participating in the Federal Web Archiving Working Group since June 2015. What did you hope to learn or accomplish with this group and how is it going so far?

Lynda: I am hoping to learn from my colleagues about their experiences and challenges, as well as other tools or approaches they are implementing at their agencies regarding web archiving. It has been interesting to hear about the various collecting missions or directives at other government agencies.

Michael: When you talk to colleagues or managers at the Smithsonian Institution about web archiving, what is the reaction? How do they see the benefit of this activity?

Lynda: Many do understand the value of it since we reach more people globally via the web than visitors coming to our museums physically. Our websites and social media accounts do indeed document the history of the Institution. Many webmasters know it is important to contact us when they are getting ready to retire a website so we can get a capture and/or retrieve the actual files from a content management system. We also have made various presentations at the Institution about web archiving.

Michael: I can imagine someone suggesting that since the Smithsonian must “back up” its web servers that it seems redundant to archive the websites. How would you explain the difference?

Image of the Smithsonian Institution’s 1995 homepage. Credit: Smithsonian Institution Archives

Image of the Smithsonian Institution’s 1995 homepage. Credit: Smithsonian Institution Archives

Lynda: It is true that we back up our network servers at the Smithsonian, but backing up is the not the same as archiving. By crawling sites we deem appropriate, we have a snapshot in time of the look and feel of a website. Backups serve the purpose of having duplicate files to rely upon due to disaster or failure. Backups typically are only saved for a certain time period. The website archiving we do is kept permanently. If I wanted to see from Oct. 9, 2012, there is a good chance the backup tape no longer exists but if I crawled that site that day I will have those files.

Michael: We have talked about my next question before: what is your view on whether it makes sense to use web archiving to make complete copies of cultural heritage presentation sites, including the records and displays of digitized collection items?

Lynda: Our approach has been to exclude as much collection objects/images from our crawls of the museum websites, as per Smithsonian Institution Archives policy. Of course, there are items that do get crawled because of the nature of the sites and we usually have the main collections page. Physical collection items fall under the unit responsible for them and they are something that we would never accession in the Archives.

Personally, I have mixed feelings about this since it is not a “complete” website capture then, especially since the images themselves are only representations online and not the actual object.

We do crawl exhibit websites that contain collection objects though.

This is something that researchers need to be aware of when using web archives. Typically, many website captures are not going to have everything either because of excluded content, blocked content, or dynamic content such as Flash elements or calendars that are generated by databases. Capturing the web is not perfect.

Another good prediction / David Rosenthal

After patting myself on the back about one good prediction, here is another. Ever since Dave Anderson's presentation to the 2009 Storage Architecture meeting at the Library of Congress, I've been arguing that for flash to displace disk as the bulk storage medium would require flash vendors to make such enormous investments in new fab capacity that there would be no possibility of making an adequate return on the investments. Since the vendors couldn't make money on the investment, they wouldn't make it, and flash would not displace disk. 6 years later, despite the arrival of 3D flash that is still the case.

Source: Gartner & Stifel
Chris Mellor at The Register has the story in a piece entitled Don't want to fork out for NAND flash? You're not alone. Disk still rules. Its summed up in this graph, showing the bytes shipped by flash and disk vendors.It shows that the total bytes shipped is growing rapidly, but the proportion that is flash is about stable. Flash is:
expected to account for less than 10 per cent of the total storage capacity the industry will need by 2020.
Stifel estimates that:
Samsung is estimated to be spending over $23bn in capex on its 3D NAND for for an estimated ~10-12 exabytes of capacity.
If it is fully ramped-in by 2018 it will make about 1% of what the disk manufacturers will that year. So the investment to replace that capacity would be $2.3T, which clearly isn't going to happen. Unless the investment to make a petabyte of flash per year is much less than the investment to make a petabyte of disk, disk will remain the medium of choice for bulk storage.

Final whitepapers for establishing international and interoperable rights statements released / DPLA

Over the past fifteen months, representatives from the Europeana and DPLA networks, in partnership with Creative Commons, have been developing a collaborative approach to internationally interoperable rights statements that can be used to communicate the copyright status of cultural objects published via the DPLA and Europeana platforms.

The purpose of these rights statements is to provide end users of our platforms with easy to understand information on what they can and cannot do with digital items that they encounter via these platforms. Having standardized interoperable rights statements will also make it easier for application developers and other third parties to automatically identify items that can be re-used.

We also anticipate that these statements will be used by other cultural heritage aggregators across the globe, and see these statements as the initial effort towards international interoperability around standardized rights statements.

In May of this year, we released two draft white papers on the recommendations for standardized international rights statements, one on the rights statements and one on the technical framework to support the statements. Both white papers received a tremendous amount of community response.

After considering the community feedback and making significant edits to both white papers and the list of statements, we are pleased to share with you today the final versions that describe our recommendations for establishing a group of rights statements, and the enabling technical infrastructure. These recommendations include a list of shared rights statements that both the DPLA and Europeana can use depending on the needs of our respective organizations.

Recommendations for standardized international rights statements

This paper describes the need for a common standardized approach. Based on the experience of both of our organizations and community feedback, we have described the principles we think any international approach to providing standardized rights statements needs to meet. Together we propose a list of ten new rights statements that can be used in situations where the licenses and legal tools offered by Creative Commons cannot be applied. The statements whitepaper and recommended list of statements can be found here.

Requirements for the technical infrastructure

In order to ensure that the new rights statements can be used by institutions around the world, we are planning to host the new rights statements in their own namespace: The whitepaper describing the technical framework can be found here. We have recently issued an RFP to assist us in building the technical infrastructure and anticipate launching the website in early 2016.

Speeding up core search / State Library of Denmark

Issue a query, get back the top-X results. It does not get more basic with Solr. So great win if we can improve on that, right? Truth be told, the answer is still “maybe”, but read on for some thoughts, test code and experimental results.

Getting the top-X results

  • The X in top-X is the maximum result set size we want back from a request.
  • A result set is the documents actually returned – this can be the same or less than X.
  • hits is the number of documents that matched the query – this can be much higher than the size of the result set.

When a top-X search is performed, Solr uses a Priority Queue to keep track of the result set. Such a queue is often implemented as a heap and so it is with Solr: The relevant class in the source is HitQueue.

A ScoreDoc represents a single result document, in the form of a document ID and a score (and sometimes a shard index, but we ignore that for now). The HitQueue holds ScoreDocs and is a simple beast: Add a new ScoreDoc and it will check to see if it has a higher score than any of the existing ScoreDocs. If so, the new ScoreDoc is added. If the queue is full, the ScoreDoc with the lowest score is removed.

Potential problems with the current implementation

  • A simple binary heap has poor memory locality: This means that an insert into the heap often results in multiple memory accesses scattered in the heap structure. This is not a problem for a tiny heap as everything will be in the CPU L2 cache, but for larger heaps that means a lot of uncached memory accesses.
  • The HitQueue in Solr uses ScoreDoc objects as elements in the heap. When the HitQueue is created, it is filled with dummy-ScoreDocs, called sentinels. There are multiple problems with that.
    • Each ScoreDoc object takes up 28 bytes on the heap. If 30 concurrent requests asks for top-2500000, that takes up 2GB of RAM, no matter what the actual result size is. I mention those specific numbers as that was the case for a person on the solr IRC channel.
    • Each ScoreDoc object is temporary, which means a lot of allocations and a lot of work for the garbage collector to clean up. In the previous case of the top-2500000, the JVM was doing stop-the-world garbage collections half the time.
    • The use of sentinel objects means that the heap is pre-filled with elements that will always be less than any real elements. Adding an element means sifting it from the top of the heap to the bottom. Not a problem for small heaps, but with larger ones it means unnecessary memory thrashing until the heap is full of new elements.
  • If the queue is not filled to its max size (the X in top-X), the ongoing maintaining of a heap structure is not the most efficient solution. For that, it would be better to simply collect all elements and do a merge-sort or similar, when they are to be delivered in order.

To sum it up: Solr’s queue in theory works well when requesting a small number of results, but poorly for large numbers. And indeed, that is part of the shared experience with Solr.

Lately CursorMark has been added to Lucene & Solr, which allows for efficient paging of results. One could say that this renders the optimization of requests for large result sets is a moot point, but hey, it’s a challenge.

Experimental improvements

Switching to a better heap algorithm would be a worthy experiment, but that is hard so we will skip that for now. Instead we will do away with all those pesky ScoreDoc objects.

The two relevant parts of a ScoreDoc are docID (an integer) and score ( a float). The sorting order of two ScoreDocs is determined primarily by the score and secondarily by the docID. There are a lot of such comparisons when using a heap.

In Java, the bits of a float can be extracted by Float.floatToRawIntBits, which produces an integer. This is a very fast operation. Interestingly enough, the sort order of two positive floats is preserved in the integer representations. This means that a ScoreDoc can be converted into an atomic long with Float.floatToRawIntBits(score) << 32 | docID and that two such longs are directly comparable.

With each element as a long, the object-guzzling ScoreDoc[] in the heap turns into a long[]. Besides taking only 8 bytes/element instead of 28, there is a much higher chance of CPU L2 cache hits as the longs in a long[] are packed tightly together on the heap.

Handling the case of a not-fully-filled heap is simple: Just add the elements to the array at the next free space. If the array gets full, heap-sort it and continue using it as a heap. If the queue does not run full, use the faster merge-sort to deliver the results in order.

Explorative experimentation

Which are just fancy words for poor methodology. Unfortunately the HitQueue used in Solr is not easily replaceable (it extends PriorityQueue, which has a lot of non-overridable methods). So to get a gist of which ideas are worth pursuing, we turn to micro benchmarks. Shalin Shekhar Mangar suggested using JMH. A sound advice which is postponed for now in favour of “just run it a lot of times, alternating between implementations”.

The test bench is simple: Start a bunch of Threads, each running a test in parallel. Each test instantiates a queue implementation, fills it with random docID/score-pairs, then empties it. To guard against noisy neighbours, all threads tests the same implementation and finish fully, before switching to the next implementation.

For completeness, a Solr HitQueue sans sentinels is also tested. Spoiler: Turning off the sentinel looks like an extremely easy win for large top-X requests.

Experimental results – the original hypothesis

The idea was to improve on large Top-X, so let’s look at top-1M. The test was with 1, 4 & 16 threads on a quad-core i7 laptop and with varying amounts of hits. The raw output of the testPQPerformanceReport1M follows:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1     1000000        10    15.06 100%      0.50   3%■     0.97   6% 
      1     1000000       100    13.56 100%      0.47   3%■     0.97   7% 
      1     1000000      1000    13.28 100%      0.56   4%■     1.20   9% 
      1     1000000     10000    21.21 100%      1.28   6%■     1.58   7% 
      1     1000000    100000    86.04 100%     18.43  21%      8.34  10%■
      1     1000000   1000000   954.23 100%    447.40  47%     87.39   9%■
      1     1000000  10000000  3245.70 100%   3544.06 109%    895.89  28%■

      4     1000000        10    27.31 100%      1.47   5%■     3.35  12% 
      4     1000000       100    25.68 100%      1.58   6%■     2.97  12% 
      4     1000000      1000    25.79 100%      1.45   6%■     3.04  12% 
      4     1000000     10000    33.42 100%      2.27   7%■     2.95   9% 
      4     1000000    100000   119.99 100%     19.50  16%     11.52  10%■
      4     1000000   1000000  1456.82 100%    576.17  40%    134.46   9%■
      4     1000000  10000000  5934.11 100%   4278.37  72%   1385.38  23%■

     16     1000000        10   131.92 100%      3.26   2%■     9.79   7% 
     16     1000000       100   120.81 100%      4.08   3%■     8.76   7% 
     16     1000000      1000   124.63 100%      3.01   2%■    10.30   8% 
     16     1000000     10000   162.60 100%      4.68   3%■    10.49   6% 
     16     1000000    100000   485.46 100%     84.27  17%     32.81   7%■
     16     1000000   1000000  4702.79 100%   1787.32  38%    368.57   8%■
     16     1000000  10000000 16563.52 100%  10964.12  66%   4197.17  25%■

Below each implementation (Sentinel, No_Sentinel and Packed) are two columns: How many milliseconds it took to initialize, fill & empty the queue, followed by how long it took relative to the vanilla Solr Sentinel implementation (this is of course 100% for the Sentinel implementation itself). The fastest implementation for any given test is marked with a black box ■.

  • Sentinel starts out relatively slow for small amounts of Hits and gets dog-slow, when the amount of Hits gets to 1M+.
  • No_Sentinel is a lot better for the smaller hit counts (it does not have to do all the initialization) and markedly better up to 1M Hits.
  • Packed is slower than No_Sentinel for the smaller hit counts, but as the number of hits rises, it pulls markedly ahead. There really is no contest beyond 100K hits, where Packed is 3-10x as fast as vanilla Solr Sentinel.
    Notice how there is a relative performance hit, when the amount of Hits exceeds the queue size for Packed. This it where it switches from “just fill the array sequentially, then merge-sort at the end” to maintaining a heap.

Experimental results – small top-X requests

A superior implementation for large top-X requests is nice and it is certainly possible to choose implementation based on the value of top-X. But let’s check to see how it behaves for small top-X requests. Top-10 for instance:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1          10        10     0.01 100%      0.01  49%      0.00  45%■
      1          10       100     0.01 100%      0.01  85%      0.01  81%■
      1          10      1000     0.01 100%■     0.02 204%      0.02 219% 
      1          10     10000     0.05 100%■     0.05 101%      0.08 152% 
      1          10    100000     0.59 100%      0.44  74%      0.37  62%■
      1          10   1000000     4.41 100%      4.53 103%      3.64  83%■
      1          10  10000000    42.15 100%     45.37 108%     36.33  86%■

      4          10        10     0.00 100%■     0.00 154%      0.01 329% 
      4          10       100     0.00 100%      0.01 348%      0.00  85%■
      4          10      1000     0.01 100%      0.01 120%      0.01  79%■
      4          10     10000     0.05 100%■     0.07 138%      0.08 154% 
      4          10    100000     0.47 100%      0.46  99%■     0.55 117% 
      4          10   1000000     4.71 100%      6.29 134%      4.45  95%■
      4          10  10000000    72.23 100%     60.09  83%     56.29  78%■

     16          10        10     0.00 100%      0.00  42%      0.00  39%■
     16          10       100     0.00 100%      0.00  72%      0.00  60%■
     16          10      1000     0.01 100%■     0.01 109%      0.01 128% 
     16          10     10000     0.08 100%      0.09 112%      0.07  80%■
     16          10    100000     1.48 100%      1.32  89%      1.12  76%■
     16          10   1000000    17.63 100%     18.74 106%     16.97  96%■
     16          10  10000000   207.65 100%    212.93 103%    192.81  93%■

Measurements for small number of Hits are somewhat erratic. That is to be expected as those tests are very fast (< 1ms) to complete, so tiny variations on machine load or garbage collection has a lot to say.

  • Sentinel is as expected very fast, compared to the top-1M test. No surprise there as asking for top-10 (or top-20) is the core request for Solr. The relative speed stays reasonably constant as the number of Hits grows.
  • No_Sentinel is a bit slower than Sentinel. That can be explained by a slightly different code path for insertion of elements – this should be investigated as there is a possibility of an easy optimization.
  • Packed is very interesting: Although it is not consistently better for the lower numbers of Hits (do remember we are still talking sub-millisecond times here), it is consistently a bit faster for the larger ones. There might not be a need for choosing between implementations.


Replacing the ScoreDoc-object using HitQueue in Solr with a bit-packing equivalent is a winning strategy in this test: The speed-up is substantial in some scenarios and the memory usage is ⅓rd of vanilla Solr.

The road ahead

This should of course be independently verified, preferably on other machine architectures. It should be investigated how to handle the case where shardIndex is relevant and more importantly discussed how to adjust the Lucene/Solr search code to use an alternative HitQueue implementation.

Dr. Yoonmo Sang: OITP Research Associate / District Dispatch

Dr. Yoonmo Sang, newly appointed Research Associate with OITP.

Dr. Yoonmo Sang, newly appointed Research Associate with OITP.

I’m pleased to announce that Dr. Yoonmo Sang will be working with ALA’s Office for Information Technology Policy (OITP) as a Research Associate effective October 5, 2015. This fall, Dr. Sang will build on the work from his doctoral dissertation “Copyright and the Future of Digital Culture: Application of the First Sale Doctrine to Digital Copyrighted Works.” One of his major activities is to assist ALA in developing a formal articulation and refinement of its copyright strategy for the digital age. In addition, Yoonmo will be participating in presentations and discussions on digital copyright, telecommunications, and the digital divide.  He will also be involved in media studies within the Washington area.

Yoonmo recently completed his Ph.D. in Media Studies at the University of Texas at Austin, after completing other degrees in journalism, law, and media studies at the University of Wisconsin at Milwaukee and Hanyang University, Seoul, South Korea. Dr. Sang is widely published in outlets that include the International Journal of Communication, Telematics and Informatics, American Behavioral Scientist, Speech & Communication, Computers in Human Behavior, Journal of Media Law, Ethics, and Policy, Journal of Medical Systems, and the Korean Journal of Broadcasting & Telecommunications Research. In January 2016, he will be joining the faculty of Howard University.

I look forward to Dr. Sang’s contributions to ALA and the national library community as well as to overall information policy work here in Washington, D.C.

The post Dr. Yoonmo Sang: OITP Research Associate appeared first on District Dispatch.

Islandora Vagrant 2.0 and Islandora Vagrant Base Box / Islandora

Hi folks- 

I cut a new release of Islandora Vagrant on October 1st. You can grab it here, and the full changelog is available here

Big changes here! 

We've addressed issue #27 which will speed up creating a Islandora Vagrant box significant. Basically we've done split the repository into two; Islandora Vagrant and Islandora Vagrant Base Box 

Islandora Vagrant downloads and installs all the Islandora Foundation Drupal modules, libraries, and any required libraries by them. 

Islandora Vagrant Base Box is the base system without the Islandora Foundation Drupal modules, libraries, and any required libraries by them. 

The first time you build Islandora Vagrant after this release, it will take a bit because it is downloading the base box from Atlas. After that, the builds should be significantly faster since we're not building and compiling everything for each build. 

Many thanks to Logan Cox and Mark Cooper for their contributions on this release! 

As always, you can grab the latest and greatest by cloning the repo, and running for there :-) 



Taking a Deep Breath after a Systems Migration / ACRL TechConnect

I have been mostly absent from ACRL Tech Connect this year because the last nine months have been spent migrating to a new library systems platform and discovery layer. As one of the key members of the implementation team, I have devoted more time to meetings, planning, development, more meetings, and more planning than any other part of my job has required thus far. We have just completed the official implementation project and are regular old customers by now. At this point I finally feel I can take a deep breath and step back to think about the past nine months in a holistic manner to glean some lessons learned from this incredible professional opportunity that was also incredibly challenging at times.

In this post I won’t go into the details of exactly which system we implemented and how, since it’s irrelevant to the larger discussion. Rather I’d like to stay at a high level to think about what working on such a project is like for a professional working with others on a team and as an individual trying to make things happen. For those who are curious about the details of the project, including management and process, those will be detailed in a forthcoming book chapter in Exploring Discovery (ALA Editions) edited by Ken Varnum. I will also be participating in an AL Live episode on this topic on October 8.

A project like this doesn’t come as a surprise. My library had been planning a move to a new platform for a number of years, and had an extremely inclusive selection process when selecting a new platform. When we found out that we would be able to go ahead with the implementation process I knew that I would have the opportunity to lead the implementation of the new discovery layer on the technical side, as well as coordinate much of the effort on the user outreach and education side. That was an exciting and terrifying role, since while it was far less challenging technically to my mind than working on the data migration, it would be the most public piece of the project. In addition it quickly became clear that our multi-campus situation wasn’t going to fit exactly into line with the built in solutions in the products, which required a great deal of additional work to understand the interoperability of the products and how they interacted with other systems. Ultimately it was a great education, but in the thick of it seemed to have no end in sight.

To that end, I wanted to share some of the lessons I learned from this process both as a leader and a member of a team. Of course, many of these are widely applicable to any project, whether it’s in a library systems department or any work place.

Someone has to say the obvious thing

One of the joys of doing something that is new to everyone is that the dread of impostor syndrome is diminished. If no one knows the answer, then no one can look like an idiot for not knowing, after all. Yet that is not always clear to everyone working on the project, and as the leader it’s useful to make it clear you have no idea how something works when you don’t, or if something is “simple” to you to still to say exactly how it works to make sure everyone understands. There’s a point at which assuming others do know the obvious thing is forgetting your own path to learning, in which it’s helpful to hear the simple thing stated clearly, which may take several attempts. Besides the obvious implications of people not understanding how something works, it robs them of a chance to investigate something of interest and become a real contributor. Try to not make other people have to admit they have no idea what you’re talking about, whether or not you think they should have known it. This also forces you to actually know what you’re talking about. Teaching something is, after all, the best way to learn it.

Don’t answer questions all the time

Human brains can be rather pathetic moment to moment even if they do all right in the end. A service mentality leads (or in some cases requires) us to answer questions as fast as we can, but it’s better to give the correct answer or the well-considered answer a little later than answer something in haste and get the answer wrong or say something in a poor manner. If you are trying to figure out things as you go along, there’s no reason for you to know anything off the top of your head. If you get a question in a meeting and need to double check, no one would be surprised. If you get an email at 5:13 PM after a long day and need to postpone even thinking about the answer until the following day, that is the best thing for your sanity and for the success of the project both.

Keep the end goal in mind, and know when to abandon pieces

This is an obvious insight, but crucial to feeling like you’ve got some control of the process. We tend to think of way more than we can possibly accomplish in a timeframe, and continual re-prioritization is essential. Some features you were sold on in the sales demo end up being lackluster, and other features you didn’t know existed will end up thrilling you. Competing opportunities and priorities will always exist. Good project management can account for those variables and still keep the core goals central and happening on time. But that said…

Project management is not a panacea

The whole past nine months I’ve had a vision that with perfect project management everything could go perfectly. This has crept into all areas of my life and made me imagine that I could project manage my way to perfection in my life with a toddler (way too many variables) or my house (110 year old houses are nearly as tricky as toddlers). We had excellent project management support from the vendor as well as internally, but I kept seeing room for improvement in everything. “If only we had foreseen that, we could have avoided this.” “If only I had communicated the action items more clearly after that meeting, we wouldn’t be so behind.” We actually learned very late in our project that other libraries undertaking similar projects hired a consultant to do nothing but project management on the library side which seemed like a very good idea–though we managed all right without one. In any event, a project manager wouldn’t have changed some of the most challenging issues, which didn’t have anything to do with timelines or resources but with differences in approach and values between departments and libraries. Everyone wants the “best” for the users, but the “best” for one person doesn’t work at all for another. Coming to a compromise is the right way to handle this, there’s no way to avoid conflict and the resulting change in the plan.

Hopefully we all get to experience projects in our careers of this magnitude, whether technical or not. Anything that shifts an institution to something new that touches everyone is something to take very seriously. It’s time-consuming and stressful because it should be! Nevertheless, managing time and stress is key to ensure that you view the work as thrilling rather than diminishing.

Quantifying Performance Gains When Batching Indexing Updates to Solr / SearchHub

Batching when indexing is good:

For quite some time it’s been part of the lore that one should batch updates when indexing from SolrJ (the post tool too, but I digress). I recently had the occasion to write a test that put some numbers to this general understanding. As usual, YMMV. The interesting bit isn’t that the absolute numbers, it’s the relative differences. I thought it might be useful to share the results.


Well, the title says it all, batching when indexing is good. The biggest percentage jump is the first order of magnitude, i.e. batching 10 docs instead of 1. Thereafter, while the throughput increases, the jump from 10 -> 100 isn’t nearly as dramatic as the jump from 1 -> 10. And this is particularly acute with small numbers of threads.

I have heard anecdotal reports of incremental improvements when going to 10,000 document/packet, so I urge you to experiment. Just don’t send a single document at a time and wonder why “Indexing to Solr is sooooo slooooowwwww”.

Note that by just throwing a lot of client threads at the problem, one can make up for the inefficiencies of small batches. This illustrates that the majority of the time spent in the small-batch scenario is establishing the connection and sending the documents over the wire. For up to 20 threads in this experiment, though, throughput increases with the packet size. And I didn’t try more than 20 threads.

All these threads were run from a single program, it’s perfectly reasonable to run multiple client programs instead if the data can be partitioned amongst them and/or you’d rather not deal with multi-threading.

This was not SolrCloud. I’d expect these general results to hold though, especially if CloudSolrClient (CloudSolrServer in 4.x/5x) were used.

Minor rant:

Eventually, you can max out the CPUs on the Solr servers. At that point, you’ve got your maximum possible throughput. Your query response time will suffer if you’re indexing and querying at the same time of course. I had to slip this comment in here because it’s quite often the case that people on the Solr User’s list ask “Why is my indexing slow?”.  90+ percent of the time it’s because the client isn’t delivering the documents to Solr fast enough and Solr is just idling along using 10% of the CPU. And there’s a very simple way to figure that out… comment out the line in your program that sends docs to solr, usually a line like:


Anyway, enough ranting, here are the results, I’ll talk about the environment afterward:

Nice tabular results:

As I mentioned, I stopped at 20 threads. You might increase throughput with more threads, but the general trend is clear enough that I stopped. The rough doubling from 1 to 2 threads indicates that Solr is simply idling along most of the time. Note that by the time we get to 20 threads, the increase is not linear with respect to the number of threads and eventually adding more threads will not increase throughput at all.

Threads     Packet Size        Docs/second

20 1 5,714
20 10 16,666
20 100 18,450
20 1,000 20,408
2 1 767
2 10 4,201
2 100 7,751
2 1,000  9,259
1 1 382
1 10 2,369
1 100 5,319
1 1,000 5,464

Test environment:

  • Solr is running a single node on a Mac Pro with 64G of memory, 16G is given to Solr. That said, indexing isn’t a very memory-heavy operation so the memory allocated to Solr is probably not much of an issue.
  • The files are being parsed locally on a Macbook Pro laptop, connected by a Thunderbolt cable to the Mac Pro.
  • The documents are very simple, there is only a single analyzed field. The rest of the fields are string or numeric types. There are 30 or so short string fields, a couple of integer fields and a date field or two. Hey, it’s the data I had available!
  • There are 200 files of 5,000 documents each for a total of 1M documents.
  • The index always started with no documents.
  • This is the result of a single run at each size.
  • There is a single HttpSolrServer being shared amongst all the threads on the indexing client.
  • There is no query load on this server.

How the program works:

There are two parameters that vary with each run,  number of threads to fire up simultaneously and number of Solr documents to put in each packet sent to Solr.

The program then recursively descends from a root directory and every time it finds a JSON file it passes that file to a thread in a FixedThreadPool that parses the documents out of the JSON file, packages them up in groups and sends them to Solr. After all files are found, it waits for all the threads to finish and reports throughput.

I felt the results were consistent enough that running a statistically valid number of tries and averaging across them all and, you know, doing a proper analysis wasn’t time well spent.


Batch documents when using SolrJ ;). My purpose here was to give some justification to why updates should be batched, just saying “it’s better” has much less immediacy than seeing a 1,400% increase in throughput (1 thread, the difference between 1 doc/packet and 1,000 docs/packet).

The gains would be less dramatic if Solr was doing more work I’m sure. For instance, if instead of a bunch of un-analyzed fields you threw in 6 long text fields with complex regex analysis chains that used back-references, the results would be quite different. Even so, batching is still recommended if at all possible.

And I want to emphasize that this was on a single, non SolrCloud node since I wanted to concentrate entirely on the effects of batching. On a properly set-up SolrCloud system, I’d expect the aggregate indexing process to scale nearly linearly with the number of shards in the system when using the CloudSolrClient (CloudSolrServer in 4x).

The post Quantifying Performance Gains When Batching Indexing Updates to Solr appeared first on

prohibitions / Andromeda Yelton

Last night I made myself a whiskey sour and curled up to start watching Ken Burns’ documentary on Prohibition.

The early activists, as I’d known, as you likely knew, were women. They were the ones who had to bear the costs of alcohol-fueled domestic violence, of children with no other caregivers, of families without economic support (and in a world where both childcare expectations and restrictions on women’s labor force participation reduced their capacity to provide that support). And they needed the costs to stop, but they didn’t have a rhetorical or legal space to advocate for themselves, so they advocated for the children, for God, and against the conduct of men.

They had some early and dramatic successes — how could it be anything short of terrifying to have two hundred women kneeling in prayer and singing hymns outside the door to your saloon? — but ultimately the movement was unsuccessful before the founding of the Anti-Saloon League, where the pictures all flip from women to men, and the solutions are political, using suffrage (a tool women didn’t have) and targeting specific politicians (who are best targeted through conversations women couldn’t have, in places women couldn’t go). After prayer and hatchets failed — after women constrained to situate themselves within or dramatically against a tiny range of acceptable conduct failed — the political machine, and the men eligible to be part of it, succeeded.

Things have changed less than I might have hoped.

Really rockin ARSL conference in Little Rock / District Dispatch

ARSL logo2I just returned from the 2015 conference of the Association of Rural & Small Libraries (ARSL) held in Little Rock. Wow! The enthusiasm, energy, creativity, and collegiality of these librarians are extraordinary—and especially so, given the very limited resources that many of them have available to serve their communities. And ARSL itself is on a fine upward trajectory, concluding their successful conference with record attendance exceeding 500 participants. ARSL truly lived up to the conference theme of “Rockin in Little Rock.” Also, we were pleased to augment the ALA divisional presence from the Young Adult Library Services Association (YALSA) and the Public Library Association (PLA).

My colleague Marijke Visser and I talked about our work at a session entitled “Information Policy: We’re from Washington, and Yes, Here to Help You.” We provided an overview of national information policy and advocacy, steeped in concepts and learnings from the Policy Revolution! initiative, using our recent E-rate work as a case study to explicate the nuts and bolts of the practice of public policy in Washington.

We are exploring the possibilities with respect to rural areas as a priority focus in the coming two years for the Policy Revolution! initiative. Indeed, that was the main impetus for Marijke and me to go to the ARSL conference. We spent an evening with a number of thoughtful librarians from the ARSL community, including newly-installed ARSL President Jet Kofoot. They had a number of ideas for us to consider as we develop a strategy as well as providing feedback on some of our ideas and questions, confirming some things we thought would resonate, but also providing some critical feedback.

I greatly increased my knowledge about small and rural libraries at conference sessions. For example, I learned a lot about stealth (passive) programming—activities in which users themselves contribute much of the work. The classic example is the library’s summer reading program—the library designs the program, but users check out, read, and record the books they read. There are many variations of this theme such as a “1000 Books before Kindergarten” program. But there are a host of other programs that include ways to share favorite comic books, local self-published books, scavenger hunts, and jigsaw puzzles.

Perhaps the newest insight for me is that small and rural libraries, often operating with very lean budgets and resources (and space!), become creative by necessity. These libraries may well represent an underestimated resource in contemplating the future of libraries.

We have worked with ARSL in the past, notably collaborating with them on last year’s monumental E-rate proceeding. We look forward to a closer working relationship in the future.

The post Really rockin ARSL conference in Little Rock appeared first on District Dispatch.

Islandora Show and Tell: Vassar Digital Library / Islandora

This Show & Tell has been in the making ever since Joanna DiPasquale presented at the Islandora Conference on how Vassar has started to store and display panoramic images in Islandora. Vassar has a pretty impressive array of collections, spanning images, videos, audio, and books that stretch the presentation of a "book" in Islandora. All of this is mediated through some small but significant customizations, such as modifications to the Internet Archive book viewer to display translated pages and feed the MODS record to a 'details' tab. Vassar also runs a custom Solr configuration which runs a transitive SPARQL query of the resource index when a record is ingested or updated. The query reports the collection tree of the object, which then gets written into Solr, creating a new facet and the option to do a “search just this collection” feature. You can find many of these customizations on Joanna DiPasquale's GitHub.

Vassar is also one of several Islandora sites making use of the Islandora OAI module to easily harvest their records for the DPLA. They also deserve our thanks for being the initial sponsors of the Islandora Book Batch module, which has greatly streamlined the process of ingesting books for the entire community.

Those are the technical details. As for content, I like to judge a collection by what happens when I throw "cats" into a search box, and I was not disappointed. The result are mainly from diaries, but the quotes (easy to grab because of great OCR) are wonderful, such as:

Vassar has the greatest number of cats around. There are pretty cats and homely cats. There is one half-blind cat, and one three footed cat. The cats with whom we are best acquanted are a large black cat and a gray and white cat. The black cat is a great favorite of Stematz's. She has often been in here and has made herself quite at home. The gray and white cat was here all one day last week, and we didn't know but she'd taken up her abode here. Over on the north corridor are a gray cat and two kittens, which belong to Miss Jones. The kittens are very pretty and nice, and have very noble titles, Julius Caesar and Tiberius Gracchus.

-Wyman, Anne (Southworth). Diary, 1878-1880 (page 84)

For the poor homely cats, or:

When does Mrs. Foote go? She doth much deceive [me]. She does not go. Lately she catches mice. Let me relate an incident! A mouse finds his way to the shrine of her cupboard. The mouse don't know much. How should she get him out! Allie Wright has a cat. Hattie is stationed at the cupboard door to hold it tight while Mrs. F. goes for the cat. The cat is brought & introduced unceremoniously "Now, kitty, get it quick!" (Mew). "Come kitty" (Mew) "Have you got it kitty?" (Mew) "There I guess she's got it". (Mew). 

Did she? O, sad sequel. O, blighted hopes. The mouse is still alive. 

- Bromley, Frances M. Diary, 1872 (page 282)

Because "O sad sequel. O blighted hopes. The mouse is still alive," is a phrase that needs to be worked into modern conversation far more often. Or:

He saw a cross between a monkey and a cat, and a cat and a rabbit.

- John Burroughs Journal, 1888-1889 (page 38)

Because that's an amazing family tree.

And that's just the cats. Vassar Digital Library's collections extend to some pretty amazing places, which I will let Joanna DiPasquale describe:

What is the primary purpose of your repository? Who is the intended audience?

Vassar College’s digital repository aims to provide access to a wide variety of digital materials to our faculty and students as well as to researchers around the world.  We have documents, letters, diaries, oral histories, papers, and even some art images.  As our website states, “Through our digital collections, we aim to provide access to high-quality digital content generated by the Libraries for research and study, as open as possible; support the teaching, learning, and research needs of the College; preserve at-risk or fragile physical collections through digitization, or at-risk born-digital collections through reformatting; expose hidden, less-used physical collections through access to digital surrogates; and foster experimental, cutting-edge, and innovative projects through technology.”  

Why did you choose Islandora?

We chose Islandora for both philosophical and practical reasons. Philosophically, I am a strong open-source proponent, and when we were looking at software for our nascent digital library, Islandora clearly fit the bill: it is an open-source product that merges two open-source products (Drupal and Fedora Commons) in a streamlined way. It also has a very active user community that drives new features, and our needs were fairly close to those features on the roadmap. But we had some practical concerns: since we are a small liberal arts college with very limited digital library staff, we also needed vendor support services for things we could not do ourselves. The availability of companies such as discoverygarden (DGI) helped enormously when we chose Islandora.

Which modules or solution packs are most important to your repository?

We use a combination of the Book Content Model solution pack with the Islandora Importer, Book Batch, IA Book Viewer, and Islandora Solr modules most frequently. We have a significant amount of multi-page objects (letters, diaries, minutes, etc.) that are either OCRed or manually transcribed, and these modules bring these objects to life. We import items into our repository and, because of Islandora, we can provide a very readable, full-text-searchable, page-by-page view (even with zoomable images!) of each complex object.

One other thing to note is that the Forms capabilities of Islandora are amazing. I couldn’t work without this critical functionality. I am able to provide incredibly complicated metadata forms to our lab manager or student workers so that they can make changes to our records, and they are able to focus exclusively on content without ever worrying about how to make those changes.

What feature of your repository are you most proud of?

One of my favorite collections is the Albert Einstein Digital Collection at Vassar College Libraries, made possible by a grant from Dr. Georgette Bennett in honor of Dr. Leonard Polonsky CBE. It contains our “2-up” text viewer for multilingual content, and I think that this viewer represents the amazing combination of liberal arts need with large-scale repository functionality.  

The repository features letters from Einstein about more social and political issues, rather than purely scientific ones, so it really documents a lesser-known aspect of Einstein’s life.  But the content had been underutilized because much of it was in German, and many students could not read the materials.  Vassar is an undergraduate institution, so when we digitized the collection, we believed it was incredibly important to provide both the German transcription of each letter as well as an English translation for our students.  We worked with the wonderful translators from Caltech to provide these materials, and then with help from Discovery Garden, we built a new feature into the IA Book Viewer to show the German and English datastreams side-by-side.  So now the repository searches and displays both German and English, which aids researchers that bring different language skills to their work.

Who built/developed/designed your repository (i.e., who was on the team?)

Generally, I am responsible for the repository; I’ve developed its functionality beyond core Islandora, designed the user interface, created processing scripts, developed new modules, etc.  I also work with our amazing digital lab manager, Sharyn Cadogan, who is not only an extraordinary imaging specialist but also a terrific project manager and supervisor to a team of students.  When we started down the Islandora road, Discovery Garden provided much-needed expertise for core Islandora, Fedora, and the entire application stack – they installed, tested, and got our repository off the ground, and we still maintain close ties with them.  Our central information technology group on campus generously provides server space for this important campus resource, for which I am grateful!  

Do you have plans to expand your site in the future?

Always, always.  We are poised to scale up in interesting and amazing ways, and I’m looking forward to the future.  In the past four years, we have digitized and made available more than 60,000 pieces of content across multiple collections.  This content has generally focused on material from our Archives & Special Collections Library, and we will continue to build collections with that focus.  But we are also moving into more digital scholarship areas, and many of the modules and features of Islandora Scholar will most likely find their way into Vassar’s repository as this area increases.

What is your favourite object in your collection to show off?

There are so many great items and collections!  Here are a few that I love:

  • One of my favorite collections contains the personal papers of the U.S. suffragist Susan B. Anthony, which provides interesting materials about abolition as well as suffrage. 
  • One of our most beautiful items is the 1685 early anatomy book by Govard Bidloo, Anatomia humani corporis.  
  • Did you know that Albert Einstein often wrote poetry to friends and family? (I didn’t!) This is such a sweet poem, and it shows off the “2-up”/multilingual feature of the IA Book Viewer.    


Stylish Writing - Jill Lepore / Ed Summers

For one of my classes this semester we’ve been reading Stylish Academic Writing by Helen Sword. The goal of the class is to help PhD students learn about the value of research, with a particular focus on accessible research that makes a significant difference in a particular community and (hopefully) the world. Too often valuable research results are packaged up in dry containers, that are generally unaccessible to other members of the field, and all the smart people and interested people outside of the academic community. We’re reading Sword’s book to learn some techniques for helping make this happen.

For this week’s class we were asked to select a piece of “stylish academic writing” and briefly discuss what we liked about it.

In a way I’m cheating. Many consider The New Yorker to be the epitome of stylish writing, and its pages are no stranger to academia. So selecting an article from there seems like an easy win, right? Well, yes – but the trick (for me) was finding something that was actually relevant to my interest in researching Web archives. Fortunately, I knew about [Jill Lepore]’s article The Cobweb already, and it was a natural fit since I had already done a bit of blogging writing in response to her article.

Jill Lepore is a professor of American history at Harvard University. She has written about the need to bridge the gap between academic writing and the larger world of publishing and the Web. So she clearly cultivates accessibility and compelling narrative in her own writing. But I really didn’t know about her work before running across The Cobweb, which was circulating around in the discussion of nerdy folks I follow on Twitter who are interested in Web archives.

One of the most notable things about The New Yorker is its iconic use of cartoons. They immediately drag you in, if you are draggable, and this one worked on me:

Ok, an image isn’t really writing, and this one was created by an artist named Harry Campbell, not Lepore herself. But presentation matters, and collaboration matters even more. Another thing that matters for engagement is a good title. Cobweb is a near perfect title since it references the World Wide Web with an metaphor of disrepair or inattention – all in one word. Nicely played Lepore.

However, there are lots and lots of other artfully crafted words. The thing I really admire in this piece is Lepore’s ability to discuss a fairly mundane technical topic (the preservation of Web content) in terms that are both immediately accessible, while also making the topic relevant to a completely different non-technical domain, the field of history.

The footnote, a landmark in the history of civilization, took centuries to invent and to spread. It has taken mere years nearly to destroy. A footnote used to say, “Here is how I know this and where I found it.” A footnote that’s a link says, “Here is what I used to know and where I once found it, but chances are it’s not there anymore.” It doesn’t matter whether footnotes are your stock-in-trade. Everybody’s in a pinch. Citing a Web page as the source for something you know—using a URL as evidence—is ubiquitous. Many people find themselves doing it three or four times before breakfast and five times more before lunch. What happens when your evidence vanishes by dinnertime?

This paragraph is a great example of Lepore in action. The evidence vanishing by dinnertime references a just previous discussion about how a post to a Russian social-media site by a Ukrainian separatist leader was deleted two hours later. The post included video of a plane being shot down, which was believed to be a Ukrainian transport plane…but turned out to be Malaysia Airlines Flight 17.

This paragraph connects the practice of history and the mechanics of the simple footnote to things we do everyday, and which I am doing in this post – linking to things on the Web in order to cite them. The whimsical and ordinary use of lunch and dinnertime in the description juxtaposed with the tragedy of the 283 people who were killed when this plane was shot down works to highlight this discontinuity between the ephemeral nature of the Web and the work of the historian.

Here’s another paragraph that I particularly liked, and which I was trying to recall recently (I’m putting it on the Web here so I can hopefully remember next time):

The Web dwells in a never-ending present. It is—elementally—ethereal, ephemeral, unstable, and unreliable. Sometimes when you try to visit a Web page what you see is an error message: “Page Not Found.” This is known as “link rot,” and it’s a drag, but it’s better than the alternative. More often, you see an updated Web page; most likely the original has been overwritten.

I think the first sentence is one of the best and shortest description of the peculiar medium of the Web I’ve ever read. Lepore is able to balance what often appears to be a fatal flaw in the technology of the Web with its real strengths: immediacy and currency.

This post wasn’t meant to be a complete review of Cobweb but just to highlight a few things I really like about it for class. There’s a lot more good stuff in there, so be sure to check it out if you are interested in the Web and history. Maybe I’ll add more over time as I inevitably return to reading Cobweb in the future…as long as is still there. I guess there’s always the print copy somewhere if they forget to renew their DNS registration, or the whole Internet goes up in flames due to a world energy crisis and cyberwarfare. Maybe?

Pushing back against network effects / David Rosenthal

I've had occasion to note the work of Steve Randy Waldman before. Today, he has a fascinating post up entitled 1099 as Antitrust that may not at first seem relevant to digital preservation. Below the fold I trace the important connection.

All over this blog (e.g. here) you will find references to W. Brian Arthur's Increasing Returns and Path Dependence in the Economy because it pointed out the driving forces, often called network effects, that cause technology markets to be dominated by one, or at most a few, large players. This is a problem for digital preservation, and for society in general, for both economic and technical reasons. The economic reason is that these natural but unregulated monopolies extract rents from their customers. The technical reason is that they make the systems upon which society depends brittle, subject to sudden, catastrophic and hard-to-recover-from failures.

It is, therefore, important to find ways to push back against the network effects that force consolidation. But it is extremely difficult to find such ways. Waldman's post looks at doing this in the "sharing economy", specifically in the case of Uber:
Right now, the greatest danger to to the rest of us from “sharing economy” platforms like Uber is that these platforms benefit from network effects that render them “winner-take-all”. Today’s apparent innovators are really contesting a tournament to become tomorrow’s monopolists. The outcome we should be hoping to achieve is neither to strangle these products in their cribs (they are often great products that create real efficiencies), nor to permit wannabe monopolists to win their prize. We should want competitive marketplaces in the products these platforms provide.
The key to Uber's business model is their claim that the drivers are not Uber employees, even if their entire income comes via Uber:
Much of the network effect that might render Uber-like platforms anticompetitive derives from density of suppliers. Customers flock to the platform that has the densest, richest set of offerings. When suppliers pick just one, they prefer to work for the platform that has the most customers. So once one platform pulls ahead, a cycle may kick in, virtual or vicious depending on your perspective, leading to a single dominant platform. But if suppliers “multihome”, if pretty much all of them sell through pretty much all of the networks, this “market cramp” can be interrupted and multiple platforms might survive. ... But ... platforms’ incentives are to make it hard as possible for suppliers to be promiscuous.
Waldman's suggestion is that instead of directly regulating these platforms via antitrust law, whose enforcement has become lackadaisical, they should be regulated indirectly, via the tax code. If, to be taxed as self-employed (1099) rather than employed by Uber, its drivers were:
required [to] multihome ... platforms’ incentives would reverse. They would face a choice of bearing much higher per-supplier costs (as “suppliers” become employees), or of insisting that suppliers also do business elsewhere. All of a sudden Uber would need Lyft and Lyft would need Uber ... Reclassification of suppliers as employees would serve as an antitrust doomsday device for sharing economy platforms.
Ever since the antitrust action against Microsoft produced only token consequences, attempting to directly regulate technology markets to prevent monopolization has seemed futile. But Waldman shows a way in which indirect pressure could be more effective.

For example, Amazon is effectively a monopoly in the cloud. Suppose, for example, insurance companies started pricing in the monoculture risk this imposes on Amazon's business customers. Companies would have an incentive to have backup systems in other vendor's clouds. The cloud industry would have incentives to standardize APIs and implement effective fail-over mechanisms. Amazon might end up a bit less dominant, and socienty might end up a bit less brittle.

Perhaps this isn't a very realistic example. But it is worth following Waldman in thinking about indirect as opposed to direct ways to push back against the negative aspects of network effects.

We screwed up: identities in loosely-coupled systems / Dan Scott

A few weeks ago, I came to the startling and depressing realization that we had screwed up. It started when someone I know and greatly respect ran into me in the library and said "We have a problem".

I'm the recently appointed Chair of our library and archives department, so being approached about a problem isn't surprising. However, the severity of the problem was.

Here's what happened: the person in question had asked for a group study key at the circulation desk, and handed over the university photo ID card to check the item out. The library staff person noted that the name on the photo ID card didn't match the name in the library system. Even though the photo was an exact match, the staff person refused to check out the item to the patron.

The next day, after the person who suffered that indignity approached me, I was able to update the name for the account in the library system in about a minute. While apologizing profusely. And I had to explain why our system had failed this person. A few years back we were able to start automatically polling our university's LDAP server for new university accounts and immediately create the corresponding library system account, with a unique barcode, and update the LDAP account with that new barcode. That removed an entire set of (essentially duplicated) paperwork that new students and faculty used to have to fill out to get a university photo ID card, as well as reduced the amount of personally identifiable information held in our library system to the bare minimum of name, email address, and university ID number.

However, we have never been able to poll the university LDAP server for updates. Admittedly, my primary interest in updates was to synchronize accounts when students become alumni, or staff retire, etc., but in retrospect the ability to synchronize name changes (and email addresses, which are often derived from names) is blindingly obvious and absolutely necessary. When a person goes through the effort of changing their name, they are changing their identity in a very meaningful, significant fashion. To have the identity they have consciously abandoned resurface in various systems is (at best) frustrating, but can also be utterly demeaning. This is not the experience we want for our patrons.

In retrospect, at least two problems have surfaced with this incident:

  1. The name attached to the account in the library system should have matched the name on the card.
  2. In a conflict between systems, given a choice between believing the person in front of you or one of the systems, staff should respect the person in front of them and note the problem for someone to follow up on.

I've held initial conversations with our university IT department to try and figure out strategies for closing that synchronization gap. In the short term, I'm willing to handle identity changes in a purely manual way (having the Registrar notify me when a change needs to be made). We have also reminded staff to defer to people rather than systems, as the people who make and maintain the systems are fallible (mea culpa).

In the slightly longer term, I'm building the synchronization piece so that we can trigger an update for an individual account at any given time. And I'm posting this in the hopes that it might prompt you to consider your various loosely-coupled systems and the identity management for the accounts within, just in case there are some synchronization gaps that you might be able to close. Because our patrons deserve respect, in person, and in the systems we design to serve them.

Computer Assisted Appraisal in Web Archives / Ed Summers

If you prefer this post is also available as a PDF.

In 2008 Google estimated that it had 1 trillion unique URLs in its index (Alpert & Hajaj, 2008). When I looked today (7 years later) the Internet Archive’s home page announced that it has archived 438 billion Web pages. It’s an astounding achievement, but the Web has certainly grown many times over since 2008. Also, it’s important to note the difference in terminology: URL versus Web page. A Web page has a unique URL, or address, but the content of a Web page can change over time. Capturing the record of documents as they change over time is essential for Web archives. So by design, there are many duplicate URLs included in the 438 billion Web pages that the Internet Archive has collected. If you ignore the duplicates and the fact that the Web has grown, it looks like Internet Archive has archived 43.8% of the Web. But if you consider the growth of the Web, the duplicates that are present in the archive, and the fact that Google removes URLs from its index when documents dissappear, the actual percentage of the Web that is preserved must be much, much lower.

As more and more information is made available on the Web how do archivists decide what to collect, and when? The Internet Archive’s Heritrix bots walk from link to link on the Web archiving what they can. Members of the International Internet Preservation Consortium (IIPC) run their own crawls of specific parts of the Web: either country domains like the .uk top-level-domain, or specific websites that have been deemed within scope of their collection development policy. These policies inform the appraisal of whether particular Web content is deemed worth adding to an archive. Archivists are aware of how these appraisal decisions shape the archive over time, and by extension also shape what we know of our past. Appraisal, or deciding what to save, and what not to save, is difficult in the face of so much information.

This annotated bibliography provides a view into the emerging field of computer assisted appraisal in Web archives. How can computers assist archivists in the selection of content for archiving? Similarly, how can archivists guide the appraisal and crawling of Web content? There are two primary themes that emerge in this review: identification and evaluation. This review is not meant to be complete, but rather to be suggestive of a field of study at the intersection of archival and computer science.

1. Finding Content

The following papers discuss ways of discovering relevant content on the Web. Particular attention has been paid to approaches that incorporate social media into appraisal decisions.

Jiang, J., Yu, N., & Lin, C.-Y. (2012). FoCUS: Learning to crawl Web forums. WWW 2012 Companion.

As more content goes on the Web researchers of all kinds are increasingly interested in analyzing Web forums in order to extract structured data, question/answer pairs, product reviews and opinions. Forum crawling is non-trivial because of paging mechanisms that can result in many duplicate links (different URLs for the same document) which can consume large amounts of time and resources. For example the researchers found that 47% of URLs listed in sitemaps and feeds were duplicates in a sample of 9 forums.

Jiang et al. detail a procedure for automatically detecting the structure of forum websites, and their URL types, in order to guide a Web crawler. The research goal is to save time, and improve coverage compared to breadth-first and other types of crawlers. The process is to automatically learn Index-Thread-Page-Flipping (ITF) regular expressions for identifying the types of pages in Web forums, and then use these patterns during the Web crawl.

The researchers studied the structure of 40 different Web forum software platorms to find common patterns in page layout/structure as well as URL and page types. For example, timestamps on pages in chronological and reverse chronological order are good indicators of thread and index pages respectively. Also, paging elements can be identified by noticing links with longer than usual URLs combined with short numeric anchor text. A training set for four different web forums was fed into a Support Vector Machine classifier, which was then used to generate ITF regular expressions for each site.

To analyze their procedure they selected nine different types of forum software and ran three types of crawlers over each: a generic crawler, an entry point crawler and a structure driven crawler. The measured effectiveness and coverage were reported for each combination. Experimantal results found that the structure driven crawler significantly outperformed the other types of crawlers. The authors note that these results have bearing on other types of similarly structured sites such as question/answer sites and blogs. They also hope to improve the 97% coverage by handling JavaScript paging mechanisms which were present in 2% of the forums tested.

On the surface this paper doesn’t seem to have much to do with automated appraisal in Web archives. But the authors demonstrate that attention to the detail and structure of websites can improve efficiency and accuracy in document collection. Forums, blogs and question/answer sites are very common on the Web, and represent unique and high value virtual spaces where actual people congregate and share opinions on focused topics. As such they are likely candidates for appraisal, especially in social science and humanities focused Web archives. The ability to automatically identify forums on particular topics as part of a wider web crawl could be a significantly important feature when deciding where to focus archiving resources. In addition, this work presents important heuristics for identifying duplicate content, which is important for knowing what not to collect, as we will see later in Kanhabua, Niederée, & Siberski (2013).

Gossen, G., Demidova, E., & Risse, T. (2015). ICrawl: Improving the freshness of web collections by integrating social web and focused web crawling. In Proceedings of the Joint Conference on Digital Libraries. Association for Computing Machinery. Retrieved from

This paper draws upon a significant body of work into focused Web crawling and specifically work done as part of the ARCOMEM project. Gossen et al. provide an important analysis of how the integration of social media streams, in this case Twitter, can significantly augment the freshness and relevance of archived Web documents. In addition they provide a useful description of their system and its open source technical components to help others to build on their work.

The analysis centers on measuring relevancy and freshness of archived content in two contemporary Web crawls related to the Ebola outbreak and the conflict in the Ukraine. In each case four different Web crawls were run: unfocused, focused, Twitter-based and integrated. Each type of crawl begins with a seed URL and wanders outwards collecting more results. The focused crawl uses their own link prioritization queue to determine which pages get collected first. The Twitter based crawler simply crawls whatever URLs that are mentioned in relevant tweets from the Twitter API. The integrated crawler is a combination of the focused and Twitter based crawlers, and represents the main innovation of this paper.

The results show that the Twitter based search is able to return the freshest results, with the integrated crawler coming in second. However the integrated crawler performed best at returning the most relevant results. Freshness on the Web is difficult to measure since it involves knowing when a page was first published, and there is not consitent metadata for that. The researches devised some hueuristics for determining creation time, and eliminated pages from the study for which freshness couldn’t be determined. The relevancy measure is also used by the prioritization queue, so in some ways I am concerned that relevancy was only measuring itself. But it is interesting that relevancy was improved while factoring in the Twitter stream. One area of related research that could build on this work is how feedback from archivists or curators could influence the system.

Yang, S., Chitturi, K., Wilson, G., Magdy, M., & Fox, E. A. (2012). A study of automation from seed URL generation to focused web archive development: The CTRnet context. In Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 341–342). Association for Computing Machinery.

This paper introduces the idea of using social media streams, in this case Twitter, to determine a list of seed URLs to archive in time sensitive situations such as natural disasters and other crises. In these time sensitive situations it is difficult for archivists to build a list of potential seed URLs to harvest, because of the large amount of new content being published on the Web in a very short time period. The goal was to prototype and test a system that could run with minimum human intervention. This was the first reference I could find of using Twitter in this way to augment web archiving, which was discussed more fully in Gossen et al. (2015).

The authors created a prototype Python Django application that manages the workflow of relevant tweets, to URL extraction, to web crawling with Heritrix, and the data extraction. The external service TwapperKeeper was used to collect the Twitter data, which is no longer available as a service today. Details about the data extraction did not seem to be included in the paper. The study used 5 different contemporary events to study the precision of the system: the Virginia Tech shooting, a Measles outbreak, a typhoon in the Philippines, violence in Sudan, and Emergency Preparedness. The results showed that precision varied depending on the type of query used. In some cases a query picked up unrelated Web content because it was unintentionally broad. The paper mentioned a filtering component for reducing spam, but did not discuss it in detail. It also gives precision results without really discussing ther method for measuring it. But there was a poster that accompanied the paper, so perhaps these details could be found there.

The paper does a nice job of introducing a new idea (social media streams in Web archiving) and sets the stage for future work in terms of how to filter out spam and measure precision. Similar to Gossen et al. (2015) it hints at future work that could integrate a archivist or curator who can influence the direction of the crawl as part of the process.

Pereira, P., Macedo, J., Craveiro, O., & Madeira, H. (2014). Time-aware focused web crawling. In Advances in Information Retrieval (pp. 534–539). Springer.

As discussed in Gossen et al. (2015), determining the time a Web page was published, or its freshness can be surprisingly difficult. When you view a Web page it is most likely being served directly from the originating Web server on the Internet, or perhaps from an intermediary cache that has a sufficiently recent copy. However it is useful to be able to determine the age of a page, especially when ordering search results, and also for appraising a given web page in an archival setting.

Pereira discusses a technique for crawling the Web in a time-aware way. Most previous work on focused web crawling has involved topic analysis (the text in the page and its similarity to the desired topic). This paper details a process for determining the age of a given Web page (temporal segmentation), and then integrating those results into a Web crawler’s behavior (temporal crawling).

The paper describes an experiment that compares the results of crawling two topics (World War 2 and September the 11th) by crawling outwards from Portuguese Wikipedia pages, using two different techniques: no time restriction and a time restriction. The results indicate that the crawl with a time-restriction performs significantly better over time, however the shape of the results is different for each topic.

The authors admit that the results are preliminary, and that their project is a proof of concept. Unfortunately the authors don’t appear to provide any source code for their prototype. It would be interesting to compare the time-based crawling with more traditional topic-based crawling, and perhaps consider a hybrid approach that would allow both approaches to be used in a single crawl.

2. Evaluating Content

Once content is identified and retrieved there are a set of factors that can be considered to help inform a preservation decision about the content. Metrics that can be generated without significant human intervention are important to highlight, as are systems that allow interventions from an archivist to shape the appraisal process.

Lyle, J. A. (2004). Sampling the umich. edu domain. In 4th International Web Archiving Workshop, Bath, UK (Vol. 2). Retrieved from

This paper is part historical overview of sampling in traditional archival appraisal, and part a study of sampling in the Web archive records for the domain. Lyle provides an excellent overview of passive and active appraisal methods, and how they’ve been employed to help shape archival collections. Active appraisal largely came about as the result of an over abundance of records in the post World War 2 era. However a partial shift back to passive appraisal was observed as electronic records became more prevalent, storage costs plummeted, and it became conceivable to think of collecting everything. In addition it became possible to automatically crawl large amounts of Web content given the structure of the World Wide Web. At the same time there was a movement towards active appraisal, where archivists became more involved with record creation, to insure that electronic documents use particular formats and have standard metadata.

Lyle also discusses the benefits and drawbacks of several different types of document sampling methods: purposive, systemic, random and mixed-mode sampling. The intent is to use these methods on records of high evidential value, but not on records of high informational value. The distinction between informational and evidential value introduced by Schellenberg isn’t clearly made, and Lyle questions whether information documents are always more valuable to the record than evidential documents.

In the second half of the paper Lyle documents the results of a Web crawl of the domain performed by the Internet Archive. This focused crawl was performed for the purposes of this study, and identified four million URLs. Only 87% of these URLs were working (not broken links) and almost half were deemed to be duplicates. An analysis of different types of sampling was performed on the resulting 1.5 million documents using information from crawl logs: size of the document, size of URL. The study looked specifically at bias in the sample results, and found that stratified random sampling worked best; although details of how the bias in results was ascertained was not discussed.

In the discussion of the results Lyle surmises that sampling is a useful way to get an idea of what sub-collections are present in a large set of Web documents, rather than a criteria for accessioning itself. He notes that to some extent the whole process was a bit suspect, since the crawl itself potentially had inherent bias: the chosen entry point, the algorithm for link discovery, and the structure of the graph of documents. The author’s specific conclusions are somewhat unclear but indicate that more work is needed to study sampling in Web archives. Sampling culd be a useful appraisal tool to discover the shape of collections, but is not an explicit mechanism for determining whether to preserve or destroy a particular document. The design of such a sampling tool that could inform appraisal decisions and be integrated with Web crawl results could be a valuable area of future work.

Kenney, A. R., Nancy, M., Botticelli, P., Entlich, R., Lagoze, C., & Payette, S. (2002). Preservation risk management for web resources: Virtual remote control in cornell’s project prism. D-Lib Magazine, 8(1). Retrieved from

This paper examines uses the technique of risk management to identify factors that can be used in the appraisal of Web documents. These factors center around the document itself, the document’s immediate context in the Web (its links), the website that the document is a part of, and the institutional context that the website is situated in. Some of the document and contextual factors are reflected in later work by Banos, Kim, Ross, & Manolopoulos (2013) such as format, standards, accessibility and metadata. A particularly interesting metric mentioned by the authors are monitoring factors such as inbound and outbound links over time, and the shape of the website graph. These measures of change can be used to determine the rate at which it is being maintained. The assumption being that a site that is not being maintained is more of a preservation risk.

A general move is made in this paper to reposition archivists from being custodians of content to being active managers of digital objects on the network. This effort seems to be worthwhile, especially if it is sustained. The discussion would have benefited from references to the existing literature on post-custodial archives which was available at the time. One somewhat discordant part of the paper is that the two project links figured prominently at the top of the article do not go to the PRISM project page, which is still available. Also, in hindsight the criticisms of the Internet Archive seem overly dismissive. The paper would have been better situated in terms of opportunities for establishing a community of practice.

Kanhabua, N., Niederée, C., & Siberski, W. (2013). Towards concise preservation by managed forgetting: Research issues and case study. In Proceedings of the 10th International Conference on Preservation of Digital Objects, iPres (Vol. 2013). Retrieved from

Appraisal is often thought about in terms of what artifacts to preserve or save for the future. But implicit in every decision to save is also a decision not to forget. Consequently, it’s also possible to look at appraisal as decisions about about what can be forgotten. In this paper Kanhabua and her colleagues at the L3S Research Center investigate processes for making these types of decisions, or managed forgetting which is materialized in the form of forgetting actions such as aggregation, summarization, revised search, ranking behavior, elimination of redundancy and deletion.

The article provides a useful entry point into the literature about human memory in the field of cognitive psychology. It also highlights several jumping off points for HCI discussions about designing systems and devices for managing memory. But the primary focus of the paper is on the interaction between information management systems and archival information systems: the first which is used to access information, and the second being the stores of content that can be accessed.

In order to describe how the act of forgetting is present in these systems the authors used historical snapshots of public bookmarks available by the BibSonomy social bookmarking project. The 15 BibSonomy snapshots taken at different periods of time provide a view into when users have chosen to bookmark a particular resource, as well as when that resource has been deleted. Their analysis determined that there was a correlation between a users delete ratio and the number of bookmarks they created, but not between the users delete ratio and the total number of bookmarks they possessed.

The paper admits that they are still in a very early phase of research into the idea of managed forgetting. I think the paper does a nice job of articulating why this way of looking at appraisal matters, and provides an example of one possible study that could be done in this area. I think it would have been useful to discuss a little bit more about how the choice of BibSonomy as a platform to study could have potentially influenced (but not invalidated) the results. It would be interesting to take another social bookmarking site like Pinboard or Digg and see if a similar correlation holds. The implications of managed forgetting for building digital preservation and access systems seems like a very viable area of research, and I hope to see more of it.

Banos, V., Kim, Y., Ross, S., & Manolopoulos, Y. (2013). CLEAR: A credible method to evaluate website archivability. International Journal on Digital Libraries, 1–23.

CLEAR stands for Credible Live Evaluation of Archive Readiness which is a process for measuring website archivability of a particular Web document. The paper provides a method for generating an archivability score based on a set of five archivability facets: accessibility, standards compliance, cohesion, performance and metadata. The authors created a working prototype called ArchiveReady that you can find on the Web and use to evaluate Websites manually or automatically with their API.

The motivation for the work on CLEAR was traced back to previous work in New Zealand on the Web Curator Tool (WCT) and in the UK on the Web At Risk project which made quality assurance part of the archiving process. Quality assurance was found to be particularly time consuming, which consequently slowed down the work of timely processing. Banos et al.’s goal with CLEAR is to provide a measure of archivability that allows archivists to select a quality threshold under which Web content would be greenlighted for accession into an archive.

The paper includes useful details about the technical system: Python, Flask, Backbone and MySQL for the Web application, Redis for managing parallel processing, and JHOVE for file identification. In addition the precise formula for generating the CLEAR metric was clearly described. An analysis of CLEAR results compared with quality assurance results from a human curator would have been useful. It would be interesting to see if automated and human appraisal decisions are correlated, and also what likely threshold values could be set to.

SalahEldeen, H. M., & Nelson, M. L. (2013). Carbon dating the web: Estimating the age of web resources. In Proceedings of the 22nd International Conference on World Wide Web Companion (pp. 1075–1082). International World Wide Web Conferences Steering Committee. Retrieved from

As Gossen et al. (2015) also discusses, it is often important to identify when a document was first added to the Web. The age of Web documents, or their freshness, is important for digital library research, as well as for making informed appraisal decisionns Yang et al. (2012). Determining the age of Web documents can be difficult when the page itself lacks an indicator of when it was created. Metadata such as the Last-Modified HTTP header are not typically reliable as a source for create date since publishers often change it to encourage reindexing by search engines , and to influence cache behavior. Therefore alternative methods need to be invented.

SalahEldeen show how trails of references, citations and social media indicators can be used to estimate the creation time for a Web document. The papers describes, and also demonstrates (through a prototype application) a mechanism for estimating Web page creation time using backlink discovery (Google), social media sharing of the document (Twitter using the Topsy API), archival versions available (Memento Aggregator API) and URL shortening services (Bitly).

The authors tested their system by creating a corpus of 1,200 Web documents from popular media sites with clearly marked creation times, and then used their algorithm to guess the creation time. The results showed that they were able to determine the correct creation date in 75% of the cases. However the Google backlinks and Bitly short URLs had little effect on the result. The determining factors were the archival snapshots available from Web archives and the references found in Twitter. Future areas of research would be to identify other potential social media indicators such as ones from Facebook, Instagram and Twitter. Do these exhibit similar behavior? Also, it would be interesting to see how well the process works when using a baseline of comparison of pages that are not from large media outlets, and may not be as well represented in Twitter.

Brunelle, J. F., Kelly, M., SalahEldeen, H., Weigle, M. C., & Nelson, M. L. (2014). Not all mementos are created equal: Measuring the impact of missing resources. In Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 321–330). IEEE Press.

Brunelle et al. observe that the nature of the Web has changed significantly in the last 15 years. Specifically there is an increasing amount of dynamic content being made available with JavaScript and content from Web service APIs. This content has historically been difficult to archive because of the asymmetry between the technology used to archive content (the Web crawler) and the technology used for accessing archived content (the Web browser). Their study aims to measure the degree to which 1) this hypothesis is true and 2) the degree to which this impacts the experience of using archived content over time.

The first part of the study uses two sets of 1000 URLs: one being Bitly URLs found in Twitter, and the other being a sample of URLs found in ArchiveIt collections. These two sets of URLs were deemed to be quite different in terms of their source and type. A measure of URL complexity was used to characterize the URLs from each source, which showed that URLs obtained through Twitter were significantly more complex that those from ArchiveIt. Three different archiving tools (wget, Heritrix and WebCite) were then used to archive the URLs, and then results were compared using an instance of the Wayback Machine. The Wayback Machine ran on a server disconnected from the Internet in order to highlight potential leakage (URLs that targeted the live Web instead of the archive). A headless Web browser (PhantomJS) was used to measure the types of requests and their results. Results showed that the Twitter dataset is much more difficult to archive, and that this is the result of reliance on JavaScript for complete rendering.

The second part of the study looks at the combined set of Twitter and ArchiveIt URLs, and identifies ones that are available for the 2005-2012 time period. Mementos, or snapshots of the pages, were then retrieved from the Internet Archive and the number of requests coming from HTML vs JavaScript was measured. The authors were able to show that between 2005-2012 there was a 14.7% increase in JavaScript use. More striking was the finding that over the same period the number of missing resources due to JavaScript rose from 39% to 73.1%.

The format of this paper was somewhat hard to digest in that it really felt like two separate studies in one. The results were significant both for Web archive crawlers that must integrate JavaScript execution in order to be create full fidelity websites. In addition the study highlighted the need for easy to use curator tools that help identify leakage in the Web archive content. Ideally there should be a solution which does not require the archivist to run a local Web archive server (Wayback Machine) with the Web archive data held locally. The implications for Web publishers were also significant, if archivability and accessiblity of their web content is of interest. Another avenue to explore would be an archivability metric that could be derived through an analysis of the page, which could be useful when appraising content from a Web crawl.


Alpert, J., & Hajaj, N. (2008, July). We knew the web was big. Retrieved from

Approaching Join Index in Apache Lucene / SearchHub

As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Mikhail Khludnev’s session on joins and block-joins in Lucene.

Lucene works great with independent text documents, but real life problems often require to handle relations between documents. Aside from several workarounds, like term encodings, field collapsing or term positions, we have two mainstream approaches to handle document relations: join and block-join. Both have their downsides. Join lacks performance, while block-join makes is really expensive to handle index updates, since it requires to wipe a whole block of related documents.

This session presents an attempt to apply join index, borrowed from RDBMS world, for addressing drawbacks of the both join approaches currently present in Lucene. We will look into the idea per se, possible implementation approaches, and review the benchmarking results.

Mikhail has years of experience building backend systems for retail industry. His interests span from general systems architecture, API design and performance engineering all the way to testing approaches. For last few years he works on eCommerce search platform extending Lucene and Solr, contributes back to community, spokes at Lucene Revolution and other conferences.

Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics from Lucidworks

lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Approaching Join Index in Apache Lucene appeared first on

Building a Large Scale SEO/SEM Application with Apache Solr / SearchHub


As we countdown to the annual Lucene/Solr Revolution conference in Austin this October, we’re highlighting talks and sessions from past conferences. Today, we’re highlighting Rahul Jain’s session on indexing large scale SEO/SEM data.

Search engine optimization (SEO) is the process of affecting the visibility of a website or a web page in a search engine’s natural or un-paid (organic) search results while other side Search engine marketing (SEM) is a form of Internet marketing that involves the promotion of websites by increasing their visibility in search engine results pages (SERPs) through optimization and advertising. We are working on building a SEO/SEM application where an end user search for a keyword or a domain and gets all the insights about these including Search engine ranking, CPC/CPM, search volume, No. of Ads, competitors details etc. in a couple of seconds. To have this intelligence, we get huge web data from various sources and after intensive processing it is as much as 40 billion records/month in MySQL database with 4.6 TB compressed index data in Apache Solr.

Due to large volume, we faced several challenges while improving indexing performance, search latency and scaling the overall system. In this session, I will talk about our several design approaches to import data faster from MySQL, tricks & techniques to improve the indexing performance, Distributed Search, DocValues(life saver), Redis and the overall system architecture.”

Rahul Jain is a Freelance Big Data/Search Consultant from Hyderabad, India where he helps organizations in scaling their big-data/search applications. He has 7 years of experience in development of Java and J2EE based distributed systems with 2 years of experience in working with Big data technologies (Apache Hadoop/Spark) and Search/IR systems (Lucene/Solr/Elasticsearch). In his previous assignments, he was associated with Aricent Technologies and Wipro Technologies Ltd, in Bangalore where he worked on development of multiple products. He is a frequent speaker and had given several talks/presentations on multiple topics in Search/IR domain at various meetup/conferences.

Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rahul Jain from Lucidworks

lucenerevolution-avatarJoin us at Lucene/Solr Revolution 2015, the biggest open source conference dedicated to Apache Lucene/Solr on October 13-16, 2015 in Austin, Texas. Come meet and network with the thought leaders building and deploying Lucene/Solr open source search technology. Full details and registration…

The post Building a Large Scale SEO/SEM Application with Apache Solr appeared first on

People probably scroll right past your above-the-fold content / LibUX

The Q2 Mobile Overview Report by ScientiaMobile dropped over the summer, flush with data about the use of and behavior surrounding mobile devices across 12 billion requests. I drummed-up a brief but good-to-chew-on writeup about the numbers that caught my attention. My favorite is this one:

21% of people start scrolling before the page finishes loading. They might just scroll past the upcoming events or new resources at the top of the page that libraries are trying to promote. Whoops. Chalk that up to “above the fold” irony.

Read the “3 Numbers about Mobile Usage that Impact Libraries” over at Public Libraries Online.

The post People probably scroll right past your above-the-fold content appeared first on LibUX.

Great Library UX Ideas published at Weave / LITA

weavelogoAnnounced today by Matthew Reidsma of Grand Valley State University and Editor-in-Chief of Weave, the Journal Of Library User Experience, the publication of the submissions of the winner and first two runners-up for the 2015 Great Library UX Ideas Under $100.

In June 2015, the LITA’s President, Rachel Vacek, Program Planning Team partnered with Weave to hold a contest for great, affordable UX ideas for libraries. The winner won some fabulous prizes, but the committee had trouble choosing just one of the entries they received for recognition. Therefore they choose a winner and first two runners-up for the 2015 Great Library UX Ideas Under $100.

Congratualations to all the winners:

  • Conny Liegl, Designer for Web, Graphics and UX Robert E. Kennedy Library at California Polytechnic State University
  • Rebecca Blakiston, User Experience Librarian, University of Arizona Libraries
  • Shoshana Mayden, Content Strategist, University of Arizona Libraries
  • Nattawan Wood, Administrative Associate, University of Arizona Libraries
  • Aungelique Rodriguez, Library Communications Student Assistant, University of Arizona Libraries
  • Beau Smith, Usability Testing Student Assistant, University of Arizona Libraries
  • Tao Zhang, Digital User Experience Specialist, Purdue University Libraries
  • Marlen Promann, Graduate Research Assistant, Purdue University Libraries

Weave’s primary purpose is to provide a forum where practitioners of UX in libraries (wherever they are, whatever their job title is) can have discussions that increase and extend our understanding of UX principles and research. This is our primary aim: to improve the practice of UX in libraries, and in the process, to help libraries be better, more relevant, more useful, more accessible places.

For questions or comments related to LITA programs and activities, contact LITA at (312) 280-4268 or Mark Beatty,

Managing iPads – The Volume Purchase Program / LITA

Photo (c) John KlimaPhoto (c) John Klima

This is part 2 in a series of managing iPads in the library. Part 1 (about the physical process of maintaining devices) was posted back in August. Part 3 (how to manage the software aspect of your devices) will come out next month.

If you’re going to offer iPad services to your patrons—either as a part of programming/instruction or as items they can check out and take home—you’re going to want some way to get apps in bulk. If you’re only looking at free apps then you’ll want to wait for the next post where I talk about how to get apps onto devices. But if you’re going to use pay apps (which is really what you want to do, right?) then read on.

You could set up each iPad individually and add a credit card/gift card to each one and buy apps as you needed them. That might not be too onerous if you’re managing a handful of devices. What if you have more than 20? What if you have hundreds? Then you’ll want a different solution.

Thankfully Apple has a solution called the Volume Purchase Program (VPP). You’ll notice there are two links on that page: one for Education and one for Business. If you’re an academic or school library you can probably use the Education link (you might need to work with some in your finance department to get things set up). If you’re at a public library, like I am, you’ll have to use the Business link. If you’re not sure which you should use Apple defines institutions eligible for the Education program as:

Any K-12 institution or district or any accredited, degree-granting higher institution in the U.S. may apply to participate.

If you qualify for the Education VPP you’ll get discounts on app purchases (typically in volumes of 20 or more) and you’ll also be able to purchase books for classrooms through the iBooks store. Apple has a wonderful guide on how the VPP program works for education. A Business VPP account doesn’t get the discounts that an Education account does but it can still be used to buy apps in bulk and buy books from the iBooks store.

The process of creating the account is roughly the same for either the Education or Business VPP. First, you need to verify that you are authorized to enroll your institution in a VPP account. In my case this involved using a verified email address and then accepting terms and conditions on behalf of my library. It’s more complicated for an Education VPP account and you can read the details on the link above. The Business VPP account has a fairly comprehensive faq for any questions not covered in this post.

After that you create a special Apple id that works as an administrator of the VPP account. This id will only be used to purchase apps/books through the VPP. You can have as many administrators as you want, but I find having only one or two works best so that can better manage how the VPP account is used. I find having too many people working on the same thing ends up with people inadvertently working against each other.

Quick note: if you are a Business VPP user, you cannot set yourself up as tax exempt (assuming you are a tax-exempt institution). All is not lost, however. You can submit your email receipts to Apple to be reimbursed for taxes after you send them your paperwork showing that you are a tax-exempt institution. The process, despite being an extra step, works pretty well. I email my claims to Apple and we get a check for the taxes within a few weeks.

The whole process of creating a VPP account is pretty straightforward*. It makes the whole process of managing multiple iPads/iPhones a lot easier so it’s worth doing. All that’s left at this point is getting your purchased apps onto the devices.

When you buy apps in bulk you’re given a list of redemption codes to download. I use Apple’s Configurator to deploy apps and manage devices. With the release of iOS9 Apple is rolling out Mobile Device Management and I’ll address both of those in the next post. Honestly the VPP is one of the easier pieces of managing multiple iPads but it’s a step you need to take.

Jump in the comments if you have follow-up questions!

* If you run into any problems, contact Apple support. They are super helpful and will get you the answers you need.